Generator hints and tips
Here are some hints and tips about choosing generators and addressing some specific use cases.
Tonic provides several options for de-identifying names of individuals names. The method that you select depends on the specific use case, including the required realism of the output and privacy needs.
The following are a few of the generator options and how and why you might use them.
- Name generator Randomly returns a name from a dictionary of primarily Westernized names, unrelated to the original value. Can provide complete privacy, unless you use Consistency. The output is realistic because the values returned are real names.
- Categorical generator This generator "shuffles" all of the values in the field. It ensures that the output contains realistic-looking names, and that the output uses the names from the original data set. This can be beneficial if the original data contains, for example, names that are common to a particular region and that should be maintained. When you use this generator with the Differential Privacy option, it ensures the output is secure from re-identification. However, if the source data set is small or each name is highly unique, Tonic might prevent you from using this option.
- Custom Categorical Allows you to provide your own dictionary of values. These values are included in the output at the same frequency that the original values occur in the source data.
- Character Scramble Randomly replaces characters with other characters. The output does not provide realistic looking names, but it provides a high level of privacy that prevents recovery of the original data. It does preserve whitespace, punctuation (such as hyphenated names), and capitalization. Because it is a character-level replacement, it preserves the length of the input string.
- Character Substitution Similar to Character Scramble, but uses a single character mapping throughout the generated data. This reduces the privacy level, but ensures consistency and uniqueness. This generator also has more support for additional unicode blocks to ensure that the output characters more closely match the input. This might be helpful if the input includes names with characters outside of the basic Latin (a-z, A-Z) characters.
Rows of data often have multiple date or timestamp fields that have a logical dependency, such as
START_DATE
and END_DATE
.In this case, a randomly generated date is not viable, because it could produce a nonsensical output where events occur chronologically out of order.
The following generator options handle these scenarios:
- Timestamp Shift generator (with Consistency) To solve the problem described above, you ensure that two or more timestamps are randomly shifted by the same amount instead of independently from each other. The key is to use the consistency option. For example, a row of data represents an individual that is identified by a primary key of
PERSON_ID
. The row also containsSTART_DATE
andEND_DATE
columns. You can apply a timestamp shift to theSTART_DATE
andEND_DATE
columns within a desired range, and make both columns consistent toPERSON_ID
. Whenever the generator encounters the samePERSON_ID
value, it shifts the dates by the same amount. - Event Timestamps generator You can apply the Event Timestamps generator to multiple date columns on the same table. You can link them to follow the underlying distribution of dates. For more information, see the blog post Simulating event pipelines for fun and profit (and for testing too).
- Date Truncation generator This generator can sometimes address the described problem. You can configure this generator to truncate the input to the year, month, day, hour, minute, or second. It guarantees that a secondary event does not occur BEFORE a primary event. However, truncation might cause them to become the same date value or timestamp. Whether you can use this generator for this purpose depends on the typical time separation between the two events relative to the truncation option, and whether truncation provides an adequate level of privacy for the particular use case.
Most Tonic generators preserve NULL values that are in the data.
They do not automatically preserve empty values.
To make sure that any empty values stay empty in the destination database:
- 1.
- 2.For the default generator, select the generator to apply to the non-empty values.
- 3.Create a condition to look for empty values. You can either:
- Use the regex comparison against the regex whitespace value (
\s*
). - Use the
=
operator and leave the value empty or empty except for a single space.
If you are not sure which characters the empty strings use, the regex option is more flexible. However, it is less efficient. - 4.
You sometimes might want to apply the same generator to all of the text values in a JSON, HTML, or XML value. For example, you might want to apply the Character Scramble to all of the text.
Instead of creating separate path expressions for each path, you can use one or two path expressions that capture all of the values.
For the Array JSON Mask or JSON Mask generator, the path expression
$..*
captures all of the text values. You can then select the generator to apply to the values.//text()
gets all of the text nodes.//@*
gets all of the attribute values.
You apply the generator to each expression.
Sub-generators are applied sequentially. You can apply the wildcard paths in addition to more specific paths and generators.
For example, one path expression references a specific name or address and uses the Name or Address generator. The wildcard path expressions use the Character Scramble generator to mask any unknown fields in the document that could contain sensitive information.
As another example, you might assign the Passthrough generator to specific known fields that never contain sensitive information.
When your XML includes namespaces, then to include the namespaces in the path expression, specify the elements as:
*[name()='namespace:elementName']
For example, for the following XML:
<ns0:Message xmlns:ns0=".">
<ns0:Payload>
<ns1:Customer xmlns:ns1=".">
<ns1:name>
Josh
</ns1:name>
</ns1:Customer>
</ns0:Payload>
</ns0:Message>
A working XPath to mask the name value is:
/*[name()='ns0:Message']/*[name()='ns0:Payload']/*[name()='ns1:Customer']/*[name()='ns1:name']
You might sometimes set default date values to the absolute minimum and maximum values that are allowed by the database. For example, for SQL Server, these values are January 1, 1753 and December 31, 9999.
When you assign the Timestamp Shift generator, the minimum value cannot be shifted backward and the maximum value cannot be shifted forward.
To skip those default values and shift the other values:
- 1.
- 2.
- 3.Create conditions to look for the minimum or maximum values.
- 4.
You might sometimes want to add values that are the output of a generator to the results of the transformation by another generator.
For example, you use Character Scramble to mask a username. You might also want to prefix the value with a fixed constant value, or append a sequential integer.
To accomplish this:
- 1.
- 2.In addition to the capture groups that are specific to your data:
- Use
(^)
as a capture group for a prefix. - Use
($)
as a capture group for a suffix. - Use
()
as an empty group at any point in the regex pattern.
- 3.Apply the relevant generators to each capture group.
So to implement the example above (prefix with a constant, scramble the value, append a sequential integer), you provide the expression
(^)(.*)()($)
.This produces four capture groups:
Last modified 2mo ago