These hints and tips can help you to choose generators and address some specific use cases.
Tonic Structural provides several options for de-identifying names of individuals names. The method that you select depends on the specific use case, including the required realism of the output and privacy needs.
The following are a few of the generator options and how and why you might use them.
Name generator Randomly returns a name from a dictionary of primarily Westernized names, unrelated to the original value. Can provide complete privacy, unless you use Consistency. The output is realistic because the values returned are real names.
Categorical generator This generator shuffles all of the values in the field while preserving the overall frequency of the values. It ensures that the output contains realistic-looking names, and that the output uses the names from the original data set. This can be beneficial if the original data contains, for example, names that are common to a particular region and that should be maintained. When you use this generator with the Differential Privacy option, it ensures the output is secure from re-identification. However, if the source data set is small or each name is highly unique, Structural might prevent you from using this option.
Custom Categorical Allows you to provide your own dictionary of values. These values are included in the output at the same frequency that the original values occur in the source data.
Character Scramble Randomly replaces characters with other characters. The output does not provide realistic looking names, but it provides a high level of privacy that prevents recovery of the original data. It does preserve whitespace, punctuation (such as hyphenated names), and capitalization. Because it is a character-level replacement, it preserves the length of the input string.
Character Substitution Similar to Character Scramble, but uses a single character mapping throughout the generated data. This reduces the privacy level, but ensures consistency and uniqueness. This generator also has more support for additional unicode blocks to ensure that the output characters more closely match the input. This might be helpful if the input includes names with characters outside of the basic Latin (a-z, A-Z) characters.
Rows of data often have multiple date or timestamp fields that have a logical dependency, such as START_DATE
and END_DATE
.
In this case, a randomly generated date is not viable, because it could produce a nonsensical output where events occur chronologically out of order.
The following generator options handle these scenarios:
Timestamp Shift generator (with Consistency)
To solve the problem described above, you ensure that two or more timestamps are randomly shifted by the same amount instead of independently from each other.
The key is to use the consistency option.
For example, a row of data represents an individual that is identified by a primary key of PERSON_ID
. The row also contains START_DATE
and END_DATE
columns. You can apply a timestamp shift to the START_DATE
and END_DATE
columns within a desired range, and make both columns consistent to PERSON_ID
.
Whenever the generator encounters the same PERSON_ID
value, it shifts the dates by the same amount.
Event Timestamps generator You can apply the Event Timestamps generator to multiple date columns on the same table. You can link them to follow the underlying distribution of dates. For more information, go to the blog post Simulating event pipelines for fun and profit (and for testing too).
Date Truncation generator This generator can sometimes address the described problem. You can configure this generator to truncate the input to the year, month, day, hour, minute, or second. It guarantees that a secondary event does not occur BEFORE a primary event. However, truncation might cause them to become the same date value or timestamp. Whether you can use this generator for this purpose depends on the typical time separation between the two events relative to the truncation option, and whether truncation provides an adequate level of privacy for the particular use case.
Free text refers to text fields in the source database that might come from an "uncontrolled" source such as user text entry. In these cases, any record might or might not contain sensitive information.
Some possible examples include:
Notes from a doctor or healthcare provider that contain Protected Health Information (PHI)
Other personally identifiable information, such as a Social Security number or telephone number, that a user enters into an open-ended text entry form
Structural provides several suitable options. The method that you select depends on the specific use case, including the required realism of the output and any privacy requirements.
Here are a few generator options for free text fields, with information on how and why you might use them.
Character Scramble generator Randomly replaces characters with other characters. The output does not contain meaningful text, but it provides a high level of privacy that prevents recovery of the original data. The Character Scramble generator does preserve whitespace, punctuation, and capitalization. Because it is a character-level replacement, it preserves the length of the input string.
Regex Mask generator
Uses regular expressions to parse strings. It then replaces specified substrings with the output of selected generators. The parts of the string to replace are specified in unnamed top-level capture groups.
The Regex Mask generator can preserve more realism of the underlying text, but introduces privacy risks. Any sensitive information that does not conform to a known and configured pattern is not captured and replaced.
As an example of matching specific formats, a configuration that includes the following two patterns would replace both telephone numbers that use the ###-###-####
format, and SSNs that use the ###-##-####
format, but leave the surrounding text unmodified:
SSN: ([0-9]{3}-[0-9]{2}-[0-9]{4})
Telephone Number: ([0-9]{3}-[0-9]{3}-[0-9]{4})
You can configure multiple regular expression patterns to handle all known or expected sensitive information formats. You cannot use this method to replace values that you cannot use a regular expression to reliably identify, such as names within free text.
When you use this option, make sure to enable Replace all matches for each pattern.
Constant, Custom Categorical, and Null generators Each of these options provides the highest level of privacy, because they completely remove or replace the original text. You might use each one for different reasons:
Null: If the field is nullable and the use case does not require any data in the field, you can use the Null generator to replace the values with NULL.
Constant: Allows you to provide a fixed value to replace all of the source value. For example, you could provide a "Lorem ipsum" string or other dummy value that is appropriate for your data set.
Custom Categorical: Similar to the Constant generator, it replaces the original value with a fixed value. To increase the cardinality of the output, you enter a list of possible values. The values are randomly used on the output records.
Most Structural generators preserve NULL values that are in the data.
They do not automatically preserve empty values.
To make sure that any empty values stay empty in the destination database:
Assign the Conditional generator to the column.
For the default generator, select the generator to apply to the non-empty values.
Create a condition to look for empty values. You can either:
Use the regex comparison against the regex whitespace value (\s*
).
Use the =
operator and leave the value empty or empty except for a single space.
If you are not sure which characters the empty strings use, the regex option is more flexible. However, it is less efficient.
For the empty value condition, set the generator to Passthrough.
You sometimes might want to apply the same generator to all of the text values in a JSON, HTML, or XML value. For example, you might want to apply the Character Scramble to all of the text.
Instead of creating separate path expressions for each path, you can use one or two path expressions that capture all of the values.
For the Array JSON Mask or JSON Mask generator, the path expression $..*
captures all of the text values. You can then select the generator to apply to the values.
For the HTML Mask and XML Mask generators, you create two path expressions:
//text()
gets all of the text nodes.
//@*
gets all of the attribute values.
You apply the generator to each expression.
Sub-generators are applied sequentially. You can apply the wildcard paths in addition to more specific paths and generators.
For example, one path expression references a specific name or address and uses the Name or Address generator. The wildcard path expressions use the Character Scramble generator to mask any unknown fields in the document that could contain sensitive information.
As another example, you might assign the Passthrough generator to specific known fields that never contain sensitive information.
When your XML includes namespaces, then to include the namespaces in the path expression, specify the elements as:
For example, for the following XML:
A working XPath to mask the name value is:
You might sometimes set default date values to the absolute minimum and maximum values that are allowed by the database. For example, for SQL Server, these values are January 1, 1753 and December 31, 9999.
When you assign the Timestamp Shift generator, the minimum value cannot be shifted backward and the maximum value cannot be shifted forward.
To skip those default values and shift the other values:
Assign the Conditional generator to the column.
For the default generator, select the Timestamp Shift generator.
Create conditions to look for the minimum or maximum values.
For those conditions, set the generator to Passthrough.
You might sometimes want to add values that are the output of a generator to the results of the transformation by another generator.
For example, you use Character Scramble to mask a username. You might also want to prefix the value with a fixed constant value, or append a sequential integer.
To accomplish this:
Apply the Regex Mask generator to the column.
In addition to the capture groups that are specific to your data:
Use (^)
as a capture group for a prefix.
Use ($)
as a capture group for a suffix.
Use ()
as an empty group at any point in the regex pattern.
Apply the relevant generators to each capture group.
So to implement the example above (prefix with a constant, scramble the value, append a sequential integer), you provide the expression (^)(.*)()($)
.
This produces four capture groups:
Group 0 is for the prefix. You assign the Constant generator and provide the value to use as the prefix.
Group 1 captures all of the original values. You assign the Character Scramble generator.
Group 2 captures any empty values. You assign the Constant generator to provide a value to use for those values.
Group 3 is for the suffix. You assign the Sequential Integer generator.
A table that contains user data might include both name and email address columns. If a user's email address is based on their name, then in the destination data, you might want to also tie the email addresses to the names.
For example, your email addresses might use the format firstName.lastName@mycompany.com
. In the source data, the email address for John Smith is John.Smith@mycompany.com. In the destination data, assuming John Smith is replaced by Michael Jones, you want the email address to be Michael.Jones@mycompany.com.
At a high level, to line up name and email address columns:
Assign the Name generator to the name fields. Make the Name generator consistent with an identifier column.
Assign the Regex Mask generator to the email address field.
Create a regular expression that extracts to capture groups the name portion of the email address. The specific expression varies based on the email address format.
Assign the Name generator to each name capture group. Make the Name generator consistent with the same identifier column.
In this example, the source data contains userId, firstName
, lastName
, and emailAddress
fields, and the email address is firstName
.lastName
@mycompany.com.
To ensure that the destination data email addresses are aligned to the destination data names:
For the firstName
field, assign the Name generator, configured to produce a first name. Make the generator consistent with the userId
column.
For the lastName
field, assign the Name generator, configured to produce a last name. Make the generator consistent with the userId
column.
For the emailAddress
field, assign the Regex Mask generator. Use the following regular expression to extract the parts of the email address to capture groups:
([a-zA-Z]+).([a-zA-Z]+)@(.*)
For the first name and last name capture groups:
Assign the Name generator, configured to produce the first and last names.
Make the Name generator consistent with the userId
column.