Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Generators transform the data in a source database column. You assign the generators to use. Tonic Structural offers a variety of generators to transform different types of data.
For Enterprise instances, generator presets allow you to configure custom configurations of generators that you can then assign to columns.
You can also view this video overview of generators and how they work.
Generator summary
Summary list of generators.
Generator reference
Details about the characteristics and configuration options for each generator.
Generator API reference
Details about the structure of each generator assignment in the API.
Generator characteristics
Common generator characteristics to be aware of, such as consistency and linking.
Composite generators
Composite generators apply a generator to a specific data element or based on a condition.
Primary key generators
Learn about generators that you can apply to primary key columns.
Assigning and configuring generators
Assign a generator or to a column.
Managing generator presets
Set the default configuration for generators and create custom configurations of generators.
Hints and tips for assigning generators
Learn more about choosing and configuring the appropriate generator.
Using Structural data encryption
Enable Structural to decrypt source data and encrypt destination data.
Using the API
Use the Structural API to assign and configure generators.
The following table summarizes the available generators. It indicates whether each generator can be made consistent, can be linked, and is differentially private.
In the Consistency column, the table also indicates whether the generator can be made self-consistent only, or can be made either self-consistent or consistent with another column.
The Description column includes:
For generators that can be data-free, whether the generator is always data-free, or only data-free when consistency is disabled.
The possible privacy rankings for the generator. For details about the available privacy rankings, go to #privacy-report-privacy-ranking-about.
Generator | Description | Consistency | Linking | Differential Privacy |
---|---|---|---|---|
Generates a random string to replace a specific part of a mailing address. Data-free if not consistent. Privacy ranking: - 1 if not consistent - 4 if consistent
Yes - Self or other
Yes
Yes if not consistent
Uses deep neural networks for high-fidelity data mimicking. By default, not available. Privacy ranking: 3
No
No
No
Identifies the algebraic relationship between 3 or more numeric values (at least one non-integer) and generates new values to match. Privacy ranking: 3
No
Yes
No
Generates unique alphanumeric strings of the same length as the input. Privacy ranking: - 3 if not consistent - 4 if consistent
Yes - Self
No
No
Within an array, replaces letters with random other letters, and numbers with random other numbers. Privacy ranking: - 3 if not consistent - 4 if consistent
Yes - Self
No
No
Runs a selected generator on values that match a user-specified JSONPath. Privacy ranking: 5
--
--
--
Runs a selected generator on values that match a regular expression. Privacy ranking: 5
--
--
--
Generates unique alpha-numeric strings based on any printable ASCII characters. Privacy ranking: - 3 if not consistent - 4 if consistent
Yes - Self
No
No
Generates a random company name-like string.
Data-free if not consistent. Privacy ranking: - 1 if not consistent - 4 if consistent
Yes - Self or other
No
Yes if not consistent
Creates values at the same frequency as the values in the underlying data. Privacy ranking: - 2 if differential privacy enabled - 3 if differential privacy not enabled
No
Yes
Configurable
Replaces letters with random other letters and numbers with random other numbers. Privacy ranking: - 3 if not consistent - 4 if consistent
Yes - Self
No
No
Replaces characters randomly, but preserves formatting. Privacy ranking: 4
Yes - Implicitly consistent
No
No
Company Name (Deprecated)
This generator is deprecated. Use the Business Name generator instead. Generates a random company name-like string. Data-free if not consistent. Privacy ranking: - 1 if not consistent - 4 if consistent
Yes - Self or other
No
Yes if not consistent
Applies different generators to rows conditionally based on any value in the table. Privacy ranking: If a fallback generator is selected, then the lower of either 5 or the fallback generator. 5 if no fallback generator is selected.
No
No
No
Uses a single specified value to mask all values in the column. Data-free. Privacy ranking: 1
No
No
Yes
Generates a continuous distribution to fit the underlying data. Privacy ranking: - 2 if differential privacy enabled - 3 if differential privacy not enabled
No
Yes
Configurable
Populates the column using the sum of the values in other columns. Privacy ranking: 3
No
No
No
Masks a text column.
Parses the text as a row for which the columns are delimited by a specified character. Privacy ranking: 5
--
--
--
Selects from values you provide. Data-free if not consistent. Privacy ranking: - 1 if not consistent - 4 if consistent
Yes - Self or other
No
Yes if not consistent
Truncates dates or timestamps to a specific date or time part. Privacy ranking: 5
No
No
No
Scrambles characters in an email address.
Preserves the formatting and keeps the @
and .
.
Privacy ranking:
- 3 if not consistent
- 4 if consistent
Yes - Self
No
No
Generates timestamps that fit an event distribution. Privacy ranking: 3
No
Yes
No
Scrambles characters in a file name.
Preserves the formatting and the file extension. Privacy ranking: - 3 if not consistent - 4 if consistent
Yes - Self
No
No
Replaces all instances of the find string with the replace string. Privacy ranking: 5
No
No
No
Transforms Norwegian national identity numbers. Privacy ranking: - 3 if not consistent - 4 if consistent
Yes - Self or other
No
No
Masks columns that contain latitude and longitude values. Privacy ranking: 3
No
No
No
Can be used to generate cities, states, zip codes, and latitude/longitude values that follow HIPAA guidelines for safe harbor. Privacy ranking: - 3 if not consistent - 4 if consistent
Yes - Self
No
No
Generates random host names, based on the English language. Data-free if not consistent. Privacy ranking: - 1 if not consistent - 4 if consistent
Yes - Self or other
No
Yes if not consistent
Runs selected generators on specified key values in an HStore column in a PostgreSQL database. Privacy ranking: 5
--
--
--
Masks text columns.
Parses the contents as HTML, and applies sub-generators to the specified path expressions. Privacy ranking: 5
--
--
--
Generates unique integer values.
By default, the generated values are within the range of the column’s data type.
You can also specify a range for the generated values. The source values must be within that range. Data-free if not consistent. Privacy ranking: - 1 if not consistent - 4 if consistent
Yes - Self
No
Yes if not consistent
Generates a random IP address-formatted string. Data-free if not consistent. Privacy ranking: - 1 if not consistent - 4 if consistent
Yes - Self or other
No
Yes if not consistent
Runs a generator on values that match a user specified JSONPath. Privacy ranking: 5
--
--
--
Generates a random MAC address formatted string. Data-free if not consistent. Privacy ranking: - 1 if not consistent - 4 if consistent
Yes - Self
No
Yes if not consistent
Generates unique MongoDB objectId values. Privacy ranking: - 3 if not consistent - 4 if consistent
Yes - Self
No
No
Generates a random name string from a dictionary of first and last names. Data-free if not consistent. Privacy ranking: - 1 if not consistent - 4 if consistent
Yes - Self or other
No
Yes if not consistent
Masks values in numeric columns.
Adds or multiplies the original value by random noise. Privacy ranking: - 3 if not consistent - 4 if consistent
Yes - Self or other
No
No
Generates NULL
values to fill the rows of the specified column.
Data-free.
Privacy ranking: 1
No
No
Yes
Generates unique numeric strings of the same length as the input. Privacy ranking: - 3 if not consistent - 4 if consistent
Yes - Self
No
No
Default generator. Does not perform any action on the source data. Privacy ranking: 6
No
No
No
Generates a random phone number that matches the country or region and format of the input phone number. Privacy ranking: 3
Yes - Self
No
No
Generates a random boolean value. Data-free. Privacy ranking: 1
No
No
Yes
Generates a random double number between the specified min and max. Data-free. Privacy ranking: 1
No
No
Yes
Generates a random hash string. Data-free Privacy ranking: 1
No
No
Yes
Returns a random integer between the specified min and max. Data-free. Privacy ranking: 1
No
No
Yes
Generates random dates, times, and timestamps. Data-free. Privacy ranking: 1
No
No
Yes
Generates a random new UUID string. Data-free. Privacy ranking: 1
No
No
Yes
Uses regular expressions to parse strings.
Replaces specified substrings with output from selected sub-generators. Privacy ranking: 5
--
--
--
Generates a column of unique integer values that start with specified value and increment by 1. Privacy ranking: 3
No
Yes
No
Generates values of ISO 6346 compliant shipping container codes. Data-free if not consistent. Privacy ranking: - 1 if not consistent - 4 if consistent
Yes - Self or other
No
Yes if not consistent
Generates a new valid Canadian Social Insurance Number. Data-free if not consistent. Privacy ranking: - 1 if not consistent - 4 if consistent
Yes - Self
No
Yes if not consistent
Generates a new valid United States Social Security Number. Data-free if not consistent. Privacy ranking: - 1 if not consistent - 4 if consistent
Yes - Self or other
No
Yes if not consistent
Can apply other generators on specific StructFields within a StructType in Spark databases (Databricks and Amazon EMR). Privacy ranking: 5
--
--
--
Shifts timestamps by a random amount of a specific unit of time, within a set range. Privacy ranking: - 3 if not consistent - 4 if consistent
Yes - Self or other
No
No
Generates unique email addresses.
Replaces the username with a randomly generated GUID, and masks the domain with a character scramble. Privacy ranking: - 3 if not consistent - 4 if consistent
Yes - Self
No
No
A substitution cipher that preserves formatting but keeps the URL scheme and top-level domain intact. Privacy ranking: 3
No
No
No
Generates UUIDs on primary key columns. Privacy ranking: - 3 if not consistent - 4 if consistent
Yes - Self
No
No
Runs a selected generator on values that match a user-specified XPath. Privacy ranking: 5
--
--
--
Here are the details for the supported generators in Tonic Structural.
The table for each generator includes:
The privacy ranking
The generator ID to use in the Tonic API
Generates a random address-like string.
You can indicate which part of an address string that the column contains. For example, the column might contain only the street address or the city, or it might contain the full address.
To configure the generator:
From the Link To dropdown list, select the columns to link this column to. You can link columns that use the Address generator to mask one of the following address components:
City
City State
Country
Country Code
State
State Abbreviation
Zip Code
Latitude
Longitude
Note that when linked to another address column, a country or country code is always the United States.
From the address component dropdown list, select the address component that this column contains. The available options are:
Building Number
Cardinal Direction (North, South, East, West)
City
City Prefix (Examples: North, South, East, West, Port, New)
City Suffix (Examples: land, ville, furt, town)
City with State (Example: Spokane, Washington)
City with State Abbr (Example: Houston, TX)
Country (Examples: Spain, Canada)
Country Code (Uses the 2-character country code. Examples: ES, CA)
County
Direction (Examples: North, Northeast, Southwest, East)
Full Address
Latitude (Examples: 33.51, 41.32)
Longitude (Examples: -84.05, -74.21)
Ordinal Direction (Examples: Northeast, Southwest)
Secondary Address (Examples: Apt 123, Suite 530)
State (Examples: Alabama, Wisconsin)
State Abbr (Examples: AL, WI)
Street Address (Example: 123 Main Street)
Street Name (Examples: Broad, Elm)
Street Suffix (Examples: Way, Hill, Drive)
US Address
US Address with Country
Zip Code (Example: 12345)
Toggle the Consistency setting to indicate whether to make the column consistent. By default, the consistency is disabled.
If consistency is enabled, then by default, the generator is self-consistent. To make the generator consistent with another column, from the Consistent to dropdown list, select the column. When the Address generator is consistent with itself, then the same value in the source database is always mapped to the same destination value. For example, for a column that contains a state name, Alabama is always mapped to Illinois. When the Address generator is consistent with another column, then the same value in the other column always results in the same destination value for the address column. For example, if the address column is consistent with a name column, then every instance of John Smith in the name column in the source database has the same address value in the destination database.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
For the Address generator, Spark workspaces (Amazon EMR, Databricks, and self-managed Spark clusters) only support the following address parts:
Building Number
City
Country
Country Code
Full Address
Latitude
Longitude
State
State Abbr
Street Address
Street Name
Street Suffix
US Address
US Address with Country
Zip Code
Within a table, the AI synthesizer uses the columns that are assigned the AI Synthesizer to train a model and generate the synthetic data.
It uses deep neural networks for high-fidelity data mimicking.
By default, the AI Synthesizer is not available. To enable the AI Synthesizer, in the Structural web server container, set the environment setting TONIC_NN_GENERATOR_ENABLED
to true
. Go to Configuring environment settings.
The privacy ranking is 3.
For details, go to Using the AI Synthesizer.
The algebraic generator identifies the algebraic relationship between three or more numeric values and generates new values to match. At least one of the values must be a non-integer.
If a relationship cannot be found, then the generator defaults to the Categorical generator.
This generator can be linked with other Algebraic generators.
To configure the generator, from the Link To dropdown list, select the columns to link this column to. You can select other columns that are assigned the Algebraic generator.
You must select at least three columns.
The column values must be numeric. At least one of the columns must contain a value other than an integer.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
Generates unique alphanumeric strings of the same length as the input. For example, for the origin value ABC123
, the output value is a six-character alphanumeric string such as D24N05
.
To configure the generator, toggle the Consistency setting to indicate whether to make the generator self-consistent.
By default, the generator is not consistent.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
A version of the Character Scramble generator that can be used for array values.
This generator replaces letters with random other letters, and numbers with random other numbers. Punctuation and whitespace are preserved.
For example, for the following array value:
["ABC.123", 3, "last week"]
The output might be something like:
["KFR.860", 7, "sdrw mwoc"]
This generator securely masks letters and numbers. There is no way to recover the original data.
To configure the generator, toggle the Consistency setting to indicate whether to make the generator self-consistent.
By default, the generator is not consistent.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
This is a composite generator.
A version of the JSON Mask generator that can be used for array values.
Runs a selected generator on values that match a user-specified JSONPath.
To configure the generator:
To assign a generator to a path expression:
Under Sub-generators, click Add Generator. On the sub-generator configuration panel, the Cell JSON field contains a sample value from the source database. You can use the previous and next icons to page through different values.
In the Path Expression field, type the JSONPath expression to identify the value to apply the generator to. To populate a path expression, you can also click a value in the Cell JSON field. Matched JSON Values shows the result from the value in Cell JSON.
By default, the selected generator is applied to any value that matches the expression. To limit the types of values to apply the generator to, from the Type Filter, specify the applicable types. You can select Any, or you can select any combination of String, Number, and Null.
From the Generator Configuration dropdown list, select the generator to apply to the path expression. You cannot select another composite generator.
Configure the selected generator. You cannot configure the selected generator to be consistent with another column.
To save the configuration and immediately add a generator for another path expression, click Save and Add Another. To save the configuration and close the add generator panel, click Save.
From the Sub-Generators list:
To edit a generator assignment, click the edit icon.
To remove a generator assignment, click the delete icon.
To move a generator assignment up or down in the list, click the up or down arrow.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
This is a composite generator.
A version of the Regex Mask generator that can be used for array values.
Uses regular expressions to parse strings and replace specified substrings with the output of specified generators. The parts of the string to replace are specified inside unnamed top-level capture groups.
To configure the generator:
To add a regular expression:
Click Add Regex. On the configuration panel, Cell Value shows a sample value from the source database. You can use the previous and next options to navigate through the values.
By default, Replace all matches is enabled. To only match the first occurrence of a pattern, toggle Replace all matches to the off position.
In the Pattern field, enter a regular expression. If the expression is valid, then Structural displays the capture groups for the expression.
For each capture group, to select and configure the generator to apply, click the selected generator. You cannot select another composite generator.
To save the configuration and immediately add a generator for another path expression, click Save and Add Another. To save the configuration and close the add generator panel, click Save.
From the Regexes list:
To edit a regex, click the edit icon.
To remove a regex, click the delete icon.
Generates unique alpha-numeric strings based on any printable ASCII characters. The length of the source string is not preserved. You can choose to exclude lowercase letters from the generated values.
To configure the generator:
To exclude lowercase letters from the generated values, toggle Exclude Lowercase Alphabet to the on position.
Toggle the Consistency setting to indicate whether to make the generator consistent. By default, the generator is not consistent.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
Generates a random company name-like string.
To configure the generator, toggle the Consistency setting to indicate whether to make the generator consistent.
By default, the generator is not consistent.
If consistency is enabled, then by default it is self-consistent. To make the generator consistent with another column, from the Consistent to dropdown list, select the column.
When the generator is consistent with itself, then a given source value is always mapped to the same destination value. For example, My Business is always mapped to New Business.
When the generator is consistent with another column, then a given source value in that other column always results in the same destination value for the company name column. For example, if the company name column is consistent with a name column, then every instance of John Smith in the name column in the source database has the same company name in the destination database.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
The Categorical generator shuffles the existing values within a field while maintaining the overall frequency of the values. It disassociates the values from other pieces of data. Note that NULL is considered a separate value.
For example, a column contains the values Small
, Medium
, and Large
. Small
appears 3 times, Medium
appears 4 times, and Large
appears 5 times. In the output data, each value still appears the same number of times, but the values are shuffled to different rows.
This generator is optimized for categories with fewer than 10,000 unique values. If your underlying data has more unique values (for example, your field is populated by freeform text entry), we recommend that you use the Character Scramble or Custom Categorical generator.
To configure the generator:
From the Link To dropdown, select the columns to link to the current column. You can select from other columns that use the Categorical generator.
Toggle the Differential Privacy setting to indicate whether to make the output data differentially private. By default, differential privacy is disabled.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
This generator replaces letters with random other letters and numbers with random other numbers. Punctuation, whitespace, and mathematical symbols are preserved.
For example, for the following input string:
ABC.123 123-456-789 Go!
The output would be something like:
PRX.804 296-915-378 Ab!
This generator securely masks letters and numbers. There is no way to recover the original data.
Character Scramble is similar to Character Substitution, with a couple of key differences. While you can enable consistency for the entire value, Character Scramble does not always replace the same source character with the same destination character. Because there is no guarantee of unique values, you cannot use Character Scramble on unique columns. Character Substitution, however, does always map the same source character to the same destination character. Character Substitution is always consistent, which makes it less secure than Character Scramble. You can use Character Substitution on unique columns.
To configure the generator, toggle the Consistency setting to indicate whether to make the generator self-consistent.
By default, the generator is not consistent.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
Performs a random character replacement that preserves formatting (spaces, capitalization, and punctuation).
Characters are replaced with other characters from within the same Unicode Block. A given source character is always mapped to the same destination character. For example, M
might always map to V
.
For example, for the following input string:
Miami Store #162
The output would be something like:
Vgkjg Gmlvf #681
Note that for a numeric column, when a generated number starts with a 0, the starting 0 is removed. This could result in matching output values in different columns. For example, one column is changed to 113 and the other to 0113, which also becomes 113.
Character Substitution is similar to Character Scramble, with a couple of key differences. Because Character Substitution always maps the same source character to the same destination character, it is always consistent. It also can be used for unique columns. In Character Scramble, the character mapping is random, which makes Character Scramble slightly more secure. However, Character Scramble cannot be used for unique columns.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
This generator is deprecated. Use the Business Name generator instead.
Generates a random company name-like string.
To configure the generator, toggle the Consistency setting to indicate whether to make the generator consistent.
By default, the generator is not consistent.
If consistency is enabled, then by default it is self-consistent. To make the generator consistent with another column, from the Consistent to dropdown list, select the column.
When the generator is consistent with itself, then a given source value is always mapped to the same destination value. For example, My Company is always mapped to New Company.
When the generator is consistent with another column, then a given source value in that other column always results in the same destination value for the company name column. For example, if the company name column is consistent with a name column, then every instance of John Smith in the name column in the source database has the same company name in the destination database.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
This is a composite generator.
Applies different generators to the value conditionally based on any value in the table.
For example, a Users table contains Name, Username, and Role columns. For the Username column, you can use a conditional generator to indicate that if the value of Role is something other than Test, then use the Character Scramble generator for the Username value. For Test users, the name is not masked.
The generator consists of a list of options. Each option includes the required conditions and the generator to use if those conditions are met.
The generator always contains a Default option. The Default option is used if the value does not meet any of the conditions. To configure the Default option:
From the Default dropdown list, select the generator to use by default.
Configure the selected generator.
To add a condition option:
Click + Conditional Generator.
To add a condition:
Click + Condition.
From the column list, select the column for which to check the value.
Select the comparison type.
Enter the column value to check for.
To remove a condition, click the delete icon for the condition.
From the Generator dropdown list, select the generator to run on the current column if the conditions are met. You cannot select another composite generator.
Choose the configuration options for the selected generator.
To view details for and edit a condition option, click the expand icon for that option.
To remove a condition option, click the delete icon for the option.
Uses a single value to mask all of the values in the column.
For example, you can replace every value in a string column with the String1
. Or you can replace every value in a numeric column with the value 12345
.
To configure the generator, in the Constant Value field, provide the value to use.
The value must be compatible with the field type. For example, you cannot provide a string value for an integer column.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
Generates a continuous distribution to fit the underlying data.
This generator can be linked to other Continuous generators to create multivariate distributions and can be partitioned by other columns.
To configure the generator:
From the Link To drop-down list, select the other Continuous generator columns to link to. The linking creates a multivariate distribution.
From the Partition By drop-down list, select one or more columns to use to partition the data. The selected columns must have the generator set to either Passthrough or Categorical. For more information about partitioning and how it works, go to Partitioning a column.
Toggle the Differential Privacy setting to indicate whether to make the output data differentially private. By default, the generator is not differentially private.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
Links columns in two tables. This column value is the sum of the values in a column in another table.
This generator does not provide a preview. The sums are not computed until the other table is generated.
For example, a Customers table contains a Total_Sales column. The Transactions table uses a foreign key Customer_ID column to identify the customer who made the transaction, and an Amount column that contains the amount of the sale. The Customer_ID value in the Transactions table is a value from the ID primary key column in the Customers table.
You assign the Cross Table Sum generator to the Total_Sales column. In the generator configuration, you indicate that the value is the sum of the Amount values for the Customer_ID value that matches the primary key ID value for the current row.
For the Customers row for ID 123
, the Total_Sales column contains the sum of the Amount column for Transactions rows where Customer_ID is 123
.
To configure the generator:
From the Foreign Table dropdown list, select the table that contains the column for which to sum the values.
From the Foreign Key dropdown list, select the foreign key. The foreign key identifies the row from the current table that is referred to in the foreign table.
From the Sum Over dropdown list, select the column for which to sum the values.
From the Primary Key dropdown list, select the primary key for the current table.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
This is a composite generator.
Masks text columns by parsing the values as rows whose columns are delimited by a specified character.
You can assign specific generators to specific indexes. You can also use the generator that is assigned to a specific index as the default. This applies the generator to every index that does not have an assigned generator.
The output value maintains the quotes around the index values.
For example, a column contains the following value:
"first","second","third"
You assign the Character Scramble generator to index 0 and assign Passthrough to index 2. You select index 0 as the index to use for the default generator.
In the output, the first and second values are masked by the Character Scramble generator. The third value is not masked. The output looks something like:
"wmcop", "xjorsl", "third"
To configure the generator:
In the Delimiter field, type the delimiter that is used as a separator for the value.
For example, for the value "first","second","third"
, the delimiter is a comma.
You can configure a generator for any or all of the indexes. To add a sub-generator for an index:
Under Sub-Generators, click Add Generator. On the add generator dialog, the Cell CSV field contains a sample value from the source data. You can use the navigation icons to page through the values.
In the CSV Index field, type the index to assign a generator to. The index numbers start with 0. You cannot use an index that already has an assigned generator. Matched CSV values shows the value at that index for the current sample column value.
Under Generator Configuration, from the Select a Generator dropdown list, select the generator to use for the selected index. You cannot select another composite generator. To remove the selection, click the delete icon.
Configure the selected generator. You cannot configure the selected generator to be consistent with another column.
To save the configuration and immediately add a generator for another index, click Save and Add Another. To save the configuration and close the add generator panel, click Save.
From the Sub-Generators list:
To edit a generator assignment, click the edit icon.
To remove a generator assignment, click the delete icon.
To move a generator assignment up or down in the list, click the up or down arrow.
After you configure a generator for at least one index, the Default Link dropdown list is displayed. From the Default Link dropdown list, select the index to use to determine how to mask values for indexes that do not have an assigned generator. For example, you assign the Character Scramble generator to index 2. If you set Default Link to 2, then all indexes that do not have an assigned generator use the Character Scramble generator.
A version of the Categorical generator that selects from values that you provide instead of shuffling the original values.
To configure the generator:
From the Link To dropdown list, select the columns to link this column to. You can only select other columns that use the Custom Categorical generator.
In the Custom Categories text area, enter the list of values that the generator can choose from.
Put each value on a separate line.
To add a NULL value to the list, use the keyword {NULL}
.
Toggle the Consistency setting to indicate whether to make the column consistent. By default, consistency is disabled.
If you enable consistency, then by default the generator is self-consistent. To make the generator consistent with another column, from the Consistent to dropdown list, select the column. When a generator is self-consistent, then a given value in the source database is always mapped to the same value in the destination database. When a generator is consistent with another column, then a given source value in that column always results in the same value for the current column in the destination database. For example, a department column is consistent with a username column. For each instance of User1 in the source database, the value in the department column is the same.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
Truncates a date value or a timestamp to a specific part.
For a date or a timestamp, you can truncate to the year, month, or day.
For a timestamp, you can also truncate to the hour, minute, or second.
To configure the generator:
From the dropdown list, select the part of the date or timestamp to truncate to. For both date and timestamp values, you can truncate to the year, month, or day. When you select one of these options, the time portion of a timestamp is set to 00:00:00. For the date, the values below the selected truncation value are set to 01. For example, when you truncate to month, the day value is set to 01, and the timestamp is set to 00:00:00. For a timestamp value, you also can truncate to the hour, minute, or second. The date values remain the same as the original data. The time values below the selected truncation value are set to 00. For example, when you truncate to minute, the seconds value is set to 00.
Toggle the Birth Date option. When you enable Birth Date, the generator shifts dates that are more than 90 years before the generation date to the date exactly 90 years before the generation date. For example, a generation occurs on January 1, 2023. Any date that occurs before January 1, 1933 is changed to January 1, 1933.
This is mostly intended for birthdate values, to group birthdates for everyone who is older than 89 into a single year. This is used to comply with HIPAA Safe Harbor.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
Here are examples of date and time values and how the selected truncation affects the output:
This generator scrambles the characters in an email address. It preserves formatting and keeps the @
and .
characters.
For example, for the following input value:
johndoe@company.com
The output value would be something like:
brwomse@xorwxlt.slt
By default, the generator scrambles the domain. You can configure the generator to not mask specific domains. You can also specify a domain to use for all of the output email addresses.
For example, if you configure the generator to not scramble the domain company.com
, then the output for johndoe@company.com
would look something like:
brwomse@company.com
This generator securely masks letters and numbers. There is no way to recover the original data.
To configure the generator:
In the Email Domain field, enter a domain to use for all of the output values.
For example, use @mycompany.com
for all of the generated values. The generator scrambles the content before the @
.
In the Excluded Email Domains field, enter a comma-separated list of domains for which email addresses are not masked in the output values. This allows you, for example, to maintain internal or testing email addresses that are not considered sensitive.
Toggle the Replace invalid emails setting to indicate whether to replace an invalid email address with a generated valid email address. By default, invalid email addresses are not replaced. In the replacement values, the username is generated. If you specify a value for Email Domain, then the email addresses use that domain. Otherwise, the domain is generated.
Toggle the Consistency setting to indicate whether to make the column self-consistent. By default, consistency is disabled.
Generates timestamps fitting an event distribution. The source timestamp must include a date. It cannot be a time-only value.
Link columns to create a sequence of events across multiple columns. This generator can be partitioned by other columns.
To configure the generator:
From the Link To dropdown list, select the other Event Timestamps generator columns to link this column to. Linking creates a sequence across multiple columns.
From the Partition drop-down list, select one or more columns to use to partition the data. The selected columns must have their generator set to either Passthrough or Categorical. For more information about partitioning and how it works, go to Partitioning a column.
The Options list displays the current column and linked columns. Use the Up and Down buttons to configure the column sequence.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
This generator scrambles characters while preserving formatting and keeping the file extension intact.
For example, for the following input value:
DataSummary1.pdf
The output value would look something like:
RsnoPwcsrtv5.pdf
This generator securely masks letters and numbers. There is no way to recover the original data.
To configure the generator, toggle the Consistency setting to indicate whether to make the generator self-consistent.
By default, the generator is not consistent.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
This generator replaces all instances of the find string with the replace string.
For example, you can indicate to replace all instances of abc with 123.
To configure the generator:
In the Find field, type the string to look for in the source column value.
To use a regular expression to identify the source value, check the Use Regex checkbox.
If you use a regular expression, use backslash ( \
) as the escape character.
In the Replace field, type the string to replace the matching string with.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
The FNR generator transforms Norwegian national identity numbers. In Norwegian, the term for national identity number abbreviates to FNR.
The first six digits of an FNR reflects the person's birthdate. You can choose to preserve the birthdates from the source values in the destination values. If you do not preserve the source values, the destination values are still within the same date range as the source values.
Another digit in an FNR indicates whether the person is male or female. You can specify whether to preserve in the generated value the gender indicated in the source value.
The last digits in an FNR are a checksum value. The last digits in the destination value are not a checksum - the values are random.
To configure the generator:
To preserve the gender from the source value in the destination value, toggle Preserve Gender to the on position.
To preserve the birthdate from the source value in the destination value, toggle Preserve Birthdate to the on position.
Toggle the Consistency setting to indicate whether to make the generator consistent. By default, consistency is disabled.
If you enable consistency, then by default the generator is self-consistent. To make the generator consistent with another column, from the Consistent to dropdown list, select the column. When a generator is self-consistent, then a given value in the source database is always mapped to the same value in the destination database. When a generator is consistent with another column, then a given value for that other column in the source database results in the same value in the destination database. For example, if the FNR column is consistent with a Name column, then every instance of John Smith in the source database results in the same FNR in the destination database.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
This generator can be used to mask columns of latitude and longitude.
The Geo generator divides the globe into grids that are approximately 4.9 x 4.9 km. It then counts the number of points within each grid.
During data generation, each (latitude, longitude) pair is mapped to its grid.
If the grid contains a sufficient number of points to preserve privacy, then the generator returns a randomly chosen point in that grid.
If the grid does not contain enough points to preserve privacy, then the generator returns a random coordinate from the nearest grid that contains enough points.
To configure the generator:
From the Link To dropdown list, select the column to link to this one. You typically assign the Geo generator to both the latitude and longitude column, then link those columns.
From the value type dropdown, select whether this column contains a latitude value or a longitude value.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
This generator can be used to generate cities, states, and zip codes that follow HIPAA guidelines for safe harbor.
Zip Codes
How the HIPAA Address generator handles zip codes is based on whether the Replace zeros in truncated Zip Code toggle in the generator configuration is off or on.
By default, the setting is off. In this case, the last two digits of the zip code in the column are replaced with zeros, unless the zip code is a low population area as designated by the current census. For a low population area, all of the digits in the zip code are replaced with zeros.
If the setting is on, then the generator selects a real zip code that starts with the same three digits as the original zip code. For a low population area, if a state is linked, then the generator selects a random zip code from within that state. Otherwise the generator selects a random zip code from the United States.
Cities
When a zip code column is not linked, a random city is chosen in the United States. When a zip code is already added to the link, a city is chosen at random that has at least some overlap with the zip code.
If the original zip code is designated as a low population area then a random city is chosen within the state, this is done only if the user has linked a State column. If they have not, a random city within the United States is chosen.
For example, if the original city and zip code were (Atlanta, 30305), the zip code would be replaced with 30300. There are many cities that contain zip codes beginning in 303 such as Atlanta, Decatur, Chamblee, Hapeville, Dunwoody, College Park, etc.). One of these cities is chosen at random so that our final value is (Chamblee, 30300), for example.
States
HIPAA guidelines allow for information at the state level to be kept. Therefore, these values are passed through.
Latitude and longitude (GPS) coordinates
GPS coordinates are randomly generated in descending order of dependence of the linked HIPAA address components:
If a zip code is linked, a random point within the same 3-digit zip code prefix is generated, if the 3-digit zip code prefix is not designated a low population area. If it is a low population area, use the linked state.
If a state is available and a zip code and city are not, or the zip code or city are in a 3-digit zip code prefix that is designated a low population area, then a random GPS coordinate is generated somewhere within the state.
If no zip code, city, or state is linked, or one or more of them were provided, but there was a problem generating a random GPS coordinate within the linked areas, then a GPS coordinate is generated at a random location within the United States.
Note: If the city component of the HIPAA address is linked with latitude and/or longitude, the GPS coordinate components are randomly generated independently of the city.
Other address parts
All other address parts are generated randomly and hence their value is not influenced at all by the underlying value in the column.
To configure the generator:
From the Link To dropdown list, select the other columns to link to. You can only select columns that are also assigned the HIPAA Address generator.
From the address part dropdown list, select the type of address value that is in the column.
Toggle the Replace zeros in truncated Zip Code setting how to generate zip codes. If the setting is off, then the last two digits are replaced with zero. For low population areas, the entire zip code is populated with zeroes. If the setting is on, then a real zip code is selected that starts with the first three digits of the original zip code. For low population areas, if a state is linked, a random zip code from the state is used. Otherwise, a random zip code from the United States is used.
Toggle the Consistency setting to indicate whether to make the column self-consistent. By default, consistency is disabled.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
For the HIPAA Address generator, Spark workspaces (Amazon EMR, Databricks, and self-managed Spark clusters) only support the following address parts:
City
City with State
City with State Abbr
State
State Abbr
US Address
US Address with Country
Zip Code
The Address generator provides support for additional address parts in Spark workspaces.
Generates random host names, based on the English language.
To configure the generator, toggle the Consistency setting to indicate whether to make the generator consistent.
By default, the generator is not consistent.
If you enable consistency, then by default the generator is self-consistent. To make the generator consistent with another column, from Consistent to, select the column.
When the generator is consistent with itself, then a given value in the source database is mapped to the same value in the destination database. For example, Host123 in the source database always produces MyHostABC in the destination database.
When the generator is consistent with another column, then a given source value in the other column results in the same host name value in the destination database. For example, a host name column is consistent with a department column. Every instance of Sales in the source data is given the same host name in the destination database.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
This is a composite generator.
Runs selected generators on specified key values in an HStore column in a PostgreSQL database. HStore columns contain a set of key-value pairs.
To configure the generator:
To assign a generator to a key:
Under Sub-generators, click Add Generator. On the sub-generator configuration panel, the Cell HStore field contains a sample value from the source database. You can use the previous and next icons to page through different values.
Under Enter a key, enter the name of a key from the column value.
For example, for the column value:
"pages"=>"446", "title"=>"The Iliad", "category"=>"mythology"
To apply a generator to the title, you would enter title
as the key.
Matched HStore Values shows the result from the value in Cell HStore.
From the Generator Configuration dropdown list, select the generator to apply to the key value. You cannot select another composite generator.
Configure the selected generator. You cannot configure the selected generator to be consistent with another column.
To save the configuration and immediately add a generator for another key, click Save and Add Another. To save the configuration and close the add generator panel, click Save.
From the Sub-Generators list:
To edit a generator assignment, click the edit icon.
To remove a generator assignment, click the delete icon.
To move a generator assignment up or down in the list, click the up or down arrow.
This is a composite generator.
Masks text columns by parsing the contents as HTML, and applying sub-generators to specified path expressions.
If applying a sub-generator fails because of an error, the generator selected as the fallback generator is applied instead.
Path expressions are defined using the XPath syntax.
For example, for the following HTML:
To get the value of h1
, the expression is //h1/text(
).
To get the value of the first list item, the expression is //ul/li[1]/text()
.
To configure the generator:
To assign a generator to a path expression:
Under Sub-generators, click Add Generator. On the sub-generator configuration panel, the Cell HTML field contains a sample value from the source database. You can use the previous and next icons to page through different values.
In the Path Expression field, type the path expression to identify the value to apply the generator to. Matched HTML Values shows the result from the value in Cell HTML.
From the Generator Configuration dropdown list, select the generator to apply to the path expression. You cannot select another composite generator.
Configure the selected generator. You cannot configure the selected generator to be consistent with another column.
To save the configuration and immediately add a generator for another path expression, click Save and Add Another. To save the configuration and close the add generator panel, click Save.
From the Sub-Generators list:
To edit a generator assignment, click the edit icon.
To remove a generator assignment, click the delete icon.
To move a generator assignment up or down in the list, click the up or down arrow.
From the Fallback Generator dropdown list, select the generator to use if the assigned generator for a path expression fails. The options are:
Generates unique integer values. By default, the generated values are within the range of the column’s data type.
You can also specify a range for the generated values. The source values must be within that range.
This generator cannot be used to transform negative numbers.
To configure the generator:
In the Minimum field, enter the minimum value to use for an output value. The minimum value cannot be larger than any of the values in the source data.
In the Maximum field, enter the maximum value to use for an output value. The maximum value cannot be smaller than any of the values in the source data.
Toggle the Consistency setting to indicate whether to make the column self-consistent. By default, consistency is disabled.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
Generates a random IP address formatted string.
To configure the generator:
In the Percent IPv4 field, type the percentage of output values that are IPv4 addresses.
For example, if you set this to 60
, then 60% of the generated IP addresses are IPv4 addresses, and 40% of the generated IP addresses are IPv6 addresses.
If you set this to 100
, then all of the generated IP addresses are IPv4 addresses.
If you set this to 0
, then all of the generated IP addresses are IPv6 addresses.
Toggle the Consistency setting to indicate whether to make the column consistent. By default, consistency is disabled.
If you enable consistency, then by default the generator is self-consistent. To make the generator consistent with another column, from the Consistent to dropdown list, select the column. When a generator is self-consistent, then a given value in the source database is always mapped to the same value in the destination database. When a generator is consistent with another column, then a given source value in that column always results in the same IP address value in the destination database. For example, an IP address column is consistent with a username column. For each instance of User1 in the source database, the value in the IP address column is the same.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
This is a composite generator.
Runs a selected generator on values that match a user specified JSONPath.
If an error occurs, the selected fallback generator is used for the entirety of the JSON value.
Sub-generators are applied sequentially, from the sub-generator at the top of the list to the sub-generator at the bottom of the list.
If multiple JSONPath expressions point to the same key, the most recently added generator takes priority.
JSON paths can also contain regular expressions and comparison logic, which allows the configured sub-generators to be applied only when there are properties that satisfy the query.
For example, a column contains this JSON:
[ { file_name: "foo.txt", b: 10 }, ... ]
The following JSON path only applies to array elements that contain a file_name
key for which the value ends in .txt
:
$.[?(@.file_Name =~ /^.*.txt$/)]
A JSON path can also be used to point to a key name recursively. For example, a column contains this JSON:
The following JSON path applies to all properties for which the key is first_name
:
$..first_name
To configure the generator:
To assign a generator to a path expression:
Under Sub-generators, click Add Generator. On the sub-generator configuration panel, the Cell JSON field contains a sample value from the source database. You can use the previous and next icons to page through different values.
In the Path Expression field, type the path expression to identify the value to apply the generator to. To create a path expression, you can also click the value in Cell JSON that you want the expression to point to. Matched JSON Values shows the result from the value in Cell JSON.
By default, the selected generator is applied to any value that matches the expression. To limit the types of values to apply the generator to, from the Type Filter, specify the applicable types. You can select Any, or you can select any combination of String, Number, Boolean, and Null.
From the Generator Configuration dropdown list, select the generator to apply to the path expression. You cannot select another composite generator.
Configure the selected generator. You cannot configure the selected generator to be consistent with another column.
To save the configuration and immediately add a generator for another path expression, click Save and Add Another. To save the configuration and close the add generator panel, click Save.
From the Sub-Generators list:
To edit a generator assignment, click the edit icon.
To remove a generator assignment, click the delete icon.
To move a generator assignment up or down in the list, click the up or down arrow.
From the Fallback Generator dropdown list, select the generator to use if the assigned generator for a path expression fails. The options are:
Generates a random MAC address formatted string.
To configure the generator:
In the Bytes Preserved field, enter the number of bytes to preserve in the generated address.
Toggle the Consistency setting to indicate whether to make the column self-consistent. By default, consistency is disabled.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
Generates unique object identifiers.
Can be assigned to text columns that contain MongoDB ObjectId
values. The column value must be 12 bytes long.
To configure the generator:
A MongoID object identifier consists of an epoch timestamp, a random value, and an incremented counter. To only change the random value portion of the identifier, but keep the timestamp and counter portions, toggle Preserve Timestamp and Incremental Counter to the on position.
Toggle the Consistency setting to indicate whether to make the generator self-consistent. By default, the generator is not consistent.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
Generates a random name string from a dictionary of first and last names.
You specify the name information that is contained in the column. A column might only contain a first name or last name, or might contain a full name. A full name might be first name first or last name first.
For example, a Name column contains a full name in the format Last, First. For the input value Smith, John
, the output value would be something like, Jones, Mary
.
To configure the generator:
From the name format dropdown list, select the type of name value that the column contains:
First. This also is commonly used for standalone middle name fields.
Last
First Last
First Middle Last
First Middle Initial Last
Last, First
Last, First Middle
Middle Initial
Toggle the Preserve Capitalization setting to indicate whether to preserve the capitalization of the column value. By default, the capitalization is not preserved.
Toggle the Consistency setting to indicate whether to make the column consistent. By default, consistency is disabled.
If you enable consistency, then by default the generator is self-consistent. To make the generator consistent with another column, from the Consistent to dropdown list, select the column.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
Masks values in numeric columns. Adds or multiplies the original value by random noise.
The additive noise generator draws noise from an interval around 0 scaled to the magnitude of original value. For example, the default scale is 10% of the underlying value. The larger the value, the larger the amount of noise that is added.
The multiplicative noise generator multiplies the original value by a random scaling factor that falls within a specified range.
To configure the generator:
To use the additive noise generator:
From the dropdown list, choose Additive.
In the Relative noise scale field, type the percentage of the underlying value to scale the noise to. The default value is 10
.
Tonic samples the additive noise from a range between [-{
scale
/100} * |
value
|, {
scale
/ 100} * |
value
|)
, where scale
is the noise scale, and value
is the original data value.
The lower value of the range is inclusive, and the upper value of the range is exclusive.
For example, for the default noise scale of 10
, and a data value of 20
, the additive noise range would be [-.1 * 20, .1 * 20)
. In other words, between -2 (inclusive) and 2 (exclusive).
To use the multiplicative noise generator:
From the dropdown list, choose Multiplicative.
In the Min field, type the minimum value for the scaling factor. The minimum value is inclusive. The default value is 0.5
.
In the Max field, type the maximum value for the scaling factor. The maximum value is exclusive. The default value is 5
.
Tonic scales the original value from a range between [
min
,
max
)
, where min
is the minimum scaling factor, and max
is the maximum scaling factor.
For example, for the default values of 0.5
and 5
, Tonic multiplies the original data value by a value from between 0.5 (inclusive) and 5 (exclusive).
Toggle the Consistency setting to indicate whether to make the column consistent. By default, the consistency is disabled.
If you enable consistency, then by default the generator is self-consistent. To make the generator consistent with another column, from the Consistent to dropdown list, select the column. If the generator is self-consistent, then a given value in the source database is masked in exactly the same way to produce the value in the destination database. If the generator is consistent with another column, then for a given value in that other column, the column that is assigned the Noise generator is always masked in exactly the same way in the destination database. For example, a field containing a salary value is assigned the Noise Generator and is consistent with the username field. For each instance of User1, the Noise Generator masks the salary value in exactly the same way.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
Generates NULL values to fill the rows of the specified column.
The Null generator has no configuration options.
Generates unique numeric strings of the same length as the input value.
For example, for the input value 123456
, the output value would be something like 832957
.
You can apply this generator only to columns that contain numeric strings.
To configure the generator, toggle the Consistency setting to indicate whether to make the generator self-consistent.
By default, the generator is not consistent.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
Passthrough is the default option.
It passes through the value from the source database to the destination database without masking it.
Passthrough has no configuration options.
Generates a random phone number that matches the country or region of the input phone number while maintaining the format. For example, (123) 456-7890 or 123-456-7890.
If the input is not a valid phone number, the generator randomly replaces numeric characters. You can also replace invalid numbers with valid numbers.
By default, the numbers are United States phone numbers. Generated numbers pass Google's libphonenumber
verification if the input is a valid phone number or if you replace invalid numbers.
To configure the generator:
Toggle the Replace invalid numbers setting to indicate whether to replace invalid input values with a valid output value. By default, the generator does not replace invalid values. It randomly replaces numeric characters.
Toggle the Consistency setting to indicate whether to make the generator self-consistent. By default, consistency is disabled.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
Generates a random boolean value.
To configure the generator, in the Percent True field, enter the percentage of values to set to True
in the output.
For example, if you set this to 60
, then 60% of the output values are True
, and 40% of the output values are False
.
If you set this to 100
, then all of the output values are True
.
If you set this to 0
, then all of the output values are False
.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
Generates a random double number between the specified minimum (inclusive) and maximum (exclusive).
To configure the generator:
In the Minimum field, type the minimum value to use in the output values. The minimum value is inclusive. The output values can be that value or higher.
In the Maximum field, type the maximum value to use in the output values. The maximum value is exclusive. The output values are lower than that value.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
Generates a random hash string.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
Returns a random integer between the specified minimum (inclusive) and maximum (exclusive).
For example, for a column that contains a percentage value, you can indicate to use a value between 0
and 101
.
To configure the generator:
In the Minimum field, type the minimum value to use in the output values. The minimum value is inclusive. The output values can be that value or higher.
In the Maximum field, type the maximum value to use in the output values. The maximum value is exclusive. The output values are lower than that value.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
Generates random dates, times, and timestamps that fall within a specified range.
For example, you might want the output dates to all fall within a specific year or month.
To configure the generator, in the Range fields, provide the start and end dates, times, or timestamps to use for the output values.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
Generates a random new UUID string.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
This is a composite generator.
Uses regular expressions to parse strings and replace specified substrings with the output of specified generators. The parts of the string to replace are specified inside unnamed top-level capture groups.
Defining multiple expressions allows you to attach completely different sets of sub-generators to to a given cell, depending on the cell's value.
If multiple regular expressions match a given string, the regular expressions and their associated generators are applied in the order that they are specified. The first expression defined that matches has the selected sub-generators applied.
With the Replace all matches option, the Regex Mask generator behaves similarly to a traditional regex parser. It matches all occurrences of a pattern before the next pattern is encountered. For example, the pattern ^(a)$
applied to the string aaab
matches every occurrence of the letter a
, instead of just the first.
Note that for Spark-based data connectors, depending on your environment, there might be slight differences in the regular expression support. To ensure consistent results across all data connectors, use regular expression patterns that are compatible with both Java and C#.
For more information about regular expressions in C#, go to this reference. For more information about regular expressions in Java, go to this reference.
Example expressions
In a cell that contains the string ProductId:123-BuyerId:234
, to mask the substrings 123
and 234
, specify the regular expression:
^ProductId:([0-9]{3})-BuyerId:([0-9]{3})$
This captures the two occurrences of three-digit numbers in the pattern ProductId:xxx-BuyerId:xxx
. This makes it possible to define a sub-generator on neither, either, or both of these captured substrings.
The following regular expression defines a broader capture that matches more cell values:
^(\w+).(\d+).(\w+).(\d+)$
This captures pairs of words ((\w+)
) and numbers ((\d+)
) if there is a single character of any value between them, instead of the relatively more specific pattern of the first expression.
To configure the generator:
To add a regular expression:
Click Add Regex. On the configuration panel, Cell Value shows a sample value from the source database. You can use the previous and next options to navigate through the values.
By default, Replace all matches is enabled. To only match the first occurrence of a pattern, toggle Replace all matches to the off position.
In the Pattern field, enter a regular expression. If the expression is valid, then Tonic displays the capture groups for the expression.
For each capture group, to select and configure the generator to apply, click the selected generator. You cannot select another composite generator.
To save the configuration and immediately add a generator for another path expression, click Save and Add Another. To save the configuration and close the add generator panel, click Save.
From the Regexes list:
To edit a regex, click the edit icon.
To remove a regex, click the delete icon.
Generates a column of unique integer values. The values increment by 1.
To configure the generator:
From the Link To dropdown list, select the other columns to link to the current column. You can only select columns that also use the Sequential Integer generator.
In the Starting Point field, type the number to use as the starting point.
By default, the starting point is 0
. This means that the column value in the first processed row is 0
. The value in the next processed row is 1
. The generator continues to increment the value by 1 in each row that it processes.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
Generates values of ISO 6346 compliant shipping container codes. All generated codes are in the freight category ("U").
To configure the generator, toggle the Consistency setting to indicate whether to make the generator consistent.
By default, the generator is not consistent.
If you enable consistency, then by default the generator is self-consistent. To make the generator consistent with another column, from the Consistent to dropdown list, select the column.
When the generator is self-consistent, then a given value in the source database is always mapped to the same value in the destination database.
When the generator is consistent with another column, then a given value for the other column in the source database always results in the same shipping container code value in the destination database. For example, a shipping container column is consistent with an owner column. Every instance of an owner column from the source database has the same shipping container value in the destination database.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
Generates a new valid Canadian Social Insurance Number that preserves the formatting of the original value.
For example, the original value might be 123456789
, 123 456 789
, or 123-456-789
. The output value uses the same format.
To configure the generator, toggle the Consistency setting to indicate whether to make the generator self-consistent.
By default, the generator is not consistent.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
Generates a new valid United States Social Security Number.
You specify the percentage of values for which to include the dashes.
To configure the generator:
In the Percent with -'s field, type the percentage of output values for which to include dashes in the format.
For example, if you set this to 60
, then 60% of the output values are formatted 123-45-6789
, and 40% are formatted 123456789
.
If you set this to 100
, then all of the output values are formatted 123-45-6789
.
If you set this to 0
, then all of the output values are formatted 12345679
.
Toggle the Consistency setting to indicate whether to make the generator consistent. By default, consistency is disabled.
If you enable consistency, then by default the generator is self-consistent. To make the generator consistent with another column, from the Consistent to dropdown list, select the column. When a generator is self-consistent, then a given value in the source database is always mapped to the same value in the destination database. When a generator is consistent with another column, then a given value for that other column in the source database results in the same SSN in the destination database. For example, if the SSN column is consistent with a Name column, then every instance of John Smith in the source database results in the same SSN in the destination database.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
This is a composite generator.
Applies selected generators to specific StructFields within a StructType in a Spark database (Databricks and Amazon EMR).
For example, for the following StructType:
To get the value of the occupation
field, you would use the expression root.occupation
.
To configure the generator:
To assign a generator to a path expression:
Under Sub-generators, click Add Generator. On the sub-generator configuration panel, the Cell Struct field contains a sample value from the source database. You can use the previous and next icons to page through different values.
In the Path Expression field, type the path expression to identify the value to apply the generator to. Matched Struct Values shows the result from the value in Cell Struct.
From the Generator Configuration dropdown list, select the generator to apply to the path expression. You cannot select another composite generator.
Configure the selected generator. You cannot configure the selected generator to be consistent with another column.
To save the configuration and close the add generator panel, click Save.
From the Sub-Generators list:
To edit a generator assignment, click the edit icon.
To remove a generator assignment, click the delete icon.
To move a generator assignment up or down in the list, click the up or down arrow.
Shifts timestamps by a random amount of a specific unit of time within a set range.
For date-only values, the Timestamp Shift Generator supports the following date formats. The example values are all for February 23, 2021.
MM/dd/yyyy
- 02/23/2021
MM/dd/yy
- 02/23/21
MM-dd-yyyy
- 02-23-2021
yyyyMMdd
- 20210223
yyyy/MM/dd
- 2021/02/23
MMddyyyy
- 02232021
To configure the generator:
From the Date Part dropdown list, select the unit of time to use for the minimum and maximum shift.
In the Minimum Shift field, type the minimum amount the value can be shifted from the original value.
Use negative numbers to indicate to shift the date to the past.
For example, assume that the date part is Day. -3
indicates that the day cannot be shifted earlier than 3 days before the original day. 3
indicates that the date cannot be shifted earlier than 3 days after the original day.
In the Maximum Shift field, type the maximum amount by which the value can be shifted from the original value.
For example, assume that the date part is Day. 5
indicates that the date cannot be shifted later than 5 days after the original day.
Toggle the Consistency setting to indicate whether to make the generator consistent. By default, consistency is disabled.
If you enable consistency, then by default the generator is self-consistent. To make the generator consistent with another column, from the Consistent to dropdown list, select the column. When a column is consistent with itself, then the same date part value is always shifted by the same amount.
When a column is consistent with another column, then for the same value in the other column, the date part value is always shifted by the same amount. For example, for the same value of username, the birthdate column value is always shifted by the same amount.
If multiple columns that use the Timestamp Shift generator are consistent with the same other column, then for those columns, the date part value shifts by the same amount. For example, the startdate
and enddate
columns are both consistent with the username
column. Both startdate
and enddate
use the Timestamp Shift generator. For the same value of username
, both startdate
and enddate
are shifted by the same amount.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
Generates unique email addresses. Replaces the username with a randomly generated GUID, and masks the domain with a character scramble.
This generator only guarantees uniqueness if the underlying column is unique.
To configure the generator:
In the Email Domain field, enter a domain to use for all of the output values.
For example, use @mycompany.com
for all of the generated values.
If you do not provide a value, then the generator uses a character scramble on the domain.
In the Excluded Email Domains field, enter a comma-separated list of domains for which email addresses are not masked in the output values. This allows you, for example, to maintain internal or testing email addresses that are not considered sensitive.
Toggle the Replace invalid emails setting to indicate whether to replace an invalid email address with a generated valid email address. By default, invalid email addresses are not replaced. In the replacement values, the username is generated. If you specify a value for Email Domain, then that value is used for the domain. Otherwise, the domain is generated.
Toggle the Consistency setting to indicate whether to make the generator self-consistent. By default, consistency is disabled.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
This is a substitution cipher that preserves formatting, but keeps the URL scheme and top-level domain intact.
For example, for the following input value:
http://www.example.com/products/clothes
The output value would be something like:
http://www.example.com/sowrmsl/kwctlsn
This mask is not secure.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
Generates UUIDs on primary key columns.
All foreign key columns that reference the configured column automatically have their UUID values masked.
To configure the generator:
To preserve the version and variant bits from the source UUID in the output value, toggle Preserve Version and Variant to the on position.
Toggle the Consistency setting to indicate whether to make the generator self-consistent. By default, the generator is not consistent.
If Structural data encryption is enabled, then to use it for this column, toggle Use data encryption process to the on position.
This is a composite generator.
Runs a selected generator on values that match a user specified path expression.
Path expressions are defined using the XPath syntax.
For example, for the following XML content:
To get the first_name
value, you would use /household/member/first_name
.
You can also select a fallback generator to run on the entire XML value if there is any error during data generation.
To configure the generator:
To assign a generator to a path expression:
Under Sub-generators, click Add Generator. On the sub-generator configuration panel, the Cell XML field contains a sample value from the source database. You can use the previous and next icons to page through different values.
In the Path Expression field, type the path expression to identify the value to apply the generator to. Matched XML Values shows the result from the value in Cell XML.
From the Generator Configuration dropdown list, select the generator to apply to the value at the path expression. You cannot select another composite generator.
Configure the selected generator. You cannot configure the selected generator to be consistent with another column.
To save the configuration and immediately add a generator for another path expression, click Save and Add Another. To save the configuration and close the add generator panel, click Save.
From the Sub-Generators list:
To edit a generator assignment, click the edit icon.
To remove a generator assignment, click the delete icon.
To move a generator assignment up or down in the list, click the up or down arrow.
From the Fallback Generator dropdown list, select the generator to use if any error occurs in the generation. The fallback generator is then used for the entire XML value. The options are:
The linking option for a generator allows multiple columns within the same table to use a single generator.
At a high level, consider using linking when columns share a strong interdependency or correlation.
When you link columns, you tell Tonic Structural that the columns are related to each other, and that Structural should take this relationship into account when it generates new data.
To link columns, you first assign the same generator to those columns.
After you assign the generator, then on the generator configuration panel for any of the columns, you can link the columns.
Categorical generators support linking and can be used to preserve hierarchical data. Examples of hierarchical data include:
City, State, Zip
Job Title, Department
Day of Month, Month, Year
To illustrate how linking works, we'll use an example of city and state columns. Here is the original data:
The below image shows the results when you apply the Categorical generator to city and state columns, but do not link the columns. Because the columns are not linked, the values in each column are shuffled independently. In the output, the city and state combinations are not valid. For example, Phoenix is not in Florida and Baltimore is not in Tennessee.
The next image shows the results when you apply the Categorical generator to and link the city and state columns. This preserves the data hierarchy and ensures that the city and state combinations are valid.
The following generators can be linked:
When you consider which generator to use, it helps to be familiar with these generator characteristics.
Consistency is an option for some generators that when turned on, maps the same input to the same output across an entire database.
Consistency can also be maintained across multiple databases of varying types. For example, if consistency is turned on for a name generator, it always maps the same input name (for example, Albert Einstein) to the same output (for example, Richard Feynman).
The primary reasons for using consistency are to:
Enable joining on columns that don't have explicit database constraints in the schema. This is often seen with values such as email addresses. With consistency, you can completely anonymize an email address and still use it in a join.
Match duplicated data across 1 or more databases. For example, you have a user database that contains a username in both a column and a JSON blob, and another database that contains their website activity, identified by the same username values. To anonymize the username, but still have the username be the same in all locations/databases, use consistency.
Self-consistency indicates that the value in the destination database is consistent with the value of the same column in the source database.
For example, a column contains a first name. You make the assigned generator self-consistent. A given first name in the source database is always replaced by the same first name in the destination database. For example, the first name value John
is always replaced by the value Michael
.
Consistency with another column indicates that the value in the destination database is consistent with the value of a different column in the source database.
For example, a column contains an IP address. You make the assigned generator consistent with the username column. Every row that has the username User1
in the input database has the same IP address in the destination database.
To enable consistency, on the generator configuration panel, toggle the Consistency switch.
Not all generators support consistency.
Consistency is a function of the both the data type and the value.
For example, a numeric field contains the value 123. A string/varchar field contains the value "123".
Both fields have consistent generators applied.
The output is not consistent between the two fields.
To demonstrate the effect of consistency on the output, we'll use a column that contains a first name, and that uses the Name generator.
Here is the sample input and output when consistency is not enabled:
In this sample data, the first name Melissa appears twice, but is mapped to Walton the first time and Linn the second time.
Here is the sample input and output when consistency is enabled:
In this case, the first name Melissa is mapped to Rosella both times.
A consistent generator ensures that the same input value always produces the same output value.
It does not guarantee that two different input values produce two different output values.
Consistent generators are not 1:1 mappings.
Consistency reduces the privacy of your data, because it reveals something about the frequency of the data values.
However, Tonic Structural does not store mappings of the source data to the destination data. In other words, someone can see that in the destination data the name Susan appears 20 times and the name John appears 3 times. But they cannot determine that Susan is mapped from Jane and John is mapped from Michael.
Any column, regardless of which table it resides in, is consistent with any other column that uses the same consistent generator.
For example, your database includes a Customers table and an Employees table. Each table contains a column for the first name of the customer or employee. You assign the Name generator to both columns to generate a first name, and make the generators consistent. The same first name value in either column is mapped to the same destination value. For example, the first name John is always mapped to Michael, whether the name John appears in the Customers table or the Employees table.
However, by default, consistency is not guaranteed between data generation runs, even if the run is on the same database.
By default, consistency is only guaranteed across a single data generation for a single workspace.
For example, for a column that contains a first name value, you assign the Name generator and configure the generator to be consistent. The first time you run data generation, all instances of the name John might be replaced with Michael. The next time you run data generation, all instances of the name John might instead be replaced with Gregory.
You can enable consistency across runs and workspaces so that, for example, every time you run a data generation, John is always replaced with Michael.
To do this, you configure a seed value. You can either:
Configure a seed value for a workspace. This ensures consistency across all data generation runs for that workspace, as well as across other workspaces that have the same seed value.
Disable cross-data generation consistency for a workspace. This indicates to not have consistency across data generation runs or with other workspaces.
To ensure consistency across all data generations and workspaces, add the following environment setting to the Structural worker and web server containers:
TONIC_STATISTICS_SEED: <ANY 32-BIT SIGNED INTEGER>
When you configure a value for this environment setting, then consistency is across all data generations for all workspaces that do not either:
Have a workspace seed value configured.
Have disabled consistency across data generations.
For an individual workspace, you can override the Structural seed value. When you override the Structural seed value, you can either:
Disable consistency across data generation runs for the workspace.
Provide a seed value for the workspace.
When a workspace has a configured seed value, then consistency is across the data generation runs for that workspace.
Consistency is also across all of the data generations for all of the workspaces that have the same seed value.
On the workspace details view, to override the Structural seed value:
Toggle Override Statistics Seed to the on position.
To disable consistency across data generations, click Don't use consistency.
To provide a seed value for the workspace:
Click Consistency value.
In the field, enter the seed value. It must be a 32-bit signed integer. The value defaults to the current value of TONIC_STATISTICS_SEED
.
The following generators can be made consistent to themselves. This means that the same input value in the column always produces the same output value.
The following generators can be made consistent either to themselves or to other columns.
When a column is consistent to another column, the output value is based on the other column.
For example, a column contains a company name. You assign the Company Name generator, and make it consistent with the username column. Every row that has the username User1 in the input database has the same company name in the destination database.
Option | Date value | Timestamp value |
---|---|---|
In a , if you change the configuration of a linked column, the columns that it is linked to also are marked as having overrides to the parent workspace configuration.
Note that you cannot configure linking as part of a . You can only configure linking when you configure specific columns.
You can also view this .
Preserve the approximate cardinality of a column. For example, a city column contains 50 different cities. To randomize this column but still have ~50 cities, you can use consistency to maintain the approximate cardinality. Because , the cardinality might change. However, it is guaranteed to not increase. If unique 1-to-1 mappings are required, a generator should be used.
When you select a generator as the sub-generator for a , in most cases you cannot configure the generator to be consistent with another column. Only the Conditional generator and the Regex Mask generator allow a sub-generator to be consistent with another column.
Note that consistency with another column cannot be configured in a . You can only configure it when you configure an individual column.
Configure the Structural TONIC_STATISTICS_SEED
. This ensures consistency across all workspaces and data generation runs.
(Deprecated)
Consistency
Yes, can be made self-consistent or consistent with another column.
Linking
Yes, can be linked.
Differential privacy
Yes, if consistency is not enabled.
Data-free
Yes, if consistency is not enabled.
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
1 if not consistent
4 if consistent
Generator ID (for the API)
Consistency
No, cannot be made consistent.
Linking
Yes, can be linked.
Differential privacy
No
Data-free
No
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
3
Generator ID (for the API)
Consistency
Yes, can be made self-consistent.
Linking
No, cannot be linked.
Differential privacy
No
Data-free
No
Allowed for primary keys
Yes
Allowed for unique columns
Yes
Uses format-preserving encryption (FPE)
Yes
Privacy ranking
3 if not consistent
4 if consistent
Generator ID (for the API)
Consistency
Yes, can be made self-consistent.
Linking
No, cannot be linked.
Differential privacy
No
Data-free
No
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
3 if not consistent
4 if consistent
Generator ID (for the API)
Consistency
Determined by the specified sub-generators.
Linking
Determined by the specified sub-generators.
Differential privacy
Determined by the specified sub-generators.
Data-free
Determined by the specified sub-generators.
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
5
Generator ID (for the API)
Consistency
Determined by the selected sub-generators.
Linking
Determined by the selected sub-generators.
Differential privacy
Determined by the selected sub-generators.
Data-free
Determined by the selected sub-generators.
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
5
Generator ID (for the API)
Consistency
Yes, can be made self-consistent.
Linking
No, cannot be linked.
Differential privacy
No
Data-free
No
Allowed for primary keys
Yes
Allowed for unique columns
Yes
Uses format-preserving encryption (FPE)
Yes
Privacy ranking
3 if not consistent
4 if consistent
Generator ID (for the API)
Consistency
Yes, can be made self-consistent or consistent with another column.
Linking
No, cannot be linked.
Differential privacy
Yes, if consistency is not enabled.
Data-free
Yes, if consistency is not enabled.
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
1 if not consistent
4 if consistent
Generator ID (for the API)
Consistency
No, cannot be made consistent.
Linking
Yes, can be linked.
Differential privacy
Configurable
Data-free
No
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
2 if differential privacy enabled
3 if differential privacy not enabled
Generator ID (for the API)
Consistency
Yes, can be made self-consistent
Linking
No, cannot be linked
Differential privacy
No
Data-free
No
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
3 if not consistent
4 if consistent
Generator ID (for the API)
Consistency
This generator is implicitly self-consistent. You do not specify whether the generator is consistent. Every occurrence of a character always maps to the same substitute character. Because of this, it can be used to preserve a join between two text columns, such as a join on a name or email.
Linking
No, cannot be linked.
Differential privacy
No
Data-free
No
Allowed for primary keys
No
Allowed for unique columns
Yes
Uses format-preserving encryption (FPE)
No
Privacy ranking
4
Generator ID (for the API)
Consistency
Yes, can be made self-consistent or consistent with another column.
Linking
No, cannot be linked.
Differential privacy
Yes, if consistency is not enabled.
Data-free
Yes, if consistency is not enabled.
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
1 if not consistent
4 if consistent
Generator ID (for the API)
Consistency
Determined by the selected generators.
Linking
Determined by the selected generators.
Differential privacy
Determined by the selected generators.
Data-free
Determined by the selected generators.
Allowed for primary keys
No
Allowed for unique columns
Yes
Uses format-preserving encryption (FPE)
No
Privacy ranking
If a fallback generator is selected, then the lower of either 5 or the fallback generator.
5 if no fallback generator is selected
Generator ID (for the API)
Consistency
No, cannot be made consistent.
Linking
No, cannot be linked.
Differential privacy
Yes
Data-free
Yes
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
1
Generator ID (for the API)
Consistency
No, cannot be made consistent.
Linking
Yes, can be linked.
Differential privacy
Configurable
Data-free
No
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
2 if differential privacy enabled
3 if differential privacy not enabled
Generator ID (for the API)
Consistency
No, cannot be made consistent.
Linking
No, cannot be linked.
Differential privacy
No
Data-free
No
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
3
Generator ID (for the API)
Consistency
Determined by the selected sub-generators.
Linking
Determined by the selected sub-generators.
Differential privacy
Determined by the selected sub-generators.
Data-free
Determined by the selected sub-generators.
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
5
Generator ID (for the API)
Consistency
Yes, can be made self-consistent or consistent with another column.
Linking
Yes, can be linked.
Differential privacy
Yes, if consistency is not enabled.
Data-free
Yes, if consistency is not enabled.
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
1 if not consistent
4 if consistent
Generator ID (for the API)
Consistency
No, cannot be made consistent.
Linking
No, cannot be linked.
Differential privacy
No
Data-free
No
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
5
Generator ID (for the API)
Original value
2021-12-20
2021-12-20 13:42:55
Truncate to year
2021-01-01
2021-01-01 00:00:00
Truncate to month
2021-12-01
2021-12-01 00:00:00
Truncate to day
2021-12-20
2021-12-20 00:00:00
Truncate to hour
Not applicable
2021-12-20 13:00:00
Truncate to minute
Not applicable
2021-12-20 13:42:00
Truncate to second
Not applicable
2021-12-20 13:42:55
Consistency
Yes, can be made self-consistent.
Linking
No, cannot be linked.
Differential privacy
No
Data-free
No
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
3 if not consistent
4 if consistent
Generator ID (for the API)
Consistency
No, cannot be made consistent.
Linking
Yes, can be linked.
Differential privacy
No
Data-free
No
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
3
Generator ID (for the API)
Consistency
Yes, can be made self-consistent.
Linking
No, cannot be linked.
Differential privacy
No
Data-free
No
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
3 if not consistent
4 if consistent
Generator ID (for the API)
Consistency
No, cannot be made consistent.
Linking
No, cannot be linked.
Differential privacy
No
Data-free
No
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
5
Generator ID (for the API)
Consistency
Yes, can be made self-consistent or consistent with another column.
Linking
No, cannot be linked
Differential privacy
No
Data-free
No
Allowed for primary keys
No
Allowed for unique columns
Yes
Uses format-preserving encryption (FPE)
No
Privacy ranking
3 if not consistent
4 if consistent
Generator ID (for the API)
Consistency
No, cannot be made consistent.
Linking
Yes, can be linked.
Differential privacy
No
Data-free
No
Allowed for primary keys
No
Allowed for unique columns
Yes
Uses format-preserving encryption (FPE)
No
Privacy ranking
3
Generator ID (for the API)
Consistency
Yes, can be made self-consistent.
Linking
Yes, can be linked.
Differential privacy
No
Data-free
No
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
3 if not consistent
4 if consistent
Generator ID (for the API)
Consistency
Yes, can be made self-consistent or consistent with another column.
Linking
No, cannot be linked.
Differential privacy
Yes, if consistency is not enabled.
Data-free
Yes, if consistency is not enabled.
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
1 if not consistent
4 if consistent
Generator ID (for the API)
Consistency
Determined by the selected sub-generators.
Linking
Determined by the selected sub-generators.
Differential privacy
Determined by the selected sub-generators.
Data-free
Determined by the selected sub-generators.
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
5
Generator ID (for the API)
Consistency
Determined by the selected sub-generators.
Linking
Determined by the selected sub-generators.
Differential privacy
Determined by the selected sub-generators.
Data-free
Determined by the selected sub-generators.
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
5
Generator ID (for the API)
Consistency
Yes, can be made self-consistent.
Linking
No, cannot be linked.
Differential privacy
Yes, if consistency is not enabled.
Data-free
Yes, if consistency is not enabled.
Allowed for primary keys
Yes
Allowed for unique columns
Yes
Uses format-preserving encryption (FPE)
Yes
Privacy ranking
1 if not consistent
4 if consistent
Generator ID (for the API)
Consistency
Yes, can be made self-consistent or consistent with another column.
Linking
No, cannot be linked.
Differential privacy
Yes, if consistency is not enabled.
Data-free
Yes, if consistency is not enabled.
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
1 if not consistent
4 if consistent
Generator ID (for the API)
Consistency
Determined by the selected sub-generators.
Linking
Determined by the selected sub-generators.
Differential privacy
Determined by the selected sub-generators.
Data-free
Determined by the selected sub-generators.
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
5
Generator ID (for the API)
Consistency
Yes, can be made self-consistent.
Linking
No, cannot be linked.
Differential privacy
Yes, if consistency is not enabled.
Data-free
Yes, if consistency is not enabled.
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
Yes
Privacy ranking
1 if not consistent
4 if consistent
Generator ID (for the API)
Consistency
Yes, can be made self-consistent
Linking
No, cannot be linked
Differential privacy
No
Data-free
No
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
3 if not consistent
4 if consistent
Generator ID (for the API)
Consistency
Yes, can be made self-consistent or consistent with another column. Note that all Name generator columns that have the same consistency configuration are automatically consistent with each other. The columns must either be all self-consistent or all consistent with the same other column. For example, you can use this to ensure that a first name and last name column value always match the first name and last name in a full name column.
Linking
No, cannot be linked.
Differential privacy
Yes, if consistency is not enabled.
Data-free
Yes, if consistency is not enabled.
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
1 if not consistent
4 if consistent
Generator ID (for the API)
Consistency
Yes, can be made self-consistent or consistent with another column.
Linking
No, cannot be linked.
Differential privacy
No
Data-free
No
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
3 if not consistent
4 if consistent
Generator ID (for the API)
Consistency
No, cannot be made consistent.
Linking
No, cannot be linked.
Differential privacy
Yes
Data-free
Yes
Allowed for primary keys
No
Allowed for unique columns
Yes
Uses format-preserving encryption (FPE)
No
Privacy ranking
1
Generator ID (for the API)
Consistency
Yes, can be made self-consistent.
Linking
No, cannot be linked.
Differential privacy
No
Data-free
No
Allowed for primary keys
Yes
Allowed for unique columns
Yes
Uses format-preserving encryption (FPE)
Yes
Privacy ranking
3 if not consistent
4 if consistent
Generator ID (for the API)
Consistency
No, cannot be made consistent.
Linking
No, cannot be linked.
Differential privacy
No
Data-free
No
Allowed for primary keys
No
Allowed for unique columns
Yes
Uses format-preserving encryption (FPE)
No
Privacy ranking
6
Generator ID (for the API)
Consistency
Yes, can be made self-consistent.
Linking
No, cannot be linked.
Differential privacy
No
Data-free
No
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
3
Generator ID (for the API)
Consistency
No, cannot be made consistent.
Linking
No, cannot be linked.
Differential privacy
Yes
Data-free
Yes
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
1
Generator ID (for the API)
Consistency
No, cannot be made consistent.
Linking
No, cannot be linked.
Differential privacy
Yes
Data-free
Yes
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
1
Generator ID (for the API)
Consistency
No, cannot be made consistent.
Linking
No, cannot be linked.
Differential privacy
Yes
Data-free
Yes
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
1
Generator ID (for the API)
Consistency
No, cannot be made consistent.
Linking
No, cannot be linked.
Differential privacy
Yes
Data-free
Yes
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
1
Generator ID (for the API)
Consistency
No, cannot be made consistent.
Linking
No, cannot be linked.
Differential privacy
Yes
Data-free
Yes
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
1
Generator ID (for the API)
Consistency
No, cannot be made consistent.
Linking
No, cannot be linked.
Differential privacy
Yes
Data-free
Yes
Allowed for primary keys
No
Allowed for unique columns
Yes
Uses format-preserving encryption (FPE)
No
Privacy ranking
1
Generator ID (for the API)
Consistency
Determined by the selected sub-generators.
Linking
Determined by the selected sub-generators.
Differential privacy
Determined by the selected sub-generators.
Data-free
Determined by the selected sub-generators.
Allowed for primary keys
No
Allowed for unique columns
Yes
Uses format-preserving encryption (FPE)
No
Privacy ranking
5
Generator ID (for the API)
Consistency
No, cannot be made consistent.
Linking
Yes, can be linked.
Differential privacy
No
Data-free
No
Allowed for primary keys
No
Allowed for unique columns
Yes
Uses format-preserving encryption (FPE)
No
Privacy ranking
3
Generator ID (for the API)
Consistency
Yes, can be made self-consistent or consistent with another column.
Linking
No, cannot be linked.
Differential privacy
Yes, if consistency is not enabled.
Data-free
Yes, if consistency is not enabled.
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
1 if not consistent
4 if consistent
Generator ID (for the API)
Consistency
Yes, can be made self-consistent.
Linking
No, cannot be linked.
Differential privacy
No, cannot be made differentially private.
Data-free
Yes, if consistency is not enabled.
Allowed for primary keys
No
Allowed for unique columns
Yes
Uses format-preserving encryption (FPE)
Yes
Privacy ranking
1 if not consistent
4 if consistent
Generator ID (for the API)
Consistency
Yes, can be made self-consistent or consistent with another column.
Linking
No, cannot be linked.
Differential privacy
Yes, if consistency is not enabled.
Data-free
Yes, if consistency is not enabled.
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
1 if not consistent
4 if consistent
Generator ID (for the API)
Consistency
Determined by the selected sub-generators.
Linking
Determined by the selected sub-generators.
Differential privacy
Determined by the selected sub-generators.
Data-free
Determined by the selected sub-generators.
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
5
Generator ID (for the API)
Consistency
Yes, can be made self-consistent or consistent with another column.
Linking
No, cannot be linked.
Differential privacy
No
Data-free
No
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
3 if not consistent
4 if consistent
Generator ID (for the API)
Consistency
Yes, can be made self-consistent.
Linking
No, cannot be linked.
Differential privacy
No
Data-free
No
Allowed for primary keys
No
Allowed for unique columns
Yes
Uses format-preserving encryption (FPE)
No
Privacy ranking
3 if not consistent
4 if consistent
Generator ID (for the API)
Consistency
No, cannot be made consistent.
Linking
No, cannot be linked.
Differential privacy
No
Data-free
No
Allowed for primary keys
No
Allowed for unique columns
Yes
Uses format-preserving encryption (FPE)
No
Privacy ranking
3
Generator ID (for the API)
Consistency
Yes, can be made self-consistent.
Linking
No, cannot be linked.
Differential privacy
No
Data-free
No
Allowed for primary keys
Yes
Allowed for unique columns
Yes
Uses format-preserving encryption (FPE)
Yes
Privacy ranking
3 if not consistent
4 if consistent
Generator ID (for the API)
Consistency
Determined by the selected sub-generators.
Linking
Determined by the selected sub-generators.
Differential privacy
Determined by the selected sub-generators.
Data-free
Determined by the selected sub-generators.
Allowed for primary keys
No
Allowed for unique columns
No
Uses format-preserving encryption (FPE)
No
Privacy ranking
5
Generator ID (for the API)
Some generators can be data-free. When a generator is data-free, it means that the output data is completely unrelated to the source data. There is no way to use the output data to uncover the source data. Data-free generators implicitly have differential privacy. A generator is not data-free if consistency is enabled.
The following generators are always data-free:
The following generators are data-free only when consistency is disabled:
Company Name (deprecated)
Differential privacy is one technique that Tonic Structural uses to ensure the privacy of your data.
Differential privacy limits the effect of a single source record or user on the destination data. Someone who views the output of a process that has differential privacy cannot determine whether a particular individual's information was used to generate that output.
Data that is protected by a process with differential privacy cannot be reverse engineered, re-identified, or otherwise compromised.
Any generator that does not use the underlying data at all is considered "data-free". A data-free generator always has differential privacy.
Several Structural generators are either always data-free, or are data-free if consistency is not enabled.
The configuration options for the Categorical and Continuous generators include a Differential Privacy toggle to enable or disable differential privacy.
The Categorical generator shuffles the values of a column while preserving the overall frequency of the values. Note that NULL is considered its own category of value.
Differential privacy (disabled by default) further protects the privacy of your data by:
First, adding noise to the frequencies of categories.
After that, if needed, removing rare categories from the possible samples.
Differential privacy is not appropriate when the data in each row is unique or nearly unique. As a general rule of thumb, categories that are represented by fewer than 15 rows are at risk of being suppressed.
Structural warns you when a column isn’t suitable for differential privacy. A column is not suitable for differential privacy if most or all categories have fewer than 15 rows.
The Continuous generator produces samples that preserve the individual column distributions and correlations between columns.
When differential privacy is enabled, noise is added to the individual distributions and the correlation matrix, using the mechanism described in [4].
Suppose we want to count the number of users in a database that have some sensitive property. For example, the number of users with a particular medical diagnosis.
Dwork, McSherry, Nissim and Smith introduced in [2] the Laplace Mechanism as a way to publish these counts in a secure way, by adding noise sampled from the Laplace distribution.
A common relaxation, called approximate differential privacy, allows for flexible privacy analysis with noise drawn that is from a wider array of distributions than the Laplace distribution.
For example, the AnalyzeGauss mechanisms of [4], and differentially private gradient descent of [1], use Gaussian noise as a fundamental ingredient, which requires the following relaxation:
Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep Learning with Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS '16). Association for Computing Machinery, New York, NY, USA, 308–318. DOI:https://doi.org/10.1145/2976749.2978318
Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam Smith. 2006 Calibrating Noise to Sensitivity in Private Data Analysis. In: Halevi S., Rabin T. (eds) Theory of Cryptography. (TCC '06). Lecture Notes in Computer Science, vol 3876. Springer, Berlin, Heidelberg. DOI:https://doi.org/10.1007/11681878_14
Cynthia Dwork and Aaron Roth. 2014. The Algorithmic Foundations of Differential Privacy. Found. Trends Theor. Comput. Sci. 9, 3–4 (August 2014), 211–407. DOI:https://doi.org/10.1561/0400000042
Cynthia Dwork, Kunal Talwar, Abhradeep Thakurta, and Li Zhang. 2014. Analyze gauss: optimal bounds for privacy-preserving principal component analysis. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing (STOC '14). Association for Computing Machinery, New York, NY, USA, 11–20. DOI:https://doi.org/10.1145/2591796.2591883
Using format-preserving encryption (FPE) means to encrypt data in such a way that the output is in the same format as the input. For example, a number in the input produces a number in the generated output.
For the following generators, Tonic Structural uses FPE to encrypt the generated values. Note that the Structural implementation of FPE might not guarantee compliance with standards. For example, the ASCII Key generator does not guarantee that the length of the output data matches the length of the input data.
Each generator supports a specific input character set or domain.
If you see encryption errors, then it probably means that the column contains values that are incompatible with the selected generator. To address this, you need to choose a different generator.
These steps ensure that a single row of source data has limited influence on the output values. By default, the privacy budget for this generator is with , where is the number of rows.
The default privacy budget for this generator is with .
Differential privacy is a property of a randomized algorithm , which takes as input a database and produces some output The outputs could be counts or summary statistics or synthetic databases — the specific type is not important for this formulation.
For this formulation, we say two databases and are neighbors if they differ by a single row.
For a given , we say that is differentially private if, for all subset of outputs , we have:
When is non-zero, this is sometimes called approximately differentially private.
The parameter is the privacy budget of the algorithm, and quantifies in a precise sense an upper bound on how much information an adversary can gain from observing the outputs of the algorithm on an unknown database.
Suppose an attacker suspects that our secret databaseis one of two possible neighboring databases , with some fixed odds.
Ifis differentially private, then observing updates the the attacker's log odds of vs by at most .
The closer is to , the better the privacy guarantee, as an attacker is more and more limited in what information they can learn from .
Conversely, larger values of mean that an attacker can possibly learn significant information by observing .
This noise affords us plausible deniability. If the underlying count changed by , then the probability of observing the same noisy output does not change by much:
We illustrate this visually, showing the probability density function (pdf) of the observed values given true counts of (blue), (orange), and (green).
The blue shaded region shows that the the possibly noisy count values for and lie within a factor of of the noisy count values of , so this mechanism is differentially private with .
For a given and , we say that is differentially private if, for all subset of outputs , we have:
The parameter is often described as the risk of a (possibly catastrophic) privacy violation. While this formal definition does allow, for example, a mechanism that reveals a sensitive database with probability , in practice this is not a plausible outcome with carefully designed mechanisms. Also, taking to be small relative to the size of the database ensures that the risk of disclosure is low.
When a generator attempts to process data that is not within the expected domain, it results in encryption errors. For example, the generator cannot process a string that includes non-numeric characters such as letters or symbols. The generator cannot process any value that is not a valid UUID.
One option is the generator, which has very few restrictions on the allowed values.
Another option is to use the generator, which allows you to assign different generators based on column values.
Most Tonic Structural generators consume source data and perform an operation on it to produce destination data. For example, the Character Scramble generator takes the original data from the source database, replaces the letters and numbers with random letters and numbers, and then writes the result to the destination database.
Composite generators do not generate data directly.
Structural provides the following composite generators:
Most composite generators treat the input as structured data that the generator parses using a domain-specific syntax, such as:
XPath for XML or HTML
JSONPath for JSON or a Spark StructType
Regular expressions for text
These generators allow you to select a sub-value of the input, and then configure a specific generator to apply to only that sub-value. This means that you can take your original structured data and selectively mask the content.
For example, for the following structured content:
{ name: { first: "Tj", last: "Bass" } }
You indicate to use the Name generator to replace the value of last
. The result is something like:
{ name: { first: "Tj", last: "Pine" } }
The Conditional generator is slightly different. It allows you to apply a specific generator when the column value matches a specific condition. For example, you can indicate to apply a Character Scramble generator only if the column value is something other than "test".
You cannot configure generator presets for composite generators from the Generator Presets view. The Generator Presets view does not have access to data to use for path expressions or conditions. From a column configuration panel, you can save the current configuration as the new baseline configuration, and reset the configuration to the current baseline.
For any composite generator, when you select the generator to apply to a selected sub-value or based on a specified condition, you cannot select another composite generator. For example, you cannot apply a Conditional or XML Mask generator to the value of a specified path expression.
For composite generators other than the Conditional or Regex Mask generators, you cannot configure a sub-generator to be consistent with another column.
These topics talk about groups of related generators that have similar functions and configurations.
The AI Synthesizer generator is intended for use cases that require high-fidelity mimicked data. It can be used instead of the continuous or categorical generators.
This generator uses deep neural networks to learn models of your data, which can be sampled to generate new synthetic rows that faithfully mimic the statistical properties of your data.
The expressiveness of deep neural networks allows this generator to capture subtle relationships in the data that may be difficult to express using linking and partitioning generators. The relationships are learned from the data, instead of specified by the user.
Because this generator uses neural networks to learn from the data, performance is limited by the time required to train a model.
The privacy ranking is 3.
For the Tonic Structural API, the generator ID is NnGenerator
.
By default, the AI Synthesizer is not available. To enable the AI Synthesizer, in the Structural web server container, set the environment setting TONIC_NN_GENERATOR_ENABLED
to true
. For more information, go to Configuring environment settings.
Within each table, to configure the AI Synthesizer:
Assign the AI Synthesizer generator to the columns to use in the model. You also determine the type of data in each column.
Determine whether the table contains event data. For event data, you must select the primary entity and order columns.
For each table, you assign the AI Synthesizer generator to each column that you want to include in the trained model. AI Synthesizer trains one model per table.
You can assign the AI Synthesizer generator to columns that contain categorical, numeric, or location data. You cannot assign the AI Synthesizer to a datetime column.
Structural identifies the type of the column, but you can make adjustments to these assignments. For example:
A numeric column might actually be an enum, which would make it a categorical column.
A city name might be designated categorical, but is actually a location.
On the generator configuration panel for the column, from the type dropdown list, select the column type.
A table might contain event data, meaning that you want to preserve relationships between both rows and columns. For example, you might want to track financial transactions across time for each user.
To indicate that a table contains event data, on the generator dialog for any of the columns, check the Event Data checkbox.
The checkbox applies to the entire table.
For event data, you specify:
The column to use to identify the row (primary entity). For example, to track activity for users, you might use a column that contains a user name or identifier.
The column to use to sort the rows (order). This column should contain a numeric representation of a datetime value.
On the generator configuration panel:
To identify the current column as the primary entity, from the type dropdown list, select Primary Entity.
To identify the current column as the column to use for ordering, from the type dropdown list, select Order.
The Primary Entity and Order options are only available when Event Data is checked. The Order option is only available for numeric columns.
When the AI Synthesizer generator is assigned to at least one column in the table, then in Table View for that table, the AI Synthesizer panel displays.
The panel displays the list of columns that are included, and, for each column, the selected encoding type.
To remove a column, click the delete icon. The column is removed from the list, and the column generator is reset to Passthrough. For event data, if you remove the primary column or the order column, then you must assign that role to a different column.
To configure the model training, click the settings icon. The settings on the settings panel are slightly different depending on whether the model contains event data.
On the settings panel, the following parameters are common to all models:
In the Epochs field, enter the number of times that the training process goes over the data. The default is 300. A higher value can increase the accuracy of the training results. However, it increases the amount of time that it takes to complete the training. It can also decrease the privacy of the results.
In the Batch Size field, enter the number of examples to use during each training step. The default is 500. A higher value can make the training more regular, but might require more epochs to converge to similar results.
In the Reconstruction Loss Factor field, type the loss function for the model. The default is 2. The loss function for a variational autoencoder is essentially the sum of a “reconstruction loss” function and a regularization term. A higher value can help to produce decoded samples that are close to encoded samples, but also can make latent representations more complicated and reduce the diversity of synthetic samples.
In the Latent Dimension field, enter the dimension of latent representation. The default is 128. This latent dimension represents the complexity of the data. If the specified value is much higher than the dimensionality of the issue that you want to analyze, it can reduce the quality of the results.
In the Maximum Categorical Dimension field, enter the dimension for columns that have categorical or location encoding. The default is 35. If a column contains more distinct categories than this parameter, the most frequent categories are embedded as distinct one-hot vectors. The remaining categories are combined into a single one-hot vector. This limit prevents the model size from becoming extremely large and generally improves data quality.
For event data, to configure the RNN-VAE Parameters:
In the Maximum Sequence Length field, enter the maximum number of steps in a sequence that Structural considers when it trains the event model. The default is 20. Longer source sequences are truncated to the maximum length. The resulting synthetic sequences have a length up to this value. Long sequences take longer to process, and can reduce the quality of the results.
In the RNN Encoder Hidden Size field, enter the number of parameters in the RNN internal states to use for the encoder network. The default is 256.
In the RNN Decoder Hidden Size field, enter the number of parameters in the RNN internal states to use for the decoder network. The default is 256.
In the RNN Decoder Fully Connected Size field, enter the value to represent the complexity of the decoder’s fully connected layer. The default is 128. The hidden state passes through the fully connected layer to generate samples at each time interval.
In the Sequence Length Loss Factor field, enter the loss factor for sequencing for the model. The default is 2.0. The sequence length loss factor indicates how important it is to predict the sequence length. When you increase this number, the AI Synthesizer uses more of the model's capacity to capture the statistical properties of sequence lengths.
In the Order Column Loss Factor field, enter the loss factor for the column value order. The default is 1024.0 The order column loss factor determines how important it is to predict the order of the column values. Similar to the sequence loss factor, when you increase this factor, it increases the realism of the synthetic order column values. The scale is different because order column values use different encodings.
For non-event data, to configure the VAE Parameters:
In the Encoder Layer Sizes field, type a comma-separated list of non-negative integers to specify the number of layers and the size of each layer for the encoder. The default is 256,256,256, which indicates that there are three layers, and that the size of each layer is 256. A higher number of layers or larger layer size increases the expressive capacity of the model. However, to produce good results, you must start with a larger dataset.
In the Decoder Layer Sizes field, type a comma-separated list of non-negative integers to specify the number of layers and the size of each layer for the decoder. The default is 256,256,256, which indicates that there are three layers, and that the size of each layer is 256. A higher number of layers or larger layer size increases the expressive capacity of the model. However, to produce good results, you must start with a larger dataset.
In a child workspace, the AI Synthesizer panel under Model indicates whether the configuration is inherited from the parent workspace.
The inheritance stops if you make any changes to the AI Synthesizer configuration. When the configuration overrides the parent configuration, to reset to the parent configuration and restore the inheritance, click Reset.
Model training starts when you start the generation job.
This can take some time, depending on the size of the table and the number of columns that use the AI Synthesizer generator.
For example, a table that has 30 AI Synthesizer columns and 200,000 rows can take 2.5 hours to train.
The status information on Jobs page includes the status of the model training.
After the model is trained, the new synthetic data writes to the destination database.
For self-hosted Enterprise instances, the selected generator is a generator preset. A generator preset provides a specific configuration for a generator. Whenever a user selects the preset, the generator automatically uses the saved configuration for the preset, which we call the baseline configuration. Tonic Structural provides a built-in preset for most generators. You can also create custom presets.
After you select the preset, you can:
Override the baseline generator preset configuration. For example, if the built-in preset for the Name generator uses the First Last format, but the column contains a first name, you can change the format to First.
Remove the overrides to the baseline configuration.
Save the updated configuration as the new baseline for the generator preset.
Save the updated configuration as a new custom generator preset.
Required license to manage generator presets: Enterprise
For Basic and Professional instances, users select and configure generators separately for each column.
Required workspace permission: Configure column generators
From the Generator Type dropdown, select the generator to assign to the column.
The list contains the names of the generators that can be applied to the column.
Use the filter field to search by generator name.
For self-hosted Enterprise instances, the generator names represent built-in and custom generator presets. When you select a generator preset, the configuration is updated to match the current baseline configuration for that preset.
To remove the selected generator and set the generator to Passthrough, click the delete icon next to the generator dropdown list.
Overriding the configuration does not affect the baseline configuration for the generator preset.
A column is also considered to have overrides when someone changed the baseline configuration of the generator preset after it was assigned to the column.
Note that the following configuration options are not part of the preset configuration:
On the column configuration panel, you use the Reset to baseline button to remove any overrides to the current baseline configuration for the generator preset.
From the column configuration panel, you can save the updated configuration as the baseline configuration for the generator preset.
To do this, click Preset Options, then select Update baseline configuration. On the confirmation panel, click Confirm.
When you update the baseline configuration for the generator preset, Structural does not change the configuration of other columns that use the previous baseline configuration.
Whenever you select a generator preset, it uses the current baseline configuration.
From the generator configuration panel, you can save the current configuration as a new custom generator preset.
When you create a new custom generator preset, it is selected as the generator preset for the column.
To do this:
Click Preset Options, then select Create a new generator preset.
On the Create New Preset dialog, in the New Preset Name field, provide a name for the new custom generator preset.
Click Create.
Required license for workspace inheritance: Enterprise
The inheritance stops if you select a different generator or change the generator configuration.
The inheritance stops if you select a different generator or generator preset (including the Passthrough generator) or change the configuration.
When the column overrides the parent configuration, to reset to the parent configuration and restore the inheritance, click Reset.
, , and all provide an option to assign a generator to a column.
For more information about generator presets, go to .
After you select a generator preset, you can change the generator configuration. For details about the available configuration options for each generator, see the .
In a , the configuration panel indicates whether the column currently inherits the configuration from the parent workspace.
Composite generators
Composite generators apply a generator to a specific data element or based on a condition.
Primary key generators
Learn about generators that you can apply to primary key columns.
Required license: Professional or Enterprise
Not available on Tonic Structural Cloud.
Required global permission: Configure Tonic data encryption
A common use case for custom processing is encrypted source data. The data might need to be decrypted before a generator is applied, and encrypted before it is saved to the destination database.
Structural data encryption allows you to configure decryption and encryption to use during data generation. The data encryption process supports AES encryption, and allows you to use either the CBC, ECB, or CFB cipher modes.
When Structural data encryption is enabled, the configuration panel for each column includes a toggle to use Structural data encryption for that column.
For columns that use both Structural data encryption and a custom value processor:
Decryption occurs before a pre-processing custom value processor.
Encryption occurs after a post-processing custom value processor.
You enable and configure the data encryption from the Data Encryption tab of the Tonic Settings view. To display the Tonic Settings view, in the Tonic heading, click Tonic Settings.
To use Structural data encryption, you must provide:
A Base64-encoded decryption key as the value of the TONIC_DATA_DECRYPTION_KEY
environment setting.
A Base64-encoded encryption key as the value of the TONIC_DATA_ENCRYPTION_KEY
environment setting.
Both key values must use the same key size - either 128, 192, or 256.
For more information, go to Configuring environment settings.
Structural validates whether the values are set correctly. Structural enables the rest of the Data Encryption tab settings only if the keys are set correctly.
By default, Structural data encryption is disabled. To enable it, toggle Enable Data Encryption to the on position.
When you enable Structural data encryption, you choose whether to use decryption, encryption, or both.
You use decryption if the source data is encrypted and must be decrypted before the generators are applied.
You use encryption to encrypt the transformed data before saving it to the destination database.
To use decryption only, select Use Decryption.
To use encryption only, select Use Encryption.
To both decrypt and encrypt data, select Use Decryption and Encryption.
Structural only supports AES encryption. The AES Encryption setting shows the current key size.
The key size is based on the values you provided for the decryption and encryption key environment settings.
From the Cipher Mode dropdown list, select the cipher mode to use for Structural data encryption. The available cipher modes are:
CBC
ECB
CFB
Before it decrypts or encrypts data, Structural applies an initialization vector.
By default, Structural generates a random initialization vector, and Use custom Initialization Vector (IV) is in the off position.
To provide custom initialization vectors for Structural to use:
Toggle Use custom Initialization Vector (IV) to the on position.
If the Structural data encryption configuration includes encryption, then in the Encryption IV field, enter the static initialization vector to use to encrypt data.
If the Structural data encryption configuration includes decryption, then in the Decryption IV field, enter the static initialization vector to use to decrypt data.
After it encrypts the destination data, but before it stores it, Structural can prepend a string to the encrypted data.
To configure Structural data encryption to prepend a string:
Toggle Prepend value to encrypted data to the on position.
In the Custom Value field, enter the string to prepend.
After you complete the configuration, the Preview Results panel allows you to test the decryption and encryption.
If the configuration is incomplete, you cannot run the test.
If the configuration is for decryption only:
In the Ciphertext field, enter an encrypted text string.
Click Run Test.
Verify that the text in the Plaintext Result field is correct.
If the configuration is for encryption only:
In the Plaintext field, enter an unencrypted text string.
Click Run Test.
Verify that the text in the Ciphertext Result field is correct.
If the configuration is for both decryption and encryption, then you provide an encrypted string. The test decrypts the string into plain text, then re-encrypts that string.
In the Ciphertext field, enter an encrypted text string.
Click Run Test.
Verify that the text in the Plaintext Result field and the Ciphertext Result field is correct.
To save the configuration, click Save.
To revert any changes since you last saved the configuration, click Revert.
Required workspace permission: Configure column generators
The Tonic Structural sensitivity scan identifies specific types of sensitive data. For each sensitivity type that it detects, Structural can have a recommended generator. For example, for a value that the sensitivity scan identifies as a Social Security Number, Structural recommends the SSN generator. For a first name, Structural recommends the Name generator configured with First as the value type.
From Privacy Hub and Database View, you can review and apply the recommended generators.
In Privacy Hub, on the settings view of the column details panel, for a detected sensitive column that does not have an applied generator, and that has a recommended generator, Structural displays a button for the recommended generator.
To apply the recommended generator, click the button.
On Database View, for a detected sensitive column that does not have an applied generator, the generator name tag displays the type of sensitive data, such as a first name or an email address.
To apply the recommended generator:
Click the generator name tag.
On the recommended generator panel, click Apply recommendation.
When there are detected sensitive columns that are not protected, Privacy Hub displays a Sensitivity Recommendations banner. The banner displays the number of detected, unprotected columns.
To review the recommended generators, and determine whether to apply them, click Review Recommendations.
The Recommended Generators by Sensitivity Type panel displays the list of sensitivity types for which there are detected, unprotected columns.
To display the columns for a sensitivity type, click the expand icon for that type.
To hide the column list, click the collapse icon.
For each column, the list includes the following information:
The table and schema name
The column name, with the column data type
An example value from the source data (Original Data), with a corresponding destination value when the recommended generator is applied (Expected Output).
To display a larger sample of source and destination values, click the view icon in the Expected Output column.
To filter the lists, you can use either:
Schema name
Table name
Column name
Start to type text in the schema, table, or column name. As you type, Structural applies the filter to all of the lists.
When you first display the panel, all of the columns are selected. The columns that are affected when you apply recommended generators or ignore columns.
Within each sensitivity type, you can select or deselect individual columns.
You can use the checkbox in the column heading to select or deselect all of the columns for a sensitivity type.
To apply the recommended generator to the selected columns for a sensitivity type, click the Apply option for that sensitivity type.
When you apply the recommended generator, Structural removes the column from the list.
If the recommended generator is incorrect, then you can ignore the recommendation.
To ignore the recommended generator for the selected columns in a sensitivity type:
Click the Ignore option for the sensitivity type.
In the Ignore dropdown list, click Ignore generator recommendation.
When you ignore the generator recommendation:
The column is removed from the list.
The recommended generator is removed. This includes the recommendation on the Privacy Hub column configuration panel.
The column continues to be marked as sensitive.
Required workspace permission: Configure column sensitivity
You can mark selected columns for a sensitivity type as not sensitive. For example, a value might be correctly identified as a first name, but be a test value that is not actually sensitive and does not need to be transformed.
To mark selected columns in a sensitivity type as not sensitive:
Click the Ignore option for the sensitivity type.
In the Ignore dropdown list, click Mark as not sensitive.
When you mark a column as not sensitive, it is removed from the list.
To apply the recommended generators to all of the selected columns across all of the sensitivity types, click Apply All.
On Database View, the Bulk Edit option includes an option to apply the recommended generators to the selected columns for which there is an available recommendation.
From Database View, to apply recommended generators to multiple columns:
Check the checkbox for each column to update.
Click Bulk Edit.
On the bulk editing panel, click Apply Recommendations.
These hints and tips can help you to choose generators and address some specific use cases.
Tonic Structural provides several options for de-identifying names of individuals names. The method that you select depends on the specific use case, including the required realism of the output and privacy needs.
The following are a few of the generator options and how and why you might use them.
Rows of data often have multiple date or timestamp fields that have a logical dependency, such as START_DATE
and END_DATE
.
In this case, a randomly generated date is not viable, because it could produce a nonsensical output where events occur chronologically out of order.
The following generator options handle these scenarios:
Free text refers to text fields in the source database that might come from an "uncontrolled" source such as user text entry. In these cases, any record might or might not contain sensitive information.
Some possible examples include:
Notes from a doctor or healthcare provider that contain Protected Health Information (PHI)
Other personally identifiable information, such as a Social Security number or telephone number, that a user enters into an open-ended text entry form
Structural provides several suitable options. The method that you select depends on the specific use case, including the required realism of the output and any privacy requirements.
Here are a few generator options for free text fields, with information on how and why you might use them.
Null: If the field is nullable and the use case does not require any data in the field, you can use the Null generator to replace the values with NULL.
Constant: Allows you to provide a fixed value to replace all of the source value. For example, you could provide a "Lorem ipsum" string or other dummy value that is appropriate for your data set.
Custom Categorical: Similar to the Constant generator, it replaces the original value with a fixed value. To increase the cardinality of the output, you enter a list of possible values. The values are randomly used on the output records.
Most Structural generators preserve NULL values that are in the data.
They do not automatically preserve empty values.
To make sure that any empty values stay empty in the destination database:
For the default generator, select the generator to apply to the non-empty values.
Create a condition to look for empty values. You can either:
Use the regex comparison against the regex whitespace value (\s*
).
Use the =
operator and leave the value empty or empty except for a single space.
If you are not sure which characters the empty strings use, the regex option is more flexible. However, it is less efficient.
Instead of creating separate path expressions for each path, you can use one or two path expressions that capture all of the values.
//text()
gets all of the text nodes.
//@*
gets all of the attribute values.
You apply the generator to each expression.
Sub-generators are applied sequentially. You can apply the wildcard paths in addition to more specific paths and generators.
When your XML includes namespaces, then to include the namespaces in the path expression, specify the elements as:
For example, for the following XML:
A working XPath to mask the name value is:
You might sometimes set default date values to the absolute minimum and maximum values that are allowed by the database. For example, for SQL Server, these values are January 1, 1753 and December 31, 9999.
To skip those default values and shift the other values:
Create conditions to look for the minimum or maximum values.
You might sometimes want to add values that are the output of a generator to the results of the transformation by another generator.
To accomplish this:
In addition to the capture groups that are specific to your data:
Use (^)
as a capture group for a prefix.
Use ($)
as a capture group for a suffix.
Use ()
as an empty group at any point in the regex pattern.
Apply the relevant generators to each capture group.
So to implement the example above (prefix with a constant, scramble the value, append a sequential integer), you provide the expression (^)(.*)()($)
.
This produces four capture groups:
Required license: Enterprise
On Basic or Professional instances, you select and configure generators separately for each column.
Required global permission: Create and manage generator presets
A generator preset is a saved configuration for a generator.
Tonic Structural provides a built-in preset for every generator. You can update the configuration of the built-in presets.
You can also create custom generator presets that have different configurations. For example, for the Address generator, you can have one generator preset to use for city columns, and another generator preset to use for full addresses. You can edit and delete the custom generator presets. The custom generator presets are available to assign to columns throughout the Structural instance.
Generator presets allow you to standardize the configuration for generators, and saves your users from having to replicate the same configuration selections across different columns, tables, and workspaces. For example, you might modify the generator preset for the Integer Key generator to enable consistency. Whenever a user assigns the Integer Key generator to a column, consistency is enabled.
The Generator Presets view contains the list of built-in generator presets for the entire Structural instance. The configured presets are not specific to a workspace or a user.
To display the Generator Presets view, in the Tonic heading, click Generator Presets.
For each generator preset, the list provides the following information:
The name of the generator preset. For the built-in presets, the generator preset name always matches the generator name.
Whether the generator preset is built-in or custom.
The number of occurrences. Includes the number of occurrences that use the current baseline configuration, and the number of occurrences that have overrides to the baseline configuration.
An occurrence has an override if, after a user assigns the generator preset to a column, one the following occurs:
A user changes the generator configuration options for that occurrence.
A user changes the baseline configuration for the generator preset.
When the preset configuration was most recently modified.
You cannot create or configure generator presets for generators that do not have any configuration options. For example, the Null generator does not have any configuration options.
For composite generators, you cannot create or configure generator presets from Generator Presets view. Generator Presets does not have access to data from which to create path expressions. You can create a new preset or update a preset baseline configuration from a column configuration panel in Privacy Hub, Database View, or Table View.
The list indicates when a generator does not allow you to configure a preset.
You can filter the list of generator presets by the preset name, whether it is built-in or custom, and by the underlying generator type.
To filter by the preset name, begin typing text from the name. As you type, Structural filters the list to only include the matching presets.
To filter the list based on whether the preset is built-in or custom:
Click Filter by Type.
In the dropdown list: To only include built-in presets, click Built-in. To only include custom presets, click Custom.
Tonic adds the selection to the selected filters.
Every generator preset is based on a Structural generator type. For example, there is a built-in generator preset for the Address generator, and you can also create custom generator presets based on the Address generator.
To filter the list based on the generator type:
Click Filter by Generator.
In the generator list, click a generator to include. You can use the search field to search for a specific generator. When you click the generator name, Structural adds the generator to the selected filters.
You can sort the generator preset list by the preset name and the by the modification date.
To sort the generator preset list by a column, click the column heading. To reverse the sort order, click the column heading again.
To create a new custom generator preset, you can either create a completely new preset, or copy an existing preset.
For composite generators such as JSON Mask, you cannot create a generator preset from Generator Presets view. Generator Presets view does not have access to data to use for path expressions. You can create presets for composite generators from a column configuration panel in Privacy Hub, Database View, or Table View.
You cannot create a custom preset at all for the AI Synthesizer, or for a generator that has no configuration options. For example, you cannot create a custom preset for the Null generator.
To create a completely new custom generator preset:
On the Generator Presets view, click Create Preset.
On the Create Preset panel, configure the generator preset.
Click Create.
When you copy an existing generator preset, the new generator preset by default inherits the configuration from the copied generator preset.
To copy an existing generator preset:
On the Generator Presets view, click the copy icon for the generator preset that you want to copy.
On the Copy Preset dialog, enter a name for the new generator preset, then click Copy. The new preset is added to the Generator Presets list, and the details panel is displayed to allow you to change the new preset configuration.
After you update the configuration, click Save and Apply.
On the confirmation panel, click Confirm.
To edit a preset, you must be either an editor or owner of at least one workspace in the Structural instance. If you are not an editor or owner of a workspace, then you can view the list of presets, but you cannot edit the presets.
When you change the configuration of a generator preset, the updated configuration becomes the new baseline configuration for the generator preset.
The baseline configuration is used whenever you select the generator preset. Existing occurrences of the generator preset keep their current configuration. You can reset those occurrences to use the current baseline configuration.
A change to the generator preset description is not considered a change to the baseline configuration.
For composite generators such as JSON Mask, you cannot update a generator preset from Generator Presets view. Generator Presets view does not have access to data to use for path expressions. You can update the baseline configuration from a column configuration panel in Privacy Hub, Database View, or Table View.
To update the baseline configuration of a generator preset:
On the Generator Presets view, click the edit icon for the preset.
On the Configuration tab of the Edit Preset panel, update the configuration. You cannot change the selected generator for the preset.
Click Save and Apply.
On the confirmation panel, click Confirm.
Each generator preset includes the following configuration:
Preset Name - The name of the generator preset. You can change the name of built-in presets. Built-in presets always use the generator name.
Preset Description - A longer description of the generator preset and how it is intended to be used.
Generator Type - Used to select the generator for a new generator preset. When you copy or edit a generator preset, you cannot change the selected generator type.
The following items are not included in the generator preset configuration. They are always configured for individual columns after you select the generator preset:
On the generator preset details panel, the Occurrences tab indicates where the generator preset is used. You can also see whether each occurrence overrides the current baseline configuration.
The Occurrences tab displays the list of workspaces that contain occurrences of the preset. Each workspace indicates the total number of occurrences that use the current baseline configuration and that have overrides to the current baseline configuration.
For workspaces that you have access to:
You can expand the workspace to display the list of columns that use the generator preset. For each column, the entry indicates whether the column uses the current baseline configuration.
You can click the Database View icon to navigate to Database View.
For workspaces that you do not have access to, you can only see the total number of occurrences. You cannot display the column list or navigate to Database View.
You can delete custom generator presets. You cannot delete built-in generator presets.
When you delete a custom generator preset, existing occurrences are assigned the built-in generator preset for that generator. If the current configuration does not match the baseline configuration for the built-in generator preset, then the occurrences also are marked as having overrides.
For example, a column is assigned a custom generator preset for the Name generator. The custom generator preset is deleted. The column is then assigned the built-in generator preset for the Name generator, and is marked as having overrides.
To delete a custom generator preset:
On the Generator Presets view, click the delete icon for the generator preset.
On the confirmation dialog, click Delete Preset.
Some data values require custom processing before or after the generator is applied.
If you require custom processing for data values, Tonic.ai can work with you to develop and deploy custom value processors for your Tonic Structural instance. Once a custom value processor is deployed, you can select the processor as part of the generator configuration for each column.
One common use case for custom processing is to decrypt source data before applying a generator, and encrypt destination data before writing it to the destination database.
Randomly returns a name from a dictionary of primarily Westernized names, unrelated to the original value. Can provide complete privacy, unless you use . The output is realistic because the values returned are real names.
This generator shuffles all of the values in the field while preserving the overall frequency of the values. It ensures that the output contains realistic-looking names, and that the output uses the names from the original data set. This can be beneficial if the original data contains, for example, names that are common to a particular region and that should be maintained. When you use this generator with the option, it ensures the output is secure from re-identification. However, if the source data set is small or each name is highly unique, Structural might prevent you from using this option.
Allows you to provide your own dictionary of values. These values are included in the output at the same frequency that the original values occur in the source data.
Randomly replaces characters with other characters. The output does not provide realistic looking names, but it provides a high level of privacy that prevents recovery of the original data. It does preserve whitespace, punctuation (such as hyphenated names), and capitalization. Because it is a character-level replacement, it preserves the length of the input string.
Similar to Character Scramble, but uses a single character mapping throughout the generated data. This reduces the privacy level, but ensures consistency and uniqueness. This generator also has more support for additional unicode blocks to ensure that the output characters more closely match the input. This might be helpful if the input includes names with characters outside of the basic Latin (a-z, A-Z) characters.
(with )
To solve the problem described above, you ensure that two or more timestamps are randomly shifted by the same amount instead of independently from each other.
The key is to use the consistency option.
For example, a row of data represents an individual that is identified by a primary key of PERSON_ID
. The row also contains START_DATE
and END_DATE
columns. You can apply a timestamp shift to the START_DATE
and END_DATE
columns within a desired range, and make both columns consistent to PERSON_ID
.
Whenever the generator encounters the same PERSON_ID
value, it shifts the dates by the same amount.
You can apply the Event Timestamps generator to multiple date columns on the same table. You can link them to follow the underlying distribution of dates. For more information, go to the blog post .
This generator can sometimes address the described problem. You can configure this generator to truncate the input to the year, month, day, hour, minute, or second. It guarantees that a secondary event does not occur BEFORE a primary event. However, truncation might cause them to become the same date value or timestamp. Whether you can use this generator for this purpose depends on the typical time separation between the two events relative to the truncation option, and whether truncation provides an adequate level of privacy for the particular use case.
Randomly replaces characters with other characters. The output does not contain meaningful text, but it provides a high level of privacy that prevents recovery of the original data. The Character Scramble generator does preserve whitespace, punctuation, and capitalization. Because it is a character-level replacement, it preserves the length of the input string.
Uses regular expressions to parse strings. It then replaces specified substrings with the output of selected generators. The parts of the string to replace are specified in unnamed top-level capture groups.
The Regex Mask generator can preserve more realism of the underlying text, but introduces privacy risks. Any sensitive information that does not conform to a known and configured pattern is not captured and replaced.
As an example of matching specific formats, a configuration that includes the following two patterns would replace both telephone numbers that use the ###-###-####
format, and SSNs that use the ###-##-####
format, but leave the surrounding text unmodified:
SSN: ([0-9]{3}-[0-9]{2}-[0-9]{4})
Telephone Number: ([0-9]{3}-[0-9]{3}-[0-9]{4})
You can configure multiple regular expression patterns to handle all known or expected sensitive information formats. You cannot use this method to replace values that you cannot use a regular expression to reliably identify, such as names within free text.
When you use this option, make sure to enable Replace all matches for each pattern.
, , and generators Each of these options provides the highest level of privacy, because they completely remove or replace the original text. You might use each one for different reasons:
Assign the generator to the column.
For the empty value condition, set the generator to .
You sometimes might want to apply the same generator to all of the text values in a JSON, HTML, or XML value. For example, you might want to apply the to all of the text.
For the or generator, the path expression $..*
captures all of the text values. You can then select the generator to apply to the values.
For the and generators, you create two path expressions:
For example, one path expression references a specific name or address and uses the or generator. The wildcard path expressions use the generator to mask any unknown fields in the document that could contain sensitive information.
As another example, you might assign the generator to specific known fields that never contain sensitive information.
When you assign the generator, the minimum value cannot be shifted backward and the maximum value cannot be shifted forward.
Assign the generator to the column.
For the default generator, select the generator.
For those conditions, set the generator to .
For example, you use to mask a username. You might also want to prefix the value with a fixed constant value, or append a sequential integer.
Apply the generator to the column.
Group 0 is for the prefix. You assign the generator and provide the value to use as the prefix.
Group 1 captures all of the original values. You assign the generator.
Group 2 captures any empty values. You assign the generator to provide a value to use for those values.
Group 3 is for the suffix. You assign the generator.
For information about assigning and updating generator presets for a column, go to .
You can also view the .
Generator configuration - The configuration options for the selected generator. For details on the specific configuration options for each generator, go to the .
Structural data encryption allows you to set up decryption and encryption to apply to columns. For more information, go to .
Partitioning allows the value of a column to be based on the values of other related columns. It is one way to generate more realistic destination values.
The following generators support partitioning:
Note that partitioning cannot be configured as part of a generator preset. You can only configure partitioning when you configure a specific column.
To enable partitioning, from the Partition by dropdown list, you choose one or more columns to partition by.
You can only choose columns that have the generator set to Passthrough or Categorical.
For each value or combination of values in the partitioning columns, Tonic Structural generates a distribution of values for the original column.
For example, you assign the Continuous generator to an Income column, and partition it by an Occupation column. For each Occupation value, Structural generates a distribution of Income values. In other words, it generates a range of incomes for each occupation, such as Doctor and Construction Worker.
If you choose multiple columns, then the distribution is for each combination of column values. For example, you partition by both Occupation and Region. Structural creates a distribution of income values for each combination of occupation and region. So there is a distribution for Doctor and Northeast, and a different distribution for Doctor and Southeast.
In the destination database, Structural sets the value of the partitioned column to a value from the appropriate distribution. The distribution that Structural uses is based on the value of the partitioning columns in the destination database, not the original value of the partitioning columns in the source database.
To continue our example, assume that the Occupation column uses the Categorical generator. During data generation, Structural assigns to each record a random occupation value from the current values. For one of the records, the occupation value is Doctor in the source database and Construction Worker in the destination database.
For the Income column for that record, Structural assigns a value from the distribution of income values for the Construction Worker occupation. In other words, it assigns an income value that is realistic for the destination occupation value based on the source data.
The partitioning option works well when you partition by only one or two columns.
To create a more complex model across several columns, instead of partitioning, use the AI Synthesizer.
A column that has a uniqueness constraint must have a unique value for every record.
Primary key columns automatically require uniqueness. Uniqueness can also be required for other columns. For example, in a users
table, userid
is the primary key column, but username
also must be unique.
The following generators can be used with columns that have uniqueness constraints:
Generators that are applied to primary key columns are different from other generators in the following ways:
The generated data must be unique in order to not break constraints
The generators are consistent (same input → same output), so that when this generator is applied to a primary key column and its linked foreign key columns, no links are broken.
This is accomplished using format preserving encryption.
For more information on this, and details on how to provide your own encryption key, contact support@tonic.ai.
You apply a primary key generator in the same way as you do any other generator.
Tonic Structural then automatically applies the same generator to all foreign key columns that reference the primary key.
Foreign keys are either defined by the source schema or added from the Foreign Key Relationships page. For more information, go to Viewing and adding foreign keys.
Structural currently supports the following generators for primary key columns:
ASCII Key The ASCII Key generator does not preserve the format of the input value. It uses the ASCII alphabet for input and the alphanumeric alphabet for output. This leads to output values that are longer than the input values.
If you need support for additional types, contact support@tonic.ai.
Primary key generators are not supported in the Scale table mode. The process requires control over the key columns to make sure that all of the relationships are maintained.
You also cannot assign a primary key generator on a table that is related to a Scale mode table through a foreign key.