Generators that are applied to primary key columns are different from other generators in the following ways:
The generated data must be unique in order to not break constraints
The generators are consistent (same input → same output), so that when this generator is applied to a primary key column and its linked foreign key columns, no links are broken.
This is accomplished using format preserving encryption.
For more information on this, and details on how to provide your own encryption key, contact support@tonic.ai.
You apply a primary key generator in the same way as you do any other generator.
Tonic Structural then automatically applies the same generator to all foreign key columns that reference the primary key.
Foreign keys are either defined by the source schema or added from the Foreign Key Relationships page. For more information, go to Viewing and adding foreign keys.
Structural currently supports the following generators for primary key columns:
ASCII Key The ASCII Key generator does not preserve the format of the input value. It uses the ASCII alphabet for input and the alphanumeric alphabet for output. This leads to output values that are longer than the input values.
If you need support for additional types, contact support@tonic.ai.
Primary key generators are not supported in the Scale table mode. The process requires control over the key columns to make sure that all of the relationships are maintained.
You also cannot assign a primary key generator on a table that is related to a Scale mode table through a foreign key.
These topics talk about groups of related generators that have similar functions and configurations.
Composite generators
Composite generators apply a generator to a specific data element or based on a condition.
Primary key generators
Learn about generators that you can apply to primary key columns.
Most Tonic Structural generators consume source data and perform an operation on it to produce destination data. For example, the Character Scramble generator takes the original data from the source database, replaces the letters and numbers with random letters and numbers, and then writes the result to the destination database.
Composite generators do not generate data directly.
Structural provides the following composite generators:
Most composite generators treat the input as structured data that the generator parses using a domain-specific syntax, such as:
XPath for XML or HTML
JSONPath for JSON or a Spark StructType
Regular expressions for text
These generators allow you to select a sub-value of the input, and then configure a specific generator to apply to only that sub-value. This means that you can take your original structured data and selectively mask the content.
For example, for the following structured content:
{ name: { first: "Tj", last: "Bass" } }
You indicate to use the Name generator to replace the value of last
. The result is something like:
{ name: { first: "Tj", last: "Pine" } }
The Conditional generator is slightly different. It allows you to apply a specific generator when the column value matches a specific condition. For example, you can indicate to apply a Character Scramble generator only if the column value is something other than "test".
You cannot configure generator presets for composite generators from the Generator Presets view. The Generator Presets view does not have access to data to use for path expressions or conditions. From a column configuration panel, you can save the current configuration as the new baseline configuration, and reset the configuration to the current baseline.
For any composite generator, when you select the generator to apply to a selected sub-value or based on a specified condition, you cannot select another composite generator. For example, you cannot apply a Conditional or XML Mask generator to the value of a specified path expression.
For composite generators other than the Conditional or Regex Mask generators, you cannot configure a sub-generator to be consistent with another column.
The AI Synthesizer generator is intended for use cases that require high-fidelity mimicked data. It can be used instead of the continuous or categorical generators.
This generator uses deep neural networks to learn models of your data, which can be sampled to generate new synthetic rows that faithfully mimic the statistical properties of your data.
The expressiveness of deep neural networks allows this generator to capture subtle relationships in the data that may be difficult to express using linking and partitioning generators. The relationships are learned from the data, instead of specified by the user.
Because this generator uses neural networks to learn from the data, performance is limited by the time required to train a model.
The privacy ranking is 3.
For the Tonic Structural API, the generator ID is NnGenerator
.
By default, the AI Synthesizer is not available. To enable the AI Synthesizer, in the Structural web server container, set the environment setting TONIC_NN_GENERATOR_ENABLED
to true
. For more information, go to Configuring environment settings.
Within each table, to configure the AI Synthesizer:
Assign the AI Synthesizer generator to the columns to use in the model. You also determine the type of data in each column.
Determine whether the table contains event data. For event data, you must select the primary entity and order columns.
For each table, you assign the AI Synthesizer generator to each column that you want to include in the trained model. AI Synthesizer trains one model per table.
You can assign the AI Synthesizer generator to columns that contain categorical, numeric, or location data. You cannot assign the AI Synthesizer to a datetime column.
Structural identifies the type of the column, but you can make adjustments to these assignments. For example:
A numeric column might actually be an enum, which would make it a categorical column.
A city name might be designated categorical, but is actually a location.
On the generator configuration panel for the column, from the type dropdown list, select the column type.
A table might contain event data, meaning that you want to preserve relationships between both rows and columns. For example, you might want to track financial transactions across time for each user.
To indicate that a table contains event data, on the generator dialog for any of the columns, check the Event Data checkbox.
The checkbox applies to the entire table.
For event data, you specify:
The column to use to identify the row (primary entity). For example, to track activity for users, you might use a column that contains a user name or identifier.
The column to use to sort the rows (order). This column should contain a numeric representation of a datetime value.
On the generator configuration panel:
To identify the current column as the primary entity, from the type dropdown list, select Primary Entity.
To identify the current column as the column to use for ordering, from the type dropdown list, select Order.
The Primary Entity and Order options are only available when Event Data is checked. The Order option is only available for numeric columns.
When the AI Synthesizer generator is assigned to at least one column in the table, then in Table View for that table, the AI Synthesizer panel displays.
The panel displays the list of columns that are included, and, for each column, the selected encoding type.
To remove a column, click the delete icon. The column is removed from the list, and the column generator is reset to Passthrough. For event data, if you remove the primary column or the order column, then you must assign that role to a different column.
To configure the model training, click the settings icon. The settings on the settings panel are slightly different depending on whether the model contains event data.
On the settings panel, the following parameters are common to all models:
In the Epochs field, enter the number of times that the training process goes over the data. The default is 300. A higher value can increase the accuracy of the training results. However, it increases the amount of time that it takes to complete the training. It can also decrease the privacy of the results.
In the Batch Size field, enter the number of examples to use during each training step. The default is 500. A higher value can make the training more regular, but might require more epochs to converge to similar results.
In the Reconstruction Loss Factor field, type the loss function for the model. The default is 2. The loss function for a variational autoencoder is essentially the sum of a “reconstruction loss” function and a regularization term. A higher value can help to produce decoded samples that are close to encoded samples, but also can make latent representations more complicated and reduce the diversity of synthetic samples.
In the Latent Dimension field, enter the dimension of latent representation. The default is 128. This latent dimension represents the complexity of the data. If the specified value is much higher than the dimensionality of the issue that you want to analyze, it can reduce the quality of the results.
In the Maximum Categorical Dimension field, enter the dimension for columns that have categorical or location encoding. The default is 35. If a column contains more distinct categories than this parameter, the most frequent categories are embedded as distinct one-hot vectors. The remaining categories are combined into a single one-hot vector. This limit prevents the model size from becoming extremely large and generally improves data quality.
For event data, to configure the RNN-VAE Parameters:
In the Maximum Sequence Length field, enter the maximum number of steps in a sequence that Structural considers when it trains the event model. The default is 20. Longer source sequences are truncated to the maximum length. The resulting synthetic sequences have a length up to this value. Long sequences take longer to process, and can reduce the quality of the results.
In the RNN Encoder Hidden Size field, enter the number of parameters in the RNN internal states to use for the encoder network. The default is 256.
In the RNN Decoder Hidden Size field, enter the number of parameters in the RNN internal states to use for the decoder network. The default is 256.
In the RNN Decoder Fully Connected Size field, enter the value to represent the complexity of the decoder’s fully connected layer. The default is 128. The hidden state passes through the fully connected layer to generate samples at each time interval.
In the Sequence Length Loss Factor field, enter the loss factor for sequencing for the model. The default is 2.0. The sequence length loss factor indicates how important it is to predict the sequence length. When you increase this number, the AI Synthesizer uses more of the model's capacity to capture the statistical properties of sequence lengths.
In the Order Column Loss Factor field, enter the loss factor for the column value order. The default is 1024.0 The order column loss factor determines how important it is to predict the order of the column values. Similar to the sequence loss factor, when you increase this factor, it increases the realism of the synthetic order column values. The scale is different because order column values use different encodings.
For non-event data, to configure the VAE Parameters:
In the Encoder Layer Sizes field, type a comma-separated list of non-negative integers to specify the number of layers and the size of each layer for the encoder. The default is 256,256,256, which indicates that there are three layers, and that the size of each layer is 256. A higher number of layers or larger layer size increases the expressive capacity of the model. However, to produce good results, you must start with a larger dataset.
In the Decoder Layer Sizes field, type a comma-separated list of non-negative integers to specify the number of layers and the size of each layer for the decoder. The default is 256,256,256, which indicates that there are three layers, and that the size of each layer is 256. A higher number of layers or larger layer size increases the expressive capacity of the model. However, to produce good results, you must start with a larger dataset.
In a child workspace, the AI Synthesizer panel under Model indicates whether the configuration is inherited from the parent workspace.
The inheritance stops if you make any changes to the AI Synthesizer configuration. When the configuration overrides the parent configuration, to reset to the parent configuration and restore the inheritance, click Reset.
Model training starts when you start the generation job.
This can take some time, depending on the size of the table and the number of columns that use the AI Synthesizer generator.
For example, a table that has 30 AI Synthesizer columns and 200,000 rows can take 2.5 hours to train.
The status information on Jobs page includes the status of the model training.
After the model is trained, the new synthetic data writes to the destination database.