Using the AI Synthesizer
The AI Synthesizer generator is intended for use cases that require high-fidelity mimicked data. It can be used instead of the continuous or categorical generators.
This generator uses deep neural networks to learn models of your data, which can be sampled to generate new synthetic rows that faithfully mimic the statistical properties of your data.
The expressiveness of deep neural networks allows this generator to capture subtle relationships in the data that may be difficult to express using linking and partitioning generators. The relationships are learned from the data, instead of specified by the user.
Because this generator uses neural networks to learn from the data, performance is limited by the time required to train a model.
By default, the AI Synthesizer is not available. To enable the AI Synthesizer, in the Tonic web server container, set the environment variable
true. See Setting environment variables.
Within each table, to configure the AI Synthesizer:
- 1.Assign the AI Synthesizer generator to the columns to use in the model. You also determine the type of data in each column.
- 2.Determine whether the table contains event data. For event data, you must select the primary entity and order columns.
For each table, you assign the AI Synthesizer generator to each column that you want to include in the trained model. AI Synthesizer trains one model per table.
You can assign the AI Synthesizer generator to columns that contain categorical, numeric, or location data. You cannot assign the AI Synthesizer to a datetime column.
Tonic identifies the type of the column, but you can make adjustments to these assignments. For example:
- A numeric column might actually be an enum, which would make it a categorical column.
- A city name might be designated categorical, but is actually a location.
On the generator configuration panel for the column, from the type dropdown list, select the column type.
Dropdown list to select the column type for an AI Synthesizer column
A table might contain event data, meaning that you want to preserve relationships between both rows and columns. For example, you might want to track financial transactions across time for each user.
To indicate that a table contains event data, on the generator dialog for any of the columns, check the Event Data checkbox.
Event Data checkbox on the AI Synthesizer configuration panel
The checkbox applies to the entire table.
For event data, you specify:
- The column to use to identify the row (primary entity). For example, to track activity for users, you might use a column that contains a user name or identifier.
- The column to use to sort the rows (order). This column should contain a numeric representation of a datetime value.
On the generator configuration panel:
- 1.To identify the current column as the primary entity, from the type dropdown list, select Primary Entity.
- 2.To identify the current column as the column to use for ordering, from the type dropdown list, select Order.
The Primary Entity and Order options are only available when Event Data is checked. The Order option is only available for numeric columns.
When the AI Synthesizer generator is assigned to at least one column in the table, then in Table View for that table, the AI Synthesizer panel displays.
AI Synthesizer panel for a table
The panel displays the list of columns that are included, and, for each column, the selected encoding type.
To remove a column, click the delete icon. The column is removed from the list, and the column generator is reset to Passthrough. For event data, if you remove the primary column or the order column, then you must assign that role to a different column.
To configure the model training, click the settings icon. The settings on the settings panel are slightly different depending on whether the model contains event data.
On the settings panel, the following parameters are common to all models:
- 1.In the Epochs field, enter the number of times that the training process goes over the data. The default is 300. A higher value can increase the accuracy of the training results. However, it increases the amount of time that it takes to complete the training. It can also decrease the privacy of the results.
- 2.In the Batch Size field, enter the number of examples to use during each training step. The default is 500. A higher value can make the training more regular, but might require more epochs to converge to similar results.
- 3.In the Reconstruction Loss Factor field, type the loss function for the model. The default is 2. The loss function for a variational autoencoder is essentially the sum of a “reconstruction loss” function and a regularization term. A higher value can help to produce decoded samples that are close to encoded samples, but also can make latent representations more complicated and reduce the diversity of synthetic samples.
- 4.In the Latent Dimension field, enter the dimension of latent representation. The default is 128. This latent dimension represents the complexity of the data. If the specified value is much higher than the dimensionality of the issue that you want to analyze, it can reduce the quality of the results.
- 5.In the Maximum Categorical Dimension field, enter the dimension for columns that have categorical or location encoding. The default is 35. If a column contains more distinct categories than this parameter, the most frequent categories are embedded as distinct one-hot vectors. The remaining categories are combined into a single one-hot vector. This limit prevents the model size from becoming extremely large and generally improves data quality.
For event data, to configure the RNN-VAE Parameters:
- 1.In the Maximum Sequence Length field, enter the maximum number of steps in a sequence that Tonic considers when it trains the event model. The default is 20. Longer source sequences are truncated to the maximum length. The resulting synthetic sequences have a length up to this value. Long sequences take longer to process, and can reduce the quality of the results.
- 2.In the RNN Encoder Hidden Size field, enter the number of parameters in the RNN internal states to use for the encoder network. The default is 256.
- 3.In the RNN Decoder Hidden Size field, enter the number of parameters in the RNN internal states to use for the decoder network. The default is 256.
- 4.In the RNN Decoder Fully Connected Size field, enter the value to represent the complexity of the decoder’s fully connected layer. The default is 128. The hidden state passes through the fully connected layer to generate samples at each time interval.
- 5.In the Sequence Length Loss Factor field, enter the loss factor for sequencing for the model. The default is 2.0. The sequence length loss factor indicates how important it is to predict the sequence length. When you increase this number, the AI Synthesizer uses more of the model's capacity to capture the statistical properties of sequence lengths.
- 6.In the Order Column Loss Factor field, enter the loss factor for the column value order. The default is 1024.0 The order column loss factor determines how important it is to predict the order of the column values. Similar to the sequence loss factor, when you increase this factor, it increases the realism of the synthetic order column values. The scale is different because order column values use different encodings.
For non-event data, to configure the VAE Parameters:
- 1.In the Encoder Layer Sizes field, type a comma-separated list of non-negative integers to specify the number of layers and the size of each layer for the encoder. The default is 256,256,256, which indicates that there are three layers, and that the size of each layer is 256. A higher number of layers or larger layer size increases the expressive capacity of the model. However, to produce good results, you must start with a larger dataset.
- 2.In the Decoder Layer Sizes field, type a comma-separated list of non-negative integers to specify the number of layers and the size of each layer for the decoder. The default is 256,256,256, which indicates that there are three layers, and that the size of each layer is 256. A higher number of layers or larger layer size increases the expressive capacity of the model. However, to produce good results, you must start with a larger dataset.
In a child workspace, the AI Synthesizer panel under Model indicates whether the configuration is inherited from the parent workspace.
The inheritance stops if you make any changes to the AI Synthesizer configuration. When the configuration overrides the parent configuration, to reset to the parent configuration and restore the inheritance, click Reset.
Model training starts when you start the generation job.
This can take some time, depending on the size of the table and the number of columns that use the AI Synthesizer generator.
For example, a table that has 30 AI Synthesizer columns and 200,000 rows can take 2.5 hours to train.
The status information on Jobs page includes the status of the model training.
After the model is trained, the new synthetic data writes to the destination database.