Using the AI Synthesizer

About the AI Synthesizer

The AI Synthesizer generator is intended for use cases that require high-fidelity mimicked data. It can be used instead of the continuous or categorical generators.

This generator uses deep neural networks to learn models of your data, which can be sampled to generate new synthetic rows that faithfully mimic the statistical properties of your data.

The expressiveness of deep neural networks allows this generator to capture subtle relationships in the data that may be difficult to express using linking and partitioning generators. The relationships are learned from the data, instead of specified by the user.

Because this generator uses neural networks to learn from the data, performance is limited by the time required to train a model.

The privacy ranking is 3.

For the Tonic Structural API, the generator ID is NnGenerator.

By default, the AI Synthesizer is not available. To enable the AI Synthesizer, in the Structural web server container, set the environment setting TONIC_NN_GENERATOR_ENABLED to true. For more information, go to Configuring environment settings.

Overview of the AI Synthesizer configuration

Within each table, to configure the AI Synthesizer:

  1. Assign the AI Synthesizer generator to the columns to use in the model. You also determine the type of data in each column.

  2. Determine whether the table contains event data. For event data, you must select the primary entity and order columns.

Selecting the columns for the AI Synthesizer

For each table, you assign the AI Synthesizer generator to each column that you want to include in the trained model. AI Synthesizer trains one model per table.

You can assign the AI Synthesizer generator to columns that contain categorical, numeric, or location data. You cannot assign the AI Synthesizer to a datetime column.

Structural identifies the type of the column, but you can make adjustments to these assignments. For example:

  • A numeric column might actually be an enum, which would make it a categorical column.

  • A city name might be designated categorical, but is actually a location.

On the generator configuration panel for the column, from the type dropdown list, select the column type.

Indicating that a table contains event data

A table might contain event data, meaning that you want to preserve relationships between both rows and columns. For example, you might want to track financial transactions across time for each user.

To indicate that a table contains event data, on the generator dialog for any of the columns, check the Event Data checkbox.

The checkbox applies to the entire table.

For event data, you specify:

  • The column to use to identify the row (primary entity). For example, to track activity for users, you might use a column that contains a user name or identifier.

  • The column to use to sort the rows (order). This column should contain a numeric representation of a datetime value.

On the generator configuration panel:

  1. To identify the current column as the primary entity, from the type dropdown list, select Primary Entity.

  2. To identify the current column as the column to use for ordering, from the type dropdown list, select Order.

The Primary Entity and Order options are only available when Event Data is checked. The Order option is only available for numeric columns.

Configuring the model training for a table

When the AI Synthesizer generator is assigned to at least one column in the table, then in Table View for that table, the AI Synthesizer panel displays.

The panel displays the list of columns that are included, and, for each column, the selected encoding type.

To remove a column, click the delete icon. The column is removed from the list, and the column generator is reset to Passthrough. For event data, if you remove the primary column or the order column, then you must assign that role to a different column.

To configure the model training, click the settings icon. The settings on the settings panel are slightly different depending on whether the model contains event data.

Configuring general model parameters

On the settings panel, the following parameters are common to all models:

  1. In the Epochs field, enter the number of times that the training process goes over the data. The default is 300. A higher value can increase the accuracy of the training results. However, it increases the amount of time that it takes to complete the training. It can also decrease the privacy of the results.

  2. In the Batch Size field, enter the number of examples to use during each training step. The default is 500. A higher value can make the training more regular, but might require more epochs to converge to similar results.

  3. In the Reconstruction Loss Factor field, type the loss function for the model. The default is 2. The loss function for a variational autoencoder is essentially the sum of a “reconstruction loss” function and a regularization term. A higher value can help to produce decoded samples that are close to encoded samples, but also can make latent representations more complicated and reduce the diversity of synthetic samples.

  4. In the Latent Dimension field, enter the dimension of latent representation. The default is 128. This latent dimension represents the complexity of the data. If the specified value is much higher than the dimensionality of the issue that you want to analyze, it can reduce the quality of the results.

  5. In the Maximum Categorical Dimension field, enter the dimension for columns that have categorical or location encoding. The default is 35. If a column contains more distinct categories than this parameter, the most frequent categories are embedded as distinct one-hot vectors. The remaining categories are combined into a single one-hot vector. This limit prevents the model size from becoming extremely large and generally improves data quality.

Configuring RNN-VAE parameters (event data)

For event data, to configure the RNN-VAE Parameters:

  1. In the Maximum Sequence Length field, enter the maximum number of steps in a sequence that Structural considers when it trains the event model. The default is 20. Longer source sequences are truncated to the maximum length. The resulting synthetic sequences have a length up to this value. Long sequences take longer to process, and can reduce the quality of the results.

  2. In the RNN Encoder Hidden Size field, enter the number of parameters in the RNN internal states to use for the encoder network. The default is 256.

  3. In the RNN Decoder Hidden Size field, enter the number of parameters in the RNN internal states to use for the decoder network. The default is 256.

  4. In the RNN Decoder Fully Connected Size field, enter the value to represent the complexity of the decoder’s fully connected layer. The default is 128. The hidden state passes through the fully connected layer to generate samples at each time interval.

  5. In the Sequence Length Loss Factor field, enter the loss factor for sequencing for the model. The default is 2.0. The sequence length loss factor indicates how important it is to predict the sequence length. When you increase this number, the AI Synthesizer uses more of the model's capacity to capture the statistical properties of sequence lengths.

  6. In the Order Column Loss Factor field, enter the loss factor for the column value order. The default is 1024.0 The order column loss factor determines how important it is to predict the order of the column values. Similar to the sequence loss factor, when you increase this factor, it increases the realism of the synthetic order column values. The scale is different because order column values use different encodings.

Configuring VAE parameters (non-event data)

For non-event data, to configure the VAE Parameters:

  1. In the Encoder Layer Sizes field, type a comma-separated list of non-negative integers to specify the number of layers and the size of each layer for the encoder. The default is 256,256,256, which indicates that there are three layers, and that the size of each layer is 256. A higher number of layers or larger layer size increases the expressive capacity of the model. However, to produce good results, you must start with a larger dataset.

  2. In the Decoder Layer Sizes field, type a comma-separated list of non-negative integers to specify the number of layers and the size of each layer for the decoder. The default is 256,256,256, which indicates that there are three layers, and that the size of each layer is 256. A higher number of layers or larger layer size increases the expressive capacity of the model. However, to produce good results, you must start with a larger dataset.

Restoring inheritance from the parent workspace

In a child workspace, the AI Synthesizer panel under Model indicates whether the configuration is inherited from the parent workspace.

The inheritance stops if you make any changes to the AI Synthesizer configuration. When the configuration overrides the parent configuration, to reset to the parent configuration and restore the inheritance, click Reset.

Training the model

Model training starts when you start the generation job.

This can take some time, depending on the size of the table and the number of columns that use the AI Synthesizer generator.

For example, a table that has 30 AI Synthesizer columns and 200,000 rows can take 2.5 hours to train.

The status information on Jobs page includes the status of the model training.

After the model is trained, the new synthetic data writes to the destination database.

Last updated