Smart Linking

The Smart Linking generator is intended for use-cases requiring high-fidelity mimicked data. This generator uses deep neural networks to learn models of your data, which can be sampled to generate new, synthetic rows which faithfully mimic the statistical properties of your data. The expressiveness of deep neural networks allows this generator to capture subtle relations in the data that may be difficult to express with linking and partitioning generators. Moreover, these relationships are learned from the data, rather than specified by the user. Since this generator uses neural networks to learn from the data, performance will be limited by the time required to train a model.

Smart Linking can be used in place of the continuous and categorical generators to build high-fidelity synthetic data. Select multiple columns of numeric, categorical, or location data, then select the Smart Linking generator. Tonic will train a model which represents all of these selected columns in a given table. Note that at most one Smart Linking model is trained per table.

Using the Smart Linking generator

Tonic will identify the type of the column, but you can toggle these type annotations, for example, if a numeric column represents an enum, you can specify that column as categorical. Moreover, categorical columns can be annotated as Location.

Annotating column type

Model training will start when you click Generate, and this may take some time, depending on the size of your table and the number of tables with the Smart Linking generator. For example, a table with 30 Smart Linking columns and 200,000 rows can require 2.5 hours of training. You can inspect the status of the model training in the Job view.

Job Status during model training

Once the model has trained, the new synthetic data will quickly write to your output database.

Environmental Variables

TONIC_NN_EPOCHS

Defaults to 300. This is the number of epochs, or iterations over the training data, that each Smart Linking model

TONIC_NN_MAX_CATEGORICAL_DIMENSION

Defaults to 35. This is the maximum number of distinct values of a categorical column that the generative model will try to model. If the data contains more distinct values than this cutoff, less-frequent values will be lumped together by the model. In order to preserve the character of your data, these will be converted back to a sample of the combined values.