Required workspace permission: Configure, train, and export models
The data model configuration includes the following elements.
Run a SQL query
The query results provide the underlying data for the model.
Configure the model parameters
The model parameters guide the model training.
Adjust the column types
Update the column types for the model data as needed.
On the model configuration view, the Advanced tab at the left contains the options that Tonic Structural uses during the data training and generation process.
By default, models are tabular. A tabular model focuses on the relationships between columns.
However, a model might be event driven, meaning that you want to correspond both rows and columns. For example, you might want to track financial transactions across time for each user.
For an event driven model, you specify:
The column to use to identify the row. For example, to track activity for users, you might use a column that contains a user name or identifier.
The column to use to sort the rows. This column contains a numeric representation of a datetime value.
Optionally, columns to use to provide conditions for sampling the data. When you sample the data, you specify the column values to use in the generated events. For example, you choose to condition the data based on a region column. When you sample the data, you can specify the regions for which to generate events.
To indicate that a model is event driven:
From the Model dropdown, select Event Driven.
From the Primary Entity drop-down list, select the column to use to identify the row.
From the Order drop-down list, select the column to use to sort the rows. The order column can be a numeric column, a date column, or a datetime column.
Under Condition On, to configure a list of columns for conditional sampling:
To add a column, begin to type the column name. From the list of matching columns, select the column to add. You can only use categorical columns. The columns also should contain static data. For example, for a transaction, the account type is static. It is not affected by the transaction. The transaction type and remaining balance are dynamic. They are specific to an individual transaction.
To remove a column, click its delete icon.
The parameters under General Parameters are common to all models:
In the Epochs field, enter the number of times that the training process goes over the data. The default is 300. A higher value can increase the accuracy of the training results. However, it increases the amount of time that it takes to complete the training. It can also decrease the privacy of the results.
Use the Early Stopping toggle to indicate whether to use early stopping for model training. If Early Stopping is turned on, then the model training does not have to run the full number of epochs. It stops running when the model begins to overfit to the training data. If Early Stopping is turned off, then the model training runs the full number of configured epochs.
In the Batch Size field, enter the number of examples to use during each training step. The default is 500. A higher value can make the training more regular, but might require more epochs to converge to similar results.
In the Reconstruction Loss Factor field, type the loss function for the model. The default is 2. The loss function for a variational autoencoder is essentially the sum of a “reconstruction loss” function and a regularization term. A higher value can help to produce decoded samples that are close to encoded samples, but also can make latent representations more complicated and reduce diversity of synthetic samples.
In the Latent Dimension field, enter the dimension of latent representation. The default is 128. This latent dimension represents the complexity of the data. If the specified value is much higher than the dimensionality of the issue that you want to analyze, it can reduce the quality of the results.
In the Maximum Categorical Dimension field, enter the dimension for columns that have categorical or location encoding. The default is 35. If a column contains more distinct categories than this parameter, the most frequent categories are embedded as distinct one-hot vectors. The remaining categories are combined into a single one-hot vector. This limit prevents the model size from becoming extremely large and generally improves data quality.
For an event driven model, to configure the RNN-VAE Parameters:
In the Maximum Sequence Length field, enter the maximum number of steps in a sequence that Tonic considers when it trains the event model. The default is 20. Longer source sequences are truncated to the maximum length. The resulting synthetic sequences have a length up to this value. Long sequences take longer to process, and can reduce the quality of the results.
In the Maximum Order Dimension field:
If the order column is numeric, then the order column is discretized. Set Maximum Order Dimension to the number of pieces to discretize the order column into.
If the order column is a date or datetime, set Maximum Order Dimension to the maximum number of distinct dates that the model considers. For datetime values, the time is ignored. If the number of dates in the data exceeds Maximum Order Dimension, then the model training fails.
In the RNN Encoder Hidden Size field, enter the number of parameters in the RNN internal states to use for the encoder network. The default is 256.
In the RNN Decoder Hidden Size field, enter the number of parameters in the RNN internal states to use for the decoder network. The default is 256.
In the RNN Decoder Fully Connected Size field, enter the value to represent the complexity of the decoder’s fully connected layer. The default is 128. The hidden state passes through the fully connected layer to generate samples at each time interval.
In the Sequence Length Loss Factor field, enter the loss factor for sequencing for the model. The default is 128. The sequence length loss factor indicates how important it is to predict the sequence length. When you increase this number, Structural uses more of the model's capacity to capture the statistical properties of sequence lengths.
In the Order Column Loss Factor field, enter the loss factor for the column value order. The default is 128. The order column loss factor determines how important it is to predict the order of the column values. Similar to the sequence loss factor, when you increase this factor, it increases the realism of the synthetic order column values. The scale is different because order column values use different encodings.
For a tabular model, to configure the VAE Parameters:
In the Encoder Layer Sizes field, type a comma-separated list of non-negative integers to specify the number of layers and the size of each layer for the encoder. The default is 256,256,256, which indicates that there are three layers, and that the size of each layer is 256. A higher number of layers or larger layer size increases the expressive capacity of the model. However, to produce good results, you must start with a larger dataset.
In the Decoder Layer Sizes field, type a comma-separated list of non-negative integers to specify the number of layers and the size of each layer for the decoder. The default is 256,256,256, which indicates that there are three layers, and that the size of each layer is 256. A higher number of layers or larger layer size increases the expressive capacity of the model. However, to produce good results, you must start with a larger dataset.
In the query results, the column headings include the identified column type. The training process uses numeric, categorical, and location columns in the training process.
Numeric columns contain a number value.
Categorical columns contain a specific set of values. For example, a categorical column might identify the marital status of a person represented in the data.
Location columns identify a physical location. For example, a location column might contain a zip code or a city name.
Tonic Structural assigns initial types when it runs the query. Typically:
String columns are assigned as categorical.
Numeric columns are assigned as numeric.
Datetime value columns are assigned as datetime. Ideally, in your SQL query you converted datetime values to a numeric representation of time such as epoch time. The columns are then assigned as numeric.
You can make adjustments to these assignments. For example:
A numeric column might actually be an enum, which would make it a categorical column.
A city name might be designated categorical, but is actually a location.
To change the designation of a column:
Click the dropdown arrow next to the current type.
From the popup menu, select the type.
For columns other than numeric columns, you can designate the column as a categorical column or a location column.
For numeric columns, you can also restore the column type to numeric.
In the query editor, provide a SQL query to identify the subset of data to obtain from the source database. The query must be deterministic - it must return the same data every time it runs.
You can use the table and column list on the Source tab at the left as a reference. If you uploaded CSV files, then each file becomes a table, with the file name (minus the extension) as the table name. For example, you upload a file named my_model_data.csv. This becomes a table named my_model_data
.
If the model contains event data, then make sure that the query results include a numeric column that can be used to sort the data based on a datetime value. You might need to transform a datetime column to use a numeric format.
To run the query, either click Run Query or press Shift-Enter. The query results are used to populate the table below the query editor and the Schema list on the model details view.