Running a data generation job

Required workspace permission: Run data generation

The data generation job uses the configured tables modes and generators to transform the data in the source database or source files. The transformed data is used to create the destination database or to write transformed files to file storage.

Types of data generation processes

Simple data generation

In the simplest type of data generation, Tonic Structural uses the configured table modes and generators to transform data in the source database and write the transformed data to the destination location. The destination location is usually a database server, but might also be:

  • A storage location such as an S3 bucket

  • A container repository

  • A Tonic Ephemeral snapshot

For a file connector workspace, the data generation job uses the configured generators for each file group to transform the data in the source files. The transformed data is used to create output files that correspond to the source files.

Subsetting data generation

When subsetting is enabled, Structural first identifies the tables and rows to include in the subset. It uses the configured table modes and generators to transform the data. It then writes the transformed data to the destination location.

Upsert data generation

Required license: Professional or Enterprise

When upsert is enabled, Structural runs a data generation job that writes the transformed data to an intermediate database. The data generation can include subsetting.

After the initial data generation, Structural runs an upsert job to add or update the appropriate records from the intermediate database to the destination database. The upsert job only adds and updates records. It does not remove any records from previous data generation jobs.

Before Structural can run an upsert job, the destination database must already exist and have the correct schema defined. To initialize the destination database:

  1. Disable upsert.

  2. Run a regular data generation.

  3. Re-enable upsert.

Selecting the data generation option

To start the data generation, at the top right of the workspace management view, click Generate Data.

As you configure the data generation options, Structural runs checks to verify that you can use the current configuration to generate data.

If any of these checks do not pass, then when you click Generate Data, Structural displays information about why you cannot run the data generation job.

If all of those checks pass, then when you click Generate Data, if there are no warnings, the Confirm Generation panel displays.

Warning for non-conflicting schema changes

Data generation is always blocked by conflicting schema changes.

The workspace configuration includes whether to block data generation for all schema changes, including non-conflicting changes.

If this setting is turned off, then if there are non-conflicting schema changes, when you click Generate Data, a warning displays. Non-conflicting schema changes include new tables and columns. If the new columns contain sensitive data, then if you do not assign generators before you generate data, that sensitive data will be in the destination database.

If you are sure that the data in the new tables and columns is not sensitive, then to continue to the Confirm Generation panel, click Continue to Data Generation.

Confirming the generation details

The Confirm Generation panel allows you to confirm the details for the data generation. If subsetting is configured, you can determine whether to generate the subset. Structural can also provide tips on how to improve the data generation performance.

Indicating whether to generate a subset

If you configured subsetting, then you can indicate whether to only generate the subset.

To create a subset based on the current subsetting configuration, toggle Use Subsetting to the on position.

The initial setting matches the current setting in the subsetting configuration. If Use subsetting is turned on on the Subsetting view, then it is on by default on the Generation Confirmation panel.

When you change the setting on the generation confirmation panel, it also updates the setting on the Subsetting view.

Indicating whether to use upsert

If upsert is enabled for the workspace, then you can also determine whether to use upsert for data generation.

If upsert is enabled for the workspace, then by default Use Upsert is in the on position.

To not use upsert, toggle Use Upsert to the off position. When upsert is turned off, the data generation is a simple data generation that directly populates and replaces the destination database.

PostgreSQL only - Enabling the new data generation process

Tonic.ai has released an improved version of the data generation process. We are enrolling Structural instances in the new process. Tonic.ai will contact you before we enroll your instance.

After your instance is enrolled, your PostgreSQL workspaces always use the new data generation process. For the new process, the job type is Data Pipeline Generation instead of Data Generation.

If your instance is not yet enrolled, then on the Confirm Generation panel, to use the new data generation process, toggle Data Pipeline V2 to the on position.

Verifying the intermediate database connection information (for upsert)

When upsert is enabled, the Confirm Generation panel provides access to the connection information for the intermediate database. To display the intermediate database connection details, click Intermediate Upsert Database.

If the intermediate database information is incorrect, to navigate to the workspace configuration view to make updates, click Edit Intermediate.

Viewing the destination location

The Confirm Generation panel provides the destination information for the workspace. To display the destination database connection details, click Destination Settings.

Depending on the workspace configuration and data connector type, the destination information is either:

  • Connection information for a database server

  • A storage location such as an S3 bucket

  • Configuration for an Ephemeral snapshot

  • Information to create container artifacts

If the destination information is incorrect, to navigate to the workspace configuration view to make updates, click Edit Destination Settings.

For a file connector workspace, if the source files came from a local file system, then the destination files are written to the large file store in the Structural application database. You can download the most recently generated files.

Enabling diagnostic logging for the job

Required global permission: Enable diagnostic logging

If the data connector is not configured to use diagnostic logging, then you can choose whether to enable diagnostic logging for an individual data generation job. The option is also available for data connectors that do not have a diagnostic logging setting.

On the Confirm Generation panel, to enable diagnostic logging for the job, toggle Enable Diagnostic Logging to the on position.

Access to diagnostic logs is also controlled by the Enable diagnostic logging global permission. If you do not have this permission, then you cannot download diagnostic logs.

Viewing generation performance tips

For data generation, assigning Truncate table mode to tables that you don't need data for can improve generation performance.

For subsetting, if an upstream table is very large, and the foreign key columns are not indexed, then it can make the subsetting process run more slowly.

The Want faster generations? message displays at the bottom of the Confirm Generation panel. It displays for all non-subsetting jobs. For subsetting jobs, the panel only displays if Structural identified columns that you should consider indexing.

To display information about tips for faster generation, click Generation Tips.

Viewing suggested columns to index

On the Generation Tips panel for subsetting jobs, the Add Indexes panel displays the first few columns that you might consider indexing.

To display a panel with a suggested SQL command to add the index, click the information icon next to the column.

On the panel, to copy the command to the clipboard, click Copy SQL to Clipboard.

If there are additional columns that are not listed, then to display the full list of columns to index, click Show all columns.

On the full list, to download the list to a CSV file, click Download list of columns (.csv).

Hint to truncate tables

On the Generation Tips panel for non-subsetting jobs, the Truncate Tables panel displays the hint to truncate tables that contain data that you do not need in the destination database.

To navigate to Database View to change the current configuration, click Go to Database View.

Starting the generation job

On the Confirm Generation panel, after you confirm the generation details, to start the data generation, click Run Generation.

When upsert is enabled, to start the data generation and upsert jobs:

  1. Click Run Generation + Upsert.

  2. In the menu, click Run Generation + Upsert.

Structural displays a notification that the job has started. To track the progress of the data generation job and view the results, click the View Job button on the notification, or go to the Job History view.

Starting an upsert job based on the most recent data generation

If upsert is enabled for a workspace, then on the Confirm Generation panel, the more common option is to run both data generation and upsert.

After you run at least one successful data generation to the intermediate database, then you can also choose to run only the upsert process.

For example, if the data generation succeeds but the upsert process fails, then after you address the issues that caused the upsert to fail, you can run the upsert process again.

You also must start the upsert job manually if you turn off Automatically Start Upsert After Successful Data Generation in the workspace settings.

From the Confirm Generation panel, to run upsert only:

  1. Click the Run Generation + Upsert button.

  2. In the menu, click Run Upsert Only.

When you run upsert only, the process uses the results of the most recent data generation.

Issues that prevent data generation and subsetting

The following issues prevent a data generation or subsetting job.

Structural and workspace access or permissions

Source and destination data availability

Table mode and generator configuration

Schema changes

Subsetting configuration

The following errors occur when you attempt to generate a subset. They do not apply if Use subsetting is turned off.

Upsert issues

When upsert is enabled, the following issues cause the upsert job to fail.

Last updated