Managing data generation performance

During Tonic Structural data generation, performance bottlenecks typically come from one of the following sources:

Network IO - Specifically, the bandwidth capacity of the network that connects Structural to the database instances.
Disk IO - The disk IO of the databases.
Tonic server and workspace configuration - Structural performs several complex data computations and transformations. Depending on your workspace selections, these tasks can take a long time to perform.

In most cases, slow data generation times are caused by disk IO and network IO.

Network IO

When possible, ensure that Structural has a fast network pipe between Structural and each source and destination database.

It is always advisable to install Structural on or near the hardware that runs your database instances.

Disk IO

This is normally limited by the database hardware.

If you run in a public cloud, you can configure options to access faster disks.

For SQL Server, you can increase your write speeds on your destination database. For details, go to SQL Server.

Reducing data loads

To reduce the required disk and network IO, you can copy less data from the source to the destination.

In some cases, you don't need the data from every table, or from specific columns within a table. Or you might be happy with the data that is already in the destination, and so you don't need to copy it again from the source.

Here are some tips to reduce the data load:

Put large tables that contain unneeded data into Truncate mode. In Truncate mode, Structural does not copy any of the table data to the destination database.
For example, audit or transaction tables might not be needed for typical QA testing.
Avoid copying over large columns such as varchar(max), blob, XML, and JSON columns.
If you do not need the data in a column, then to reduce the required IO, either:
- If the column is nullable, apply the NULL generator.
- Apply the Constant generator.
For subsequent generation runs from the same source database:
- For large tables that have not changed, use Preserve Destination mode. In Preserve Destination mode, Structural does not copy the table over, but instead uses the existing data in the destination database.
- For large tables that have very few changes, use Incremental mode. In Incremental mode, Structural only copies over the changes that occurred since the previous generation.

Configuring parallel processing

When you believe that the Structural server is the bottleneck, then to improve performance, you can tune the following settings that control parallel processing.

You apply these settings as environment settings in your tonic_worker container. For more information on configuring environment settings, go to Configuring environment settings.

You can also configure overrides for some of these settings in individual workspaces. For more information on how to configure override values, go to Advanced workspace overrides.

Settings that are not data connector-specific

The following settings are not limited to specific data connectors:

Setting

Description

TONIC_CONSTRAINT_PARALLELISM Default: 8

The number of constraints that a worker can apply in parallel during a job. You can configure this setting from Structural Settings, and override it in individual workspaces.

TONIC_PROCESS_PARALLELISM Default: 1

The number of threads to devote to performing the data transformations.

Certain Structural configurations can introduce CPU bottlenecks. This typically occurs when you configure composite generators such as JSON Mask or XML Mask with a large number of paths.

If your workspace has a very high number of generators, or a large number of JSON Mask, XML Mask, Integer Primary Key, or Alphanumeric Primary Key generators, then you should increase this value to at least 2. You can configure this setting from Structural Settings, and override it in individual workspaces.

TONIC_TABLE_PARALLELISM Default: 1

The number of tables that Structural operates on at the same time.

For subsetting, the number of subsetting steps that a worker processes in parallel during a subsetting job. For more information, go to Enabling parallel processing for subsetting.

If your Structural server has enough CPU, and your source and target databases are not fully utilized, then we recommend that you to increase this variable to 2.

Depending on your hardware, you can even increase it higher. You can configure this setting from Structural Settings, and override it in individual workspaces.

TONIC_WRITE_PARALLELISM Default: 2

The number of threads to devote to writing rows to the output database. For Data Pipeline V2 on PostgreSQL, this should be a factor of TONIC_JOBFLOW_MAX_DESTINATION_CONNECTIONS. For example, if TONIC_JOBFLOW_MAX_DESTINATION_CONNECTIONS is 8, then TONIC_WRITE_PARALLELISM should be 1, 2, or 4.

You can configure this setting from Structural Settings, and override it in individual workspaces.

Data connector-specific settings

The following settings apply to specific data connectors:

Setting and default value

Description

TONIC_JOBFLOW_MAX_DESTINATION_CONNECTIONS

Default: 16

Only applies to the Data Pipeline V2 data generation process.

The maximum number of connections to the destination database.

Each action requires at least one connection.

We recommend that you set this value to the number of CPUs on the destination database server.

You can configure this setting from Structural Settings, and override it in individual workspaces.

TONIC_JOBFLOW_MAX_SOURCE_CONNECTIONS

Default: 8

Only applies to the Data Pipeline V2 data generation process.

The maximum number of connections to the source database.

Each action requires at least one connection.

We recommend that you set this value to the number of CPUs on the source database server.

You can configure this setting from Structural Settings, and override it in individual workspaces.

TONIC_READ_RANGES_PARALLELISM Default: 8

Snowflake only.

The number of ranges per table to read in parallel. You can configure this setting from Structural Settings, and override it in individual workspaces.

TONIC_BIGQUERY_READ_PARALLELISM

Default: 2

Google BigQuery only.

The number of read threads per table for Google BigQuery. You can configure this setting from Structural Settings, and override it in individual workspaces.

TONIC_MYSQL_COPY_TABLE_WRITE_PARALLELISM

Default: 1

MySQL only.

The number of tables that a worker can copy in parallel during a job.

TONIC_ORACLE_DATA_PUMP_PARALLELISM

Default: 0

Oracle only, and only on Oracle Enterprise Edition databases.

The maximum number of processes of active execution for Data Pump to use. You can configure this setting from Structural Settings, and override it in individual workspaces.

TONIC_PARTITION_PARALLELISM Default: 1

MySQL and SQL Server only.

The number of table partitions per table that are read from concurrently during a job. You can configure this setting from Structural Settings, and override it in individual workspaces.

TONIC_FILE_GROUP_PARALLELISM Default: 6

File connector only. The number of file groups that Structural operates on concurrently. You can configure this setting from Structural Settings, and override it in individual workspaces.

TONIC_FILE_GROUP_READ_SCHEMA_PARALLELISM Default: 4

File connector only. Only applies to cloud storage workspaces. The number of file groups that Structural reads the schemas of concurrently. Has the most effect on file types that have serialized schemas, such as Parquet and Avro.

TONIC_FILE_SAMPLE_PARALLELISM Default: 4

File connector only. Within a file group, the number of files that Structural samples concurrently. For very large files, a larger sample value can increase memory usage.

TONIC_FILE_READ_PARALLELISM Default: 2

File connector only. The number of files in a cloud storage file group that Structural fetches concurrently.

Last updated 3 days ago

Was this helpful?