Managing Tonic Structural data generation performance

During Tonic Structural data generation, performance bottlenecks typically come from one of the following sources:

  • Network IO. Specifically, the bandwidth capacity of the network that connects Structural to the database instances.

  • Disk IO. The disk IO of the databases.

  • Tonic server and workspace configuration. Structural performs several complex data computations and transformations. Depending on your workspace selections, these tasks can take a long time to perform.

In most cases, slow data generation times are caused by disk IO and network IO.

Network IO

When possible, ensure that Structural has a fast network pipe between Structural and each source and destination database.

It is always advisable to install Structural on or near the hardware that runs your database instances.

Disk IO

This is normally limited by the database hardware.

If you run in a public cloud, you can configure options to access faster disks.

For SQL Server, you can increase your write speeds on your destination database. For details, go to SQL Server.

Reducing data loads

To reduce the required disk and network IO, you can copy less data from the source to the destination.

In some cases, you don't need the data from every table, or from specific columns within a table. Or you might be happy with the data that is already in the destination, and so you don't need to copy it again from the source.

Here are some tips to reduce the data load:

  • Put large tables that contain unneeded data into Truncate mode. In Truncate mode, Structural does not copy any of the table data to the destination database.

    For example, audit or transaction tables might not be needed for typical QA testing.

  • Avoid copying over large columns such as varchar(max), blob, XML, and JSON columns.

    If you do not need the data in a column, then to reduce the required IO, either:

    • If the column is nullable, apply a NULL generator.

    • Apply a Constant generator

  • For subsequent generation runs from the same source database:

    • For large tables that have not changed, use Preserve Destination mode. In Preserve Destination mode, Structural does not copy the table over, but instead uses the existing data in the destination database.

    • For large tables that have very few changes, use Incremental mode. In Incremental mode, Structural only copies over the changes that occurred since the previous generation.

Configuring parallel processing

When you believe that the Structural server is the bottleneck, then to improve performance, you can tune the following settings that control parallel processing.

You apply these settings as environment settings in your tonic_worker container. For more information on configuring environment settings, go to Configuring environment settings.

Settings that are not data connector-specific

The following settings are not limited to specific data connectors:

SettingDescription

TONIC_CONSTRAINT_PARALLELISM Default: 8

The number of constraints that a worker can apply in parallel during a job. You can configure this setting from Tonic Settings.

TONIC_PROCESS_PARALLELISM Default: 1

The number of threads to devote to performing the data transformations.

Certain Structural configurations can introduce CPU bottlenecks. This typically occurs when you configure composite generators such as JSON Mask or XML Mask with a large number of paths.

If your workspace has a very high number of generators, or a large number of JSON Mask, XML Mask, Integer Primary Key, or Alphanumeric Primary Key generators, then you should increase this value to at least 2. You can configure this setting from Tonic Settings.

TONIC_TABLE_PARALLELISM Default: 1

The number of tables that Structural operates on at the same time.

For subsetting, the number of subsetting steps that a worker processes in parallel during a subsetting job. For more information, go to Enabling parallel processing for subsetting.

If your Structural server has enough CPU, and your source and target databases are not fully utilized, then we recommend that you to increase this variable to 2.

Depending on your hardware, you can even increase it higher. You can configure this setting from Tonic Settings.

TONIC_WRITE_PARALLELISM Default: 2

The number of threads to devote to writing rows to the output database. For Data Pipeline V2 on PostgreSQL, this should be a factor of TONIC_JOBFLOW_MAX_DESTINATION_CONNECTIONS. For example, if TONIC_JOBFLOW_MAX_DESTINATION_CONNECTIONS is 8, then TONIC_WRITE_PARALLELISM should be 1, 2, or 4.

You can configure this setting from Tonic Settings.

Data connector-specific settings

The following settings apply to specific data connectors:

Setting and default valueDescription

TONIC_BIGQUERY_READ_PARALLELISM

Default: 2

Google BigQuery only.

The number of read threads per table for Google BigQuery.

TONIC_INDEX_RESTORATION_PARALLELISM

Default: 1

MySQL and PostgreSQL only.

At the end of the data generation run, the number of indexes to restore concurrently in the destination database.

TONIC_JOBFLOW_MAX_DESTINATION_CONNECTIONS

Default: 16

Only applies to the Data Pipeline V2 data generation process for PostgreSQL.

The maximum number of connections to the destination database.

Each action requires at least one connection.

We recommend that you set this value to the number of CPUs on the destination database server.

You can configure this setting from Tonic Settings.

TONIC_JOBFLOW_MAX_SOURCE_CONNECTIONS

Default: 8

Only applies to the Data Pipeline V2 data generation process for PostgreSQL.

The maximum number of connections to the source database.

Each action requires at least one connection.

We recommend that you set this value to the number of CPUs on the source database server.

You can configure this setting from Tonic Settings.

TONIC_MYSQL_COPY_TABLE_WRITE_PARALLELISM

Default: 1

MySQL only.

The number of tables that a worker can copy in parallel during a job.

TONIC_ORACLE_DATA_PUMP_PARALLELISM

Default: 0

Oracle only, and only on Oracle Enterprise Edition databases.

The maximum number of processes of active execution for Data Pump to use.

TONIC_PARTITION_PARALLELISM Default: 1

MySQL and SQL Server only.

The number of table partitions per table that are read from concurrently during a job.

TONIC_READ_RANGES_PARALLELISM Default: 8

PostgreSQL only.

The number of ranges per table to read in parallel.

Last updated