1 of 11

Subsetting data

Required license: Professional or Enterprise

Required workspace permission: Configure subsetting

Subsetting generates a representative sample of your data in a way that maintains referential integrity.

You can also view this video overview of subsetting.

Subsetting overview and table filtering

Viewing and updating the subset configuration

Viewing subsetting steps and run results

Other resources for subset configuration

About subsetting

Required license: Professional or Enterprise

What is subsetting?

Subsetting allows you to intelligently reduce the size of your destination database. It takes a representative sample of the data that preserves the data's referential integrity.

You configure how Tonic Structural generates the subset. When you generate the output data, you decide whether to enable the subsetting process.

For example, you can configure subsetting to get 5% of all transactions, or all of the data that is associated with customers who live in California.

Here are a few examples where subsetting data might be important or necessary:

You want to use your production database in staging or test environments, without the PII. Because the database is very large, you want to use only a portion of it.
To reproduce a bug, you want a test database that contains a few specific rows from production, and related rows from other tables.
You want to share data with others, but you don’t want them to have all of it. For example, you can provide developers an anonymized subset that also enables them to run the test database locally on their local machines.

To learn more about our approach to subsetting, go to the following technical blog posts:

Components of subsetting

Subsetting uses foreign keys to determine the relationships in the data. These relationships enable the subsetting process to traverse the database as it builds the subset.

Foreign keys are either configured in your source data, or configured using the Structural virtual foreign key tool. For more information, go to Subsetting and foreign keys.

For subsetting, each table in the source database falls into one of the following categories:

Target tables
Lookup tables
Related tables
Out-of-subset tables

Target tables

Target tables are the seed tables that provide the initial set of rows to include in the subset. Structural retrieves the initial subset of data from the target tables. Structural then uses those rows to identify the information to pull from related tables.

A target table typically contains an important object that is well connected to everything else in the source data. For example, users, transactions, or claims. A subset should usually have a very small number of target tables.

When you identify a target table, you specify how to retrieve the subset of the data that you want from the table. You can request a percentage of the data, or use a WHERE clause to identify a specific subset of data.

For more information, go to Identifying and configuring target tables.

Lookup tables

A lookup table contains a static set of values that is used in other tables in your subset. For example, a lookup table might contain a list of postal codes or country names that are referenced in other tables.

Structural always retrieves all of the data in a lookup table. It does not check whether or where the lookup values are used.

It does not pull records from related tables based on lookup table values. Relationships with lookup tables are ignored during the subsetting process.

For more information, go to Identifying lookup tables.

Related tables are tables that are connected by direct or indirect relationships with a target table, and that are not identified as lookup tables.

Downstream tables have data that is required to maintain referential integrity for the subset. These tables have primary keys that are referenced by foreign keys in related tables.
Upstream tables contain data that has a foreign key that references a primary key in the target table. For large upstream tables, if the foreign key columns are not indexed, the subsetting process can be significantly slower. These upstream records are not required to maintain referential integrity, but can contain useful information. In the subset configuration, you can filter these upstream records either by date or by using a WHERE clause.

Some related tables are both downstream and upstream. In that case, you can provide a filter that applies only to the upstream records. Because the downstream records are required for referential integrity, they cannot be filtered.

For example, a transactions table contains a foreign key column to identify the customer. The value is the primary key of a record in the customers table. The customers table is downstream of the transactions table - the transaction data is incomplete without the customer information. The transactions table is upstream of the customers table.

Structural pulls data from related tables in order to preserve referential integrity in the output data subset.

In many cases, the relationship is direct. For example, a target table contains a list of events. The events table identifies the user that hosted the event. The user is identified using a foreign key relationship from the events table to the users table. The users table is a related table. The subset includes all the users that the events refer to.

The relationship also might be indirect. To continue the example, the events table identifies a user from the users table. The users table identifies the company that each user belongs to. The company is identified using a foreign key relationship from the users table to the companies table. The companies table is also a related table. The subset needs to include all of the companies that are referred to by the users that the events table refers to.

For an example of how Structural identifies related tables, view the example diagram in How Structural creates a subset.

Out-of-subset tables

Tables other than target tables, lookup tables, or related tables are not part of the subset.

By default, Structural copies only the table schema of out-of-subset tables. It does not populate any of the data.

You can also choose to process the tables using the table mode that is assigned to each table.

For more information, go to Determining how to process tables that are not in the subset.

How Structural creates a subset

Structural creates the subset before it applies any transformations to the source data.

To provide a basic overview of how Structural creates the subset, we'll use the following simple example schema:

The Events table is the target table for the subset. The Events table includes information about the event hosts (Hosts table) and the event venue (Venues table). For each host, the data includes the company that the host belongs to (Companies table).

The Attendees table includes the event that the attendee registered for.

The Hosts, Companies, Venues, and Attendees tables are all related tables for the subset.

The States table provides a lookup of state values to use for the company, venue, and attendee addresses. It is a lookup table for the subset. A subset always includes all of the data in a lookup table.

When you enable subsetting for a data generation job:

To create the basis of the subset, Structural gets data from the target tables based on the configured filters - either a percentage or a WHERE clause. In our example, Structural gets the subset of data from the Events table. Structural then traverses your database based on the relationships that originate from the target tables.
Structural first goes upstream. For the upstream process, Structural traverses through tables that reference a target table, based on the data collected in step 1. In other words, the value of the primary key for a target table record is the value of a foreign key column in the upstream table. This step continues until there are no remaining upstream tables to process. To continue our example, Structural retrieves the attendees for the event records that are in the subset.
Next, Structural goes downstream. Structural traverses all of the tables to look for foreign key columns for which the value is the primary key of an upstream table record. To continue our example, Structural retrieves the hosts and venues that are referred to in the event records that it retrieved in the first or second pass on the events table. It also retrieves the companies that are referred to in the host records. During this downstream step, Structural considers both upstream and downstream tables to ensure that the subset includes every connected table. For example, if the Venues table included a foreign key column that referenced a primary key from the Attendees table, Tonic would have to return to the Attendees table to get those attendee records.

Other notes about retrieving subset data

You might want to be aware of how Structural retrieves subset data in the following cases, which can result in either more or less data than you might expect.

If there are multiple target tables, and the tables are related to each other, Structural takes the union of the required data for both the target table configuration and the table relationships.

For example, table A contains a foreign key column that refers to table B. You configure both tables as target tables. For table B, Structural pulls both the directly targeted set of records, and the records that the targeted table A records refer to.

Tables that are upstream of multiple target tables

If a table is upstream of multiple target tables, then Structural only pulls records from that table that contain references to targeted records in all of the target tables.

For example, in related table Child1, column1 is a foreign key that refers to a primary key in target table Parent1. column2 is a foreign key that refers to a primary key in target table Parent2.

If column1 and column2 both refer to targeted records in Parent1 and Parent2, then that Child1 record is included in the subset. If only one of those columns refers to a targeted record in Parent1 or Parent2, then that Child1 record is not included.

Using table filtering for data warehouses and Spark-based data connectors

Some Tonic Structural data connectors do not support subsetting.

However, for the following connectors that do not support subsetting, to generate a smaller set of data, you instead can add table filters.

The following data connectors support both subsetting and table filtering:

You can only filter tables that use De-identify table mode. The filter identifies the rows from the source database to process and include in the destination database.

Note that unlike subsetting, table filters do not guarantee referential integrity.

To add a filter, in the Table Filter text area on the table mode panel, provide the WHERE clause for the filter, then click Apply.

For Databricks workspaces where the source database uses Delta files, the filter WHERE clause can only refer to columns that have partitions.

For Amazon EMR and Google BigQuery, the filter WHERE clause can refer to columns without partitions. However, the performance is better when the referenced columns have partitions.

On the workspace configuration for Amazon EMR and Databricks, the Enable partition filter validation toggle determines whether Structural validates the WHERE clause when you create it. By default, the toggle is in the on position, and the WHERE clause is validated.

For Amazon Redshift, Google BigQuery, and Snowflake, Structural always validates the WHERE clause.

Viewing the current subsetting configuration

To display Subsetting view, either:

On the workspace management view, in the workspace navigation bar, click Subsetting.
On Workspaces view, from the dropdown menu in the Name column, select Subsetting.
On Workspaces view, click the subsetting icon for the workspace.

The Configuration tab on Subsetting view shows the current subsetting configuration.

It consists of:

Subsetting summary
Table View and Graph View. Both views display the source data tables, show the current subsetting configuration, and allow you to update the configuration. Table View displays a tabular list of tables. Graph View displays a diagram that shows the relationships between the tables.
Configuration to enable subsetting for data generation.
Configuration for handling out-of-subset tables.
Results of the most recent subsetting data generation.

Subsetting summary

The panels at the top of the Configuration tab provide a clickable summary of the current subsetting configuration.

The summary includes the following values:

Target shows the number of target tables.
Lookup shows the number of lookup tables.
In Subset shows the number of tables that are in the subset. This includes target tables, lookup tables, and related tables.
Out of Subset shows the number of tables that are not in the subset.

When you click a summary panel:

On Table View, the table list is filtered to only display matching tables. For example, when you click Target, the list is filtered to only include target tables.
On Graph View, the matching tables are highlighted with a shadow behind the table objects.

Subsetting data generation results

After you run data generation with subsetting, on Table View, the Latest Results tab displays on the Configuration tab.

Before you run data generation with subsetting, the Latest Results tab does not display.

The Latest Results tab displays details for the most recent data generation with subsetting. It ignores data generation runs that do not use subsetting.

The subsetting results include:

The job status (successful, failed, canceled).
The amount of time it took to complete the run.
The percentage of the source data that is included in the subset destination data.
The volume of data in the source data and the subset destination data.
The percent reduction from the original source data to the subset destination data.
When the job began and ended.

To display the details for the data generation job, click View Job Details.

Table View

Information in the table list

The Configuration tab contains the list of tables in the source database. It shows how each table is affected by the most recently completed subsetting configuration.

For each table, the table list includes:

Whether the table is a target table, a lookup table, or a related table that is filtered.
Whether the table is in or out of the subset. Target and lookup tables are always in the subset. Related tables also are in the subset. Other tables are out of the subset.

Sorting the list

You can sort the list based on values in a selected column. To use a column to sort the list, click the column heading. To switch the sort order, click the column heading again.

The Sort by dropdown list provides the following options to sort the list:

Rows pre-subset - Sort by the number of rows in the table before subsetting.
Rows post-subset - Sort by the number of rows in the table after subsetting. Before you run a data generation job to create the subset, this value is unknown.
Inbound relationships - Sort the list based on the number of inbound relationships.
Outbound relationships - Sort the list based on the number of outbound relationships.
Total relationships - Sort the list based on the total number of inbound and outbound relationships.

By default, the drop-down sort options sort the table list in descending order. For example, when you select Rows pre-subset, the table that currently has the largest number of rows is at the top of the list. To change the sort direction, select the option again.

Filtering the list

You can filter the list based on:

The table name.
Whether the table is in or out of the subset.
Whether the table is a target or lookup table.

To filter by table name, in the filter field, begin to type text from the table name. As you type, the list is filtered to only include tables whose names contain the filter text.

To filter the list to show only target tables, lookup tables, in-subset tables, or out-of-subset tables, do one of the following:

Click the panel at the top of the tab.
From the Filter Tables drop-down list, select the filter option.

To remove a table subset status filter, click the delete icon.

You can combine a name filter and a table subset status filter. For example, you can filter the list to show in-subset tables that contain the text "test".

You cannot combine the table subset status filters. When you select a different filter, the current filter is replaced.

Graph View

Graph View displays a diagram of the source data tables and the relationships between them. It also indicates:

Whether each table is in the subset.
Whether the subsetting status for the table changed since the last subsetting data generation.

Information on the Graph View table blocks

Each table block provides the following information about the table:

At the top left:
- The name of the table.
- The name of the schema that contains the table.
At the top right, the status of the table in the context of the subset. A table might be a target table, a lookup table, a related table that is in the subset, or a table that is out of the subset.

How relationships display on Graph View

The Graph View diagram connects tables that are related to each other based on a foreign key relationship. The position of the tables indicates the type of relationship.

Tables that have an upstream relationship with another table are displayed above the table.
Tables that have a downstream relationship with another table are displayed below the table.

The Attendees table refers to the event for the attendee. Attendees is upstream of Events, and would display above the Events table in Graph View.
The Events table refers to a venue from the Venues table. Venues is downstream of Events, and would display below the Events table in Graph View.

Focusing on a specific table

To find and focus on a specific table:

In the search field, begin to type text in the table name. As you type, Tonic Structural filters the list to display matching tables.
When you see the table that you want, click the table name. Structural highlights the connections to other tables and displays the table details panel.

Other Graph View navigation options

To pan around the Graph View graph, click and drag.

To zoom in and out, use the navigation tools at the bottom left of Graph View.

How Structural determines the number of pre-subset and post-subset rows

For tables that contain fewer than 1,000 rows, the pre-subset number of rows is displayed as <1k.

For tables that are in the subset, the resulting rows are based on the target table and related table configuration.

If the data generation job hasn't run yet, or the details from the job are not yet available, then the number of rows after the subset is marked as unknown.

If you updated the configuration for a table since the most recent data generation, then on Table View, an information icon displays next to the post-subset value.

Viewing table details

When you click a table in either Table View or Graph View, the table details panel displays to the right of the table.

The table details include:

Whether the table is in the subset.
The number of outbound and inbound relationships.
The list of inbound and outbound relationships with other tables. When you click a table name, Structural selects and displays the details for that table.

Subsetting and foreign keys

How subsetting uses foreign keys

The subsetting process uses foreign keys to navigate the relationships in your data. It uses these relationships to identify the data to include in the subset.

Without foreign keys, Tonic Structural does not know how to navigate the relationships in your data. Properly configured foreign keys allow Structural to select the necessary rows from other tables, which ensures referential integrity.

Foreign keys are often set up directly within the source database. You can also set up virtual foreign keys within Structural. For example, a foreign key relationship might be missing, or your database might not use foreign keys. If your database uses polymorphic keys, then you must use the foreign key upload to add those keys manually.

For information about Foreign Keys view, including how to create and upload virtual foreign keys, go to Viewing and adding foreign keys.

Adding virtual foreign keys from Subsetting view

You can also add virtual foreign keys from Subsetting view.

For example, on Graph View, you might notice that a relationship between tables is missing. You can immediately add a virtual foreign key to establish that relationship.

To create a virtual foreign key from Subsetting view:

Display the table details panel for the table that contains the foreign key.
Click Create Virtual Foreign Key.
Under Foreign Key from this Table, select the column in the current table that contains the foreign key. To find the foreign key column, begin to type the column name.
Under Primary Key in Another Table, select the column that contains the primary key. To find the primary key column, begin to type the column name or the name of the table.
Click Save.

Identifying and managing circular dependencies

About circular dependencies

Foreign key relationships can sometimes have circular dependencies, also referred to as cyclical dependencies.

In the simplest case, a circular dependency occurs when two tables each contain a foreign key that references the other table. In the following example, the Employees table contains a department_id foreign key column that references the Departments table, and the Departments table contains a manager_id foreign key column that references the Employees table.

Circular dependencies can also come from a much longer chain of references, where you follow references through several tables before returning to the original table.

Circular dependencies can also occur when a table references itself. In the following example, the Employees table contains a manager_id foreign key column that contains an employee ID value from the id column.

How Structural breaks a circular dependency

During subsetting, if the circular dependency isn't broken, then there is an endless loop of going back and forth between the tables that reference each other.

To break a circular dependency, Structural identifies a foreign key column that is NULLable, and sets its values to NULL. When the process reaches a NULL value, it stops looking for additional related records. Structural applies the minimum number of NULL values that are needed to break the circular dependencies.

If none of the foreign key values are NULLable, then the circular dependency cannot be broken, and the subset generation fails.

How to identify when Structural breaks a circular dependency

Tonic can detect circular dependencies before you run subsetting.

When a table contains foreign keys that are part of a circular dependency that Structural breaks:

On Graph View, a Cycle Break marker is added to the table object. The marker includes the name of the foreign key column.

The table details panel also indicates that there is a cycle break and lists the affected columns.

Configuring subsetting

Identifying and configuring target tables

A target table is a table for which you specify a subset of the data to include in the destination database.

Options to specify the data subset

To identify the subset of data to include, you can either:

Specify a percentage of the table to include in the destination database. You can use this option when you care about the specific volume of data, but not the specific rows. Tonic Structural converts the percentage to a filter or a WHERE clause, depending on your database type. Depending on how your tables are related, the target tables in the final subset might contain more rows than the percentage that you specified. These additional rows are required to maintain referential integrity. To view the tables that contribute to the additional rows, see the subset steps. For additional assistance, reach out to your Tonic.ai contact.

Provide a WHERE clause to specify the subset of data to include in the destination database. The WHERE clause allows you to be more specific about the data to include. For example, you might want to only include data for a specific user or date range.

To combine a specified set of records with a random set of the remaining records, use a WHERE clause. For example, to get all users that are from Alabama, and 5 percent of the other records, use the following WHERE clause:

state = "Alabama" OR random() < 0.05

Configuring a percentage target table

To identify and configure subsetting for a target table:

In Table View or Graph View, click the table.
On the table details panel, from the Select Table Type dropdown list, select Target Table (Percentage).
In the Target Percentage field, type the percentage of the data to include in the destination database. The default is 5, which indicates to use 5% of the rows in the table. You can specify a decimal value, including a value that is less than 1. For example, you might configure the subset to include .5 percent of the rows, or 33.33 percent of the rows.

Configuring a WHERE clause target table

In Table View or Graph View, click the table.
On the table details panel, from the Select Table Type dropdown list, select Target Table (Where Clause).
In the Target Where Clause field, type the WHERE clause to use to identify the data to include in the destination database. For example, the target table contains a column called event_id. To select all rows where event_id is greater than 1000, add the following WHERE clause: event_id > 1000
For a more complex WHERE clause, you can display an editor with a larger text area.
1. Click Open in Editor.
2. In the text area, enter the WHERE clause.
3. Click Save.

You can query across tables within the WHERE clause. For example, you configure the customers table as a target table, but you also want to use information from the customers_legacy table to identify the target records in customers.

In the following example query, the matching records in customers have a Customer_Key value that matches a CustomerKey value in customers_legacy, and where the value of Occupation in customers_legacy is Detective:

"Customer_Key" IN (
    SELECT "Customer_Key"
    FROM customers_legacy
    WHERE customers_legacy."Customer_Key"=customers."Customer_Key"
    AND customers_legacy."Occupation" = 'Detective'
)

You can also create a query that selects a random percentage of a specified set of data.

For example, in PostgreSQL, to select 50% of the records that have an identifier that is divisible by 3, you could use the following WHERE clause:

id % 3 = 0 and random() < 0.5
order by id

Removing a target table

To remove a target table:

In Table View or Graph View, click the table.
On the table details panel, from the table type dropdown list, select Remove.

Identifying lookup tables

A lookup table contains a list of values that are used to populate columns in other tables. For example, a list of states, countries, or currencies. Lookup tables are sometimes referred to as reference tables.

Structural always copies lookup tables to the destination database in their entirety. If you do not configure a table as a lookup table, then Structural treats the table as a related table, and copies only rows that are used in the subset data. Structural also pulls in rows from other tables that refer to the table, but that are not necessarily related to the target tables. This could result in an unexpectedly large subset.

For example, in a Users table, every user record refers to a state in the States table. If you do not identify States as a lookup table, then the subset would include every record in the Users table.

Properties of a lookup table

Here are some typical properties of a lookup table:

It is fairly small and rarely updated.
Many tables point to the table, but it does not point to another table.
The table contains a set of unique values.

Identifying an individual lookup table

To identify an individual table as a lookup table:

In Table View or Graph View, click the table.
On the table details panel, from the Select Table Type dropdown list, select Lookup Table.

Identifying multiple lookup tables

To identify multiple tables as lookup tables:

On Table View, check the checkbox for each table to identify as a lookup table.
From the Actions dropdown list, select Add Lookup Tables.

Removing lookup tables

To remove the lookup designation for a table:

In Table View or Graph View, click the table.
On the table details panel, from the dropdown list, select Remove.

Filtering optional records

The Salesforce data connector does not support filtering optional records.

Records that reference required subset records are considered upstream records. Unlike downstream records, upstream records are not required for referential integrity. Upstream records are optional.

To reduce the size of the subset, you can apply a filter to these optional records. To filter the records, you can either:

Use a date column to specify an amount of time before the current date for which to include records. For example, you can only include records for which the update date is one week before the current date.
Use a WHERE clause to identify the records to include.

You can filter a table that contains both upstream and downstream records. However, the filter only applies to the optional upstream records.

In the table list, when an upstream table is filtered, a Filtered icon displays.

Filtering by date

The date filter allows you to filter optional records based on the value of a date-based column.

To filter an upstream table by date:

In Table View or Graph View, click the table. On the table details panel, under Filter Optional Tables, Type is set by default to No Filter Applied, which indicates that the table is not filtered.
From the Type dropdown list, select Filter By Date Column.
From the Date Column dropdown list, select the date column to use for the filter. To improve performance, select a column that is indexed.
Under Get data from the last, from the time unit dropdown list, select the unit of time to use for the filter. You can filter records based on their age in days, weeks, months, or years.
In the field, enter the number of the selected unit before the current date for which to include the upstream records. For example, you select days as the unit, and set the number to 4. Structural then pulls related records for which the date column value is up to 4 days before the current date.

Filtering with a WHERE clause

To filter the upstream records, you can also use a WHERE clause.

To use a WHERE clause to filter the upstream records:

In Table View or Graph View, click the table. On the table details panel, under Filter Optional Tables, Type is set by default to No Filter Applied, which indicates that the table is not filtered.
From the Type dropdown list, select Filter by Where Clause.
In the Where Clause text area, enter the WHERE clause to use to filter the related records.
Note that if the WHERE clause evaluates to false, then Structural excludes all of the upstream records from the table. For most data connectors, to exclude all of the upstream records, you can simply set the text of the WHERE clause to false. Otherwise, use an obviously false statement such as 1=0.
For a more complex WHERE clause, you can display an editor with a larger text area.
1. Click Open in Editor.
2. In the text area, enter the WHERE clause.
3. Click Save.
To copy the WHERE clause to the clipboard, click Copy To Clipboard.

Removing a filter

To remove an upstream filter, from the Type dropdown list, select No filter applied.

Identifying configuration changes since the most recent subsetting run

As you make changes to the subsetting configuration, Table View and Graph View indicate how the changes affect the next run of the subsetting generation when compared to the most recent subsetting generation.

When a table's inclusion in the subset is affected, on Graph View, a colored marker is added to the bottom of the table box.

On Table View, a colored icon displays next to the table. A tooltip indicates the type of change.

The possible types of changes are:

Added to the subset. For example:
- A new target table
- A table that is newly included because it is related to a new target table
- A new lookup table
Removed from the subset. For example:
- A removed target table
- A table that is removed because it is related to a removed target table
- A removed lookup table
Modified in the subset. This usually reflects a change to a target table configuration. You might:
- Change the type of target table (percentage or WHERE clause)
- Change the percentage
- Change the WHERE clause
- Change the upstream filter

When you run a subsetting generation, Tonic clears the markers.

Determining how to process tables that are not in the subset

Tables other than target tables, lookup tables, or related tables are not in the subset.

The subsetting configuration includes how to copy all of these tables to the destination database.

You can either:

Use the table modes that are assigned to the out-of-subset tables.
Truncate all of the out-of-subset tables. The table schema is preserved, but none of the data is copied to the destination database.

On Table View, on the Configuration tab, you use the Process tables that are out of subset toggle to determine how to handle these tables. After you run subsetting data generation, the toggle is on the Options tab.

By default, the setting is turned off, and Structural truncates the out-of-subset tables.

To use the assigned table mode to process each table, toggle the setting to the on position.

Determining whether to use subsetting during data generation

If you configured subsetting, then when you run a data generation job, you can either generate the entire dataset, or use the subsetting configuration to generate a subset.

On Table View, on the Configuration tab, the Use Subsetting toggle indicates whether to generate a subset. After you run subsetting data generation, the Use Subsetting toggle is on the Options tab.

By default, the toggle is in the off position. When you run a data generation job, it generates the entire destination data dataset.

To instead generate a subset, toggle Use subsetting to the on position.

When you run a data generation job, you are also prompted to confirm whether to generate the entire dataset or a subset. These two toggles are synchronized. If you turn on the Use Subsetting toggle on the Configuration tab, then it is on by default on the generation confirmation panel.

Enabling parallel processing for subsetting

You can sometimes use parallel processing to improve the performance of the subsetting process. Parallel processing allows multiple subsetting steps to be processed at the same time. The steps cannot rely on the output of other steps that are processed in parallel.

To enable parallel processing for subsetting, set the environment setting TONIC_TABLE_PARALLELISM to a number greater than 1 (the default). You can configure this setting from Structural Settings. This setting determines the maximum number of subsetting steps that Structural can process in parallel. For regular data generation, it also determines the number of tables that Structural operates on at the same time.

The effect of subsetting parallelism on performance depends on:

Your subsetting configuration.
The layout of your schema.
The performance characteristics of the machine that runs Structural.
The performance characteristics of your databases.

We recommend that you start with a relatively small number such as 4, and then run a data generation job to see how it affects performance. If performance improves, you can increase the number incrementally until the performance no longer improves.

The environment setting only controls the maximum number of steps that can be processed in parallel. Performance should not degrade if your system cannot support parallelism or won't benefit from using it.

If you have any other questions, contact support@tonic.ai.

Viewing and managing configuration inheritance

Required license: Enterprise

By default, a child workspace inherits the subsetting configuration from its parent workspace. Any changes to the subsetting configuration in the parent workspace are copied to the child workspace.

If you make any change to the subsetting configuration in the child workspace, then it no longer inherits the subsetting configuration. The changes to the parent workspace do not affect the child workspace.

You can reset the child workspace to restore the inheritance.

Viewing and updating the inheritance status for a child workspace

For a child workspace, the Configuration tab on Table View indicates whether the child workspace currently inherits the configuration from the parent workspace.

Inherits parent configuration means that the child workspace inherits the parent configuration.

Overrides parent configuration means that the child workspace overrides the parent configuration. Changes to the parent configuration are not copied to the child workspace.

For a child workspace that overrides its parent configuration, you can reset the inheritance. When you reset the inheritance, the overrides are removed from the subsetting configuration. Changes to the parent configuration are once again copied to the child workspace.

To reset the inheritance, in the Overrides parent configuration notice, click Reset, then on the confirmation dialog, click Reset again.

The configuration overrides are removed. The child workspace inherits any subsequent configuration changes from the parent workspace.

Viewing the inheritance status for children of a parent workspace

For a parent workspace, you can view the current inheritance status of all of the child workspaces. When you change the subsetting configuration for a parent workspace, it applies to the child workspaces that do not override the subsetting configuration.

The Child Workspaces tab contains the list of child workspaces.

For each workspace, the list includes:

The workspace name.
The inheritance status. Inheriting indicates that the child workspace inherits the configuration from the parent. Overriding indicates that the child workspace overrides the configuration and does not inherit it from the parent.
Your role in the child workspace.
The owner of the child workspace.

You cannot reset the inheritance status from the Child Workspaces tab. If you have access to a child workspace, to switch to that workspace, click the arrow icon in the rightmost column.

Viewing the subset creation steps

The Subset Steps tab outlines the steps that Tonic Structural uses to create the subset based on the current configuration.

The steps include the processing of the target, lookup, and related tables. However, the list is not necessarily a one-to-one correspondence with the in-subset tables. To satisfy referential integrity, a table might appear in multiple steps.

The steps do not include the out-of-subset tables.

Each step includes:

Action Step and Table Name

Identifies the table, and indicates whether the table is a target table or a lookup table.

Status

For target tables and lookup tables, Status is Direct. This indicates that the subsetting process pulls data directly from the table. For target tables, this is based on the percentage or WHERE clause. For lookup tables, the subsetting process copies the entire table. For related tables, the status is either Downstream or Upstream.

Contributing Tables

For related tables, the Contributing Tables column indicates the number of tables that affect the data that Structural pulls from the table. To display the contributing tables and how the current table is affected by those tables, click the information icon.

Source/Destination Rows

The number of rows in the source data and in the subset. For tables that contain fewer than 1,000 rows, the pre-subset value is <1k. Before you run the data generation, the number of rows in the subset is unknown. Otherwise, the number reflects the results of the most recent data generation.

Viewing previous subsetting data generation runs

The Configuration tab displays the results of the most recent subsetting run, as well as the subsetting configuration that was in place during each run. You can use the Previous Subsetting Runs tab to view details for any of the previous 100 subsetting runs.

You can use information about previous runs to see how changes to the subsetting configuration affected the subset results.

Selecting the run to view

From the Select a previous subset run dropdown list, select the subsetting run to display the details for.

The runs are identified by the run date and time.

Information displayed for the selected run

The details for a selected run include the following:

A summary of the run results.
Details about the subsetting configuration that was in place at the time of the run.

Summary of the run results

The panel at the left of the tab summarizes the results of the selected run. The run summary includes the following:

The status of the run (successful, failed, canceled).
The amount of time it took to complete the run.
For successful runs:
- The percentage of source data included in the destination database
- The volume of data in the source database
- The volume of data in the destination database

To display the job details for the run, click View Job Details.

Subsetting configuration

The subsetting configuration reflects the configuration that was in place at the time of the selected run. It is read-only. To make adjustments to the subsetting configuration, return to the Configuration tab.

Previous Subsetting Runs only displays Table View.

The panels above the table list show the number of target tables, lookup tables, related tables, and out-of-subset tables.

For each table, the list indicates whether the table is in the subset. It identifies the target and lookup tables. To view the subset configuration for an individual table, click the row.

Generating cohesive subset data from related databases

Your data might be stored in separate but related databases. In Tonic Structural, each database provides the source data for a different workspace.

For example, a Users database contains a list of users. Each service also has a separate database. The Service1 and Service2 databases refer to identifiers of users from the Users database, but there are no direct foreign key relationships.

When you generate a subset from each database, you might want to ensure that the resulting data is complete and cohesive. For example, your application connects to and pulls data from each database. This means that your end-to-end testing also requires corresponding data from each database.

To continue the previous example, you generate subsets from the Users, Service1, and Service2 databases. Your application pulls data from each database. For the data to be complete and have referential integrity, the subsets from the Service1 and Service2 service databases should only contain records that refer to the users in the subset from the Users database.

Here are some options to generate subset from separate databases to produce data that is complete and cohesive:

Use deterministic WHERE clauses
Use pre-computed record lists
Run the subsetting jobs serially
Using database links or linked servers

In all cases, when you generate subsets across different databases, you must use consistency to ensure that common columns have the same values in each subset.

Using deterministic WHERE clauses

One way to produce complete and cohesive data across databases is to use deterministic WHERE clauses in your target table configuration. A deterministic WHERE clause always produces the same results, and is never random.

A percentage is not deterministic. Structural selects a specific number of records, but selects those records at random.

Not all WHERE clauses are deterministic. For example, the Users, Service1, and Service2 databases each have a TotalValue column that reflects the total spent as a whole and for each service. Filtering based on TotalValue does not guarantee that you get a cohesive set of records.

Instead, provide a WHERE clause that can be used in each database to produce a cohesive set of records across the databases. For example, use a WHERE clause to look for a specific set of UUID values in each database.

In our example, if we target the same set of user UUIDs in the Users, Service1, and Service2 databases, we produce a complete and cohesive set of records for those users.

When you use a deterministic WHERE clause in each database, you can run the subsetting jobs independently.

Using pre-computed record lists

This is somewhat similar to using a deterministic WHERE clause. It is one way to provide input to create a deterministic WHERE clause.

You can run a query outside of Structural, and then use the results as input to the subset configuration. For example, you could run a query to identify users that are located in the United States.

One way to use the results would be to store the results somewhere in a database that is accessible to each workspace and that you can reference in the WHERE clauses. You could also return the result as a hard-coded list, and create WHERE clauses that use an IN() filter that contains a long list of these hard-coded values. You could even use the Structural API to update the WHERE clause values as a part of an automated process.

This method allows you to run the subsetting jobs independently.

Running subsetting jobs serially

Another option is to run the jobs on the workspaces serially. The results of a job on one workspace feed into the job on the next workspace.

To do this, you run the first job, which can have a target percentage or a non-deterministic WHERE clause.

After this job completes, use the results as input to a WHERE clause in the second workspace. For example, the results might be a set of user ID values.

Depending on how the databases are set up, you might be able to query the results directly. For fully isolated databases, you could export the list and hard-code it in the WHERE clause of the second workspace.

You can only use this option if the relevant column values are not changed by the generation process. If the column has a generator applied, then the output column values from the first database do not exist in the source column values in the second database.

Using database links or linked servers

The previous options are ideal for when the related databases are completely isolated from each other.

However, in some cases you can connect different database instances directly to query across them. Many database engines provide this capability, such as:

Oracle database links
SQL Server linked servers
PostgreSQL foreign-data wrapper with foreign server

If your environment allows and supports these mechanisms, then you can directly reference the external server in a query.

Ensuring consistent column values between databases

For columns that are common across all of the databases, you must ensure that a specific value in the source databases results in the same value in all of the destination databases.

To do this, you must assign a generator that supports consistency, and enable consistency on the column.

You must also configure Structural to ensure consistency across databases.

For more information, go to Enabling consistency.

Other subsetting hints and tips

If you are having trouble with your subsetting configuration or results, the following hints and tips might be helpful.

Remember that subsetting is an iterative process. It can take multiple attempts to get the exact subset of data that you want.

Start with a very small subset

One way to troubleshoot a subsetting configuration is to start with a very small subset. With a small subset, it is easier to verify that the data generation returns the data you are expecting.

For example, start with a , and use a query to limit the subset to a single record. Add the necessary lookup tables.

After you verify the results in this very small subset, gradually increase the subset size in subsequent iterations. Verify the results after each iteration. Continue to increase the subset size until you reach the full subset.

For each iteration, verify that:

The subset includes the target table records.
The subset includes all of the lookup table data.
The subset includes all of the related records - records that each target table record refers to, and records that refer to the target table record.

Minimize the number of target tables

An ideal subset contains a small number of . A target table typically contains an important object that is well connected to everything else in the source data.

Tables that contain static values that are used in multiple other tables should be , not target tables.

For , allow the subsetting process to identify the necessary rows to include in the subset based on the . Do not add them as target tables.

Make circular foreign key columns NULLable

If your source data includes circular foreign keys, make sure that at least one of those columns is NULLable.

Improve subsetting performance

Note that subsetting performance does not improve in a linear fashion.

For example, if you subset 10% of a target table, the generation does not take 10% of the time of a standard de-identification run. It might take 90% of the time.

To improve performance:

Make sure to properly configure all relevant tables as lookup tables. Lookup tables are automatically copied to the subset in their entirety, and do not require processing to identify the rows to include. Structural also does not look for records that are upstream of a lookup table.
For upstream tables, in particular large upstream tables, add indexes to virtual foreign key columns. If there are no indexes on the foreign key columns, subsetting can be much slower.
Alternatively, instead of adding indexes, assign Truncate mode to the upstream tables.

Prune related subset tables

Here are some other options to further reduce the size of the subset. These options reduce the amount of related data for the included target table records.

You can truncate related tables in a subset, which removes that entire section of the subset. When you truncate a related table, Structural removes both that table and any tables that are included because of that table.

For example, in this subsetting graph, the products table is a related table. For the records in products, the subset also includes related records from wholesale_orders, vendors, and wo_date.

After you truncate the products table, the subsetting graph indicates that during the next data generation, the products table and its connected tables will all be removed from the subset.

Note that for a related table that contains primary keys that are used to populate a foreign key column in an included table, you can only truncate the related table if the foreign key column is nullable.

For example, the product_id primary key column in the products table populates the purchased_product_id foreign key column in the transactions table. The transactions table is included in the subset. If purchased_product_id is not nullable, then you cannot truncate products.

Exclude data from upstream tables

Upstream tables contain related data that is not required for referential integrity. To completely remove the table, you can truncate it.

To keep the upstream table, but exclude all of the related records, create a WHERE clause filter where the WHERE clause evaluates to false. For many data connectors, you can directly set the WHERE clause text to false. Otherwise, use an obviously false statement such as 1=0.

When you exclude all of the records in an upstream table, the subset also excludes all records from its related tables.

For example, in this subsetting graph, before you add an upstream filter to the wholesale_orders table, the subset includes records from wholesale_orders, vendors, and wo_date.

After you add a WHERE clause filter that is set to false to the wholesale_orders table, the subset includes zero records from wholesale_orders, vendors, and wo_date.

About subsetting

Required license: Professional or Enterprise

What is subsetting?

Subsetting allows you to intelligently reduce the size of your destination database. It takes a representative sample of the data that preserves the data's referential integrity.

You configure how Tonic Structural generates the subset. When you generate the output data, you decide whether to enable the subsetting process.

For example, you can configure subsetting to get 5% of all transactions, or all of the data that is associated with customers who live in California.

Here are a few examples where subsetting data might be important or necessary:

You want to use your production database in staging or test environments, without the PII. Because the database is very large, you want to use only a portion of it.
To reproduce a bug, you want a test database that contains a few specific rows from production, and related rows from other tables.
You want to share data with others, but you don’t want them to have all of it. For example, you can provide developers an anonymized subset that also enables them to run the test database locally on their local machines.

To learn more about our approach to subsetting, go to the following technical blog posts:

Components of subsetting

Subsetting uses foreign keys to determine the relationships in the data. These relationships enable the subsetting process to traverse the database as it builds the subset.

Foreign keys are either configured in your source data, or configured using the Structural virtual foreign key tool. For more information, go to Subsetting and foreign keys.

For subsetting, each table in the source database falls into one of the following categories:

Target tables
Lookup tables
Related tables
Out-of-subset tables

Target tables

For more information, go to Identifying and configuring target tables.

Lookup tables

Structural always retrieves all of the data in a lookup table. It does not check whether or where the lookup values are used.

It does not pull records from related tables based on lookup table values. Relationships with lookup tables are ignored during the subsetting process.

For more information, go to Identifying lookup tables.

Related tables are tables that are connected by direct or indirect relationships with a target table, and that are not identified as lookup tables.

Downstream tables have data that is required to maintain referential integrity for the subset. These tables have primary keys that are referenced by foreign keys in related tables.
Upstream tables contain data that has a foreign key that references a primary key in the target table. For large upstream tables, if the foreign key columns are not indexed, the subsetting process can be significantly slower. These upstream records are not required to maintain referential integrity, but can contain useful information. In the subset configuration, you can filter these upstream records either by date or by using a WHERE clause.

Structural pulls data from related tables in order to preserve referential integrity in the output data subset.

For an example of how Structural identifies related tables, view the example diagram in How Structural creates a subset.

Out-of-subset tables

Tables other than target tables, lookup tables, or related tables are not part of the subset.

By default, Structural copies only the table schema of out-of-subset tables. It does not populate any of the data.

You can also choose to process the tables using the table mode that is assigned to each table.

For more information, go to Determining how to process tables that are not in the subset.

How Structural creates a subset

Structural creates the subset before it applies any transformations to the source data.

To provide a basic overview of how Structural creates the subset, we'll use the following simple example schema:

The Attendees table includes the event that the attendee registered for.

The Hosts, Companies, Venues, and Attendees tables are all related tables for the subset.

When you enable subsetting for a data generation job:

To create the basis of the subset, Structural gets data from the target tables based on the configured filters - either a percentage or a WHERE clause. In our example, Structural gets the subset of data from the Events table. Structural then traverses your database based on the relationships that originate from the target tables.
Structural first goes upstream. For the upstream process, Structural traverses through tables that reference a target table, based on the data collected in step 1. In other words, the value of the primary key for a target table record is the value of a foreign key column in the upstream table. This step continues until there are no remaining upstream tables to process. To continue our example, Structural retrieves the attendees for the event records that are in the subset.
Next, Structural goes downstream. Structural traverses all of the tables to look for foreign key columns for which the value is the primary key of an upstream table record. To continue our example, Structural retrieves the hosts and venues that are referred to in the event records that it retrieved in the first or second pass on the events table. It also retrieves the companies that are referred to in the host records. During this downstream step, Structural considers both upstream and downstream tables to ensure that the subset includes every connected table. For example, if the Venues table included a foreign key column that referenced a primary key from the Attendees table, Tonic would have to return to the Attendees table to get those attendee records.

Other notes about retrieving subset data

You might want to be aware of how Structural retrieves subset data in the following cases, which can result in either more or less data than you might expect.

If there are multiple target tables, and the tables are related to each other, Structural takes the union of the required data for both the target table configuration and the table relationships.

Tables that are upstream of multiple target tables

If a table is upstream of multiple target tables, then Structural only pulls records from that table that contain references to targeted records in all of the target tables.

Configuring subsetting

Identifying and configuring target tables

A target table is a table for which you specify a subset of the data to include in the destination database.

Options to specify the data subset

To identify the subset of data to include, you can either:

Specify a percentage of the table to include in the destination database. You can use this option when you care about the specific volume of data, but not the specific rows. Tonic Structural converts the percentage to a filter or a WHERE clause, depending on your database type. Depending on how your tables are related, the target tables in the final subset might contain more rows than the percentage that you specified. These additional rows are required to maintain referential integrity. To view the tables that contribute to the additional rows, see the subset steps. For additional assistance, reach out to your Tonic.ai contact.

Provide a WHERE clause to specify the subset of data to include in the destination database. The WHERE clause allows you to be more specific about the data to include. For example, you might want to only include data for a specific user or date range.

state = "Alabama" OR random() < 0.05

Configuring a percentage target table

To identify and configure subsetting for a target table:

In Table View or Graph View, click the table.
On the table details panel, from the Select Table Type dropdown list, select Target Table (Percentage).
In the Target Percentage field, type the percentage of the data to include in the destination database. The default is 5, which indicates to use 5% of the rows in the table. You can specify a decimal value, including a value that is less than 1. For example, you might configure the subset to include .5 percent of the rows, or 33.33 percent of the rows.

Configuring a WHERE clause target table

In Table View or Graph View, click the table.
On the table details panel, from the Select Table Type dropdown list, select Target Table (Where Clause).
In the Target Where Clause field, type the WHERE clause to use to identify the data to include in the destination database. For example, the target table contains a column called event_id. To select all rows where event_id is greater than 1000, add the following WHERE clause: event_id > 1000
For a more complex WHERE clause, you can display an editor with a larger text area.
1. Click Open in Editor.
2. In the text area, enter the WHERE clause.
3. Click Save.

"Customer_Key" IN (
    SELECT "Customer_Key"
    FROM customers_legacy
    WHERE customers_legacy."Customer_Key"=customers."Customer_Key"
    AND customers_legacy."Occupation" = 'Detective'
)

You can also create a query that selects a random percentage of a specified set of data.

For example, in PostgreSQL, to select 50% of the records that have an identifier that is divisible by 3, you could use the following WHERE clause:

id % 3 = 0 and random() < 0.5
order by id

Removing a target table

To remove a target table:

In Table View or Graph View, click the table.
On the table details panel, from the table type dropdown list, select Remove.

Identifying lookup tables

For example, in a Users table, every user record refers to a state in the States table. If you do not identify States as a lookup table, then the subset would include every record in the Users table.

Properties of a lookup table

Here are some typical properties of a lookup table:

It is fairly small and rarely updated.
Many tables point to the table, but it does not point to another table.
The table contains a set of unique values.

Identifying an individual lookup table

To identify an individual table as a lookup table:

In Table View or Graph View, click the table.
On the table details panel, from the Select Table Type dropdown list, select Lookup Table.

Identifying multiple lookup tables

To identify multiple tables as lookup tables:

On Table View, check the checkbox for each table to identify as a lookup table.
From the Actions dropdown list, select Add Lookup Tables.

Removing lookup tables

To remove the lookup designation for a table:

In Table View or Graph View, click the table.
On the table details panel, from the dropdown list, select Remove.

Filtering optional records

The Salesforce data connector does not support filtering optional records.

Records that reference required subset records are considered upstream records. Unlike downstream records, upstream records are not required for referential integrity. Upstream records are optional.

To reduce the size of the subset, you can apply a filter to these optional records. To filter the records, you can either:

Use a date column to specify an amount of time before the current date for which to include records. For example, you can only include records for which the update date is one week before the current date.
Use a WHERE clause to identify the records to include.

You can filter a table that contains both upstream and downstream records. However, the filter only applies to the optional upstream records.

In the table list, when an upstream table is filtered, a Filtered icon displays.

Filtering by date

The date filter allows you to filter optional records based on the value of a date-based column.

To filter an upstream table by date:

In Table View or Graph View, click the table. On the table details panel, under Filter Optional Tables, Type is set by default to No Filter Applied, which indicates that the table is not filtered.
From the Type dropdown list, select Filter By Date Column.
From the Date Column dropdown list, select the date column to use for the filter. To improve performance, select a column that is indexed.
Under Get data from the last, from the time unit dropdown list, select the unit of time to use for the filter. You can filter records based on their age in days, weeks, months, or years.
In the field, enter the number of the selected unit before the current date for which to include the upstream records. For example, you select days as the unit, and set the number to 4. Structural then pulls related records for which the date column value is up to 4 days before the current date.

Filtering with a WHERE clause

To filter the upstream records, you can also use a WHERE clause.

To use a WHERE clause to filter the upstream records:

In Table View or Graph View, click the table. On the table details panel, under Filter Optional Tables, Type is set by default to No Filter Applied, which indicates that the table is not filtered.
From the Type dropdown list, select Filter by Where Clause.
In the Where Clause text area, enter the WHERE clause to use to filter the related records.
Note that if the WHERE clause evaluates to false, then Structural excludes all of the upstream records from the table. For most data connectors, to exclude all of the upstream records, you can simply set the text of the WHERE clause to false. Otherwise, use an obviously false statement such as 1=0.
For a more complex WHERE clause, you can display an editor with a larger text area.
1. Click Open in Editor.
2. In the text area, enter the WHERE clause.
3. Click Save.
To copy the WHERE clause to the clipboard, click Copy To Clipboard.

Removing a filter

To remove an upstream filter, from the Type dropdown list, select No filter applied.

Identifying configuration changes since the most recent subsetting run

When a table's inclusion in the subset is affected, on Graph View, a colored marker is added to the bottom of the table box.

On Table View, a colored icon displays next to the table. A tooltip indicates the type of change.

The possible types of changes are:

Added to the subset. For example:
- A new target table
- A table that is newly included because it is related to a new target table
- A new lookup table
Removed from the subset. For example:
- A removed target table
- A table that is removed because it is related to a removed target table
- A removed lookup table
Modified in the subset. This usually reflects a change to a target table configuration. You might:
- Change the type of target table (percentage or WHERE clause)
- Change the percentage
- Change the WHERE clause
- Change the upstream filter

When you run a subsetting generation, Tonic clears the markers.

Determining how to process tables that are not in the subset

Tables other than target tables, lookup tables, or related tables are not in the subset.

The subsetting configuration includes how to copy all of these tables to the destination database.

You can either:

Use the table modes that are assigned to the out-of-subset tables.
Truncate all of the out-of-subset tables. The table schema is preserved, but none of the data is copied to the destination database.

By default, the setting is turned off, and Structural truncates the out-of-subset tables.

To use the assigned table mode to process each table, toggle the setting to the on position.

Determining whether to use subsetting during data generation

If you configured subsetting, then when you run a data generation job, you can either generate the entire dataset, or use the subsetting configuration to generate a subset.

By default, the toggle is in the off position. When you run a data generation job, it generates the entire destination data dataset.

To instead generate a subset, toggle Use subsetting to the on position.