Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Some Tonic Structural data connectors do not support subsetting.
However, for the following connectors that do not support subsetting, to generate a smaller set of data, you instead can add table filters.
The following data connectors support both subsetting and table filtering:
You can only filter tables that use De-identify table mode. The filter identifies the rows from the source database to process and include in the destination database.
Note that unlike subsetting, table filters do not guarantee referential integrity.
To add a filter, in the Table Filter text area on the table mode panel, provide the WHERE
clause for the filter, then click Apply.
For Databricks workspaces where the source database uses Delta files, the filter WHERE
clause can only refer to columns that have partitions.
For Amazon EMR and Google BigQuery, the filter WHERE
clause can refer to columns without partitions. However, the performance is better when the referenced columns have partitions.
On the workspace configuration for Amazon EMR and Databricks, the Enable partition filter validation toggle determines whether Structural validates the WHERE
clause when you create it. By default, the toggle is in the on position, and the WHERE
clause is validated.
For Amazon Redshift, Google BigQuery, and Snowflake, Structural always validates the WHERE
clause.
Required license: Professional or Enterprise
Required workspace permission: Configure subsetting
Subsetting generates a representative sample of your data in a way that maintains referential integrity.
You can also view this video overview of subsetting.
A target table is a table for which you specify a subset of the data to include in the destination database.
To identify the subset of data to include, you can either:
Specify a percentage of the table to include in the destination database.
You can use this option when you care about the specific volume of data, but not the specific rows.
Tonic Structural converts the percentage to a filter or a WHERE
clause, depending on your database type.
Depending on how your tables are related, the target tables in the final subset might contain more rows than the percentage that you specified. These additional rows are required to maintain referential integrity. To view the tables that contribute to the additional rows, see the . For additional assistance, reach out to your Tonic.ai contact.
Provide a WHERE
clause to specify the subset of data to include in the destination database.
The WHERE
clause allows you to be more specific about the data to include. For example, you might want to only include data for a specific user or date range.
To combine a specified set of records with a random set of the remaining records, use a WHERE
clause. For example, to get all users that are from Alabama, and 5 percent of the other records, use the following WHERE
clause:
To identify and configure subsetting for a target table:
In Table View or Graph View, click the table.
On the table details panel, from the Select Table Type dropdown list, select Target Table (Percentage).
In the Target Percentage field, type the percentage of the data to include in the destination database. The default is 5, which indicates to use 5% of the rows in the table. You can specify a decimal value, including a value that is less than 1. For example, you might configure the subset to include .5 percent of the rows, or 33.33 percent of the rows.
In Table View or Graph View, click the table.
On the table details panel, from the Select Table Type dropdown list, select Target Table (Where Clause).
In the Target Where Clause field, type the WHERE
clause to use to identify the data to include in the destination database.
For example, the target table contains a column called event_id
. To select all rows where event_id
is greater than 1000, add the following WHERE
clause:
event_id > 1000
For a more complex WHERE
clause, you can display an editor with a larger text area.
Click Open in Editor.
In the text area, enter the WHERE
clause.
Click Save.
You can query across tables within the WHERE
clause.
For example, you configure the customers
table as a target table, but you also want to use information from the customers_legacy
table to identify the target records in customers
.
In the following example query, the matching records in customers
have a Customer_Key
value that matches a CustomerKey
value in customers_legacy
, and where the value of Occupation
in customers_legacy
is Detective
:
You can also create a query that selects a random percentage of a specified set of data.
For example, in PostgreSQL, to select 50% of the records that have an identifier that is divisible by 3, you could use the following WHERE
clause:
To remove a target table:
In Table View or Graph View, click the table.
On the table details panel, from the table type dropdown list, select Remove.
A lookup table contains a list of values that are used to populate columns in other tables. For example, a list of states, countries, or currencies. Lookup tables are sometimes referred to as reference tables.
Structural always copies lookup tables to the destination database in their entirety. If you do not configure a table as a lookup table, then Structural treats the table as a related table, and copies only rows that are used in the subset data. Structural also pulls in rows from other tables that refer to the table, but that are not necessarily related to the target tables. This could result in an unexpectedly large subset.
For example, in a Users table, every user record refers to a state in the States table. If you do not identify States as a lookup table, then the subset would include every record in the Users table.
Here are some typical properties of a lookup table:
It is fairly small and rarely updated.
Many tables point to the table, but it does not point to another table.
The table contains a set of unique values.
To identify an individual table as a lookup table:
In Table View or Graph View, click the table.
On the table details panel, from the Select Table Type dropdown list, select Lookup Table.
To identify multiple tables as lookup tables:
On Table View, check the checkbox for each table to identify as a lookup table.
From the Actions dropdown list, select Add Lookup Tables.
To remove the lookup designation for a table:
In Table View or Graph View, click the table.
On the table details panel, from the dropdown list, select Remove.
The Salesforce data connector does not support filtering optional records.
Records that reference required subset records are considered upstream records. Unlike downstream records, upstream records are not required for referential integrity. Upstream records are optional.
To reduce the size of the subset, you can apply a filter to these optional records. To filter the records, you can either:
Use a date column to specify an amount of time before the current date for which to include records. For example, you can only include records for which the update date is one week before the current date.
Use a WHERE
clause to identify the records to include.
You can filter a table that contains both upstream and downstream records. However, the filter only applies to the optional upstream records.
In the table list, when an upstream table is filtered, a Filtered icon displays.
The date filter allows you to filter optional records based on the value of a date-based column.
To filter an upstream table by date:
In Table View or Graph View, click the table. On the table details panel, under Filter Optional Tables, Type is set by default to No Filter Applied, which indicates that the table is not filtered.
From the Type dropdown list, select Filter By Date Column.
From the Date Column dropdown list, select the date column to use for the filter. To improve performance, select a column that is indexed.
Under Get data from the last, from the time unit dropdown list, select the unit of time to use for the filter. You can filter records based on their age in days, weeks, months, or years.
In the field, enter the number of the selected unit before the current date for which to include the upstream records. For example, you select days as the unit, and set the number to 4. Structural then pulls related records for which the date column value is up to 4 days before the current date.
You can also filter the upstream records using a WHERE
clause.
To filter the upstream records using a WHERE
clause:
In Table View or Graph View, click the table. On the table details panel, under Filter Optional Tables, Type is set by default to No Filter Applied, which indicates that the table is not filtered.
From the Type dropdown list, select Filter by Where Clause.
In the Where Clause text area, enter the WHERE
clause to use to filter the related records.
Note that if the WHERE
clause evaluates to false
, then Structural excludes all of the upstream records from the table.
For most data connectors, you can simply set the text of the WHERE
clause to false
. Otherwise, use an obviously false statement such as 1=0
.
For a more complex WHERE
clause, you can display an editor with a larger text area.
Click Open in Editor.
In the text area, enter the WHERE
clause.
Click Save.
To copy the WHERE
clause to the clipboard, click Copy To Clipboard.
To remove an upstream filter, from the Type dropdown list, select No filter applied.
As you make changes to the subsetting configuration, Table View and Graph View indicates how the changes affect the next run of the subsetting generation when compared to the most recent subsetting generation.
When a table's inclusion in the subset is affected, on Graph View, a colored marker is added to the bottom of the table box.
On Table View, a colored icon displays next to the table. A tooltip indicates the type of change.
The possible types of changes are:
Added to the subset. For example:
A new target table
A table that is newly included because it is related to a new target table
A new lookup table
Removed from the subset. For example:
A removed target table
A table that is removed because it is related to a removed target table
A removed lookup table
Modified in the subset. This usually reflects a change to a target table configuration. You might:
Change the type of target table (percentage or WHERE
clause)
Change the percentage
Change the WHERE
clause
Change the upstream filter
When you run a subsetting generation, Tonic clears the markers.
Tables other than target tables, lookup tables, or related tables are not in the subset.
The subsetting configuration includes how to copy all of these tables to the destination database.
You can either:
Use the table modes that are assigned to the out-of-subset tables.
Truncate all of the out-of-subset tables. The table schema is preserved, but none of the data is copied to the destination database.
On Table View, on the Configuration tab, you use the Process tables that are out of subset toggle to determine how to handle these tables. After you run subsetting data generation, the toggle is on the Options tab.
By default, the setting is turned off, and Structural truncates the out-of-subset tables.
To use the assigned table mode to process each table, toggle the setting to the on position.
If you configured subsetting, then when you run a data generation job, you can either generate the entire dataset, or use the subsetting configuration to generate a subset.
On Table View, on the Configuration tab, the Use Subsetting toggle indicates whether to generate a subset. After you run subsetting data generation, the Use Subsetting toggle is on the Options tab.
By default, the toggle is in the off position. When you run a data generation job, it generates the entire destination data dataset.
To instead generate a subset, toggle Use subsetting to the on position.
When you run a data generation job, you are also prompted to confirm whether to generate the entire dataset or a subset. These two toggles are synchronized. If you turn on the Use Subsetting toggle on the Configuration tab, then it is on by default on the generation confirmation panel.
You can sometimes use parallel processing to improve the performance of the subsetting process. Parallel processing allows multiple subsetting steps to be processed at the same time. The steps cannot rely on the output of other steps that are processed in parallel.
The effect of subsetting parallelism on performance depends on your subsetting configuration, the layout of your schema, the performance characteristics of the machine that runs Structural, and the performance characteristics of your databases.
We recommend that you start with a relatively small number such as 4, and then run a data generation job to see how it affects performance. If performance improves, you can increase the number incrementally until the performance no longer improves.
The environment setting only controls the maximum number of steps that can be processed in parallel. Performance should not degrade if your system cannot support parallelism or won't benefit from using it.
If you have any other questions, contact support@tonic.ai.
Required license: Enterprise
By default, a child workspace inherits the subsetting configuration from its parent workspace. Any changes to the subsetting configuration in the parent workspace are copied to the child workspace.
If you make any change to the subsetting configuration in the child workspace, then it no longer inherits the subsetting configuration. The changes to the parent workspace do not affect the child workspace.
You can reset the child workspace to restore the inheritance.
For a child workspace, the Configuration tab on Table View indicates whether the child workspace currently inherits the configuration from the parent workspace.
Inherits parent configuration means that the child workspace inherits the parent configuration.
Overrides parent configuration means that the child workspace overrides the parent configuration. Changes to the parent configuration are not copied to the child workspace.
For a child workspace that overrides its parent configuration, you can reset the inheritance. When you reset the inheritance, the overrides are removed from the subsetting configuration. Changes to the parent configuration are once again copied to the child workspace.
To reset the inheritance, in the Overrides parent configuration notice, click Reset, then on the confirmation dialog, click Reset again.
The configuration overrides are removed. The child workspace inherits any subsequent configuration changes from the parent workspace.
For a parent workspace, you can view the current inheritance status of all of the child workspaces. When you change the subsetting configuration for a parent workspace, it applies to the child workspaces that have not overridden the subsetting configuration.
The Child Workspaces tab contains the list of child workspaces.
For each workspace, the list includes:
The workspace name.
The inheritance status. Inheriting indicates that the child workspace inherits the configuration from the parent. Overriding indicates that the child workspace overrides the configuration and does not inherit it from the parent.
Your role in the child workspace.
The owner of the child workspace.
You cannot reset the inheritance status from the Child Workspaces tab. If you have access to a child workspace, to switch to that workspace, click the arrow icon in the rightmost column.
Required license: Professional or Enterprise
Subsetting allows you to intelligently reduce the size of your destination database. It takes a representative sample of the data while preserving the data's referential integrity.
You configure how Tonic Structural should generate a subset. When you generate the output data, you decide whether to enable the subsetting process.
For example, you can configure subsetting to get 5% of all transactions, or all of the data that is associated with customers who live in California.
Here are a few examples where subsetting data might be important or necessary:
You want to use your production database in staging or test environments, without the PII. The database is very large, so you want to use only a portion of it.
You want a test database that contains a few specific rows from production (and related rows from other tables) so that you can reproduce a bug.
You want to share data with others but you don’t want them to have all of it. For example, you can provide developers an anonymized subset that also enables them to run the test database locally on their local machines.
To learn more about our approach to subsetting, go to the following technical blog posts:
Subsetting uses foreign keys to determine the relationships in the data. These relationships enable the subsetting process to traverse the database as it builds the subset. Foreign keys are either configured in your source data, or configured using the Structural virtual foreign key tool. For more information, go to .
For subsetting, each table in the source database falls into one of the following categories:
Target tables are the seed tables that provide the initial set of rows to include in the subset. Structural retrieves the initial subset of data from the target tables. Structural then uses those rows to identify the information to pull from related tables.
A target table typically contains an important object that is well connected to everything else in the source data. For example, users, transactions, or claims. A subset should usually have a very small number of target tables.
When you identify a target table, you specify how to retrieve the subset of the data that you want from the table. You can request a percentage of the data, or use a WHERE
clause to identify a specific subset of data.
A lookup table contains a static set of values that is used in other tables in your subset. For example, a lookup table might contain a list of postal codes or country names that are referenced in other tables.
Structural always retrieves all of the data in a lookup table. It does not check whether or where the lookup values are used.
It does not pull records from related tables based on lookup table values. Relationships with lookup tables are ignored during the subsetting process.
Related tables are tables that are connected by direct or indirect relationships with a target table, and that are not identified as lookup tables.
Downstream tables have data that is required to maintain referential integrity for the subset. These tables have primary keys that are referenced by foreign keys in related tables.
Some related tables are both downstream and upstream. In that case, you can provide a filter that applies only to the upstream records. Because the downstream records are required for referential integrity, they cannot be filtered.
For example, a transactions table contains a foreign key column to identify the customer. The value is the primary key of a record in the customers table. The customers table is downstream of the transactions table - the transaction data is incomplete without the customer information. The transactions table is upstream of the customers table.
Structural pulls data from related tables in order to preserve referential integrity in the output data subset.
In many cases, the relationship is direct. For example, a target table contains a list of events. The events table identifies the user that hosted the event. The user is identified using a foreign key relationship from the events table to the users table. The users table is a related table. The subset includes all the users that the events refer to.
The relationship also might be indirect. To continue the example, the events table identifies a user from the users table. The users table identifies the company that each user belongs to. The company is identified using a foreign key relationship from the users table to the companies table. The companies table is also a related table. The subset needs to include all of the companies that are referred to by the users that the events table refers to.
Tables other than target tables, lookup tables, or related tables are not part of the subset.
By default, Structural copies only the table schema of out-of-subset tables. It does not populate any of the data.
You can also choose to process the tables using the table mode that is assigned to each table.
Structural creates the subset before it applies any transformations to the source data.
To provide a basic overview of how Structural creates the subset, we'll use the following simple example schema:
The Events table is the target table for the subset. The Events table includes information about the event hosts (Hosts table) and the event venue (Venues table). For each host, the data includes the company that the host belongs to (Companies table).
The Attendees table includes the event that the attendee registered for.
The Hosts, Companies, Venues, and Attendees tables are all related tables for the subset.
The States table provides a lookup of state values to use for the company, venue, and attendee addresses. It is a lookup table for the subset. A subset always includes all of the data in a lookup table.
When you enable subsetting for a data generation job:
To create the basis of the subset, Structural gets data from the target tables based on the configured filters, either a percentage or a WHERE
clause.
In our example, Structural gets the subset of data from the Events table.
Structural then traverses your database based on the relationships that originate from the target tables.
Structural first goes upstream. For the upstream process, Structural traverses through tables that reference a target table, based on the data collected in step 1. In other words, the value of the primary key for a target table record is the value of a foreign key column in the upstream table. This step continues until there are no remaining upstream tables to process. To continue our example, Structural retrieves the attendees for the event records that are in the subset.
Next, Structural goes downstream. Structural traverses all of the tables to look for foreign key columns for which the value is the primary key of an upstream table record. To continue our example, Structural retrieves the hosts and venues that are referred to in the event records that it retrieved in the first or second pass on the events table. It also retrieves the companies that are referred to in the host records. During this downstream step, Structural considers both upstream and downstream tables to ensure that the subset includes every connected table. For example, if the Venues table included a foreign key column that referenced a primary key from the Attendees table, Tonic would have to return to the Attendees table to get those attendee records.
You might want to be aware of how Structural retrieves subset data in the following cases, which can result in either more or less data than you might expect.
If there are multiple target tables, and the tables are related to each other, Structural takes the union of the required data for both the target table configuration and the table relationships.
For example, table A contains a foreign key column that refers to table B. You configure both tables as target tables. For table B, Structural pulls both the directly targeted set of records, and the records that the targeted table A records refer to.
If a table is upstream of multiple target tables, then Structural only pulls records from that table that contain references to targeted records in all of the target tables.
For example, in related table Child1
, column1
is a foreign key that refers to a primary key in target table Parent1
. column2
is a foreign key that refers to a primary key in target table Parent2
.
If column1
and column2
both refer to targeted records in Parent1
and Parent2
, then that Child1
record is included in the subset. If only one of those columns refers to a targeted record in Parent1
or Parent2
, then that Child1
record is not included.
To enable parallel processing for subsetting, set the TONIC_TABLE_PARALLELISM
to a number greater than 1 (the default). You can configure this setting from Structural Settings. This setting determines the maximum number of subsetting steps that Structural can process in parallel. For regular data generation, it also determines the number of tables that Structural operates on at the same time.
For more information, go to .
For more information, go to .
Upstream tables contain data that has a foreign key that references a primary key in the target table. For large upstream tables, if the foreign key columns are not indexed, the subsetting process can be significantly slower.
These upstream records are not required to maintain referential integrity, but can contain useful information. In the subset configuration, you can either by date or by using a WHERE
clause.
For an example of how Structural identifies related tables, view the example diagram in .
For more information, go to .
About subsetting
Learn about subsetting, its components, and how subsetting generation works.
View the subset configuration
Review the current subset configuration.
Configure subsetting
Select and configure target and lookup tables.
View the subset generation steps
Review the path that Tonic Structural takes to identify the subset records.
View previous subset generation runs
See the results and the configuration for previous runs of subsetting generation.
Generate subset data from related databases
Subsetting hints and tips
Hints and tips for subset configuration.
Your data might be stored in separate but related databases. In Tonic Structural, each database provides the source data for a different workspace.
For example, a Users database contains a list of users. Each service also has a separate database. The Service1 and Service2 databases refer to identifiers of users from the Users database, but there are no direct foreign key relationships.
When you generate a subset from each database, you might want to ensure that the resulting data is complete and cohesive. For example, your application connects to and pulls data from each database. This means that your end-to-end testing also requires corresponding data from each database.
To continue the previous example, you generate subsets from the Users, Service1, and Service2 databases. Your application pulls data from each database. For the data to be complete and have referential integrity, the subsets from the Service1 and Service2 service databases should only contain records that refer to the users in the subset from the Users database.
Here are some options to generate subset from separate databases to produce data that is complete and cohesive:
In all cases, when you generate subsets across different databases, you must use consistency to ensure that common columns have the same values in each subset.
One way to produce complete and cohesive data across databases is to use deterministic WHERE
clauses in your target table configuration. A deterministic WHERE
clause always produces the same results, and is never random.
A percentage is not deterministic. Structural selects a specific number of records, but selects those records at random.
Not all WHERE
clauses are deterministic. For example, the Users, Service1, and Service2 databases each have a TotalValue
column that reflects the total spent as a whole and for each service. Filtering based on TotalValue
does not guarantee that you get a cohesive set of records.
Instead, provide a WHERE
clause that can be used in each database to produce a cohesive set of records across the databases. For example, use a WHERE
clause to look for a specific set of UUID values in each database.
In our example, if we target the same set of user UUIDs in the Users, Service1, and Service2 databases, we produce a complete and cohesive set of records for those users.
When you use a deterministic WHERE
clause in each database, you can run the subsetting jobs independently.
This is somewhat similar to using a deterministic WHERE
clause. It is one way to provide input to create a deterministic WHERE
clause.
You can run a query outside of Structural, and then use the results as input to the subset configuration. For example, you could run a query to identify users that are located in the United States.
One way to use the results would be to store the results somewhere in a database accessible to each workspace that you can reference in the WHERE
clauses. You could also return the result as a hard-coded list, and create WHERE
clauses that use an IN()
filter that contains a long list of these hard-coded values. You could even use the Structural API to update the WHERE
clause values as a part of an automated process.
This method allows you to run the subsetting jobs independently.
Another option is to run the jobs on the workspaces serially. The results of a job on one workspace feed into the job on the next workspace.
To do this, you run the first job, which can have a target percentage or a non-deterministic WHERE
clause.
After this job completes, use the results as input to a WHERE
clause in the second workspace. For example, the results might be a set of user ID values.
Depending on how the databases are set up, you might be able to query the results directly. For fully isolated databases, you could export the list and hard-code it in the WHERE
clause of the second workspace.
You can only use this option if the relevant column values are not changed by the generation process. If the column has a generator applied, then the output column values from the first database do not exist in the source column values in the second database.
The previous options are ideal for when the related databases are completely isolated from each other.
However, in some cases you can connect different database instances directly to query across them. Many database engines provide this capability, such as:
Oracle database links
SQL Server linked servers
PostgreSQL foreign-data wrapper with foreign server
If your environment allows and supports these mechanisms, then you can directly reference the external server in a query.
For columns that are common across all of the databases, you must ensure that a specific value in the source databases results in the same value in all of the destination databases.
To do this, you must assign a generator that supports consistency, and enable consistency on the column.
You must also configure Structural to ensure consistency across databases.
For more information, go to Enabling consistency.
The subsetting process uses foreign keys to navigate the relationships in your data. It uses these relationships to identify the data to include in the subset. Without foreign keys, Tonic Structural does not know how to navigate the relationships in your data. Properly configured foreign keys allow Structural to select the necessary rows from other tables, which ensures referential integrity.
Foreign keys are often set up directly within the source database. You can also set up virtual foreign keys within Structural. For example, a foreign key relationship might be missing, or your database might not use foreign keys. If your database uses polymorphic keys, then you must use the foreign key upload to add those keys manually.
For information about Foreign Keys view, including how to create and upload virtual foreign keys, go to Viewing and adding foreign keys.
You can also add virtual foreign keys from Subsetting view.
For example, on Graph View, you might notice that a relationship between tables is missing. You can immediately add a virtual foreign key to establish that relationship.
To create a virtual foreign key from Subsetting view:
Display the table details panel for the table that contains the foreign key.
Click Create Virtual Foreign Key.
Under Foreign Key from this Table, select the column in the current table that contains the foreign key. To find the foreign key column, begin to type the column name.
Under Primary Key in Another Table, select the column that contains the primary key. To find the primary key column, begin to type the column name or the name of the table.
Click Save.
Foreign key relationships can sometimes have circular dependencies, also referred to as cyclical dependencies.
In the simplest case, a circular dependency occurs when two tables each contain a foreign key that references the other table. In the following example, the Employees
table contains a department_id
foreign key column that references the Departments
table, and the Departments
table contains a manager_id
foreign key column that references the Employees
table.
Circular dependencies can also come from a much longer chain of references, where you follow references through several tables before returning to the original table.
Circular dependencies can also occur when a table references itself. In the following example, the Employees
table contains a manager_id
foreign key column that contains an employee ID value from the id
column.
During subsetting, if the circular dependency isn't broken, then there is an endless loop of going back and forth between the tables that reference each other.
To break a circular dependency, Structural identifies a foreign key column that is NULL
able, and sets its values to NULL
. When the process reaches a NULL
value, it stops looking for additional related records. Structural applies the minimum number of NULL
values that are needed to break the circular dependencies.
If none of the foreign key values are NULL
able, then the circular dependency cannot be broken, and the subset generation fails.
Tonic can detect circular dependencies before you run subsetting.
When a table contains foreign keys that are part of a circular dependency that Structural breaks:
On Graph View, a Cycle Break marker is added to the table object. The marker includes the name of the foreign key column.
The table details panel also indicates that there is a cycle break and lists the affected columns.
If you are having trouble with your subsetting configuration or results, the following hints and tips might be helpful.
Remember that subsetting is an iterative process. It can take multiple attempts to get the exact subset of data that you want.
One way to troubleshoot a subsetting configuration is to start with a very small subset. With a small subset, it is easier to verify that the data generation returns the data you are expecting.
For example, start with a single target table, and use a query to limit the subset to a single record. Add the necessary lookup tables.
After you verify the results in this very small subset, you can gradually increase the subset size in subsequent iterations. Verify the results after each iteration, and continue to increase the subset size until you reach the full subset.
For each iteration, verify that:
The subset includes the target table records.
The subset includes all of the lookup table data.
The subset includes all of the related records - records that each target table record refers to, and records that refer to the target table record.
An ideal subset contains a small number of target tables. A target table typically contains an important object that is well connected to everything else in the source data.
Tables that contain static values that are used in multiple other tables should be lookup tables, not target tables.
For other tables that are related to the target table, allow the subsetting process to identify the necessary rows to include in the subset based on the foreign key configuration. Do not add them as target tables.
When Tonic Structural detects a circular foreign key dependency, to break the dependency, it sets all of the values of one of the columns to NULL
. For more information, go to #circular-dependencies.
If your source data includes circular foreign keys, make sure that at least one of those columns is NULL
able.
Note that subsetting performance does not improve in a linear fashion.
For example, if you subset 10% of a target table, the generation does not take 10% of the time of a standard de-identification run. It might take 90% of the time.
To improve performance:
Make sure to properly configure all relevant tables as lookup tables. Lookup tables are automatically copied to the subset in their entirety, and do not require processing to identify which rows to include. Structural also does not look for records that are upstream of a lookup table.
For upstream tables, in particular large upstream tables, add indexes to virtual foreign key columns. If there are no indexes on the foreign key columns, subsetting can be much slower.
Alternatively, instead of adding indexes, assign Truncate mode to the upstream tables.
Here are some other options to further reduce the size of the subset. These options reduce the amount of related data for the included target table records.
You can truncate related tables in a subset, which removes that entire section of the subset. When you truncate a related table, Structural removes both that table and any tables that are included because of that table.
For example, in this subsetting graph, the products
table is a related table. For the records in products
, the subset also includes related records from wholesale_orders
, vendors
, and wo_date
.
After you truncate the products
table, the subsetting graph indicates that during the next data generation, the products
table and its connected tables will all be removed from the subset.
Note that for a related table that contains primary keys that are used to populate a foreign key column in an included table, you can only truncate the related table if the foreign key column is nullable.
For example, the product_id
primary key column in the products
table populates the purchased_product_id
foreign key column in the transactions
table. The transactions
table is included in the subset. If purchased_product_id
is not nullable, then you cannot truncate products
.
Upstream tables contain related data that is not required for referential integrity. To completely remove the table, you can truncate it.
You can also configure a filter for an upstream table that is based on either a date column or a WHERE
clause. When you filter the table, the subset only includes related records that match the filter.
To keep the upstream table, but exclude all of the related records, create a WHERE
clause filter where the WHERE
clause evaluates to false
. For many data connectors, you can directly set the WHERE
clause text to false
. Otherwise, use an obviously false statement such as 1=0
.
When you exclude all of the records in an upstream table, the subset also excludes all records from its related tables.
For example, in this subsetting graph, before you add an upstream filter to the wholesale_orders
table, the subset includes records from wholesale_orders
, vendors
, and wo_date
.
After you add a WHERE
clause filter that is set to false
to the wholesale_orders
table, the subset includes zero records from wholesale_orders
, vendors
, and wo_date
.
To display the Subsetting view, either:
On the workspace management view, in the workspace navigation bar, click Subsetting.
On Workspaces view, from the dropdown menu in the Name column, select Subsetting.
On Workspaces view, click the subsetting icon for the workspace.
The Configuration tab on the Subsetting view shows the current subsetting configuration.
It consists of:
Subsetting summary
Table View and Graph View. Both views display the source data tables and show the current subsetting configuration, and allow you to update the configuration. Table View displays a tabular list of tables. Graph View displays a diagram that shows the relationships between the tables.
Configuration to enable subsetting for data generation
Configuration for handling out-of-subset tables
Results of the most recent subsetting data generation
The panels at the top of the Configuration tab provide a clickable summary of the current subsetting configuration.
The summary includes the following values:
Target shows the number of target tables.
Lookup shows the number of lookup tables.
In Subset shows the number of tables that are in the subset. This includes target tables, lookup tables, and related tables.
Out of Subset shows the number of tables that are not in the subset.
When you click a summary panel:
On Table View, the table list is filtered to only display matching tables. For example, when you click Target, the list is filtered to only include target tables.
On Graph View, the matching tables are highlighted with a shadow behind the table objects.
After you run data generation with subsetting, on Table View, the Latest Results tab displays on the Configuration tab.
Before you run data generation with subsetting, the Latest Results tab does not display.
The Latest Results tab displays details for the most recent data generation with subsetting. It ignores data generation runs that do not use subsetting.
The subsetting results include:
The job status (successful, failed, canceled).
The amount of time it took to complete the run.
The percentage of the source data that is included in the subset destination data.
The volume of data in the source data and the subset destination data.
The percent reduction from the original source data to the subset destination data.
When the job began and ended.
To display the details for the data generation job, click View Job Details.
The Configuration tab contains the list of tables in the source database. It shows how each table is affected by the most recently completed subsetting configuration.
For each table, the table list includes:
Whether the table is a target table, a lookup table, or a related table that is filtered.
Whether the table is in or out of the subset. Target and lookup tables are always in the subset. Related tables also are in the subset. Other tables are out of the subset.
The number of rows in the table before and after the subset is created. For tables that are in the subset, the percentage of table data that is in the subset. For more information, go to #subsetting-view-calculate-pre-post-subset-rows.
The number of direct inbound (downstream) and outbound (upstream) relationships for the table. An inbound relationship means that a primary key from another table is used as a value in the current table. An outbound relationship means that the primary key of the current table is a foreign key in another table. You can filter the upstream records to only include the records that you need. For target tables, the relationships are used to determine the related tables that are included in the subset. The related tables can also include other tables where the relationship is indirect.
You can sort the list based on values in a selected column. To use a column to sort the list, click the column heading. To switch the sort order, click the column heading again.
The Sort by dropdown list provides the following options to sort the list:
Rows pre-subset - Sort by the number of rows in the table before subsetting.
Rows post-subset - Sort by the number of rows in the table after subsetting. Before you run a data generation job to create the subset, this value is unknown.
Inbound relationships - Sort the list based on the number of inbound relationships.
Outbound relationships - Sort the list based on the number of outbound relationships.
Total relationships - Sort the list based on the total number of inbound and outbound relationships.
By default, the drop-down sort options sort the table list in descending order. For example, when you select Rows pre-subset, the table that currently has the largest number of rows is at the top of the list. To change the sort direction, select the option again.
You can filter the list based on:
The table name
Whether the table is in or out of the subset
Whether the table is a target or lookup table
To filter by table name, begin typing the name text into the filter field. As you type, the list is filtered to only include tables whose names contain the filter text.
To filter the list to show only target tables, lookup tables, in-subset tables, or out-of-subset tables, do one of the following:
Click the panel at the top of the tab.
From the Filter Tables drop-down list, select the filter option.
To remove a table subset status filter, click the delete icon.
You can combine a name filter and a table subset status filter. For example, you can filter the list to show in-subset tables that contain the text "test".
You cannot combine the table subset status filters. When you select a different filter, the current filter is replaced.
Graph View displays a diagram of the source data tables and the relationships between them. It also indicates:
Whether each table is in the subset.
Whether the subsetting status for the table changed since the last subsetting data generation.
Each table block provides the following information about the table:
At the top left:
The name of the table
The name of the schema that contains the table
At the top right, the status of the table in the context of the subset. A table might be a target table, a lookup table, a related table that is in the subset, or a table that is out of the subset.
At the bottom, the number of rows in the table before and after the subset is created. For more information, go to #subsetting-view-calculate-pre-post-subset-rows.
It also indicates the effect on the table of subset configuration changes that occurred since the most recent subsetting generation. For more information, go to #subsetting-config-identify-changes-since-last-run.
The Graph View diagram connects tables that are related to each other based on a foreign key relationship. The position of the tables indicates the type of relationship.
Tables that have an upstream relationship with another table are displayed above the table.
Tables that have a downstream relationship with another table are displayed below the table.
In the example schema from #subsetting-how-tonic-creates, the Events table contains a list of events:
The Attendees table refers to the event for the attendee. Attendees is upstream of Events, and would display above the Events table in Graph View.
The Events table refers to a venue from the Venues table. Venues is downstream of Events, and would display below the Events table in Graph View.
To find and focus on a specific table:
In the search field, begin to type text in the table name. As you type, Tonic Structural filters the list to display matching tables.
When you see the table that you want, click the table name. Structural highlights the connections to other tables and displays the table details panel.
To navigate around Graph View, you can click and drag to pan around the graph.
You can also use the navigation tools at the bottom left of Graph View to zoom in and zoom out.
For tables that contain fewer than 1,000 rows, the pre-subset number of rows is displayed as <1k.
For tables that are in the subset, the resulting rows are based on the target table and related table configuration.
For tables that are not in the subset, the resulting rows are based on whether you enable Process tables that are out of subset. For more information, go to #subsetting-config-out-of-subset.
If the data generation job hasn't run yet, or the details from the job are not yet available, then the number of rows after the subset is marked as unknown.
If you updated the configuration for a table since the most recent data generation, then on Table View, an information icon displays next to the post-subset value.
When you click a table in either Table View or Graph View, the table details panel displays to the right of the table.
The table details include:
Whether the table is in the subset
The number of rows before and after the subsetting. For more information, go to #subsetting-view-calculate-pre-post-subset-rows.
The number of outbound and inbound relationships
For target tables, the subset configuration
The list of inbound and outbound relationships with other tables. When you click a table name, Structural selects and displays the details for that table.
The Subset Steps tab outlines the steps that Tonic Structural uses to create the subset based on the current configuration.
The steps include the processing of the target, lookup, and related tables. However, the list is not necessarily a one-to-one correspondence with the in-subset tables. A table might appear in multiple steps in order to satisfy referential integrity.
The steps do not include the out-of-subset tables.
Each step includes:
Action Step and Table Name
Identifies the table, and indicates whether the table is a target table or a lookup table.
Status
For target tables and lookup tables, Status is Direct. This indicates that the subsetting process pulls data directly from the table.
For target tables, this is based on the percentage or WHERE
clause.
For lookup tables, the subsetting process copies the entire table.
For related tables, the status is either Downstream or Upstream.
Contributing Tables
For related tables, the Contributing Tables column indicates the number of tables that affect the data that Structural pulls from the table. To display the contributing tables and how the current table is affected by those tables, click the information icon.
Source/Destination Rows
The number of rows in the source data and in the subset. For tables that contain fewer than 1,000 rows, the pre-subset value is <1k. Before you run the data generation, the number of rows in the subset is unknown. Otherwise, the number reflects the results of the most recent data generation.
The Configuration tab displays the results of the most recent subsetting run, as well as the subsetting configuration that was in place during each run. You can use the Previous Subsetting Runs tab to view details for any of the previous 100 subsetting runs.
You can use information about previous runs to see how changes to the subsetting configuration affect the subset results.
From the Select a previous subset run dropdown list, select the subsetting run to display the details for.
The runs are identified by the run date and time.
The details for a selected run include the following:
A summary of the run results.
Details about the subsetting configuration that was in place at the time of the run.
The panel at the left of the tab summarizes the results of the selected run. The run summary includes the following:
The status of the run (successful, failed, canceled).
The amount of time it took to complete the run.
For successful runs:
The percentage of source data included in the destination database
The volume of data in the source database
The volume of data in the destination database
To display the job details for the run, click View Job Details.
The subsetting configuration reflects the configuration that was in place at the time of the selected run. It is read-only. To make adjustments to the subsetting configuration, return to the Configuration tab.
Previous Subsetting Runs only displays Table View.
The panels above the table list show the number of target tables, lookup tables, related tables, and out-of-subset tables.
For each table, the list indicates whether the table is in the subset. It identifies the target and lookup tables. To view the subset configuration for an individual table, click the row.