About subsetting
What is subsetting?
Subsetting allows you to intelligently reduce the size of your destination database. It takes a representative sample of the data that preserves the data's referential integrity.
You configure how Tonic Structural generates the subset. When you generate the output data, you decide whether to enable the subsetting process.
For example, you can configure subsetting to get 5% of all transactions, or all of the data that is associated with customers who live in California.
Here are a few examples where subsetting data might be important or necessary:
You want to use your production database in staging or test environments, without the PII. Because the database is very large, you want to use only a portion of it.
To reproduce a bug, you want a test database that contains a few specific rows from production, and related rows from other tables.
You want to share data with others, but you don’t want them to have all of it. For example, you can provide developers an anonymized subset that also enables them to run the test database locally on their local machines.
To learn more about our approach to subsetting, go to the following technical blog posts:
Components of subsetting
Subsetting uses foreign keys to determine the relationships in the data. These relationships enable the subsetting process to traverse the database as it builds the subset.
Foreign keys are either configured in your source data, or configured using the Structural virtual foreign key tool. For more information, go to Subsetting and foreign keys.
For subsetting, each table in the source database falls into one of the following categories:
Target tables
Target tables are the seed tables that provide the initial set of rows to include in the subset. Structural retrieves the initial subset of data from the target tables. Structural then uses those rows to identify the information to pull from related tables.
A target table typically contains an important object that is well connected to everything else in the source data. For example, users, transactions, or claims. A subset should usually have a very small number of target tables.
When you identify a target table, you specify how to retrieve the rows that you want from the table. You can request a percentage of the rows in the table, or use a WHERE
clause to identify a specific set of rows from that table.
For more information, go to Identifying and configuring target tables.
Lookup tables
A lookup table contains a static set of values that is used in other tables in your subset. For example, a lookup table might contain a list of postal codes or country names that are referenced in other tables.
Structural always retrieves all of the data in a lookup table. It does not check whether or where the lookup values are used.
It does not pull records from related tables based on lookup table values. Relationships with lookup tables are ignored during the subsetting process.
For more information, go to Identifying lookup tables.
Related tables
Related tables are tables that are connected by direct or indirect relationships with a target table, and that are not identified as lookup tables.
Downstream tables have data that is required to maintain referential integrity for the subset. These tables have primary keys that are referenced by foreign keys in related tables.
Upstream tables contain data that has a foreign key that references a primary key in the target table. For large upstream tables, if the foreign key columns are not indexed, the subsetting process can be significantly slower. These upstream records are not required to maintain referential integrity, but can contain useful information. In the subset configuration, you can filter these upstream records either by date or by using a
WHERE
clause.
Some related tables are both downstream and upstream. In that case, you can provide a filter that applies only to the upstream records. Because the downstream records are required for referential integrity, they cannot be filtered.
For example, a transactions table contains a foreign key column to identify the customer. The value is the primary key of a record in the customers table. The customers table is downstream of the transactions table - the transaction data is incomplete without the customer information. The transactions table is upstream of the customers table.
Structural pulls data from related tables in order to preserve referential integrity in the output data subset.
In many cases, the relationship is direct. For example, a target table contains a list of events. The events table identifies the user that hosted the event. The user is identified using a foreign key relationship from the events table to the users table. The users table is a related table. The subset includes all the users that the events refer to.
The relationship also might be indirect. To continue the example, the events table identifies a user from the users table. The users table identifies the company that each user belongs to. The company is identified using a foreign key relationship from the users table to the companies table. The companies table is also a related table. The subset needs to include all of the companies that are referred to by the users that the events table refers to.
For an example of how Structural identifies related tables, view the example diagram in How Structural creates a subset.
Out-of-subset tables
Tables other than target tables, lookup tables, or related tables are not part of the subset.
By default, Structural copies only the table schema of out-of-subset tables. It does not populate any of the data.
You can also choose to process the tables using the table mode that is assigned to each table.
For more information, go to Determining how to process tables that are not in the subset.
How Structural creates a subset
Structural creates the subset before it applies any transformations to the source data.
To provide a basic overview of how Structural creates the subset, we'll use the following simple example schema:

The Events
table is the target table for the subset. The Events
table includes information about the event hosts (Hosts
table) and the event venue (Venues
table). For each host, the data includes the company that the host belongs to (Companies
table).
The Attendees
table includes the event that the attendee registered for.
The Hosts
, Companies
, Venues
, and Attendees
tables are all related tables for the subset.
The States
table provides a lookup of state values to use for the company, venue, and attendee addresses. It is a lookup table for the subset. A subset always includes all of the data in a lookup table. Our example subset automatically contains the 50 records from States
.
Step 1. Get the initial rows from the target tables
When you enable subsetting for a data generation job, to create the basis of the subset, Structural gets data from the target tables based on the configured filters - either a percentage or a WHERE
clause.
In our example, Structural first gets a set of rows from the Events
table.
For example, if the subset is configured to get 5% of the Events
records, and the Events
table contains 200 records, then Structural gets a random set of 10 records from Events
.
Structural then traverses your database based on the relationships that originate from the target tables.
Step 2. Get upstream records
Structural first goes upstream.
For the upstream process, Structural traverses through tables that reference a target table, based on the rows collected in step 1. In other words, the value of the primary key for a target table record is the value of a foreign key column in the upstream table.
This step continues until there are no remaining upstream tables to process.
To continue our example, Structural retrieves the attendees for those initial 10 events. Assuming that each event has 15 attendees, Structural retrieves the 150 Attendees
records that are linked to those events.
Step 3. Get downstream records and other connected records
Next, Structural goes downstream.
Structural traverses all of the tables to look for foreign key columns for which the value is the primary key of an upstream table record.
To continue our example, Structural retrieves the Hosts
and Venues
rows that are referred to in the Events
rows that it retrieved in the first step. It also retrieves the Companies
rows that those Hosts
rows refer to.
Assuming that the events occurred at 5 different venues and had 10 different hosts, each of which represent a different company, Structural retrieves:
The 5
Venues
rows for the original 10 events.The 10
Hosts
rows for the original 10 events.The 10
Companies
rows for those 10 hosts.
During this downstream step, Structural considers both upstream and downstream tables to ensure that the subset includes every connected table.
For example, if the Venues
table included a foreign key column that referenced a primary key from the Attendees
table, Structural would have to return to the Attendees
table to get those attendee records.
The data relationships can also affect the number of target table rows, and in some cases cause circular dependencies.
For example, if the Venues
table contains a FirstEvent
column that is a foreign key to the Events
table, then in addition to the original set of Events
rows based on the target table percentage configuration, the subset would also include additional Events
rows based on the FirstEvent
column for the original event venues. And then for those events, Structural would retrieve the attendees, venues, hosts, and companies. And then for those additional venues, their events, which starts the process again.
Other notes about retrieving subset data
Here are some items to be aware of, to help explain why a subset might contain either more or less data than you expect.
For hints and tips on how to reduce the size of a subset, and improve performance, go to Other subsetting hints and tips.
Subset percentage of rows is not the target table percentage of rows
When you configure a target table to use a percentage of the target table rows, the percentage only applies to the rows in the target table. It does not mean that the entire subset consists of the same percentage of rows from the entire database.
To continue the example from How Structural creates a subset, assuming the total number of rows in each table is as follows:
Events
- 200Attendees
- 300Venues
- 7Hosts
- 50Companies
- 35States
- 50
That makes a total of 642 rows in the database.
Based on the subset configuration (5% of the rows in the Events
target table), and the relationships with the other tables, the number of rows in the subset from each table is:
Events
- 10 - 5% of 200Attendees
- 150 - 50% of 300Venues
- 5 - 71% of 7Hosts
- 10 - 20% of 50Companies
- 10 - 29% of 35States
- 50 - 100% of 50
The subset contains a total of 235 rows out of the original 642.
That means that even though the target table is configured to retrieve 5% of the target table rows, the subset actually contains around 36% of the rows in the database.
Percentage of rows is not the percentage of data size
The percentage of rows is based on the count of rows. It is not tied to the size of the data.
To continue our example, while the subset contains 36% of the count of rows from the source database (235 out of 642), the subset data volume is not necessarily 36% of the original data volume.
The data volume of the subset depends on the data that the subset rows contain and the size in bytes of the rows. The subset data volume might be much smaller than the original data volume. However, the subset also might not significantly reduce the size of the original data.
Target tables related to other target tables
If there are multiple target tables, and the tables are related to each other, Structural takes the union of the required data for both the target table configuration and the table relationships.
For example, table A contains a foreign key column that refers to table B. You configure both tables as target tables. For table B, Structural pulls both the directly targeted set of records, and the records that the targeted table A records refer to.
Tables that are upstream of multiple target tables
If a table is upstream of multiple target tables, then Structural only pulls records from that table that contain references to targeted records in all of the target tables.
For example, in related table Child1
, column1
is a foreign key that refers to a primary key in target table Parent1
. column2
is a foreign key that refers to a primary key in target table Parent2
.
If
column1
andcolumn2
both refer to targeted records inParent1
andParent2
, then thatChild1
record is included in the subset.If only one of those columns refers to a targeted record in
Parent1
orParent2
, then thatChild1
record is not included.
Last updated
Was this helpful?