Other subsetting hints and tips
If you are having trouble with your subsetting configuration or results, the following hints and tips might be helpful.
Remember that subsetting is an iterative process. It can take multiple attempts to get the exact subset of data that you want.
Start with a very small subset
One way to troubleshoot a subsetting configuration is to start with a very small subset. With a small subset, it is easier to verify that the data generation returns the data you are expecting.
For example, start with a single target table, and use a query to limit the subset to a single record. Add the necessary lookup tables.
After you verify the results in this very small subset, you can gradually increase the subset size in subsequent iterations. Verify the results after each iteration, and continue to increase the subset size until you reach the full subset.
For each iteration, verify that:
The subset includes the target table records.
The subset includes all of the lookup table data.
The subset includes all of the related records - records that each target table record refers to, and records that refer to the target table record.
Minimize the number of target tables
An ideal subset contains a small number of target tables. A target table typically contains an important object that is well connected to everything else in the source data.
Tables that contain static values that are used in multiple other tables should be lookup tables, not target tables.
For other tables that are related to the target table, allow the subsetting process to identify the necessary rows to include in the subset based on the foreign key configuration. Do not add them as target tables.
Make circular foreign key columns NULLable
If your source data includes circular foreign keys, make sure that at least one of those columns is NULL
able.
Improve subsetting performance
Note that subsetting performance does not improve in a linear fashion.
For example, if you subset 10% of a target table, the generation does not take 10% of the time of a standard de-identification run. It might take 90% of the time.
To improve performance:
Make sure to properly configure all relevant tables as lookup tables. Lookup tables are automatically copied to the subset in their entirety, and do not require processing to identify which rows to include. Structural also does not look for records that are upstream of a lookup table.
For upstream tables, in particular large upstream tables, add indexes to virtual foreign key columns. If there are no indexes on the foreign key columns, subsetting can be much slower.
Alternatively, instead of adding indexes, assign Truncate mode to the upstream tables.
Prune related subset tables
Here are some other options to further reduce the size of the subset. These options reduce the amount of related data for the included target table records.
Truncate related tables
You can truncate related tables in a subset, which removes that entire section of the subset. When you truncate a related table, Structural removes both that table and any tables that are included because of that table.
For example, in this subsetting graph, the products
table is a related table. For the records in products
, the subset also includes related records from wholesale_orders
, vendors
, and wo_date
.
After you truncate the products
table, the subsetting graph indicates that during the next data generation, the products
table and its connected tables will all be removed from the subset.
Note that for a related table that contains primary keys that are used to populate a foreign key column in an included table, you can only truncate the related table if the foreign key column is nullable.
For example, the product_id
primary key column in the products
table populates the purchased_product_id
foreign key column in the transactions
table. The transactions
table is included in the subset. If purchased_product_id
is not nullable, then you cannot truncate products
.
Exclude data from upstream tables
Upstream tables contain related data that is not required for referential integrity. To completely remove the table, you can truncate it.
You can also configure a filter for an upstream table that is based on either a date column or a WHERE
clause. When you filter the table, the subset only includes related records that match the filter.
To keep the upstream table, but exclude all of the related records, create a WHERE
clause filter where the WHERE
clause evaluates to false
. For many data connectors, you can directly set the WHERE
clause text to false
. Otherwise, use an obviously false statement such as 1=0
.
When you exclude all of the records in an upstream table, the subset also excludes all records from its related tables.
For example, in this subsetting graph, before you add an upstream filter to the wholesale_orders
table, the subset includes records from wholesale_orders
, vendors
, and wo_date
.
After you add a WHERE
clause filter that is set to false
to the wholesale_orders
table, the subset includes zero records from wholesale_orders
, vendors
, and wo_date
.
Last updated