Other subsetting hints and tips

If you are having trouble with your subsetting configuration or results, the following hints and tips might be helpful.

Remember that subsetting is an iterative process. It can take multiple attempts to get the exact subset of data that you want.

Start with a very small subset

One way to troubleshoot a subsetting configuration is to start with a very small subset. With a small subset, it is easier to verify that the data generation returns the data you are expecting.

For example, start with a single target table, and use a query to limit the subset to a single record. Add the necessary lookup tables.

After you verify the results in this very small subset, gradually increase the subset size in subsequent iterations. Verify the results after each iteration. Continue to increase the subset size until you reach the full subset.

For each iteration, verify that:

  • The subset includes the target table records.

  • The subset includes all of the lookup table data.

  • The subset includes all of the related records - records that each target table record refers to, and records that refer to the target table record.

Minimize the number of target tables

An ideal subset contains a small number of target tables. A target table typically contains an important object that is well connected to everything else in the source data.

Tables that contain static values that are used in multiple other tables should be lookup tables, not target tables.

For other tables that are related to the target table, allow the subsetting process to identify the necessary rows to include in the subset based on the foreign key configuration. Do not add them as target tables.

Make circular foreign key columns NULLable

If your source data includes circular foreign keys, make sure that at least one of those columns is NULLable.

Improve subsetting performance

Note that subsetting performance does not improve in a linear fashion.

For example, if you subset 10% of a target table, the generation does not take 10% of the time of a standard de-identification run. It might take 90% of the time.

To improve performance:

  • Make sure to properly configure all relevant tables as lookup tables. Lookup tables are automatically copied to the subset in their entirety, and do not require processing to identify the rows to include. Structural also does not look for records that are upstream of a lookup table.

  • For upstream tables, in particular large upstream tables, add indexes to virtual foreign key columns. If there are no indexes on the foreign key columns, subsetting can be much slower.

  • Alternatively, instead of adding indexes, assign Truncate mode to the upstream tables.

Prune related subset tables

Here are some other options to further reduce the size of the subset. These options reduce the amount of related data for the included target table records.

You can truncate related tables in a subset, which removes that entire section of the subset. When you truncate a related table, Structural removes both that table and any tables that are included because of that table.

For example, in this subsetting graph, the products table is a related table. For the records in products, the subset also includes related records from wholesale_orders, vendors, and wo_date.

After you truncate the products table, the subsetting graph indicates that during the next data generation, the products table and its connected tables will all be removed from the subset.

Note that for a related table that contains primary keys that are used to populate a foreign key column in an included table, you can only truncate the related table if the foreign key column is nullable.

For example, the product_id primary key column in the products table populates the purchased_product_id foreign key column in the transactions table. The transactions table is included in the subset. If purchased_product_id is not nullable, then you cannot truncate products.

Exclude data from upstream tables

Upstream tables contain related data that is not required for referential integrity. To completely remove the table, you can truncate it.

You can also configure a filter for an upstream table that is based on either a date column or a WHERE clause. When you filter the table, the subset only includes related records that match the filter.

To keep the upstream table, but exclude all of the related records, create a WHERE clause filter where the WHERE clause evaluates to false. For many data connectors, you can directly set the WHERE clause text to false. Otherwise, use an obviously false statement such as 1=0.

When you exclude all of the records in an upstream table, the subset also excludes all records from its related tables.

For example, in this subsetting graph, before you add an upstream filter to the wholesale_orders table, the subset includes records from wholesale_orders, vendors, and wo_date.

After you add a WHERE clause filter that is set to false to the wholesale_orders table, the subset includes zero records from wholesale_orders, vendors, and wo_date.

Last updated