Other subsetting hints and tips

If you are having trouble with your subsetting configuration or results, the following hints and tips might be helpful.

Remember that subsetting is an iterative process. It can take multiple attempts to get the exact subset of data that you want.

Start with a very small subset

One way to troubleshoot a subsetting configuration is to start with a very small subset. With a small subset, it is easier to verify that the data generation returns the data you are expecting.

For example, start with a single target table, and use a query to limit the subset to a single record. Add the necessary lookup tables.

After you verify the results in this very small subset, you can gradually increase the subset size in subsequent iterations. Verify the results after each iteration, and continue to increase the subset size until you reach the full subset.

For each iteration, verify that:

  • The subset includes the target table records.

  • The subset includes all of the lookup table data.

  • The subset includes all of the related records - records that each target table record refers to, and records that refer to the target table record.

Minimize the number of target tables

An ideal subset contains a small number of target tables. A target table typically contains an important object that is well connected to everything else in the source data.

Tables that contain static values that are used in multiple other tables should be lookup tables, not target tables.

For other tables that are related to the target table, allow the subsetting process to identify the necessary rows to include in the subset based on the foreign key configuration. Do not add them as target tables.

Make circular foreign key columns NULLable

If your source data includes circular foreign keys, make sure that at least one of those columns is NULLable.

Improve subsetting performance

Note that subsetting performance does not improve in a linear fashion.

For example, if you subset 10% of a target table, the generation does not take 10% of the time of a standard de-identification run. It might take 90% of the time.

To improve performance:

  • Make sure to properly configure all relevant tables as lookup tables. Lookup tables are automatically copied to the subset in their entirety, and do not require processing to identify which rows to include. Structural also does not look for records that are upstream of a lookup table.

  • For upstream tables, in particular large upstream tables, add indexes to virtual foreign key columns. If there are no indexes on the foreign key columns, subsetting can be much slower.

  • Alternatively, instead of adding indexes, assign Truncate mode to the upstream tables.

Last updated