Subsetting

Background on Subsetting

Subsetting data is the process of taking a representative sample of your data in a manner that preserves the integrity of your database, e.g., give me 5% of all transactions or pull all data associated with customers who live in California. If you do this naively, your database will break foreign key constraints or you’ll end up with a statistically non-representative data sample. Here are a few situations in which you might find subsetting data to be important or necessary:

  1. You’d like to use your production database in staging or test environments (sans the PII) but the database is very large so you want to use only a portion of it.

  2. You’d like a test database that contains a few specific rows from production (and related rows from other tables) so you can reproduce a bug.

  3. You want to share data with others but you don’t want them to have all of it.

You can learn more about our approach to subsetting in these two technical blog posts here and here. Our commercial subsetter included in the Tonic platform is a highly optimized and more performant version of our open source subsetter: https://github.com/TonicAI/condenser.

Subsetting Configuration

When subsetting is enabled the database is subsetted before any table or column transformations are applied. All subsetting occurs before data is transferred to the destination database to ensure that no sensitive data is leaked.

In order to subset, your source database needs either foreign key constraints or you need to manually defined relationships per the instructions in the Workspace section

  1. Subsetting must first be enabled

  2. Specify your target table(s) and either a Percentage or Where Clause to limit the records from this table that you want to keep in the output. You can pick as many targets are you want, Tonic will automatically figure out which rows from other tables need to be included to hit these targets.

  3. A list of all tables that will have data included in the output subset based on foreign key constraints

  4. A list of all tables that will be truncated in the output subset

Subsetting does not drop excluded tables from the output schema. The entire schema is copied to the output regardless of what data is included in the subset.