Subsetting data

About subsetting

Subsetting data is the process of taking a representative sample of your data in a manner that preserves the integrity of your database, e.g., give me 5% of all transactions or pull all data associated with customers who live in California. If you do this naively, your database will break foreign key constraints or you’ll end up with a statistically non-representative data sample. Here are a few situations in which you might find subsetting data to be important or necessary:
  1. 1.
    You’d like to use your production database in staging or test environments (sans the PII) but the database is very large so you want to use only a portion of it.
  2. 2.
    You’d like a test database that contains a few specific rows from production (and related rows from other tables) so you can reproduce a bug.
  3. 3.
    You want to share data with others but you don’t want them to have all of it. A common scenario is providing developers an anonymized subset which also enables them to run the test database locally on their own machines.
You can learn more about our approach to subsetting in these two technical blog posts here and here.

Subsetting prerequisites

You’ve identified the need to subset so let’s get you set up with the following four prerequisites:
  1. 1.
    Subsetting is enabled. This can be toggled on or off. See 1 in Figure 1.
  2. 2.
    Target Tables are defined with an appropriate condition to get you the desired output dataset. You can select any of the masked tables from the dropdown and specify a percentage or a custom where clause to determine which rows will be part of the output. It's often useful to think of your most important object that's well connected to everything else — this could be a users table, transactions table, claims table, etc. See 3 in Figure 1 .
  3. 3.
    Reference tables should also be specified at this point, but is optional. Reference tables will be included in the subset in its entirety and will not proceed to grab additional data as part of the subset. For example, you could have a table that represents states in the US — one row per state. When coming across the states table, setting it as a reference table will ensure all 50 states are pulled. Otherwise, if you were to say subset on users in CA, it would only pull the state of California. See 5 in Figure 1.
  4. 4.
    Foreign Keys are set up directly within the source database or created within Tonic, otherwise Tonic would not know how to crawl the relationships of your data. Setting this properly will allow Tonic to select the necessary rows from other tables - ultimately ensuring referential integrity. In situations where a cyclic relationship is present, Tonic will break the cyclic dependency by setting a NULLable column to NULL to avoid grabbing more data than necessary.

Subsetting process

Once subsetting has been enabled and the target tables, reference tables, and foreign keys have been defined - you’re now ready to start a subsetting job. Below are the steps Tonic takes in order to produce a representative sample of the source database:
  1. 1.
    Tonic forms the basis of the dataset by grabbing data from the target tables based on how you chose to filter them (percentage or custom where clause). It will then traverse your database based on the relationships originating from those target tables. Eg. We may want to target 50% of the Events table in Figure 2.
  2. 2.
    Once the basis has been built, Tonic will traverse across the database. Tonic begins by going upstream. The upstream process involves Tonic traversing through tables that have a foreign key relationship pointing towards or referencing a target table based on the data collected in step 1. This step continues until there are no remaining upstream tables to process. Eg. The Data table is upstream of the Events table.
  3. 3.
    Tonic continues the subsetting process by going downstream. Tonic will traverse downwards to grab data that is referenced by the data collected in step 2. Eg. The Facts, Figs, and Users table would be downstream of the Events table.
  4. 4.
    During the downstream phase, Tonic will consider both upstream and downstream tables to subset in order to ensure every connected table is included in the subset. Eg. The downstream process would continue to the Meta table to complete the subsetting job.
Figure 2: A schema depicting foreign key relationships between tables (i.e. Events has an FK to the Users table)
For further visibility, you can preview the steps in the Subset Preview found at the bottom of the page. You’ll see the following details:
  • In Subset: A list of all tables that will have data included in the output subset based on foreign key constraints.
  • Out of Subset: A list of all tables that will be empty in the output subset. If you do not want the tables empty and instead want Tonic to process them based on their set table mode and configuration, you can toggle the ability to process tables out of subset. See 4 from the screenshot above.
  • Subset Steps: The steps that Tonic will take as part of the subsetting process. It will provide you with which tables are processed when going upstream and/or downstream, and what table relationships are determining each step.

Target where clause

If you would like to select a custom set of rows to be the core of your subset, use a target where clause for your target table. For example, if your target table has a column called event_id and you would like to select all rows where event_id is greater than 1000, you would input the following where clause:
event_id > 1000
When subsetting is enabled, the database is subsetted before any table or column transformations are applied. All subsetting occurs before the data is transferred to the destination database to ensure that no sensitive data is leaked.