Search…
⌃K
Links

Configuring subsetting

Identifying and configuring target tables

A target table is a table for which you specify a subset of the data to include in the destination database.

Options to specify the data subset

To identify the subset of data to include, you can either:
  • Specify a percentage of the table to include in the destination database. You can use this option when you care about the specific volume of data, but not the specific rows. Tonic converts the percentage to a filter or a WHERE clause, depending on your database type. Depending on how your tables are related, the target tables in the final subset might contain more rows than the percentage that you specified. These additional rows are required to maintain referential integrity. To view the tables that contribute to the additional rows, see the subset steps. For additional assistance, reach out to your Tonic contact.
Configuration for a percentage target table
  • Provide a WHERE clause to specify the subset of data to include in the destination database. The WHERE clause allows you to be more specific about the data to include. For example, you might want to only include data for a specific user or date range.
Configuration for a where clause target table

Configuring a percentage target table

To identify and configure subsetting for a target table:
  1. 1.
    In the table list, click the table.
  2. 2.
    On the table details panel, from the Select Table Type dropdown list, select Target Table (Percentage).
  3. 3.
    In the Target Percentage field, type the percentage of the data to include in the destination database. The default is 5, which indicates to use 5% of the rows in the table.

Configuring a WHERE clause target table

  1. 1.
    In the table list, click the table.
  2. 2.
    On the table details panel, from the Select Table Type dropdown list, select Target Table (Where Clause).
  3. 3.
    In the Target Where Clause field, type the WHERE clause to use to identify the data to include in the destination database. For example, the target table contains a column called event_id. To select all rows where event_id is greater than 1000, add the following WHERE clause: event_id > 1000
  4. 4.
    For a more complex WHERE clause, you can display an editor with a larger text area.
    1. 1.
      Click Open in Editor.
    2. 2.
      In the text area, enter the WHERE clause.
    3. 3.
      Click Save.
You can query across tables within the WHERE clause. For example, you configure the customers table as a target table, but you also want to use information from the customers_legacy table to identify the target records in customers.
In the following example query, the matching records in customers have a Customer_Key value that matches a CustomerKey value in customers_legacy, and where the value of Occupation in customers_legacy is Detective:
"Customer_Key" IN (
SELECT "Customer_Key"
FROM customers_legacy
WHERE customers_legacy."Customer_Key"=customers."Customer_Key"
AND customers_legacy."Occupation" = 'Detective'
);

Removing a target table

To remove a target table:
  1. 1.
    In the table list, click the table.
  2. 2.
    On the table details panel, from the table type dropdown list, select Remove.

Identifying lookup tables

A lookup table contains a list of values that are used to populate columns in other tables. For example, a list of states, countries, or currencies. Lookup tables are sometimes referred to as reference tables.
Tonic always copies lookup tables to the destination database in their entirety. If you do not configure a table as a lookup table, then Tonic treats the table as a related table, and copies only rows that are used in the subset data. Tonic also pulls in rows from other tables that refer to the table, but that are not necessarily related to the target tables. This could result in an unexpectedly large subset.
For example, in a Users table, every user record refers to a state in the States table. If you do not identify States as a lookup table, then the subset would include every record in the Users table.

Properties of a lookup table

Here are some typical properties of a lookup table:
  • It is fairly small and rarely updated.
  • Many tables point to the table, but it does not point to another table.
  • The table contains a set of unique values.

Identifying an individual lookup table

To identify an individual table as a lookup table:
  1. 1.
    In the table list, click the table.
  2. 2.
    On the table details panel, from the Select Table Type dropdown list, select Lookup Table.

Identifying multiple lookup tables

To identify multiple tables as lookup tables:
  1. 1.
    Check the check box for each table to identify as a lookup table.
  2. 2.
    From the Actions dropdown list, select Add Lookup Tables.
Actions menu for selected rows

Removing lookup tables

To remove the lookup designation for a table:
  1. 1.
    In the table list, click the table.
  2. 2.
    On the table details panel, from the dropdown list, select Remove.

Filtering upstream records

Records that reference required subset records are considered upstream records. Unlike downstream records, upstream records are not required for referential integrity. Upstream records are optional.
To reduce the size of the subset, you can apply a filter to these optional records. To filter the records, you can either:
  • Use a date column to specify an amount of time before the current date for which to include records. For example, you can only include records for which the update date is one week before the current date.
  • Use a WHERE clause to identify the records to include.
You can filter a table that contains both upstream and downstream records. However, the filter only applies to the upstream records.
In the table list, when an upstream table is filtered, a Filtered icon displays.

Filtering by date

The date filter allows you to filter upstream records based on the value of date-based column.
Filter Table fields for date filter
To filter an upstream table by date:
  1. 1.
    In the table list, click the table. On the table details panel, under Filter Table, Select table filter is set by default to None, which indicates that the table is not filtered.
  2. 2.
    From Select table filter, select Filter By Date Column.
  3. 3.
    From the Date Column dropdown list, select the date column to use for the filter. To improve performance, select a column that is indexed.
  4. 4.
    Under Get data from the last, from the time unit dropdown list, select the unit of time to use for the filter. You can filter records based on their age in days, weeks, months, or years.
  5. 5.
    In the field, enter the number of the selected unit before the current date for which to include the upstream records. For example, you select days as the unit, and set the number to 4. Tonic then pulls related records for which the date column value is up to 4 days before the current date.

Filtering with a WHERE clause

You can also filter the upstream records using a WHERE clause.
Filter Table fields for WHERE clause
To filter the upstream records using a WHERE clause:
  1. 1.
    In the table list, click the table. On the table details panel, under Filter Table, Select table filter is set by default to None, which indicates that the table is not filtered.
  2. 2.
    From Select table filter, select Filter by Where Clause.
  3. 3.
    In the Where Clause text area, enter the WHERE clause to use to filter the related records.
  4. 4.
    For a more complex WHERE clause, you can display an editor with a larger text area.
    1. 1.
      Click Open in Editor.
    2. 2.
      In the text area, enter the WHERE clause.
    3. 3.
      Click Save.
  5. 5.
    To copy the WHERE clause to the clipboard, click Copy To Clipboard.

Removing a filter

To remove an upstream filter, from Select table filter, select None.

Determining how to process tables that are not in the subset

Tables other than target tables, lookup tables, or related tables are not in the subset.
The subsetting configuration includes how to copy all of these tables to the destination database.
You can either:
  • Use the table modes that are assigned to the out-of-subset tables.
  • Truncate all of the out-of-subset tables. The table schema is preserved, but none of the data is copied to the destination database.
On the Configuration tab, you use the Process tables that are out of subset toggle to determine how to handle these tables. After you run subsetting data generation, the toggle is on the Options tab.
Configuration for how to process tables that are not in the subset
By default, the setting is turned off, and Tonic truncates the out-of-subset tables.
To use the assigned table mode to process each table, toggle the setting to the on position.

Determining whether to use subsetting during data generation

If you configured subsetting, then when you run a data generation job, you can either generate the entire dataset, or use the subsetting configuration to generate a subset.
On the Configuration tab, the Use Subsetting toggle indicates whether to generate a subset. After you run subsetting data generation, the Use Subsetting toggle is on the Options tab.
By default, the toggle is in the off position. When you run a data generation job, it generates the entire destination data dataset.
To instead generate a subset, toggle Use subsetting to the on position.
When you run a data generation job, you are also prompted to confirm whether to generate the entire dataset or a subset. These two toggles are synchronized. If you turn on the Use Subsetting toggle on the Configuration tab, then it is on by default on the generation confirmation panel.

Enabling parallel processing for subsetting

You can sometimes use parallel processing to improve the performance of the subsetting process. Parallel processing allows multiple subsetting steps to be processed at the same time. The steps cannot rely on the output of other steps that are processed in parallel.
To enable parallel processing for subsetting, set the environment variable TONIC_TABLE_PARALLELISM to a number greater than 1 (the default). This setting determines the maximum number of subsetting steps that Tonic can process in parallel. For regular data generation, it also determines the number of tables that Tonic operates on at the same time.
The effect of subsetting parallelism on performance depends on your subsetting configuration, the layout of your schema, the performance characteristics of the machine that runs Tonic, and the performance characteristics of your databases.
We recommend that you start with a relatively small number such as 4, and then run a data generation job to see how it affects performance. If performance improves, you can increase the number incrementally until the performance no longer improves.
The environment variable only controls the maximum number of steps that can be processed in parallel. Performance should not degrade if your system cannot support parallelism or won't benefit from using it.
If you have any other questions, contact [email protected]