Configuring subsetting

Identifying and configuring target tables

A target table is a table for which you specify a subset of the data to include in the destination database.

Options to specify the data subset

To identify the subset of data to include, you can either:

  • Specify a percentage of the table to include in the destination database. You can use this option when you care about the specific volume of data, but not the specific rows. Tonic Structural converts the percentage to a filter or a WHERE clause, depending on your database type. Depending on how your tables are related, the target tables in the final subset might contain more rows than the percentage that you specified. These additional rows are required to maintain referential integrity. To view the tables that contribute to the additional rows, see the subset steps. For additional assistance, reach out to your Tonic.ai contact.

  • Provide a WHERE clause to specify the subset of data to include in the destination database. The WHERE clause allows you to be more specific about the data to include. For example, you might want to only include data for a specific user or date range.

To combine a specified set of records with a random set of the remaining records, use a WHERE clause. For example, to get all users that are from Alabama, and 5 percent of the other records, use the following WHERE clause:

state = "Alabama" OR random() < 0.05 

Configuring a percentage target table

To identify and configure subsetting for a target table:

  1. In Table View or Graph View, click the table.

  2. On the table details panel, from the Select Table Type dropdown list, select Target Table (Percentage).

  3. In the Target Percentage field, type the percentage of the data to include in the destination database. The default is 5, which indicates to use 5% of the rows in the table. You can specify a decimal value, including a value that is less than 1. For example, you might configure the subset to include .5 percent of the rows, or 33.33 percent of the rows.

Configuring a WHERE clause target table

  1. In Table View or Graph View, click the table.

  2. On the table details panel, from the Select Table Type dropdown list, select Target Table (Where Clause).

  3. In the Target Where Clause field, type the WHERE clause to use to identify the data to include in the destination database. For example, the target table contains a column called event_id. To select all rows where event_id is greater than 1000, add the following WHERE clause: event_id > 1000

  4. For a more complex WHERE clause, you can display an editor with a larger text area.

    1. Click Open in Editor.

    2. In the text area, enter the WHERE clause.

    3. Click Save.

You can query across tables within the WHERE clause. For example, you configure the customers table as a target table, but you also want to use information from the customers_legacy table to identify the target records in customers.

In the following example query, the matching records in customers have a Customer_Key value that matches a CustomerKey value in customers_legacy, and where the value of Occupation in customers_legacy is Detective:

"Customer_Key" IN (
    SELECT "Customer_Key"
    FROM customers_legacy
    WHERE customers_legacy."Customer_Key"=customers."Customer_Key"
    AND customers_legacy."Occupation" = 'Detective'
)

You can also create a query that selects a random percentage of a specified set of data.

For example, in PostgreSQL, to select 50% of the records that have an identifier that is divisible by 3, you could use the following WHERE clause:

id % 3 = 0 and random() < 0.5
order by id

Removing a target table

To remove a target table:

  1. In Table View or Graph View, click the table.

  2. On the table details panel, from the table type dropdown list, select Remove.

Identifying lookup tables

A lookup table contains a list of values that are used to populate columns in other tables. For example, a list of states, countries, or currencies. Lookup tables are sometimes referred to as reference tables.

Structural always copies lookup tables to the destination database in their entirety. If you do not configure a table as a lookup table, then Structural treats the table as a related table, and copies only rows that are used in the subset data. Structural also pulls in rows from other tables that refer to the table, but that are not necessarily related to the target tables. This could result in an unexpectedly large subset.

For example, in a Users table, every user record refers to a state in the States table. If you do not identify States as a lookup table, then the subset would include every record in the Users table.

Properties of a lookup table

Here are some typical properties of a lookup table:

  • It is fairly small and rarely updated.

  • Many tables point to the table, but it does not point to another table.

  • The table contains a set of unique values.

Identifying an individual lookup table

To identify an individual table as a lookup table:

  1. In Table View or Graph View, click the table.

  2. On the table details panel, from the Select Table Type dropdown list, select Lookup Table.

Identifying multiple lookup tables

To identify multiple tables as lookup tables:

  1. On Table View, check the checkbox for each table to identify as a lookup table.

  2. From the Actions dropdown list, select Add Lookup Tables.

Removing lookup tables

To remove the lookup designation for a table:

  1. In Table View or Graph View, click the table.

  2. On the table details panel, from the dropdown list, select Remove.

Filtering optional records

Records that reference required subset records are considered upstream records. Unlike downstream records, upstream records are not required for referential integrity. Upstream records are optional.

To reduce the size of the subset, you can apply a filter to these optional records. To filter the records, you can either:

  • Use a date column to specify an amount of time before the current date for which to include records. For example, you can only include records for which the update date is one week before the current date.

  • Use a WHERE clause to identify the records to include.

You can filter a table that contains both upstream and downstream records. However, the filter only applies to the optional upstream records.

In the table list, when an upstream table is filtered, a Filtered icon displays.

Filtering by date

The date filter allows you to filter optional records based on the value of a date-based column.

To filter an upstream table by date:

  1. In Table View or Graph View, click the table. On the table details panel, under Filter Table, Select table filter is set by default to None, which indicates that the table is not filtered.

  2. From Select table filter, select Filter By Date Column.

  3. From the Date Column dropdown list, select the date column to use for the filter. To improve performance, select a column that is indexed.

  4. Under Get data from the last, from the time unit dropdown list, select the unit of time to use for the filter. You can filter records based on their age in days, weeks, months, or years.

  5. In the field, enter the number of the selected unit before the current date for which to include the upstream records. For example, you select days as the unit, and set the number to 4. Structural then pulls related records for which the date column value is up to 4 days before the current date.

Filtering with a WHERE clause

You can also filter the upstream records using a WHERE clause.

To filter the upstream records using a WHERE clause:

  1. In Table View or Graph View, click the table. On the table details panel, under Filter Table, Select table filter is set by default to None, which indicates that the table is not filtered.

  2. From Select table filter, select Filter by Where Clause.

  3. In the Where Clause text area, enter the WHERE clause to use to filter the related records.

  4. For a more complex WHERE clause, you can display an editor with a larger text area.

    1. Click Open in Editor.

    2. In the text area, enter the WHERE clause.

    3. Click Save.

  5. To copy the WHERE clause to the clipboard, click Copy To Clipboard.

Removing a filter

To remove an upstream filter, from Select table filter, select None.

Identifying configuration changes since the most recent subsetting run

As you make changes to the subsetting configuration, Table View and Graph View indicates how the changes affect the next run of the subsetting generation when compared to the most recent subsetting generation.

When a table's inclusion in the subset is affected, on Graph View, a colored marker is added to the bottom of the table box.

On Table View, a colored icon displays next to the table. A tooltip indicates the type of change.

The possible types of changes are:

  • Added to the subset. For example:

    • A new target table

    • A table that is newly included because it is related to a new target table

    • A new lookup table

  • Removed from the subset. For example:

    • A removed target table

    • A table that is removed because it is related to a removed target table

    • A removed lookup table

  • Modified in the subset. This usually reflects a change to a target table configuration. You might:

    • Change the type of target table (percentage or WHERE clause)

    • Change the percentage

    • Change the WHERE clause

    • Change the upstream filter

When you run a subsetting generation, Tonic clears the markers.

Determining how to process tables that are not in the subset

Tables other than target tables, lookup tables, or related tables are not in the subset.

The subsetting configuration includes how to copy all of these tables to the destination database.

You can either:

  • Use the table modes that are assigned to the out-of-subset tables.

  • Truncate all of the out-of-subset tables. The table schema is preserved, but none of the data is copied to the destination database.

On Table View, on the Configuration tab, you use the Process tables that are out of subset toggle to determine how to handle these tables. After you run subsetting data generation, the toggle is on the Options tab.

By default, the setting is turned off, and Structural truncates the out-of-subset tables.

To use the assigned table mode to process each table, toggle the setting to the on position.

Determining whether to use subsetting during data generation

If you configured subsetting, then when you run a data generation job, you can either generate the entire dataset, or use the subsetting configuration to generate a subset.

On Table View, on the Configuration tab, the Use Subsetting toggle indicates whether to generate a subset. After you run subsetting data generation, the Use Subsetting toggle is on the Options tab.

By default, the toggle is in the off position. When you run a data generation job, it generates the entire destination data dataset.

To instead generate a subset, toggle Use subsetting to the on position.

When you run a data generation job, you are also prompted to confirm whether to generate the entire dataset or a subset. These two toggles are synchronized. If you turn on the Use Subsetting toggle on the Configuration tab, then it is on by default on the generation confirmation panel.

Enabling parallel processing for subsetting

You can sometimes use parallel processing to improve the performance of the subsetting process. Parallel processing allows multiple subsetting steps to be processed at the same time. The steps cannot rely on the output of other steps that are processed in parallel.

To enable parallel processing for subsetting, set the environment setting TONIC_TABLE_PARALLELISM to a number greater than 1 (the default). You can configure this setting from Tonic Settings. This setting determines the maximum number of subsetting steps that Structural can process in parallel. For regular data generation, it also determines the number of tables that Structural operates on at the same time.

The effect of subsetting parallelism on performance depends on your subsetting configuration, the layout of your schema, the performance characteristics of the machine that runs Structural, and the performance characteristics of your databases.

We recommend that you start with a relatively small number such as 4, and then run a data generation job to see how it affects performance. If performance improves, you can increase the number incrementally until the performance no longer improves.

The environment setting only controls the maximum number of steps that can be processed in parallel. Performance should not degrade if your system cannot support parallelism or won't benefit from using it.

If you have any other questions, contact support@tonic.ai.

Last updated