# Configuring subsetting

## Identifying and configuring target tables <a href="#subsetting-configure-target-tables" id="subsetting-configure-target-tables"></a>

A target table is a table for which you specify the initial set of rows to include in the destination database.

The subset also includes rows from related tables and lookup tables.

### Options to specify the data subset <a href="#target-table-subset-options" id="target-table-subset-options"></a>

To identify the subset of data to include, you can either:

* **Specify a percentage of the table to include in the destination database.**\
  \
  You can use this option when you care about the specific volume of data from the target table, but not the specific rows.\
  \
  Tonic Structural converts the percentage to a filter or a `WHERE` clause, depending on your database type.

![Configuration for a percentage target table](https://3378426797-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LSQCLFQ4bslJ-HYc8c3%2Fuploads%2F2XDXlXo5w4b8x4sc8wgv%2FSubsettingTargetTablePercentage.png?alt=media\&token=82389c1e-c107-4322-a97e-981d81412eb1)

* **Provide a `WHERE` clause to specify the subset of data to include in the destination database.**\
  \
  The `WHERE` clause allows you to be more specific about the data to include. For example, you might want to only include data for a specific user or date range.

![Configuration for a where clause target table](https://3378426797-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LSQCLFQ4bslJ-HYc8c3%2Fuploads%2FbD6K4JZVfY2JmWzkNIiY%2FSubsettingTargetTableWhereClause.png?alt=media\&token=61f8e9fc-9a2f-4b06-aaa2-885b8eac034e)

To combine a specified set of records with a random set of the remaining records, use a `WHERE` clause. For example, to get all users that are from Alabama, and 5 percent of the other records, use the following `WHERE` clause:

```
state = "Alabama" OR random() < 0.05 
```

### Configuring a percentage target table <a href="#target-table-percentage" id="target-table-percentage"></a>

To identify and configure subsetting for a target table:

1. In **Table View** or **Graph View**, click the table.
2. On the table details panel, from the **Select Table Type** dropdown list, select **Target Table (Percentage)**.
3. In the **Target Percentage** field, type the percentage of the data to include in the destination database.\
   \
   The default is 5, which indicates to use 5% of the rows in the table.\
   \
   You can specify a decimal value, including a value that is less than 1. For example, you might configure the subset to include .5 percent of the rows, or 33.33 percent of the rows.

{% hint style="info" %}
Depending on how your tables are related, the target tables in the final subset might contain a larger number of rows than the percentage that you specified.

The additional rows are rows that are required to maintain referential integrity.

To view the tables that contribute to the additional rows, go to the [subset steps](https://docs.tonic.ai/app/generation/subsetting/subsetting-view-steps). For additional assistance, reach out to your Tonic.ai contact.
{% endhint %}

### Configuring a WHERE clause target table <a href="#target-table-where-clause" id="target-table-where-clause"></a>

1. In **Table View** or **Graph View**, click the table.
2. On the table details panel, from the **Select Table Type** dropdown list, select **Target Table (Where Clause)**.
3. In the **Target** **Where Clause** field, type the `WHERE` clause to use to identify the data to include in the destination database.\
   \
   For example, the target table contains a column called `event_id`. To select all rows where `event_id` is greater than 1000, add the following `WHERE` clause:\
   \
   `event_id > 1000`
4. For a more complex `WHERE` clause, you can display an editor with a larger text area.
   1. Click **Open in Editor**.
   2. In the text area, enter the `WHERE` clause.
   3. Click **Save**.

You can query across tables within the `WHERE` clause.\
\
For example, you configure the `customers` table as a target table, but you also want to use information from the `customers_legacy` table to identify the target records in `customers`.

In the following example query, the matching records in `customers` have a `Customer_Key` value that matches a `CustomerKey` value in `customers_legacy`, and where the value of `Occupation` in `customers_legacy` is `Detective`:

```
"Customer_Key" IN (
    SELECT "Customer_Key"
    FROM customers_legacy
    WHERE customers_legacy."Customer_Key"=customers."Customer_Key"
    AND customers_legacy."Occupation" = 'Detective'
)
```

You can also create a query that selects a random percentage of a specified set of data.

For example, in PostgreSQL, to select 50% of the records that have an identifier that is divisible by 3, you could use the following `WHERE` clause:

```
id % 3 = 0 and random() < 0.5
order by id
```

### Removing a target table <a href="#subset-config-target-table-remove" id="subset-config-target-table-remove"></a>

To remove a target table:

1. In **Table View** or **Graph View**, click the table.
2. On the table details panel, from the table type dropdown list, select **Remove**.

## Identifying lookup tables

A lookup table contains a list of values that are used to populate columns in other tables. For example, a list of states, countries, or currencies. Lookup tables are sometimes referred to as reference tables.

Structural always copies lookup tables to the destination database in their entirety. If you do not configure a table as a lookup table, then Structural treats the table as a related table, and copies only rows that are used in the subset data. Structural also pulls in rows from other tables that refer to the table, but that are not necessarily related to the target tables. This could result in an unexpectedly large subset.

For example, in a Users table, every user record refers to a state in the States table. If you do not identify States as a lookup table, then the subset would include every record in the Users table.

### Properties of a lookup table <a href="#subset-config-lookup-table-properties" id="subset-config-lookup-table-properties"></a>

Here are some typical properties of a lookup table:

* It is fairly small and rarely updated.
* Many tables point to the table, but it does not point to another table.
* The table contains a set of unique values.

### Identifying an individual lookup table <a href="#subset-config-lookup-table-individual" id="subset-config-lookup-table-individual"></a>

To identify an individual table as a lookup table:

1. In **Table View** or **Graph View**, click the table.
2. On the table details panel, from the **Select Table Type** dropdown list, select **Lookup Table**.

### Identifying multiple lookup tables <a href="#subset-config-lookup-tables-multiple" id="subset-config-lookup-tables-multiple"></a>

To identify multiple tables as lookup tables:

1. On **Table View**, check the checkbox for each table to identify as a lookup table.
2. From the **Actions** dropdown list, select **Add Lookup Tables**.

![Actions menu for selected rows](https://3378426797-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LSQCLFQ4bslJ-HYc8c3%2Fuploads%2FXTttSioHOujELwAu1AV4%2FSubsettingActionsMenu.png?alt=media\&token=d408e8b8-3e77-4bc3-87f7-5796302a6d42)

### Removing lookup tables <a href="#subset-config-lookup-tables-remove" id="subset-config-lookup-tables-remove"></a>

To remove the lookup designation for a table:

1. In **Table View** or **Graph View**, click the table.
2. On the table details panel, from the dropdown list, select **Remove**.

## Filtering optional records <a href="#subsetting-config-filter-upstream-records" id="subsetting-config-filter-upstream-records"></a>

{% hint style="info" %}
The Salesforce data connector does not support filtering optional records.
{% endhint %}

Records that reference required subset records are considered upstream records. Unlike downstream records, upstream records are not required for referential integrity. Upstream records are optional.

To reduce the size of the subset, you can apply a filter to these optional records. To filter the records, you can either:

* Use a date column to specify an amount of time before the current date for which to include records.\
  \
  For example, you can only include records for which the update date is one week before the current date.
* Use a `WHERE` clause to identify the records to include.

You can filter a table that contains both upstream and downstream records. However, the filter only applies to the optional upstream records.

In the table list, when an upstream table is filtered, a **Filtered** icon displays.

### Filtering by date <a href="#subsetting-upstream-filter-by-date" id="subsetting-upstream-filter-by-date"></a>

The date filter allows you to filter optional records based on the value of a date-based column.

<figure><img src="https://3378426797-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LSQCLFQ4bslJ-HYc8c3%2Fuploads%2FkKYwsuFCG48oARhKNuUF%2FSubsettingUpstreamFilteringDate.png?alt=media&#x26;token=fb11deb8-06db-425b-956b-b9cc44ccb00f" alt=""><figcaption><p>Filter Optional Tables for date column</p></figcaption></figure>

To filter an upstream table by date:

1. In **Table View** or **Graph View**, click the table.\
   \
   On the table details panel, under **Filter Optional Tables**, **Type** is set by default to **No Filter Applied**, which indicates that the table is not filtered.
2. From the **Type** dropdown list, select **Filter By Date Column**.
3. From the **Date Column** dropdown list, select the date column to use for the filter. To improve performance, select a column that is indexed.
4. Under **Get data from the last**, from the time unit dropdown list, select the unit of time to use for the filter.\
   \
   You can filter records based on their age in days, weeks, months, or years.
5. In the field, enter the number of the selected unit before the current date for which to include the upstream records.\
   \
   For example, you select days as the unit, and set the number to 4. Structural then pulls related records for which the date column value is up to 4 days before the current date.

### Filtering with a WHERE clause <a href="#subset-config-upstream-filter-where-clause" id="subset-config-upstream-filter-where-clause"></a>

To filter the upstream records, you can also use a `WHERE` clause.

<figure><img src="https://3378426797-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LSQCLFQ4bslJ-HYc8c3%2Fuploads%2FHLbycJF7hnsGJJ9SrZRW%2FSubsettingUpstreamFilteringWhereClause.png?alt=media&#x26;token=2f8e981b-82eb-4963-9ad7-f440d21d0306" alt=""><figcaption><p>Filter Optional Tables fields for WHERE clause</p></figcaption></figure>

To use a `WHERE` clause to filter the upstream records:

1. In **Table View** or **Graph View**, click the table.\
   \
   On the table details panel, under **Filter Optional Tables**, **Type** is set by default to **No Filter Applied**, which indicates that the table is not filtered.
2. From the **Type** dropdown list, select **Filter by Where Clause**.
3. In the **Where Clause** text area, enter the `WHERE` clause to use to filter the related records.

   \
   Note that if the `WHERE` clause evaluates to `false`, then Structural excludes all of the upstream records from the table.\
   \
   For most data connectors, to exclude all of the upstream records, you can simply set the text of the `WHERE` clause to `false`. Otherwise, use an obviously false statement such as `1=0`.
4. For a more complex `WHERE` clause, you can display an editor with a larger text area.
   1. Click **Open in Editor**.
   2. In the text area, enter the `WHERE` clause.
   3. Click **Save**.
5. To copy the `WHERE` clause to the clipboard, click **Copy To Clipboard**.

### Removing a filter <a href="#subset-config-upstream-filter-remove" id="subset-config-upstream-filter-remove"></a>

To remove an upstream filter, from the **Type** dropdown list, select **No filter applied**.

## Identifying configuration changes since the most recent subsetting run <a href="#subsetting-config-identify-changes-since-last-run" id="subsetting-config-identify-changes-since-last-run"></a>

As you make changes to the subsetting configuration, **Table View** and **Graph View** indicate how the changes affect the next run of the subsetting generation when compared to the most recent subsetting generation.

When a table's inclusion in the subset is affected, on **Graph View**, a colored marker is added to the bottom of the table box.

<figure><img src="https://3378426797-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LSQCLFQ4bslJ-HYc8c3%2Fuploads%2FQN7HPvUANsf8RNPXnbMH%2FSubsettingGraphViewChangeMarker.png?alt=media&#x26;token=7703d7b6-3bcb-4968-a272-90bb5d4c8458" alt=""><figcaption><p>Graph View table box<br>marked as changed</p></figcaption></figure>

On **Table View**, a colored icon displays next to the table. A tooltip indicates the type of change.

<figure><img src="https://3378426797-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LSQCLFQ4bslJ-HYc8c3%2Fuploads%2FUB6VxVCDEFGvy2gCVNgJ%2FSubsettingTableViewChangeMarker.png?alt=media&#x26;token=cda7af4e-e05f-4c1d-acae-38309296b5ba" alt=""><figcaption><p>Change marker for a table in the<br>Table View list</p></figcaption></figure>

The possible types of changes are:

* Added to the subset. For example:
  * A new target table
  * A table that is newly included because it is related to a new target table
  * A new lookup table
* Removed from the subset. For example:
  * A removed target table
  * A table that is removed because it is related to a removed target table
  * A removed lookup table
* Modified in the subset. This usually reflects a change to a target table configuration. You might:
  * Change the type of target table (percentage or `WHERE` clause)
  * Change the percentage
  * Change the `WHERE` clause
  * Change the upstream filter

When you run a subsetting generation, Tonic clears the markers.

## Determining how to process tables that are not in the subset <a href="#subsetting-config-out-of-subset" id="subsetting-config-out-of-subset"></a>

Tables other than target tables, lookup tables, or related tables are not in the subset.

The subsetting configuration includes how to copy all of these tables to the destination database.

You can either:

* Use the table modes that are assigned to the out-of-subset tables.
* Truncate all of the out-of-subset tables. The table schema is preserved, but none of the data is copied to the destination database.

On **Table View**, on the **Configuration** tab, you use the **Process tables that are out of subset** toggle to determine how to handle these tables. After you run subsetting data generation, the toggle is on the **Options** tab.

![Configuration for how to process tables that are not in the subset](https://3378426797-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LSQCLFQ4bslJ-HYc8c3%2Fuploads%2FZDvyme9lKfzpJK8RlVsZ%2FSubsettingConfigureOptionsTab.png?alt=media\&token=ff5e723e-9ce7-4eff-a09a-f34e35d2a401)

By default, the setting is turned off, and Structural truncates the out-of-subset tables.

To use the assigned table mode to process each table, toggle the setting to the on position.

## **Determining whether to use subsetting during data generation** <a href="#subsetting-config-use-subsetting" id="subsetting-config-use-subsetting"></a>

If you configured subsetting, then when you run a data generation job, you can either generate the entire dataset, or use the subsetting configuration to generate a subset.

On **Table View**, on the **Configuration** tab, the **Use Subsetting** toggle indicates whether to generate a subset. After you run subsetting data generation, the **Use Subsetting** toggle is on the **Options** tab.

By default, the toggle is in the off position. When you run a data generation job, it generates the entire destination data dataset.

To instead generate a subset, toggle **Use subsetting** to the on position.

When you [run a data generation job](https://docs.tonic.ai/app/workflows/data-generation-run-job/data-generation-manual), you are also prompted to confirm whether to generate the entire dataset or a subset. These two toggles are synchronized. If you turn on the **Use Subsetting** toggle on the **Configuration** tab, then it is on by default on the generation confirmation panel.

## Enabling parallel processing for subsetting <a href="#subsetting-parallelism" id="subsetting-parallelism"></a>

You can sometimes use parallel processing to improve the performance of the subsetting process. Parallel processing allows multiple subsetting steps to be processed at the same time. The steps cannot rely on the output of other steps that are processed in parallel.

To enable parallel processing for subsetting, set the [environment setting](https://docs.tonic.ai/app/admin/environment-variables-setting) `TONIC_TABLE_PARALLELISM` to a number greater than 1 (the default). You can configure this setting from **Structural Settings**. This setting determines the maximum number of subsetting steps that Structural can process in parallel. For regular data generation, it also determines the number of tables that Structural operates on at the same time.

The effect of subsetting parallelism on performance depends on:

* Your subsetting configuration.
* The layout of your schema.
* The performance characteristics of the machine that runs Structural.
* The performance characteristics of your databases.

We recommend that you start with a relatively small number such as 4, and then run a data generation job to determine how it affects performance. If performance improves, you can increase the number incrementally until the performance no longer improves.

The environment setting only controls the maximum number of steps that can be processed in parallel. Performance should not degrade if your system cannot support parallelism or won't benefit from using it.

If you have any other questions, contact <support@tonic.ai>.
