LogoLogo
Release notesAPI docsDocs homeStructural CloudTonic.ai
  • Tonic Structural User Guide
  • About Tonic Structural
    • Structural data generation workflow
    • Structural deployment types
    • Structural implementation roles
    • Structural license plans
  • Logging into Structural for the first time
  • Getting started with the Structural free trial
  • Managing your user account
  • Frequently Asked Questions
  • Tutorial videos
  • Creating and managing workspaces
    • Managing workspaces
      • Viewing your list of workspaces
      • Creating, editing, or deleting a workspace
      • Workspace configuration settings
        • Workspace identification and connection type
        • Data connection settings
        • Configuring secrets managers for database connections
        • Data generation settings
        • Enabling and configuring upsert
        • Writing output to Tonic Ephemeral
        • Writing output to a container repository
        • Advanced workspace overrides
      • About the workspace management view
      • About workspace inheritance
      • Assigning tags to a workspace
      • Exporting and importing the workspace configuration
    • Managing access to workspaces
      • Sharing workspace access
      • Transferring ownership of a workspace
    • Viewing workspace jobs and job details
  • Configuring data generation
    • Privacy Hub
    • Database View
      • Viewing and configuring tables
      • Viewing the column list
      • Displaying sample data for a column
      • Configuring an individual column
      • Configuring multiple columns
      • Identifying similar columns
      • Commenting on columns
    • Table View
    • Working with document-based data
      • Performing scans on collections
      • Using Collection View
    • Identifying sensitive data
      • Running the Structural sensitivity scan
      • Manually indicating whether a column is sensitive
      • Built-in sensitivity types that Structural detects
      • Creating and managing custom sensitivity rules
    • Table modes
    • Generator information
      • Generator summary
      • Generator reference
        • Address
        • Algebraic
        • Alphanumeric String Key
        • Array Character Scramble
        • Array JSON Mask
        • Array Regex Mask
        • ASCII Key
        • Business Name
        • Categorical
        • Character Scramble
        • Character Substitution
        • Company Name
        • Conditional
        • Constant
        • Continuous
        • Cross Table Sum
        • CSV Mask
        • Custom Categorical
        • Date Truncation
        • Email
        • Event Timestamps
        • File Name
        • Find and Replace
        • Finnish Personal Identity Code
        • FNR
        • Geo
        • HIPAA Address
        • Hostname
        • HStore Mask
        • HTML Mask
        • Integer Key
        • International Address
        • IP Address
        • JSON Mask
        • MAC Address
        • Mongo ObjectId Key
        • Name
        • Noise Generator
        • Null
        • Numeric String Key
        • Passthrough
        • Phone
        • Random Boolean
        • Random Double
        • Random Hash
        • Random Integer
        • Random Timestamp
        • Random UUID
        • Regex Mask
        • Sequential Integer
        • Shipping Container
        • SIN
        • SSN
        • Struct Mask
        • Timestamp Shift Generator
        • Unique Email
        • URL
        • UUID Key
        • XML Mask
      • Generator characteristics
        • Enabling consistency
        • Linking generators
        • Differential privacy
        • Partitioning a column
        • Data-free generators
        • Supporting uniqueness constraints
        • Format-preserving encryption (FPE)
      • Generator types
        • Composite generators
        • Primary key generators
    • Generator assignment and configuration
      • Reviewing and applying recommended generators
      • Assigning and configuring generators
      • Document View for file connector JSON columns
      • Generator hints and tips
      • Managing generator presets
      • Configuring and using Structural data encryption
      • Custom value processors
    • Subsetting data
      • About subsetting
      • Using table filtering for data warehouses and Spark-based data connectors
      • Viewing the current subsetting configuration
      • Subsetting and foreign keys
      • Configuring subsetting
      • Viewing and managing configuration inheritance
      • Viewing the subset creation steps
      • Viewing previous subsetting data generation runs
      • Generating cohesive subset data from related databases
      • Other subsetting hints and tips
    • Viewing and adding foreign keys
    • Viewing and resolving schema changes
    • Tracking changes to workspaces, generator presets, and sensitivity rules
    • Using the Privacy Report to verify data protection
  • Running data generation
    • Running data generation jobs
      • Types of data generation
      • Data generation process
      • Running data generation manually
      • Scheduling data generation
      • Issues that prevent data generation
    • Managing data generation performance
    • Viewing and downloading container artifacts
    • Post-job scripts
    • Webhooks
  • Installing and Administering Structural
    • Structural architecture
    • Using Structural securely
    • Deploying a self-hosted Structural instance
      • Deployment checklist
      • System requirements
      • Deploying with Docker Compose
      • Deploying on Kubernetes with Helm
      • Enabling the option to write output data to a container repository
        • Setting up a Kubernetes cluster to use to write output data to a container repository
        • Required access to write destination data to a container repository
      • Entering and updating your license key
      • Setting up host integration
      • Working with the application database
      • Setting up a secret
      • Setting a custom certificate
    • Using Structural Cloud
      • Structural Cloud notes
      • Setting up and managing a Structural Cloud pay-as-you-go subscription
      • Structural Cloud onboarding
    • Managing user access to Structural
      • Structural organizations
      • Determining whether users can create accounts
      • Creating a new account in an existing organization
      • Single sign-on (SSO)
        • Structural user authentication with SSO
        • Enabling and configuring SSO on Structural Cloud
        • Synchronizing SSO groups with Structural
        • Viewing the list of SSO groups in Tonic Structural
        • AWS IAM Identity Center
        • Duo
        • GitHub
        • Google
        • Keycloak
        • Microsoft Entra ID (previously Azure Active Directory)
        • Okta
        • OpenID Connect (OIDC)
        • SAML
      • Managing Structural users
      • Managing permissions
        • About permission sets
        • Built-in permission sets
        • Available permissions
        • Viewing the lists of global and workspace permission sets
        • Configuring custom permission sets
        • Selecting default permission sets
        • Configuring access to global permission sets
        • Setting initial access to all global permissions
        • Granting Account Admin access for a Structural Cloud organization
    • Structural monitoring and logging
      • Monitoring Structural services
      • Performing health checks
      • Downloading the usage report
      • Tracking user access and permissions
      • Redacted and diagnostic (unredacted) logs
      • Data that Tonic.ai collects
      • Verifying and enabling telemetry sharing
    • Configuring environment settings
    • Updating Structural
  • Connecting to your data
    • About data connectors
    • Overview for database administrators
    • Data connector summary
    • Amazon DynamoDB
      • System requirements and limitations for DynamoDB
      • Structural differences and limitations with DynamoDB
      • Before you create a DynamoDB workspace
      • Configuring DynamoDB workspace data connections
    • Amazon EMR
      • Structural process overview for Amazon EMR
      • System requirements for Amazon EMR
      • Structural differences and limitations with Amazon EMR
      • Before you create an Amazon EMR workspace
        • Creating IAM roles for Structural and Amazon EMR
        • Creating Athena workgroups
        • Configuration for cross-account setups
      • Configuring Amazon EMR workspace data connections
    • Amazon Redshift
      • Structural process overview for Amazon Redshift
      • Structural differences and limitations with Amazon Redshift
      • Before you create an Amazon Redshift workspace
        • Required AWS instance profile permissions for Amazon Redshift
        • Setting up the AWS Lambda role for Amazon Redshift
        • AWS KMS permissions for Amazon SQS message encryption
        • Amazon Redshift-specific Structural environment settings
        • Source and destination database permissions for Amazon Redshift
      • Configuring Amazon Redshift workspace data connections
    • Databricks
      • Structural process overview for Databricks
      • System requirements for Databricks
      • Structural differences and limitations with Databricks
      • Before you create a Databricks workspace
        • Granting access to storage
        • Setting up your Databricks cluster
        • Configuring the destination database schema creation
      • Configuring Databricks workspace data connections
    • Db2 for LUW
      • System requirements for Db2 for LUW
      • Structural differences and limitations with Db2 for LUW
      • Before you create a Db2 for LUW workspace
      • Configuring Db2 for LUW workspace data connections
    • File connector
      • Overview of the file connector process
      • Supported file and content types
      • Structural differences and limitations with the file connector
      • Before you create a file connector workspace
      • Configuring the file connector storage type and output options
      • Managing file groups in a file connector workspace
      • Downloading generated file connector files
    • Google BigQuery
      • Structural differences and limitations with Google BigQuery
      • Before you create a Google BigQuery workspace
      • Configuring Google BigQuery workspace data connections
      • Resolving schema changes for de-identified views
    • MongoDB
      • System requirements for MongoDB
      • Structural differences and limitations with MongoDB
      • Configuring MongoDB workspace data connections
      • Other MongoDB hints and tips
    • MySQL
      • System requirements for MySQL
      • Before you create a MySQL workspace
      • Configuring MySQL workspace data connections
    • Oracle
      • Known limitations for Oracle schema objects
      • System requirements for Oracle
      • Structural differences and limitations with Oracle
      • Before you create an Oracle workspace
      • Configuring Oracle workspace data connections
      • Troubleshooting Oracle permissions
    • PostgreSQL
      • System requirements for PostgreSQL
      • Before you create a PostgreSQL workspace
      • Configuring PostgreSQL workspace data connections
    • Salesforce
      • System requirements for Salesforce
      • Structural differences and limitations with Salesforce
      • Before you create a Salesforce workspace
      • Configuring Salesforce workspace data connections
    • Snowflake on AWS
      • Structural process overviews for Snowflake on AWS
      • Structural differences and limitations with Snowflake on AWS
      • Before you create a Snowflake on AWS workspace
        • Required AWS instance profile permissions for Snowflake on AWS
        • Other configuration for Lambda processing
        • Source and destination database permissions for Snowflake on AWS
        • Configuring whether Structural creates the Snowflake on AWS destination database schema
      • Configuring Snowflake on AWS workspace data connections
    • Snowflake on Azure
      • Structural process overview for Snowflake on Azure
      • Structural differences and limitations with Snowflake on Azure
      • Before you create a Snowflake on Azure workspace
      • Configuring Snowflake on Azure workspace data connections
    • Spark SDK
      • Structural process overview for the Spark SDK
      • Structural differences and limitations with the Spark SDK
      • Configuring Spark SDK workspace data connections
      • Using Spark to run de-identification of the data
    • SQL Server
      • System requirements for SQL Server
      • Before you create a SQL Server workspace
      • Configuring SQL Server workspace data connections
    • Yugabyte
      • System requirements for Yugabyte
      • Structural differences and limitations with Yugabyte
      • Before you create a Yugabyte workspace
      • Configuring Yugabyte workspace data connections
      • Troubleshooting Yugabyte data generation issues
  • Using the Structural API
    • About the Structural API
    • Getting an API token
    • Getting the workspace ID
    • Using the Structural API to perform tasks
      • Configure environment settings
      • Manage generator presets
        • Retrieving the list of generator presets
        • Structure of a generator preset
        • Creating a custom generator preset
        • Updating an existing generator preset
        • Deleting a generator preset
      • Manage custom sensitivity rules
      • Create a workspace
      • Connect to source and destination data
      • Manage file groups in a file connector workspace
      • Assign table modes and filters to source database tables
      • Set column sensitivity
      • Assign generators to columns
        • Getting the generator IDs and available metadata
        • Updating generator configurations
        • Structure of a generator assignment
        • Generator API reference
          • Address (AddressGenerator)
          • Algebraic (AlgebraicGenerator)
          • Alphanumeric String Key (AlphaNumericPkGenerator)
          • Array Character Scramble (ArrayTextMaskGenerator)
          • Array JSON Mask (ArrayJsonMaskGenerator)
          • Array Regex Mask (ArrayRegexMaskGenerator)
          • ASCII Key (AsciiPkGenerator)
          • Business Name (BusinessNameGenerator)
          • Categorical (CategoricalGenerator)
          • Character Scramble (TextMaskGenerator)
          • Character Substitution (StringMaskGenerator)
          • Company Name (CompanyNameGenerator)
          • Conditional (ConditionalGenerator)
          • Constant (ConstantGenerator)
          • Continuous (GaussianGenerator)
          • Cross Table Sum (CrossTableAggregateGenerator)
          • CSV Mask (CsvMaskGenerator)
          • Custom Categorical (CustomCategoricalGenerator)
          • Date Truncation (DateTruncationGenerator)
          • Email (EmailGenerator)
          • Event Timestamps (EventGenerator)
          • File Name (FileNameGenerator)
          • Find and Replace (FindAndReplaceGenerator)
          • Finnish Personal Identity Code (FinnishPicGenerator)
          • FNR (FnrGenerator)
          • Geo (GeoGenerator)
          • HIPAA Address (HipaaAddressGenerator)
          • Hostname (HostnameGenerator)
          • HStore Mask (HStoreMaskGenerator)
          • HTML Mask (HtmlMaskGenerator)
          • Integer Key (IntegerPkGenerator)
          • International Address (InternationalAddressGenerator)
          • IP Address (IPAddressGenerator)
          • JSON Mask (JsonMaskGenerator)
          • MAC Address (MACAddressGenerator)
          • Mongo ObjectId Key (ObjectIdPkGenerator)
          • Name (NameGenerator)
          • Noise Generator (NoiseGenerator)
          • Null (NullGenerator)
          • Numeric String Key (NumericStringPkGenerator)
          • Passthrough (PassthroughGenerator)
          • Phone (USPhoneNumberGenerator)
          • Random Boolean (RandomBooleanGenerator)
          • Random Double (RandomDoubleGenerator)
          • Random Hash (RandomStringGenerator)
          • Random Integer (RandomIntegerGenerator)
          • Random Timestamp (RandomTimestampGenerator)
          • Random UUID (UUIDGenerator)
          • Regex Mask (RegexMaskGenerator)
          • Sequential Integer (UniqueIntegerGenerator)
          • Shipping Container (ShippingContainerGenerator)
          • SIN (SINGenerator)
          • SSN (SsnGenerator)
          • Struct Mask (StructMaskGenerator)
          • Timestamp Shift (TimestampShiftGenerator)
          • Unique Email (UniqueEmailGenerator)
          • URL (UrlGenerator)
          • UUID Key (UuidPkGenerator)
          • XML Mask (XmlMaskGenerator)
      • Configure subsetting
      • Check for and resolve schema changes
      • Run data generation jobs
      • Schedule data generation jobs
    • Example script: Starting a data generation job
    • Example script: Polling for a job status and creating a Docker package
Powered by GitBook
On this page
  • What is subsetting?
  • Components of subsetting
  • Target tables
  • Lookup tables
  • Related tables
  • Out-of-subset tables
  • How Structural creates a subset
  • Other notes about retrieving subset data
  • Target tables related to other target tables
  • Tables that are upstream of multiple target tables

Was this helpful?

Export as PDF
  1. Configuring data generation
  2. Subsetting data

About subsetting

Last updated 5 months ago

Was this helpful?

Required license: Professional or Enterprise

What is subsetting?

Subsetting allows you to intelligently reduce the size of your destination database. It takes a representative sample of the data that preserves the data's referential integrity.

You configure how Tonic Structural generates the subset. When you generate the output data, you decide whether to enable the subsetting process.

For example, you can configure subsetting to get 5% of all transactions, or all of the data that is associated with customers who live in California.

Here are a few examples where subsetting data might be important or necessary:

  • You want to use your production database in staging or test environments, without the PII. Because the database is very large, you want to use only a portion of it.

  • To reproduce a bug, you want a test database that contains a few specific rows from production, and related rows from other tables.

  • You want to share data with others, but you don’t want them to have all of it. For example, you can provide developers an anonymized subset that also enables them to run the test database locally on their local machines.

To learn more about our approach to subsetting, go to the following technical blog posts:

Components of subsetting

Subsetting uses foreign keys to determine the relationships in the data. These relationships enable the subsetting process to traverse the database as it builds the subset.

Foreign keys are either configured in your source data, or configured using the Structural virtual foreign key tool. For more information, go to Subsetting and foreign keys.

For subsetting, each table in the source database falls into one of the following categories:

Target tables

Target tables are the seed tables that provide the initial set of rows to include in the subset. Structural retrieves the initial subset of data from the target tables. Structural then uses those rows to identify the information to pull from related tables.

A target table typically contains an important object that is well connected to everything else in the source data. For example, users, transactions, or claims. A subset should usually have a very small number of target tables.

When you identify a target table, you specify how to retrieve the subset of the data that you want from the table. You can request a percentage of the data, or use a WHERE clause to identify a specific subset of data.

Lookup tables

A lookup table contains a static set of values that is used in other tables in your subset. For example, a lookup table might contain a list of postal codes or country names that are referenced in other tables.

Structural always retrieves all of the data in a lookup table. It does not check whether or where the lookup values are used.

It does not pull records from related tables based on lookup table values. Relationships with lookup tables are ignored during the subsetting process.

Related tables

Related tables are tables that are connected by direct or indirect relationships with a target table, and that are not identified as lookup tables.

  • Downstream tables have data that is required to maintain referential integrity for the subset. These tables have primary keys that are referenced by foreign keys in related tables.

Some related tables are both downstream and upstream. In that case, you can provide a filter that applies only to the upstream records. Because the downstream records are required for referential integrity, they cannot be filtered.

For example, a transactions table contains a foreign key column to identify the customer. The value is the primary key of a record in the customers table. The customers table is downstream of the transactions table - the transaction data is incomplete without the customer information. The transactions table is upstream of the customers table.

Structural pulls data from related tables in order to preserve referential integrity in the output data subset.

In many cases, the relationship is direct. For example, a target table contains a list of events. The events table identifies the user that hosted the event. The user is identified using a foreign key relationship from the events table to the users table. The users table is a related table. The subset includes all the users that the events refer to.

The relationship also might be indirect. To continue the example, the events table identifies a user from the users table. The users table identifies the company that each user belongs to. The company is identified using a foreign key relationship from the users table to the companies table. The companies table is also a related table. The subset needs to include all of the companies that are referred to by the users that the events table refers to.

For an example of how Structural identifies related tables, view the example diagram in How Structural creates a subset.

Out-of-subset tables

Tables other than target tables, lookup tables, or related tables are not part of the subset.

By default, Structural copies only the table schema of out-of-subset tables. It does not populate any of the data.

You can also choose to process the tables using the table mode that is assigned to each table.

How Structural creates a subset

Structural creates the subset before it applies any transformations to the source data.

To provide a basic overview of how Structural creates the subset, we'll use the following simple example schema:

The Events table is the target table for the subset. The Events table includes information about the event hosts (Hosts table) and the event venue (Venues table). For each host, the data includes the company that the host belongs to (Companies table).

The Attendees table includes the event that the attendee registered for.

The Hosts, Companies, Venues, and Attendees tables are all related tables for the subset.

The States table provides a lookup of state values to use for the company, venue, and attendee addresses. It is a lookup table for the subset. A subset always includes all of the data in a lookup table.

When you enable subsetting for a data generation job:

  1. To create the basis of the subset, Structural gets data from the target tables based on the configured filters - either a percentage or a WHERE clause. In our example, Structural gets the subset of data from the Events table. Structural then traverses your database based on the relationships that originate from the target tables.

  2. Structural first goes upstream. For the upstream process, Structural traverses through tables that reference a target table, based on the data collected in step 1. In other words, the value of the primary key for a target table record is the value of a foreign key column in the upstream table. This step continues until there are no remaining upstream tables to process. To continue our example, Structural retrieves the attendees for the event records that are in the subset.

  3. Next, Structural goes downstream. Structural traverses all of the tables to look for foreign key columns for which the value is the primary key of an upstream table record. To continue our example, Structural retrieves the hosts and venues that are referred to in the event records that it retrieved in the first or second pass on the events table. It also retrieves the companies that are referred to in the host records. During this downstream step, Structural considers both upstream and downstream tables to ensure that the subset includes every connected table. For example, if the Venues table included a foreign key column that referenced a primary key from the Attendees table, Tonic would have to return to the Attendees table to get those attendee records.

Other notes about retrieving subset data

You might want to be aware of how Structural retrieves subset data in the following cases, which can result in either more or less data than you might expect.

Target tables related to other target tables

If there are multiple target tables, and the tables are related to each other, Structural takes the union of the required data for both the target table configuration and the table relationships.

For example, table A contains a foreign key column that refers to table B. You configure both tables as target tables. For table B, Structural pulls both the directly targeted set of records, and the records that the targeted table A records refer to.

Tables that are upstream of multiple target tables

If a table is upstream of multiple target tables, then Structural only pulls records from that table that contain references to targeted records in all of the target tables.

For example, in related table Child1, column1 is a foreign key that refers to a primary key in target table Parent1. column2 is a foreign key that refers to a primary key in target table Parent2.

If column1 and column2 both refer to targeted records in Parent1 and Parent2, then that Child1 record is included in the subset. If only one of those columns refers to a targeted record in Parent1 or Parent2, then that Child1 record is not included.

For more information, go to .

For more information, go to .

Upstream tables contain data that has a foreign key that references a primary key in the target table. For large upstream tables, if the foreign key columns are not indexed, the subsetting process can be significantly slower. These upstream records are not required to maintain referential integrity, but can contain useful information. In the subset configuration, you can either by date or by using a WHERE clause.

For more information, go to .

Honey I shrunk the (Postgres) database
Database subsetting is not a piece of cake, so we baked Condenser 2.0 just for you
Target tables
Lookup tables
Related tables
Out-of-subset tables
Identifying and configuring target tables
Identifying lookup tables
Determining how to process tables that are not in the subset
filter these upstream records
Example schema