Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
The Tonic Structural synthetic data platform combines sensitive data detection and data transformation to allow users to create safe, secure, and compliant datasets.
Common Structural use cases include creating staging and development environments and trying out a new cloud provider without complex data agreements. Structural allows you to reduce bug counts, shorten testing life cycles, and share data with partners, all while helping to ensure security and compliance with the latest regulations, from GDPR to CCPA.
You can use Structural APIs to integrate with CI/CD pipelines or to create automated processes that ensure that the generated data is available on demand.
The Tonic Structural platform creates safe, realistic datasets to use in staging environments or for local development. It includes a web application and API that can be used by engineers, data analysts, or security experts.
Structural connects to source databases that contain sensitive data such as personally identifiable information (PII) or protected health information (PHI). To protect that data, Structural transforms the sensitive values and writes the transformed data to a destination location.
New to Structural? Review the Tonic Structural workflow overview. Go to Getting started with the Structural free trial for information on how to create a Structural account and start a Structural free trial.
Want to know what's in the latest Structural releases? Go to the Tonic Structural release notes.
The Structural application heading includes a feature updates icon, which displays a summary of the newest features, including a link to the Structural release notes.
Structural data generation workflow
Overview of the Structural steps to generate de-identified data
Structural deployment types
You can use Structural Cloud or set up a self-hosted Structural instance
Structural implementation roles
Functions that participate in a Structural implementation
Structural license plans
View the license options and their available features
Workspaces
A workspace contains the data connections and data generation or data science mode configuration.
Data connectors
Each data connector allows Structural to read from and write to a specific type of data source.
Privacy Hub
View and update the current protection status based on the sensitivity scan and workspace configuration.
Database View
Configure transformation options for tables and columns.
Generators
A generator is assigned to a column and performs a data transformation.
Subsetting
Configure a subset of source data to include in the transformed destination data.
Generate data
Run the data generation process to produce transformed destination data.
Schema changes
Review and address changes to the source data schema.
User access
Manage who has access to your instance.
Monitoring and logging
Monitor Structural services and share logs with Tonic.ai.
Updating Structural
Upgrade to the latest version of Structural.
When you go to Tonic Structural for the first time, you create an account. How you create an account depends on the type of user you are.
A new Structural user can be one of the following:
A completely new user who is starting a Structural 14-day free trial. Free trial users use Structural Cloud to explore and experiment with Structural before they decide whether to purchase it.
A new user on a self-hosted Structural instance. Self-hosted instances are installed on-premises. The customer administers the Structural users.
A new user in an existing Structural Cloud organization. New users are added to existing organizations based on their email domain.
Tonic Structural data generation combines sensitive data detection and data transformation to create safe, secure, and compliant datasets.
The Structural data generation workflow involves the following steps:
You can also view this video overview of the Structural data generation workflow.
To get started, you create a data generation workspace. When you create a data generation workspace, you identify the type of source data, such as PostgreSQL or MySQL, and establish the connections to the source database and the destination location. The source database contains the original data that you want to synthesize. The destination location is where Structural stores the synthesized data. It might be a database, a storage location, a container repository, or an Ephemeral database.
Next, you analyze the results of the initial sensitivity scan. The sensitivity scan identifies columns that contain sensitive data. These columns need to be protected by a generator.
Based on the sensitivity scan results, you configure the data generation. The configuration includes:
Assigning table modes to tables. The table mode controls the number of rows and columns that are copied to the destination database.
Indicating column sensitivity. You can make adjustments to the initial sensitivity assignments. For example, you can mark additional columns as sensitive that the initial scan did not identify as sensitive.
Assigning and configuring column generators. To protect the data in a column, especially a sensitive column, you assign a generator to it. The generator replaces the source value with a different value in the destination database. For example, the generator might scramble the characters or assign a random value of the same type.
After you complete the configuration, you run the data generation job. The data generation job uses the configured table modes and generators to transform the data from the source database and write the transformed data to the destination location. You can track the job progress and view the job results.
If you are a user who wants to set up an account in an existing Tonic Structural Cloud or self-hosted organization, go to Creating a new account in an existing organization.
The Structural 14-day free trial allows you to explore and experiment in Structural Cloud before you decide whether to purchase Structural.
When you sign up for a free trial, Structural automatically creates a sample workspace for you to use. You can also create a workspace that uses your own database or files.
The free trial provides tools to introduce you to Structural and to guide you through configuring and completing a data generation.
Structural tracks and displays the amount of time remaining in your free trial. You can request a demonstration and contact support.
When the free trial period ends, you can continue to use Structural to configure workspaces. You can no longer generate data or train models. Contact Tonic.ai to discuss purchasing a Structural license, or select the option to start a Structural Cloud pay-as-you-go subscription.
To start a new free trial of Structural:
Go to app.tonic.ai.
Click Create Account.
On the Create your account dialog, to create an account, either:
To use a corporate Google email address to create the account, click Create account using Google.
To create a new Structural account, enter your email address, create and confirm a Structural password, then click Create Account. You cannot use a public email address for a free trial account.
Structural sends an activation link to your email address.
After you activate your account and log in, Structural next prompts you to select the use case that best matches why you are exploring Structural. If none of the provided use cases fits, use the Other option to tell us about your use case.
After you select a use case, click Next. The Create Your Workspace panel displays.
When you sign up for a free trial, Structural automatically creates a sample PostgreSQL workspace that you can use to explore how to configure and run data generation.
You can also choose to create a workspace that uses your own data, either from local files or from a database.
On the Create your workspace panel:
To use the sample workspace, click Use a sample workspace, then click Next. Structural displays Privacy Hub, which summarizes the protection status for the source data. It also displays the Getting Started Guide panel and the quick start checklist.
To create a workspace that uses local files as the source data, click Upload Files, then click Next. Go to #uploading-files.
To create a new workspace that uses your own data, click Bring your own data, then click Next. Go to #connecting-to-a-database.
The Upload files option creates a local files file connector workspace. The source data consists of groups of files selected from a local file system. The files in a file group must have the same type and structure. Each file group becomes a "table" in the source data.
For other workspaces that you create during the free trial, you can also create a file connector workspace that uses files from cloud storage ( Amazon S3 or Google Cloud Storage).
After you select Upload files and click Next, you are prompted to provide a name for the workspace.
In the field provided, enter the name to use for the workspace, then click Next.
Structural displays the File Groups view, where you can set up the file groups for the workspace.
It also displays the Getting Started Guide panel with links to resources to help you get started.
After you create at least one file group, you can start to use the other Structural features and functions.
If you choose to create a workspace with your own data, then the first step is to provide a name for the workspace.
In the field provided, enter the name to use for your first workspace, then click Next.
The Invite others to Tonic panel displays.
Under Invite others to Tonic, you can optionally invite other users with the same corporate email domain to start their own Structural free trial. The users that you invite are able to view and edit your workspace.
For example, you might want to invite other users if you don't have access to the connection information for the source data. You can invite a user who does have access. They can then update the workspace configuration to add the connection details.
To continue without inviting other users, click Skip this step.
To invite users:
For each user to invite, enter the email address, then press Enter. The email addresses must have the same corporate email domain as your email address.
After you create the list of users to invite, click Next.
The Add source data connection view displays.
The final step in the workspace creation is to provide the source data to use for your workspace.
Structural provides data connectors that allow you to connect to an existing database. Each data connector allows you to connect to a specific type of database. Structural supports several types of application databases, data warehouses, and Spark data solutions.
For the first workspace that you create using the free trial wizard, you can choose:
For subsequent workspaces that you create from Workspaces view, you can also choose Databricks.
To connect to an existing database, on the Add source data connection panel, click the data connector to use, then click Add connection details.
The panel also includes a Local files option, which creates a local files file connector workspace, the same as the Upload files option.
Use the connection details fields to provide the connection information for your source data. The specific fields depend on the type of data connector that you select.
After you provide the connection details, to test the connection, click Test Connection.
To save your workspace, click Save.
Structural displays Privacy Hub, which summarizes the protection status for the source data.
It also displays the Getting Started Guide panel with links to resources to help you get started.
The Structural free trial includes a couple of resources to introduce you to Structural and to guide you through the tasks for your first data generation.
The Getting Started Guide panel provides access to Structural information and support resources.
The Getting Started Guide panel displays automatically when you first start the free trial. To display the Getting Started Guide panel manually, in the Structural heading, click Getting Started.
The Getting Started Guide panel provides links to Structural instructional videos and this Structural documentation. It also contains links to request a Structural demo, contact Tonic.ai support, and purchase a Structural Cloud pay-as-you-go subscription.
For each free trial workspace, Structural provides access to a workspace checklist.
The checklist displays at the bottom left of the workspace management view. It displays automatically when you display the workspace management view. To hide the checklist, click the minimize icon. To display the checklist again, click the checklist icon.
The checklist provides a basic list of tasks to perform in order to complete a Structural data generation.
Each checklist task is linked to the Structural location where you can complete that task. Structural automatically detects and marks when a task is completed.
The checklist tasks are slightly different based on the type of workspace.
For workspaces that are connected to a database, including the sample PostgreSQL workspace and workspaces that you connect to your own data, the checklist contains:
Connect a source database - Set the connection to the source database. In most cases, you set the source connection when you create the workspace. When you click this step, Structural navigates to the Source Settings section of the workspace details view.
Connect to destination database - Set the location where Structural writes the transformed data. When you click this step, Structural navigates to the Destination Settings section of the workspace details view.
Apply generators to modify dataset - Configure how Structural transforms at least one column in the source data. When you click this step:
If there are available generator recommendations, then Structural navigates to Privacy Hub and displays the generator recommendations panel.
If there are no available generator recommendations, then Structural navigates to Database View.
Generate data - Run the data generation to produce the destination data. When you click this item, Structural navigates to the Confirm Generation panel.
For workspaces that use data from local files, the checklist contains:
Create a file group - Create a file group with files that you upload from a local file system. Each file group becomes a table in the workspace. When you click this step, Structural navigates to the File Groups view for the workspace.
Apply generators to modify dataset - Configure how Structural transforms at least one column in the source files. When you click this step:
If there are available generator recommendations, then Structural navigates to Privacy Hub and displays the generator recommendations panel.
If there are no available generator recommendations, then Structural navigates to Database View.
Generate data - Run the data generation to produce transformed versions of the source files. When you click this step, Structural navigates to the Confirm Generation panel.
Download your dataset - Download the transformed files from the Structural application database.
For workspaces that use data from files in cloud storage (Amazon S3 or Google Cloud Storage), the checklist contains:
Configure output location - Configure the cloud storage location where Structural writes the transformed files. When you click this step, Structural navigates to the Output location section of the workspace details view.
Create a file group - Create a file group that contains files selected from cloud storage. When you click this step, Structural navigates to the File Groups view for the workspace.
Apply generators to modify dataset - Configure how Structural transforms at least one column in the source data. When you click this step:
If there are available generator recommendations, then Structural navigates to Privacy Hub and displays the generator recommendations panel.
If there are no available generator recommendations, then Structural navigates to Database View.
Generate data - Run the data generation to produce transformed versions of the source files. When you click this step, Structural navigates to the Confirm Generation panel.
In addition to the workspace checklists, Structural uses next step hints to help guide you through the workspace configuration and data generation.
When a next step hint is available, it displays as an animated marker next to the suggested next action.
When you hover over the highlighted action, Structural displays a help text popup that explains the recommended action.
When you click the highlighted action, the hint is removed, and the next hint is displayed.
For a file connector workspace, to identify the source data, you create file groups. A file group is a set of files of the same type and with the same structure. Each file group becomes a table in the workspace. For CSV files, each column becomes a table column. For XML and JSON file groups, the table contains a single XML or JSON column.
On the File Groups view, click Create File Group.
For a file connector workspace that uses local files, you can either drag and drop files from your local file system to the file group, or you can search for and select files to add. For more information, go to #adding-files-from-a-local-file-system.
For a file connector workspace that uses cloud storage, you select the files to include in the file group. For more information, go to #adding-files-from-amazon-s3-or-gcs.
For files that contain CSV content, you also configure the delimiters and other file settings. For more information, go to #configuring-delimiters-and-file-settings-for-.csv-files.
To get value out of the data generation process, you assign generators to the data columns.
A generator indicates how to transform the data in a column. For example, for a column that contains a name value, you might assign the Name generator, which indicates how to generate a replacement name in the generation output.
For sensitive columns that Structural detects, Structural can also provide a recommended generator configuration.
When there are recommendations available, Privacy Hub displays a link to review all of the recommendations.
The Recommended Generators by Sensitivity Type panel displays a list of sensitive columns that Structural detected, along with the suggested generators to apply.
After reviewing, to apply all of the suggested generators, click Apply All. For more information about using this panel, go to Reviewing and applying recommended generators.
You can also choose to apply an individual generator manually. You can do this from Privacy Hub, Database View, or Table View.
To display Database View, on the workspace management view, click Database View.
On Database View, in the column list, the Applied Generator column lists the currently assigned generator for each column. For a new workspace, the columns are all assigned the Passthrough generator. The Passthrough generator simply passes the source value through to the destination data without masking it.
Click a column that is marked as Passthrough. For example, in the sample workspace, the customers.Marital_Status
column. The column configuration panel displays. To select a generator, click the generator dropdown. The list contains generators that can be assigned to the column based on the column data type. For customers.Marital_Status
, the Categorical generator is a good option.
For Passthrough columns that Structural identified as containing sensitive data, the column displays the type of sensitive data, such as a name, email address, or location.
In Database View, click one of those columns. For example, in the sample workspace, the customers.email
column is marked as containing an email address.
For customers.Email
, click the Email label. Instead of the column configuration panel, you see a panel that indicates the type of sensitive data and the recommended generator. For customers.Email
, the recommended generator is Email. To assign the Email generator, click Apply recommendation. The column configuration panel displays with the generator assigned.
To run a data generation, Structural must have a destination for the transformed data.
For a local files workspace, Structural saves the transformed files to the application databases.
For workspaces that use data from a database, and for workspaces that use cloud storage files, you configure where Structural writes the output data.
Note that in non-free trial accounts, the Ephemeral option writes the output to an Ephemeral user snapshot instead of a database.
For more information about writing output to Ephemeral, go to Writing data generation output to a Tonic Ephemeral snapshot.
The other output options are:
For database-based data connectors, you can write the transformed data to a destination database.
For some Structural data connectors, Structural can write the transformed data to a data volume in a container repository.
For file connector workspaces that transform files from cloud storage (Amazon S3 or Google Cloud Storage), you configure the cloud storage location where Structural writes the transformed files.
To display the destination configuration for the workspace:
Click the Settings tab.
Scroll to the Destination Settings section or, for a file connector workspace that uses cloud storage files, scroll to the Output location section.
By default, the sample workspace, as well as any other PostgreSQL, MySQL, or SQL Server workspace, writes the transformed data to an Ephemeral database. The database expires after 48 hours. If you do not already have an Ephemeral account, then Structural creates an Ephemeral free trial account for you.
After you run data generation, Tonic provides the credentials that you need to connect to the database. If it created a new Ephemeral free trial account, then it also sends you an activation email message.
For this option, you do not need to change the workspace configuration.
You can also choose to write the transformed data either to a destination database or to a container repository.
To write the data to a destination database, click Database Server. Structural displays the configuration fields for the destination database.
For information on how to configure the destination information for a specific data connector, go to the workspace configuration information for that data connector. The data connector summary contains a list of the available data connectors, and provides a link to the documentation for each data connector.
To write the data to a data volume in a container repository, click Container Repository. Structural displays the configuration fields to select a base image and provide the details about the repository.
For more information, go to Writing data generation output to a container repository.
For a file connector workspace that uses files from cloud storage (Amazon S3 or Google Cloud Storage), you configure the cloud storage output location where Structural writes the transformed files. The configuration includes the required credentials to use.
For more information, go to Configuring the file connector storage type and output options.
After you complete the workspace and generator configuration, you can run your first data generation.
The data generation process uses the assigned generators to transform the source data. It writes the transformed data to the configured destination location.
For a local files workspace, it writes the files to the Structural application database.
The Generate Data option is at the top right of the Tonic heading.
When you click Generate Data, Structural displays the Confirm Generation panel.
The Confirm Generation panel provides access to the current destination configuration, along with other advanced generation options such as subsetting and upsert. It also indicates if there are any issues that prevent you from starting the data generation. For example, if the workspace does not have a configured destination, then Structural cannot run the data generation.
To start the data generation, click Run Generation. For more information about running data generation, go to Running a data generation job.
For a new Tonic Ephemeral account, the first time that you run data generation, you also receive an activation email message for the account.
To view the job status and details:
Click Job History.
In the list, click the data generation job.
For a data generation that writes the output to an Ephemeral database, the Data Available in Tonic Ephemeral panel provides access to the database connection information.
To display the connection details, click Connecting to your database.
The connection details include:
The database location and credentials. Each field contains a copy icon to allow you to copy the value.
SSH tunnel information, including instructions on how to create an SSH tunnel from your local machine to the Ephemeral database.
The first time that you complete all of the steps in a checklist, Structural displays a panel with options to chat with our sales team, schedule a demo, or purchase a subscription.
You can also continue to get to know Structural and experiment with other Structural features such as subsetting or using composite generators to mask more complex values such as JSON or XML.
If your free trial has expired, to get an extension, you can reach out to us using either the in-app chat or an email message.
From the User Settings view, you can manage settings for your individual Tonic Structural account.
To display the User Settings view:
Click your user image at the top right.
In the menu, click User Settings.
The User Settings view includes options to:
You can select an image to associate with your account. The image is displayed next to your name and email address throughout Structural.
If your instance uses Google or Azure single sign-on (SSO) to manage Structural users, then by default your Structural account image is the image from the SSO.
Otherwise, the default image displays your initials.
To change your user image, click Upload, then select the image file.
Required license: Professional or Enterprise
From the Comment Notification Settings section of User Settings, you can configure when to receive email notifications for comments.
The available options are:
I am an owner, editor, auditor, or am being replied to This is the default option. You receive email notifications when comments are made on columns in a workspace that you are an owner, editor, or auditor for. You also receive an email notification when someone replies to a comment that you made.
I am @ mentioned You only receive an email notification if someone specifically mentions you in a comment.
Never You never receive email notifications for column comments.
Before you can use the Structural API, you must create an API token. From the User API Tokens section of the User Settings view, you can create and revoke API tokens.
To create an API token:
Click Create Token.
On the Create New Token dialog, enter a name for the new token.
Click Confirm. The token is added to the list.
To revoke a token, click the Revoke link for the token.
If your Structural account is not managed using SSO, then from User Settings, you can change your Structural password.
If your Structural instance uses SSO to manage users, then your user credentials are managed in the SSO system. You cannot change your user password in Structural.
Under Password Change, to change your Structural password:
In the Old Password field, type your current Structural password.
In the New Password field, type your new Structural password.
In the Repeat New Password field, type your new Structural password again.
Click Confirm.
From User Settings, you can delete your Structural account. If your instance uses SSO to manage users, then deleting your account only affects your access to Structural.
You cannot delete your Structural account if you are the owner of a workspace for which other users are granted access. Before you can delete your Structural account, you must either:
To delete your Structural account, click Delete Account.
When you delete your account, you are logged out of Structural.
Structural Cloud is our secure hosted environment. On Structural Cloud, Tonic handles monitoring Structural services and updating Structural.
Structural Cloud does not include:
Access to the following data connectors:
Each Structural Cloud user belongs to a Structural Cloud organization, which is determined either by the user's email domain or by a workspace invitation. Structural Cloud users do not have any access to workspaces or users from other organizations.
Each free trial user is in a separate organization, along with any users that they invite to have access to a free trial workspace.
A Tonic Structural implementation can involve the following roles - from those who set up the Structural environment to the consumers of the data that Structural processes.
For self-hosted instances of Structural.
Infrastructure engineers set up the Structural application and its relevant dependencies. They are typically DevOps, Site Reliability Engineering (SRE), or Kubernetes cluster administrators.
Infrastructure engineers perform the following Structural-related tasks:
Create Structural-processed data pipelines for development and testing workflows.
For both self-hosted instances of Structural and Structural Cloud.
They ensure that source databases are available to Structural, and that Structural can write to destination databases.
Set up the required Structural access to source databases.
Set up destination databases for Structural to write transformed data to.
Structural users are the actual users of the Structural application.
Depending on the use case, Structural users might be compliance analysts, DevOps, or data engineers.
Tonic users perform the following Structural-related tasks:
Work with data consumers to produce usable data
Data consumers are the end users of transformed destination data or trained data models.
They are typically QA testers, developers, or analysts.
Data consumers perform the following Structural-related tasks.
Validate the usability of the destination data.
Provide guidance on application-specific requirements for data.
Security and compliance ensure and validate that the data that Structural produces meets expectations, and that Structural is compliant with other security-related processes.
Security and compliance specialists perform the following Structural-related tasks:
Provide guidance on what data is sensitive.
Sign off on proposed approaches to mask sensitive data.
Approve data access and permissions.
Tonic Structural provides different license plans to accommodate organizations of different sizes who have more or less complex data architectures.
Access to Structural data science mode is granted to individual licenses. It is not based on a Structural license plan.
The Basic license is designed for very small organizations who have a very simple data architecture. It provides access to Structural's core de-identification and data generation features.
The Basic license allows access for a single user, with an option to purchase an additional two users.
With a Basic license, you can create workspaces for one data connector type. The data connector type must be one of the following:
With a Basic license, your Structural instance can have only one Structural worker. This means that only one sensitivity scan or data generation job can run at the same time.
With a Basic license, you can create and configure workspaces, and run data generation for those workspaces.
The Basic license does NOT provide access to the following features:
Custom generators
With a Basic license, you only have access to the basic version of the Structural API.
You cannot use the basic Structural API to perform the following API tasks, which require the advanced API:
The Professional license is designed for larger organizations that have more complex data architectures. The organization might have a larger team that supports multiple databases.
The Professional license provides access to a larger set of Structural features than the Basic license.
The Professional license allows up to 10 users. You can purchase access for unlimited users as an add-on.
With a Professional license, you can create workspaces for up to two types of data connectors. You can purchase one additional data connector type as an add-on.
With a Professional license, your Structural instance can have more than one Structural worker.
This means that you can run multiple jobs from different workspaces at the same time. You can never run multiple jobs from the same workspace at the same time.
With a Professional license, you can do the following:
Create and configure workspaces, and run data generation for those workspaces
The Professional license does NOT provide access to the following features:
With a Professional license, you only have access to the basic version of the Structural API.
You cannot use the basic Structural API to perform the following API tasks, which require the advanced API:
The Enterprise license is ideal for very large organizations that have multiple teams that support very large and complex data structures, and that might have more requirements related to scale and compliance.
It provides full access to all Structural features.
An Enterprise instance does not limit the number of users.
You can use any number of any of the available data connectors.
The following features are exclusive to the Enterprise license:
The Enterprise license provides exclusive access to the advanced API.
The advanced Structural API provides access to all of the available API tasks, including the following tasks that are not available in the basic API:
The following table compares the available features for the Structural license plans.
If the data connector supports Tonic Ephemeral, then the default option is to write the transformed data to an Ephemeral database. This option is selected by default. The Ephemeral database expires after 48 hours. When you run the data generation, Structural also if needed creates an Ephemeral free trial account for you. For more information about Tonic Ephemeral, go to the .
(if your Structural instance does not use SSO).
Structural allows users to provide comments on columns. You can do this from and .
You can .
For a self-hosted instance, Structural provides administrator tools that allow you to and .
You can to customize your instance.
On a self-hosted instance, based on your , you have access to the full set of supported data connectors.
. Structural Cloud uses a single configuration.
Most Spark-based data connectors (, , )
Structural Cloud also supports a pay-as-you-go plan, where free trial users can move on to set up a monthly subscription. For more information, go to .
For information about Structural Cloud organizations, go to .
The Account Admin permission set allows a Structural Cloud user to manage organization users and workspaces. For information about granting access to the Account Admin permission set, go to .
Note that these roles are not related to role-based access (RBAC) within Structural, which is managed using .
Ensure that the proper infrastructure is ready for Structural installation based on the .
. Works with Tonic.ai support as needed.
Perform routine maintenance of Structural and the Structural environment. and its dependencies as needed.
Database administrators integrate Structural into your data architecture to support .
perform the following Structural-related tasks:
Use the to configure the logic used to transform source data and to generate the transformed data.
Use the to configure and train data models that are based on source data.
There is no access to .
You can use to view the current sensitivity status based on the current workspace configuration.
- Can view foreign keys from the data, but cannot add virtual foreign keys
The Professional license is also granted to .
You can use to manage your Structural users.
Those data connectors can be of any type except for and .
Use to view the current sensitivity status for your workspace configuration.
. The Professional license does not allow you to assign the built-in Viewer and Auditor permission sets.
. The comments can trigger email notifications.
Run and configure .
Use to generate a smaller destination database.
Create .
Use to add destination database records and update existing destination database records, but keep unchanged destination database records in place. The Professional license does not allow you to connect to migration scripts.
Use view to view and address both conflicting and non-conflicting changes to the source data schema.
Use to have Structural decrypt source data, encrypt destination data, or both.
Request , which are primarily developed to preserve encryption that can't be managed using Structural data encryption. You can also purchase custom generators.
The Enterprise license provides exclusive access to the and data connectors.
Feature | Basic | Professional | Enterprise |
---|
The minimum screen width is 1120 pixels.
If the locally running database that you want to connect to runs in a Docker container:
Run: docker inspect
In the networks
section of the results, find the Gateway IP address.
Use this IP address as the server address in Structural.
If the locally running database does NOT run in a container, but runs on the machine, then:
On Windows or Mac, use host.docker.internal
On Linux, use 172.17.0.1
, which is the IP address of the docker0
interface.
If you use Structural Cloud, and your database only allows connections from allowlisted IP addresses, then you need to allowlist Structural static IP addresses.
This is not required for self-hosted instances of Structural.
For the United States-based instance (app.tonic.ai), the static IP addresses are:
54.92.217.68
52.22.13.250
The following IP addresses are used if needed for scaling or failover:
44.215.74.226
3.232.203.148
3.224.2.189
44.230.136.147
44.230.79.194
For the Europe-based instance (app-de.tonic.ai), the static IP addresses are:
18.159.127.160
3.69.249.144
The following IP addresses are used if needed for scaling or failover:
18.159.179.95
3.120.214.225
3.75.12.1
16.16.71.42
16.170.51.237
The URL https://telemetry.tonic.ai/ is used for our Amplitude telemetry. Allowlist the URL or the following IP addresses:
75.2.74.76
99.83.246.105
Telemetry sharing is required. These metrics are valuable for us as we debug, make product roadmaps, and determine feature viability.
No customer data is included. For more information, go to Data that Tonic.ai collects.
To support the one-click update option, Structural needs to be able to retrieve information about the latest Structural version.
For more information, go to #tonic-updating-allowlist-for-version-info.
Click your user image at the top right. The menu includes the Tonic version.
We recommend that you use a static copy of your production database that was restored from a backup.
If that's not possible, consider the following when you connect Structural to your source data:
Structural cannot guarantee referential integrity of the output data if the source database is written to while data is generated. For this reason we recommend that you connect to a static copy of production data.
Read replicas and fast followers can be problematic for Structural because of how long it takes some queries run. Read replicas tend to have short query timeout limits, which causes the queries to timeout. Read replicas also reflect recent writes, which means that we cannot guarantee the referential integrity of the output.
For details about the types of data that Tonic.ai does and does not collect, go to Data that Tonic.ai collects.
Number of users | 1 2 additional users available as add-ons | 10 Unlimited users available as an add-on | Unlimited |
1 data connector PostgreSQL or MySQL | 2 data connectors 1 additional data connector available as an add-on Any data connector except for Oracle or Db2 for LUW | Unlimited number from any available data connector |
Manager | Manager, Editor | Manager, Editor, Auditor, Viewer |
Custom generators | Available for purchase | 2 included Additional ones available for purchase |
✓ | ✓ | ✓ |
✓ | ✓ | ✓ |
✓ | ✓ | ✓ |
✓ | ✓ | ✓ |
✓ | ✓ | ✓ |
✓ | ✓ | ✓ |
✓ | ✓ | ✓ |
✓ | ✓ | ✓ |
✓ | ✓ |
✓ | ✓ |
✓ | ✓ |
✓ | ✓ |
✓ | ✓ |
✓ | ✓ |
✓ | ✓ |
✓ | ✓ |
✓ | ✓ |
✓ | ✓ |
Concurrent jobs (more than 1 worker) | ✓ | ✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
Structural API |
A Tonic Structural workspace provides a context within which to either:
Configure and generate transformed data
Configure, train, and export data models
For data generation, a workspace represents a path between the source and transformed output data. For example, postgres-prod-copy
to postgres-staging
. A data generation workspace includes:
Where to find the source data to transform during data generation.
Where to write the transformed data.
The rules for the transformation.
For data science mode, a workspace is used to configure and train data models based on source data. Each data science model workspace includes:
Where to find the source data to base the models on.
Model configurations.
When you create a new workspace, you can either:
Create a copy of an existing workspace. The copy initially uses the configuration from the original workspace. After the copy is created, it is completely independent from the original workspace.
Create a child of an existing workspace. Child workspaces inherit configuration from the parent workspace. They continue to be updated automatically when the parent workspace is updated. See About workspace inheritance.
You can also view this video overview of how to create a workspace.
Required global permission: Create workspaces
To create a completely new workspace, on Workspaces view, click Create Workspace > New Workspace.
Required workspace permission: Copy workspace (in the workspace to copy)
Or
Required global permission: Copy any workspace
To create a workspace based on an existing workspace, either:
On the workspace management view of the workspace to copy, from the workspace actions menu, select Duplicate Workspace.
On Workspaces view, click the actions menu for the workspace, then select Duplicate Workspace.
When you create a copy of a workspace, the copy initially inherits the following workspace configuration:
Source and destination database connections
Sensitivity designations, including manual designations that override the sensitivity scan results
Table mode assignments
Generator configuration
Subsetting configuration
Post-job scripts
Required license: Enterprise
Required workspace permission: Create child workspaces (in the parent workspace)
You can create a workspace that is a child of an existing workspace. You cannot create a child workspace of another child workspace.
The parent workspace must have a source database configured. You cannot create a child workspace from a workspace that uses the Databricks, Spark (Amazon EMR or self-managed Spark cluster), or MongoDB data connector.
To create a child workspace, either:
On Workspaces view:
Click Create Workspace > Child Workspace.
Click the actions menu for the parent workspace, then select Create Child Workspace.
On the workspace management view, from the workspace actions menu, select Create Child Workspace.
On the New Workspace view, under Child Workspace, Parent Workspace identifies the parent workspace.
If you used the Create Workspace > Child Workspace option to create the child workspace, then Parent Workspace is not populated. From the Parent Workspace dropdown list, select the parent workspace for the new child workspace.
If you selected the child workspace option for a specific workspace, then Parent Workspace is set to that workspace.
If you originally chose to create a completely new workspace, then on the New Workspace view:
To change to a child workspace, select Create Child Workspace from the Create a child workspace panel at the right. Structural adds the Child Workspace panel to the New Workspace view.
From the Parent Workspace dropdown list, select the parent workspace for the new child workspace.
Required workspace permission: Configure workspace settings
To edit the configuration for an existing workspace, either:
On the workspace management view:
On the workspace navigation bar, click Settings.
From the workspace actions menu, select Settings.
On Workspaces view, click the actions menu for the workspace, then select Settings.
Required workspace permission: Delete workspace
You can delete workspaces that you no longer need.
You cannot delete a parent workspace. You must first delete all of its child workspaces.
To delete a workspace:
On the workspace management view, from the workspace actions menu, select Delete Workspace.
On the Workspaces view, click the actions menu for the workspace, then select Delete.
The workspace details for a new or edited workspace specify information about the workspace and the workspace data.
All workspaces have the following fields, used to identify the workspace and indicate the connector type:
In the Workspace name field, enter the name of the workspace.
In the Workspace description field, provide a brief description of the workspace. The description can contain up to 200 characters.
Depending on your Tonic Structural license agreement, you can either:
Create either data generation or data science mode workspaces
Under Data Science Mode, the Enable Data Science Mode toggle determines whether the workspace is a data generation workspace or a data science mode workspace.
If your instance only supports data generation workspaces, then the toggle is not displayed.
If your instance only supports data science mode workspaces, then the toggle is displayed and locked in the on position.
If your instance supports both data generation and data science mode workspaces, then the toggle is displayed. By default, it is in the off position, indicating to create a data generation workspace. To create a data science mode workspace, toggle Enable Data Science Mode to the on position.
For data generation, the source and destination databases are always of the same type.
The Basic and Professional licenses limit the number and type of data connectors you can use.
A Basic instance can only use one data connector type, which can be either PostgreSQL or MySQL. After you create your first workspace, any subsequent workspaces must use the same data connector type.
A Professional instance can use up two different data connector types, which can be any type other than Oracle. After you create workspaces that use two different data connector types, any subsequent workspaces must use one of those data connector types.
If you don't see the database that you want to connect to, or you want to have different database types for your source and destination database, contact support@tonic.ai.
After you select the connector type, you first configure the connection to the source data.
For a data generation workspace, the Destination Settings section provides information about where and how Structural writes the output data from data generation.
For a data science mode workspace, you do not configure destination information.
For data connectors other than the file connector, depending on the connector type, you can write to either:
Destination database - Writes the output data to a destination database on a database server.
Ephemeral snapshot - Writes the output data to a Tonic Ephemeral user snapshot.
Container repository - Writes the output data to a data volume in a container repository.
For the file connector, you might need to provide a cloud storage location for the transformed files.
When you write the output to a destination database, the destination database must be of the same type as the source database.
Structural does not create the destination database. It must exist before you generate data.
If available, the Copy Settings from Source allows you to copy the source connection details to the destination database, if both databases are in the same location. Structural does not copy the connection password.
For data connectors that support upsert, when you write the output to a destination database, the connection details include an Upsert section to allow you to enable and configure upsert.
Upsert is not available for output to an Ephemeral database or to a container repository.
If Ephemeral supports your workspace database type, then you can choose to write the destination data to a snapshot in Ephemeral. For data larger than 10 GB, this option is recommended instead of writing to a container repository.
From Ephemeral, you can use the snapshot to start new Ephemeral databases.
Some data connectors allow you to choose to write the transformed data to a data volume in a container repository instead of to a database server.
You can use the resulting data volume to create a database in Tonic Ephemeral. If you do plan to use the data to start an Ephemeral database, and the size of the data is larger than 10 GB, then the recommendation is to write the data to an Ephemeral user snapshot instead.
For a file connector workspace that transforms files from cloud storage (Amazon S3 or Google Cloud Storage), you provide the output location.
Whenever you provide connection details for a database server, Structural provides a Test Connection button to test the connection, and verify that Structural can use the connection details to connect to the database. Structural uses the connection details to try to reach the database, and indicates whether it succeeded or failed. We strongly recommend that you test the connections.
Most data generation workspaces have a Block data generation if schema changes detected toggle. The setting is usually in the Source Settings section.
By default, the option is turned off. When the option is off, Structural only blocks data generation when there are conflicting schema changes. Structural does not block data generation when there are non-conflicting schema changes.
For generators where consistency is enabled, a statistics seed enables consistency across data generation runs. The Structural-wide statistics seed value ensures consistency across both data generation runs and workspaces.
In the workspace configuration, under Destination Settings, use the Override Statistics Seed setting to override the Structural-wide statistics seed value. You can either disable consistency across data generations, or provide a seed value for the workspace. The workspace seed value ensures consistency across data generation runs for that workspace, and across other workspaces that have the same seed value.
For a data science mode workspace, instead of connecting to a database, you can upload one or more CSV files that contain the data that you want to use. Each file that you upload becomes a table in your source data. You can then issue model queries against the data.
To indicate to use CSV files to provide the source data, for Connection Type, under Upload your own data, click CSV.
Under Add dataset files, to add files to the list, either:
Click Select files to upload, then select the files.
Drag and drop the files from your machine.
You cannot upload a file with the same name as an existing file in the list. To replace the data in an existing file, you must delete the file and then upload the updated file.
To configure the options for a file:
If the file includes a heading row, then toggle Treat first row as column header to the on position.
In the Column Delimiter field, provide the character that is used as delimiter. The default is a comma.
In the Escape Character field, provide the character that is used to escape characters. The default is a backslash (\
).
In the Quote Character field, provide the character that is used to quote text. The default is the double quote.
In the NULL Character field, provide the text used to indicate a null value. The default is \N
.
To display a preview of the data in the file, click Expand.
To remove a file, click Remove.
Requires Kubernetes.
Not compatible with upsert.
Not compatible with Preserve Destination or Incremental table modes.
Only supported for PostgreSQL, MySQL, and SQL Server.
You can configure a workspace to write destination data to a container repository instead of a database server.
Under Destination Settings, to indicate to write the destination data to container artifacts, click Container Repository.
You can switch between writing to a database server and writing to a container repository at any time. Structural preserves the configuration details for both options. When you run data generation, it uses the currently selected option for the workspace.
From the Database Image dropdown list, select the image to use to create the container artifacts.
Select an image version that is compatible with the version of the database that is used in the workspace.
For a MySQL workspace, you can provide a customization file that helps to ensure that the temporary destination database is configured correctly.
To provide the customization details:
Toggle Use customization to the on position.
In the text area, paste the contents of the customization file.
To provide the location where Structural publishes the container artifacts:
In the Registry field, type the path to the container registry where Structural publishes the data volume.
In the Repository Path field, provide the path within the registry where Structural publishes the data volume.
You next provide the credentials that Structural uses to read from and write to the registry.
When you provide the registry, Structural detects whether the registry is from Amazon Elastic Container Registry (Amazon ECR), Google Artifact Registry (GAR), or a different container solution.
It displays the appropriate fields based on the registry type.
For a registry other than an Amazon ECR or a GAR registry, the credentials can be either a username and access token, or a secret.
The option to use a secret is not available on Structural Cloud.
In general, the credentials must be for a user that has read and write permissions for the registry.
To use a username and access token:
Click Access token.
In the Username field, provide the username.
In the Access Token field, provide the access token.
To use a secret:
Click Secret name.
In the Secret Name field, provide the name of the secret.
For ACR, the provided credentials must be for a service principal that has sufficient permissions on the registry.
Structural only supports Google Artifact Registry (GAR). It does not support Google Container Registry (GCR).
For a GAR registry, you upload a service account file, which is a JSON file that contains credentials that provide access to Google Cloud Platform (GCP).
The associated service account must have the Artifact Registry Writer role.
For Service Account File, to search for and select the file, click Browse.
For an Amazon ECR registry, you can either:
Provide the AWS access and secret key that is associated with the IAM user that will connect to the registry
(Self-hosted only) Use the credentials configured in the Structural environment settings TONIC_AWS_ACCESS_KEY_ID
and TONIC_AWS_SECRET_ACCESS_KEY
.
(Self-hosted only) If Structural is deployed in Amazon Elastic Kubernetes Service (Amazon EKS), then you can use the AWS credentials that live on the EC2 instance.
On Structural Cloud, you must provide an AWS access key and secret key.
On a self-hosted instance, you can choose the source of the credentials. The default is Access Keys.
To provide an AWS access key and secret key, click Access Keys.
To use the credentials configured in the environment settings, click Environment Variables.
To use the AWS credentials from the EC2 instance, click Instance Profile.
The IAM user must have permission to list, push, and pull images from the registry. The following example policy includes the required permissions.
For additional security, a repository name filter allows you to limit access to only the repositories that are used in Structural. You need to make sure that the repositories that you create for Structural match the filter.
For example, you could prefix Structural repository names with tonic-
. In the policy, you include a filter based on the tonic-
prefix:
In the Tags field, provide the tag values to apply to the container artifacts. You can also change the tag configuration for individual data generation jobs.
Use commas to separate the tags.
A tag cannot contain spaces. Structural provides the following built-in values for you to use in tags:
{workspaceId}
- The identifier of the workspace.
{workspaceName}
- The name of the workspace.
{timestamp}
- The timestamp when the data generation job that created the artifact completed.
{jobId}
- The identifier of the data generation job that created the artifact.
For example, the following creates a tag that contains the workspace name, job identifier, and timestamp:
{workspaceName}_{jobId}_{timestamp}
To also tag the artifacts as latest, check the Tag as "latest" in your repository checkbox.
You can also optionally configure custom resource values for the Kubernetes pods. You can specify the ephemeral storage, memory, and CPU millicores.
To provide custom resources:
Toggle Set custom pod resources to the on position.
Under Storage Size:
In the field, provide the number of megabytes or gigabytes of storage.
From the dropdown list, select the unit to use.
The storage can be between 32MB and 25GB.
Under Memory Size:
In the field, provide the number of megabytes or gigabytes of RAM.
From the dropdown list, select the unit to use.
The memory can be between 512MB and 4 GB.
Under Processor Size:
In the field, provide the number of millicores.
From the dropdown list, select the unit.
The processor size can be between 250m and 1000m.
In the Tags field, provide a comma-separated list of tags to assign to the workspace. For more information on managing tags, go to .
Only create workspaces
Only create workspaces
Under Connection Type, select the type of database to connect to. You cannot change the connection type on a .
For a data science mode workspace, there is also a CSV option, which allows you to use as the source of your model data.
When you select a connector type, Structural updates the view to display the connection fields used for that connector type. The specific fields vary based on the .
For a workspace that connects to a database, the Source Settings section provides connection information for the source database. For information about the source connection fields for a specific data connector, go to the workspace configuration topic for that .
For a workspace, which uses files for source data, the File Location section indicates where the source files are obtained from - a local file system, Amazon S3, or Google Cloud Storage. For more information, go to .
You cannot change the source data configuration for a .
In Destination Settings, you provide the connection information for the destination database. For information about the destination database connection fields for a specific data connector, go to the workspace configuration topic for that .
For more information, go to .
Tonic Ephemeral is a separate Tonic.ai product that allows you to create temporary databases to use for testing and demos. For more information about Tonic Ephemeral, go to the .
For more information, go to .
For more information, go to .
For more information, go to .
The TONIC_TEST_CONNECTION_TIMEOUT_IN_SECONDS
determines the number of seconds before a connection test times out. You can configure this setting from the Environment Settings tab on Tonic Settings. By default, the connection test times out after 15 seconds.
If this option is turned on, then if Structural detects any changes at all to the schema, then data generation is blocked until you resolve the schema changes. For more information, go to .
For details about using seed values to ensure consistency across data generation runs and databases, go to .
For self-hosted Docker deployments, you can install and configure a separate Kubernetes cluster to use. For more information, go to .
For information about required Kubernetes permissions, go to .
When it writes data generation output to a repository, Structural writes the destination data to a container volume. From the list of container artifacts, you can copy the volume digest, and download a Docker Compose file that provides connection settings for the database on the volume. Structural generates the Compose file when you make the request to download it. For more information about getting access to the container artifacts, go to .
You can also use the data volume to start a Tonic Ephemeral database. However, if the data is larger than 10 GB, we recommend that you write the data to an Ephemeral user snapshot instead. For information about writing to an Ephemeral snapshot, go to .
For an overview of writing destination data to container artifacts, you can also view the .
For a Structural instance that is deployed on Docker, unless you , the Container Repository option is hidden.
The secret is the name of a Kubernetes secret that lives on the pod that the Structural worker runs on. The secret type must be kubernetes.io/dockerconfigjson
. The Kubernetes documentation provides information on .
For Structural, the service principal must at least have the permissions that are associated with the.
Workspaces view
View the list of workspaces that you have access to.
Create, edit, and delete workspaces
Add and remove workspaces, or update a workspace configuration.
Export and import workspace configuration
Save an existing configuration to apply to a workspace.
Workspace settings
Includes the name, description, and data connections.
Workspace management view
Provides access to workspace configuration and generation tools.
Workspace inheritance
Create child workspaces that inherit source data and configuration from their parent workspace.
Required license: Enterprise
If you have multiple workspaces, then it is likely that many of the workspace components and configurations are the same or similar. It can be difficult to maintain that consistency across separate, independent workspaces.
When you copy a workspace, the new workspace is completely independent of the original workspace. There is no visibility into or inheritance of changes from the original workspace.
Workspace inheritance allows you to create workspaces that are children of a selected workspace. Unlike a copy of a workspace, a child workspace remains tied to its parent workspace.
By default, a child workspace configuration is synchronized with the configuration of the parent. In other words, any changes to the parent workspace are copied to the child workspaces. Child workspaces can also override some of the parent configuration. You can track the child workspaces and how they are customized from the parent workspace.
For example, you might want separate workspaces for different development teams. Each team can make adjustments to suit their specific projects - such as different subsets - but inherit everything else.
By default, a child workspace inherits all of the configuration from the parent workspace, except for the following:
Workspace name - A child workspace has its own name.
Workspace description - A child workspace has its own description.
Tags - A child workspace has its own tags.
Destination database - A child workspace writes output data to its own destination database. You can copy the destination database from the parent workspace.
Intermediate database - For upsert, a child workspace does not inherit the intermediate database.
Webhooks - A child workspace has its own webhooks.
When you change the configuration of a parent workspace, the configuration is also updated in the child workspaces.
The exception is when a child workspace overrides the configuration. If the configuration is overridden, then the child workspace does not inherit the change.
Tonic Structural indicates on both the parent and child workspaces when the configuration is overridden.
A child workspace can override the following configuration items.
Table modes - A child workspace can override the table mode for individual tables. The other tables continue to inherit the table mode that is configured in the parent workspace.
Column generators - A child workspace can override the generator for individual columns. The other columns continue to inherit the generator that is configured in the parent workspace. For linked columns, a change to any of the linked columns overrides the inheritance for all of the columns.
Subsetting - A child workspace can override the subsetting configuration from the parent workspace. Any change in the child workspace means that the child workspace no longer inherits any changes to the subsetting configuration from the parent workspace. For example, if you change the percentage setting on a single target table from 5 to 6, that eliminates the subsetting inheritance. The child workspace keeps the subsetting configuration that it already has, but it is not updated when the parent workspace is updated.
Post-job scripts - A child workspace can override the post-job scripts. Any change to the post-job scripts in the child workspace means that the child workspace no longer inherits any changes to the post-job scripts configuration.
Statistics seed - A child workspace can override the statistics seed configuration.
From each view, you can eliminate the overrides and restore the inheritance.
A child workspace cannot override the following configuration items:
Data connector type and source database - A child workspace always uses the same source data as the parent workspace.
Foreign keys - A child workspace always uses the same foreign key configuration as the parent workspace.
Sensitivity designation for a column - A child workspace cannot change whether a column is marked as sensitive.
For removed tables and columns, when a child workspace overrides the parent workspace configuration for the table or column, you must resolve the change in the child workspace.
If there is a conflicting change for the removed table or column in the parent workspace configuration, then regardless of whether the configuration is inherited, you must resolve that change in the parent workspace before the change is resolved for the child workspace.
For changes to column nullability or data type, you resolve the change separately in the child and parent workspaces.
You also dismiss notifications (new tables and columns) separately in the parent and child workspaces.
Required workspace permission: Configure workspace settings
You can associate custom tags with each workspace. Tags can help to organize and provide a quick glance into the workspace configuration.
Tags can be seen by every user that has access to the workspace.
Tags are stored in the workspace JSON, and are included in the workspace export. You can also use the API to get access to tags.
You can add and edit tags in the Tags field on the New Workspace and Settings pages.
To add tags, enter a comma-separated list of the tags to add.
To remove a tag, click its delete icon.
You can also manage tags directly from Workspaces view.
To add tags to a workspace that does not currently have tags:
Hover over the Tags column for the workspace.
Click Add Tags.
In the tag input field, type a comma-separated list of tags to apply.
Press Enter.
To edit the assigned tags:
Click the Tags column for the workspace.
In the tag input field, to remove tag, click its delete icon.
To add tags, type a comma-separated list of the tags to add.
To save the tag changes, press Enter.
Required license: Professional or Enterprise
Not compatible with writing output to a container repository or a Tonic Ephemeral snapshot.
By default, Tonic Structural data generation replaces the existing destination database with the transformed data from the current job.
Upsert allows you to add and update rows in the destination database, but keep all other existing rows intact. For example, you might have a standard set of test records that you do not want to have to replace every time you generate data in Structural.
If you enable upsert, then you cannot write the destination data to a container repository or to a Tonic Ephemeral snapshot. You must write the data to a database server.
Upsert is currently only supported for the following data connectors:
MySQL
Oracle
PostgreSQL
SQL Server
For an overview of upsert, you can also view the video tutorial.
When upsert is enabled, the data generation job writes the generated data to an intermediate database. Structural then runs the upsert job to write the new and updated records to the destination database.
The destination database must already exist. Structural cannot run an upsert job to an empty destination database.
The upsert job adds and updates records based on the primary keys.
If the primary key for a record already exists in the destination database, the upsert job updates the record.
If the primary key for a record does not exist in the destination database, the upsert job inserts a new row.
To only update or insert records that Structural creates based on source records, and ignore other records that are already in the destination database, ensure that the primary keys for each set of records operate on different ranges. For example, allocate the integer range 1-1000 for existing destination database records that you add manually. Then ensure that the source database records, and by extension the records that Structural creates during data generation, use a different range.
Also note that when upsert is enabled, the Truncate table mode does not actually truncate the destination table. Instead, it works more like Preserve Destination table mode, which preserves existing records in the destination table.
To enable upsert, in the Upsert section of the workspace details, toggle Enable Upsert to the on position.
When you enable upsert for a workspace, you are prompted to configure the upsert processing and provide the connection details for the intermediate database.
When you enable upsert, Structural displays the following settings to configure the upsert process.
Required license: Enterprise
The intermediate database must have the same schema as the destination database. If the schemas do not match, then the upsert process fails.
To ensure that schema changes are automatically reflected in the intermediate database, you can connect the workspace to your own database migration script or tool. Structural then runs the migration script or tool whenever you run upsert data generation.
When you start an upsert data generation job:
If migration is enabled, Structural calls the endpoint to start the migration.
Structural cannot start the upsert data generation until the migration completes successfully. It regularly calls the status check endpoint to check whether the migration is complete.
When the migration is complete, Structural starts the upsert data generation.
Required. Structural calls this endpoint to start the migration process specified by the provided URL.
The request includes:
Any custom parameter values that you add.
The connection information for the intermediate database.
The request uses the following format:
The response contains the identifier of the migration task.
The response uses the following format:
Required. Structural calls this endpoint to check the current status of the migration process.
The request includes the task identifier that was returned when the migration process started. The request URL must be able to pass the request identifier as either a path or query parameter.
The response provides the current status of the migration task. The possible status values are:
Unknown
Queued
Running
Canceled
Completed
Failed
The response uses the following format:
Optional. Structural calls this endpoint to retrieve the log entries for the migration process. It adds the migration logs to the upsert logs.
The request includes the task identifier that was returned when the migration process started. The request URL must be able to pass the request identifier as either a path or query parameter
The response body of the request should be 'text/plain'.
It contains the raw logs.
Optional. Structural calls this endpoint to cancel the migration process.
The request includes the task identifier that was returned when the migration process started. The request URL must be able to pass the request identifier as either a path or query parameter.
To enable the migration process, toggle Enable Migration Service to the on position.
When you enable the migration process, you must configure the POST Start Schema Changes
and GET Status of Schema Change
endpoints.
You can optionally configure the GET Schema Change Logs
and DELETE Cancel Schema Changes
endpoints.
To configure the endpoints:
To configure the POST Start Schema Changes
endpoint:
In the URL field, provide the URL of the migration script.
Optionally, in the Parameters field, provide any additional parameter values that your migration scripts need.
To configure the GET Status of Schema Change
endpoint, in the URL field, provide the URL for the status check.
The URL must include an {id}
placeholder. This is used to pass the identifier that is returned from the Start Schema Changes
endpoint.
To configure the GET Schema Change Logs
endpoint, in the URL field, provide the URL to use to retrieve the logs.
The URL must include an {id}
placeholder. This is used to pass the identifier that is returned from the Start Schema Changes
endpoint.
To configure the DELETE Cancel Schema Changes
endpoint, in the URL field, provide the URL to use for the cancellation.
The URL must include an {id}
placeholder. This is used to pass the identifier that is returned from the Start Schema Changes
endpoint.
When you enable upsert, you must provide the connection information for the intermediate database.
For details, go to the workspace configuration information for the data connector.
Workspaces view lists the workspaces that you have access to. To display Workspaces view, in the Tonic Structural heading, click Workspaces.
The workspace list contains:
Workspaces that you own
Workspaces that you are granted access to
If you have the global permission Copy any workspace or Manage user access to Tonic and to any workspace, then you see the complete list of workspaces.
The Permissions column lists the workspace permission sets that you are granted in each workspace. The permission sets include both permission sets that were granted to you directly as a user, and permission sets that were granted to an SSO group that you are a member of.
Child workspaces always display under their parent workspace. You can only see child workspaces that you have access to. If you have access to a child workspace, but not to its parent workspace, then the parent workspace is grayed out. You cannot select it.
You can filter the workspaces based on the following information:
Name - In the filter field, begin to type text that is in the name of the workspaces to display in the list.
Owner - From the Filter by Owner dropdown list, select the owner of the workspaces to display in the list.
Database type - From the Filter by Database Type dropdown list, select the type of database for the workspaces to display in the list.
Generation status - In the Generation Status column heading, click the filter icon. Check the checkbox next to the generation status values for the workspaces to display in the list.
Tags - In the Tags column heading, click the filter icon. By default, the workspaces are not filtered by tag, and all of the checkboxes are unchecked. To only include workspaces that have specific tags, check the checkbox next to each tag to include. To uncheck all of the selected tags, click Reset Tags. When you filter by tag, Structural checks whether each workspace contains any of the selected tags.
Permissions - In the Permissions column heading, click the filter icon. You can check and uncheck checkboxes to include or exclude specific permission sets. For example, you can filter the list to only display workspaces for which the Editor permission set is granted either to you or to an SSO group that you belong to. For users that have the global permission Copy any workspace, the Permissions filter panel also contains an Any permissions checkbox. By default, Any permissions is unchecked, and the list includes workspaces for which you are not assigned any workspace permission sets. To display all of the workspaces for which you have any assigned workspace permission sets, check Any permissions. If you filter the list based on a specific permission set, to clear the filter and show all workspaces for which you have any permission set, check Any permissions. To display all workspaces, including workspaces that you do not have any permissions for, uncheck Any permissions.
You can combine different filters. For example, you can filter the list to only include workspaces that use PostgreSQL and for which the generation status is Canceled or Failed.
Child workspaces always display under their parent workspace, even if the parent workspace does not match the filter.
You can sort the workspace list by name, status, or owner.
By default, the list is sorted alphabetically by name.
To sort by a column, click the column heading. To reverse the order of the sort, click the column heading again.
Child workspaces always display under their parent workspace. The child workspaces are sorted within the parent.
Workspaces view provides the following information about each workspace:
Name - Contains the name and database type for the workspace. To view the workspace description, hover over the name.
Generation status - The status for the most recent generation job. To display the job details for the job, click the job status. To display more details about the date, time, and duration for the job, hover over the generation timestamp. If a job failed recently, you are given additional information about how long this job has been failing (the date of the first failure occurrence among a continuous series of failures).
Schema changes - Indicates whether Structural detected changes to the source database schema. If there are changes, the column shows the number of changes. Hover over the column value to display additional details, and to navigate to the Schema Changes page. See Viewing and resolving schema changes.
Tags - The tags that are assigned to the workspace.
Permissions - The permission sets that are assigned to you for the workspace.
Owner - The name and email address of the workspace owner.
On Workspaces view, when you click the workspace name, the workspace management view for the workspace is displayed. The Privacy Hub tab is selected.
The Name column also provides access to a menu of workspace configuration options. When you select an option, the workspace management view is displayed, open to the view for the selected option.
The last column in the workspaces list provides additional workspace options:
Subsetting icon - Displays the subsetting configuration for the workspace. See Viewing the current subsetting configuration.
Post-job actions icon - Displays the post-job actions for the workspace. For more information, go to Post-job scripts and Webhooks.
Actions menu - Provides access to additional options.
The Actions menu at the top left of the workspaces list allows you to to perform bulk actions on multiple workspaces. It is enabled when you check one or more of the checkboxes in the first column of each row. The Actions menu provides options for the selected workspaces.
You use the workspace management view to configure and run data generation for an individual workspace.
When you log in to Tonic Structural, it displays the workspace management view for the workspace that was selected when you logged out.
The workspace management view includes the following components.
The top left of the workspace management view provides information about the workspace, including:
The workspace name
When the workspace was last updated
The user who last updated the workspace
The top right of the workspace management view provides general options for working with the workspace, including:
Undo and redo options for configuration changes
The workspace download menu to:
Download sensitivity scan and privacy reports
The workspace actions menu
The workspace navigation bar provides access to workspace configuration options.
To display the workspace management view for a workspace:
On Workspaces view, in the Name column either:
Click the workspace name. The workspace management view opens to Privacy Hub.
Click the dropdown icon, then select a workspace management option.
Click the search field at the top. A list of available type the name of the workspace. As you type, Tonic displays a list of matching workspaces. In the list, click the workspace name.
To reduce the amount of vertical space used by the heading of the workspace management view, you can collapse it.
To collapse the heading, click the collapse icon in the Structural heading.
When you collapse the workspace management heading:
The workspace information is hidden. The workspace name is displayed in the search field.
The workspace options are moved up into the Structural heading.
The workspace navigation bar remains visible.
When you collapse the heading, the collapse icon changes to an expand icon. To restore the full heading, click the expand icon.
Required workspace permission: Export and import workspace
You can export a workspace configuration to a JSON file, and import configuration from a workspace configuration JSON file.
For example, you might want to preserve a version of the workspace configuration before you test other changes. You can then use the exported file to restore the original configuration.
Or you might want to use a script to make changes to an exported configuration file. You can then import the updated file to update the workspace configuration.
For data generation workspaces, the workspace JSON configuration file includes the following information:
Sensitivity designations that you assigned to columns
Assigned table modes
Assigned column generators
Subsetting configuration
Post-job script configuration
For data science mode workspaces, the workspace JSON configuration file includes the model details.
To export the workspace configuration, either:
On the workspace management view, from the download menu, select Export Workspace.
On Workspaces view, click the actions menu for the workspace, then select Export.
When you export a child workspace, the exported workspace does not retain any of the inheritance information. The exported information is the same for all exported workspaces.
To import a workspace configuration file:
Select the import option. Either:
On the workspace management view, from the download menu, select Import Workspace.
On Workspaces view, click the actions menu for the workspace, then select Import.
On the Import Workspace dialog, to select the file to import, click Browse.
After you select the file, click Import.
Whether the workspace is a
The workspace share icon, to
For data generation workspaces, the Generate Data button, to
When you import a workspace configuration into a child workspace, Tonic Structural only updates the configuration that can be overridden. If a configuration must be inherited from the parent workspace, then it is not affected by the imported configuration. For more information, go to .
Disable Triggers
Indicates whether to disable any user-defined triggers before the upsert job runs. This prevents duplicate rows from being added to the destination database. By default, this is enabled.
Automatically Start Upsert After Successful Data Generation
Indicates whether to immediately run the upsert job after the initial data data generation to the intermediate database. By default, this is enabled. If you turn this off, then after the initial data generation, you must start the upsert job manually. For more information, go to #data-gen-run-upsert-only.
Persist Conflicting Data Tables
When an upsert job cannot process rows with unique constraint conflicts, as well as rows that have foreign keys to those rows, this setting indicates whether to preserve the temporary tables that contain those rows. By default, this is disabled. Structural only keeps the applicable temporary tables from the most recent upsert job.
Warn on Mismatched Constraints
Indicates whether to treat mismatched foreign key and unique constraints between the source and destination databases as warnings instead of errors, so that the upsert job does not fail. By default, this is disabled.
When you create a workspace, you become the owner of the workspace, and by default are assigned the built-in Manager workspace permission set for the workspace. The Manager permission set provides full access to the workspace configuration, data, and results.
With a Professional or Enterprise license, you can also assign workspace permission sets to other users and to SSO groups. You can also transfer a workspace to a different owner.
If you are granted access to any workspace permission set for a workspace, then you can see all of the workspace management views for that workspace. However, you can only perform tasks that you have permission for in that workspace.
Workspace access is managed from the Workspaces view. You cannot assign workspace permission sets from Tonic Settings view.
You can also view an overview video tutorial about workspace access.
Required license: Professional or Enterprise
Required permission
Global permission: View organization users. This permission is only required for the Tonic Structural application. It is not needed when you use the Structural API.
Either:
Workspace permission: Share workspace access
Global permission: Manage user access to Tonic and to any workspace
Tonic Structural uses workspace permission sets for role-based access (RBAC) of each workspace.
You cannot remove the owner workspace permission set from the workspace owner. By default, the owner permission set is the built-in Manager permission set.
To change the current access to the workspace:
To manage access to a single workspace, either:
On the workspace management view, in the heading, click the share icon.
On Workspaces view, click the actions menu for the workspace, then select Share.
To manage access for multiple workspaces:
Check the checkbox for each workspace to grant access to.
From the Actions menu, select Share Workspaces.
The workspace access panel contains the current list of users and groups that have access to the workspace. To add a user or group to the list of users and groups, begin to type the user email address or group name. From the list of matching users or groups, select the user or group to add. Free trial users can invite other users to start their own free trial. Provide the email addresses of the users to invite. The email addresses must have the same corporate email domain as your email address. When the invited users sign up for the free trial, they are added to the Structural organization for the free trial user that invited them and have access to the workspace.
For a user or group, to change the assigned workspace permission sets:
Click Access. The dropdown list is populated with the list of custom and built-in workspace permission sets. If you selected multiple workspaces, then on the initial display of the workspace sharing panel, for each permission set that a user or group currently has access to, the list shows the number of workspaces for which the user or group has that permission set. For example, you select three workspaces. A user currently has Editor access for one workspace and Viewer access for the other two. The Editor permission set has 1 next to it, and the Viewer permission set has 2 next to it.
Under Custom Permission Sets, check the checkbox next to each workspace permission set to assign to the user or group. Uncheck the checkbox next to each workspace permission set to remove from the user or group.
Under Built-In Permission Sets, check the workspace permission set to assign to the user or group. You can only select one built-in permission set to assign. By default, for an added user or group, the Editor permission set is selected. To select a built-in workspace permission set that is lower in access than the currently selected permission set, you must first uncheck the selected permission set. For example, if Editor is currently checked, then to change the selection to Viewer, you must first uncheck Editor.
To remove all access for a user or group, and remove the user or group from the list, click Access, then click Revoke.
To save the new access, click Save.
Tonic Structural runs the following types of jobs on a workspace:
Sensitivity scans, which analyze the source database to identify sensitive data.
Collection scans, which analyze the source data for a MongoDB workspace to determine the available fields in each collection, the field types, and how prevalent the fields are.
Data generation, data pipeline generation, and containerized generation jobs, which generate the destination data from the source data.
Upsert data generation jobs, which generate the intermediate database from the source database.
Upsert jobs, which use data from the intermediate database to add new rows to and update changed rows in the destination database. If the migration process is enabled, then it is a step in the upsert job.
SDK table statistics jobs. These jobs only run when you use the SDK to generate data in a Spark workspace, and the assigned generators require the statistics.
Model training jobs. These jobs only run on data science mode workspaces. A model training job shows the results of a model being trained. A trained model can be used to generate synthetic data.
You can view a list of jobs that ran on the workspace, and view details for individual jobs.
The Job History page displays the list of jobs that ran on the workspace. The list includes the 100 most recent jobs.
To display the Job History view:
On the workspace management view, in the workspace navigation bar, click Jobs.
On Workspaces view, from the dropdown menu in the Name column, select Jobs.
For each job, the job list includes the following information:
Job ID - The identifier of the job. To copy the job ID, click the icon at the left of the row.
Type - The type of job.
Submitted - The date and time when the job was submitted.
Completed - The date and time when the job finished running.
A job can have one of the following statuses:
Queued - The job is queued to run, but has not yet started. A job is queued for one of the following reasons:
Another job is currently running on the same workspace. For example, you cannot run a sensitivity scan and a data generation, or multiple data generations, at the same time on the same workspace. This is true regardless of the number of workers on the instance.
There isn't an available worker on the instance to run the job. A Structural instance with one worker can only run one job at a time. If a job from one workspace is currently running, a job from another workspace cannot start until the first job is finished.
To view information about why a job is queued, click the status value.
Running - The job is in progress.
Canceled - The job is canceled.
Completed - The job completed successfully.
Failed - The job failed to complete.
Each of these statuses has a corresponding "with warnings" status. For example, Running with warnings, Completed with warnings. A "with warnings" status indicates that the job had at least one warning at the time of the request.
You can filter the list by either the type or the status.
To filter the list by the job type:
Click the filter icon in the Type column heading. By default, all types are included, and none of the checkboxes are checked.
To only include specific types of jobs, check the checkbox next to each type to include. Checking all of the checkboxes has the same effect as unchecking all of the checkboxes.
To filter the list by the job status:
Click the filter icon in the Status column heading. The status panel displays all of the statuses that are currently in the list. For example, if there are no Queued jobs, then the Queued status is not in the list. By default, all of the statuses are included, and none of the checkboxes are checked.
To only include jobs that have specific statuses, check the checkbox next to each status to include. Checking all of the checkboxes has the same effect as unchecking all of the checkboxes.
You can sort the jobs by either the submission or completion timestamp.
To sort by submission date, click the Submitted column heading. To reverse the sort order, click the heading again.
To sort by completion date, click the Completed column heading. To reverse the sort order, click the heading again.
For jobs other than Queued jobs, you can display details about the workspace and the job progress.
From the Job History view, to display the details for a job, click the job row.
The left side of the job details view contains the workspace information.
For a sensitivity scan, the workspace information is limited to the owner, database type, and worker version.
For a data generation job, the workspace information also includes:
Whether subsetting, post-job scripts, or webhooks are used.
The number of schemas, tables, and columns in the source database.
The number of schemas, tables and columns in the destination database.
The Job Log tab shows the start date, start time, and duration of the job, followed by the list of job process steps.
For data generation jobs, the Privacy Report tab displays the number of at-risk, protected, and not sensitive columns in the source database.
At-risk columns contain sensitive data, but still have Passthrough as the assigned generator.
Protected columns have an assigned generator other than Passthrough.
Not sensitive columns have Passthrough as the assigned generator, but do not contain sensitive data.
For a data generation that writes the output to Ephemeral, the Data Available in Tonic Ephemeral panel provides access to the database or snapshot.
To navigate to Ephemeral and view the details for an Ephemeral snapshot, click View Snapshot in Tonic Ephemeral.
To display the connection details for an Ephemeral database, click View connection info.
For an Ephemeral database, the connection details include:
The database location and credentials. Each field contains a copy icon to allow you to copy the value.
SSH tunnel information, including instructions on how to create an SSH tunnel from your local machine to the Ephemeral database.
For a new Ephemeral account, you also receive an activation email message.
The job identifier is a unique identifier for the job. To copy the job ID, either:
You can cancel Queued or Running jobs.
For jobs with those statuses, the rightmost column in the job list contains a cancel icon.
To cancel the job, click the icon.
Required workspace permission: Download job logs
To download diagnostic logs, you must have the Enable diagnostic logging global permission.
For all jobs, the job logs provide detailed information about the job processing. Tonic.ai support might request the job logs to help diagnose issues.
For upsert jobs where the migration process is enabled, and you configured the GET Schema Change Logs
endpoint, the upsert job logs include the migration process logs.
You can download the job logs from the Job History view or the job details view. The download includes up to 1MB of log entries.
On the Job History view, to download the logs for a job, click the download icon in the rightmost column.
On the job details view, to download the logs for a job, click Download, then select Job Logs.
To access diagnostic log files, you must have the Enable diagnostic logging global permission.
If you do not have the Enable diagnostic logging global permission, then you cannot download the logs for that job. The download option is disabled.
Required workspace permission: View and download Privacy Report
From the job details view, you can download a Privacy Report file that provides an overview of the current protection status of the database columns based on the workspace configuration at the time that the job ran.
You can download either:
The Privacy Report .csv file, which provides details about the table columns, the column content, and the current protection configuration.
The Privacy Report PDF file, which provides charts that summarize the privacy ranking scores for the table columns. It also includes the table from the .csv file.
To display the download options, click Download. In the download menu:
To download the Privacy Report .csv file, click Privacy Report CSV.
To download the Privacy Report PDF file, click Privacy Report PDF.
For workspaces that are connected to Amazon Redshift or Snowflake on AWS databases, the data generation job requires multiple calls to a Lambda function. For these data generation jobs, the CloudWatch logs monitor the progress of and display errors for these Lambda function calls.
To download the CloudWatch logs for a data generation job, on the job details view, click Download, then select CloudWatch Logs.
The CloudWatch Logs option only displays for Amazon Redshift and Snowflake on AWS data generation jobs.
Required workspace permission: Download SqlLdr Files
For an Oracle data generation, if both of the following are true:
The data generation job ran SQL Loader (sqlldr).
sqlldr either failed or succeeded with errors.
Then to download the sqlldr log files, click Download, then select sqlldr Logs.
For a data generation from a file connector workspace that uses local files, you can download the transformed files for that job.
The download is a .zip file that contains the files for a selected file group.
On the job details view, when files are available to download, the Data available for file groups panel displays.
To download the files for a file group:
Click Download Results.
From the list, select the file group. Use the filter field to filter the list by the file group name.
Privacy Hub tracks the current protection status of source data columns based on:
To display Privacy Hub, either:
On the workspace management view, in the workspace navigation bar, click Privacy Hub.
On Workspaces view, click the workspace name.
From Privacy Hub, you can:
Review and apply the recommended generators for all detected sensitive columns
View the current protection status of columns
Manually mark columns as sensitive or not sensitive
Configure protection for sensitive columns
Download a preview Privacy Report
Run a new sensitivity scan
The sensitivity scan detects specific types of sensitive data.
If your workspace contains any columns that the sensitivity scan identified, and for which you have not either:
Assigned a generator
Marked as not sensitive
Then Tonic Structural displays a Sensitivity Recommendations banner that contains a count of those columns.
The count only includes sensitive columns that the sensitivity scan detects. If you manually mark a column as sensitive, it is not included in the list.
On the banner, the Review Recommendations option allows you to review the detected columns and the recommended generators for each detected sensitive data type.
You can then apply the recommended generators or ignore the recommendation. When you ignore a recommendation, you either:
Indicate to remove the generator recommendation for the column.
Indicate that the column data is not sensitive.
The protection status panels at the top of Privacy Hub provide an overview of the current protection status of the columns in the source data.
Each panel displays:
The number of columns that are in that category
The estimated percentage of columns that are in that category
The column counts do not include columns that do not have data in the destination database. For example, if a table is assigned Truncate table mode, then Privacy Hub ignores the columns in that table.
The information on these panels updates automatically as you change whether columns are sensitive and assign generators to columns.
The At-Risk Columns panel reflects columns that:
Are populated in the destination database.
Are marked as sensitive.
Have the generator set to Passthrough, which indicates that Structural does not perform any transformation on the data.
The goal is to have 0 at-risk columns.
The Protected Columns panel reflects columns that:
Are populated in the destination database.
Are assigned a generator other than Passthrough.
It includes both sensitive and non-sensitive columns.
Note that a column is considered protected based solely on the assigned generator. Some more complex generators, such as JSON Mask or Conditional, allow you to apply different generators to specific portions of a value or based on a specific condition. However, the protection status does not reflect these sub-generators. An applied sub-generator could be Passthrough.
The Not Sensitive Columns panel reflects columns that:
Are populated in the destination database.
Are marked as not sensitive.
Have the generator set to Passthrough.
The Database Tables list shows the protection status for each table in the source database. You can view the number of columns that have each protection status, and update the column configuration.
The list does not include tables where the table mode is Truncated or Preserve Destination. Truncated tables are not populated in the destination database. For Preserve Destination tables, the existing data in the destination database does not change.
For each table, Database Tables provides the following information:
Privacy Status - Indicates the current protection status of the columns in the table. It provides the same view and configuration options as the protection status panels at the top of Privacy Hub.
You can filter the Database Tables list either by the table name or by the schema.
To filter the list by table name, in the filter field, begin typing text in the table name. As you type, Structural updates the list to only display matching tables.
To filter the list to only include tables that belong to a specific schema:
Click Filter by Schema.
From the schema dropdown list, select the schema.
When you select a schema, Structural adds it to the filter field.
You can sort the Database Tables list by any column except for the Privacy Status column.
To sort by a column, click the column heading. To reverse the sort order, click the heading again.
The Privacy Status column in the Database Tables list indicates the protection status of the columns in the table.
Each protection status panel displays a series of boxes to represent the columns that apply to that status. For example, if the source data contains four columns that are at-risk, then the At-Risk Columns panel displays four boxes, one for each column.
The Privacy Status column in the Database Tables list displays the same set of boxes for the columns in an individual table.
If the number of columns is too large to fit, then the last box shows the number of additional columns that apply. For example, if there are 15 columns that don't fit, then the last box is labeled +15.
When you hover over a box, the column name displays in a tooltip.
When you click a box, the details panel for that column displays.
When you click the box for remaining columns, the details panel for the first column in the remaining columns displays.
You can use the next and previous icons at the bottom right of the details panel to display the details for the next or previous column.
The column details panel opens to the settings view. The settings view contains the following information:
The table and column name.
Whether the column is flagged as sensitive.
The type of PII that the column contains.
The data type for the column data.
The generator that is assigned to the column.
For a child workspace, whether the column configuration is inherited from the parent workspace. For columns that have overrides, you can reset to the parent configuration.
Required workspace permission: Configure column sensitivity
From the settings view of the column details, you can configure the column sensitivity.
As you change the column sensitivity, Structural updates the protection status panels.
To change whether the column is sensitive, toggle the Sensitive option. The column is moved if needed to reflect its new status. However, you remain on the current panel.
For example, from the At-Risk Columns panel, you change a column to be not sensitive. The column is moved to the Not Sensitive Columns panel. When you click the next or previous icons, you view the details for the next or previous column on the At-Risk Columns panel.
Required workspace permission: Configure column generators
From the column details, you can assign and configure the column generator.
When you change the column generator, Structural updates the protection status panels.
If the column generator was previously Passthrough, then the column is moved to the Protected Columns panel. However, you remain on the current panel. For example, you assign a generator to a column that is on the At-Risk Columns panel. The column is moved to the Protected Columns panel, but when you click the next or previous icons, you view the details for the next or previous column on the At-Risk Columns panel.
For sensitive columns that are not protected, Structural displays the recommended generator as a button.
For self-hosted instances that have an Enterprise license, the recommended generator is the built-in generator preset.
To assign the recommended generator to the column, click the button.
Otherwise, select the generator from the Generator Type dropdown list.
If the selected generator requires additional configuration, then below the Generator Type dropdown list is an Edit Generator Options link.
To display the configuration fields for the generator, click Generator Options.
After you configure the generator, to return to the settings view, click Back.
Required workspace permission:
Source data: Preview source data
Destination data: Preview destination data
From the column details, you can display sample data for the column. The sample data allows you to compare the source and destination versions of the column values.
To display the sample data, click the view sample (magnifying glass) icon.
On the sample data view of the column details:
The Original Data tab shows the values in the source data.
The Protected Output tab shows the values that the generator produced.
Required license: Professional or Enterprise
From the column details, you can view and add comments on the column. You might use a comment to explain why you selected a particular generator or marked a column as sensitive or not sensitive.
From the column details, to display the comments for the column, click the comment icon.
The comments view displays any existing comments on the column. The most recent comment is at the bottom of the list. Each comment includes the name of the user who made the comment.
To add the first comment to a column, type the comment into the comment text area, then click Comment.
To add an additional comment, type the comment into the comment text area, then click Reply.
Required license: Enterprise
The Privacy Report files that you download from Privacy Hub or the workspace download menu provide an overview of the current protection status based on the current configuration.
This is different from the Privacy Report files that you download from the data generation job details, which show the protection status after the data generation.
You can download either:
The Privacy Report .csv file, which provides details about the table columns, the column content, and the current protection configuration.
The Privacy Report PDF file, which provides charts that summarize the privacy ranking scores for the table columns. It also includes the table from the .csv file.
From the workspace management view, click the download icon. In the download menu:
To download the Privacy Report PDF file, click Download Privacy Report PDF.
To download the Privacy Report .csv file, click Download Privacy Report CSV.
From Privacy Hub, click Download, then
To download the Privacy Report .csv file, click Privacy Report CSV.
To download the Privacy Report PDF file, click Privacy Report PDF.
Required workspace permission: Run sensitivity scan
You add columns to the source database. The new scan identifies whether the new columns contain sensitive data.
The data in a column changes significantly, and a column that Structural originally marked as not sensitive might now contain sensitive data.
To run a new sensitivity scan, click Run Sensitivity Scan.
When Structural runs a new sensitivity scan:
Structural analyzes and determines the sensitivity of any new columns.
It does not change the sensitivity of existing columns that you marked as sensitive or not sensitive.
For existing columns that you did not change the sensitivity of:
Structural does not change the sensitivity of existing columns that the original scan marked as sensitive.
It can change the sensitivity of existing columns that the original scan marked as not sensitive.
The protection status panels are updated to reflect the results of the new scan.
A workspace permission set is a set of . Each permission provides access to a specific workspace feature or function.
Structural provides . Enterprise instances can also .
You can assign workspace permission sets to users and to SSO groups, if you use SSO to manage Structural users. Before you assign a workspace permission set to an SSO group, make sure that you are aware of who is in the group. The permissions that are granted to an SSO group automatically are granted to all of the users in the group. For information on how to configure Structural to filter the allowed SSO groups, go to .
Status - The current status of the job, and how long ago the job reached that status. When you hover over the status, a tooltip displays the actual timestamp for the status change, and a summary of how long the job ran. For queued jobs, to display a panel with information about why the job is queued, click the status value.
A can write output to a Tonic Ephemeral database. Non-free trial workspaces can , with an option to preserve the temporary Ephemeral database that is used to create the snapshot.
From the Job History view, click the copy () icon in the leftmost column.
From the job details view, click the copy () icon next to the job ID.
For workspaces that are configured to write destination data to container artifacts, the Job History view also provides access to those artifacts. For more information, go to .
By default, Structural redacts sensitive values from the job logs. To help support troubleshooting, you can configure data connectors or an individual data generation job to create unredacted versions of the log files, referred to as diagnostic logs. For more information, go to .
For more information about the Privacy Report files and their content, go to .
, either from the most recent sensitivity scan or from manual assignments
Assigned
Assigned
You can also track the history of changes to column sensitivity and the assigned column generators. For more information, go to .
For more information, go to .
From each panel, you can .
Click Open in Database View to navigate to . The column list is filtered to show columns that are at risk.
Click Open in Database View to navigate to . The column list is filtered to show all included columns that are protected.
Click Open in Database View to navigate to . The column list is filtered to show included columns that are not sensitive and are not protected.
Name - The table name. For a workspace, each table corresponds to a file group.
Not Sensitive - The number of not sensitive columns in the table. Not sensitive columns are not marked as sensitive and have Passthrough as the generator. Click the value to navigate to , filtered to display the not sensitive columns for the table.
Protected - The number of protected columns in the table. Protected columns have an assigned generator. A protected column can be either sensitive or not sensitive. Click the value to navigate to , filtered to display the protected columns for the table.
At-Risk - The number of at-risk columns in the table. These columns are marked as sensitive, but have Passthrough as the generator. The goal is to have 0 unprotected sensitive columns. Click the value to navigate to , filtered to display the at-risk columns for the table.
This column provides the same as the protection status panels at the top of Privacy Hub, but is limited to the columns in a specific table.
You cannot change the sensitivity of columns in a child workspace. A child workspace always inherits the sensitivity from its parent workspace. For more information, go to .
For more information about selecting a generator, go to .
For information about configuring a selected generator or generator preset, go to .
For more information about the Privacy Report files and their content, go to .
Privacy Hub provides an option to manually start a new . For example, you might want to run a new sensitivity scan when:
You cannot run a sensitivity scan on a . Child workspaces always inherit the sensitivity results from their parent workspace.
Assign workspace permission sets
The assigned permission sets determine the level of access to the workspace.
Transfer ownership of a workspace
Make another user the workspace owner. You can also assign yourself workspace permission sets.
Required permission
Global permission: View organization users. This permission is only required for the Tonic Structural application. It is not needed when you use the Structural API.
Either:
Workspace permission: Transfer workspace ownership
Global permission: Manage access to Tonic and to any workspace
To grant yourself access after the transfer:
Workspace permission: Share workspace access
Every workspace has an owner. The owner is always a user.
The user who creates the workspace is automatically the owner of the workspace.
By default, the workspace owner is assigned the built-in Manager workspace permission set. On Enterprise instances, you can choose a different workspace permission set to assign to all workspace owners.
You cannot remove that permission set from the workspace owner.
You can transfer a workspace to a different owner. The new owner is assigned the owner permission set. If the previous owner does not otherwise have access to the owner permission set, then that permission set is removed.
To transfer workspace ownership:
To transfer ownership of a single workspace, from the workspace actions menu, select Transfer Ownership.
To transfer ownership of multiple workspaces:
Check the checkbox for each workspace to grant access to.
From the Actions menu, select Transfer Ownership.
On the transfer ownership panel, from the User dropdown list, select the new owner.
If you are the current owner of the workspace, then to grant yourself non-owner access after you transfer the ownership:
Toggle Receive access to workspace to the on position.
Select the workspace permission set to assign to yourself.
Click Transfer Ownership.
Tonic Structural uses sensitivity scans to identify source data columns that contain sensitive information. You can also manually mark a column as sensitive.
Structural runs sensitivity scans automatically. You can also run a manual sensitivity scan.
Structural automatically runs a sensitivity scan when you create a completely new workspace and connect a data source.
Structural also runs a new sensitivity scan when you change the data connection details for the source database.
For a file connector workspace, Structural runs a sensitivity scan when you add a file group.
A child workspace always inherits the sensitivity designations from its parent workspace.
When you copy a workspace, Structural runs a new sensitivity scan on the copy to identify sensitive columns. However, it keeps the sensitivity designation for columns that you specifically marked as sensitive or not sensitive.
In addition to the automatic scans, from Privacy Hub, you can start a sensitivity scan manually.
To identify that a column contains sensitive information, Structural looks at the data type, column name, and column values. To help identify sensitive column values, the scan uses regex matching and dictionary lookups.
This process cannot guarantee perfect precision and recall. We strongly recommend that a human reviews the sensitivity scan results and the broader dataset to ensure that nothing sensitive was missed.
Structural identifies the following types of sensitive values. These include some information types that are considered by many privacy standards and frameworks such as HIPAA, GDPR, CCPA, and PCI.
Names
First
Last
Full
Organization
Location
Street address
ZIP
PO Box
City
State and two letter abbreviation
Country
Postal code
Contact information
Email address
Phone number
Password
Financial information
Credit card number
International bank account number (IBAN)
SWIFT code for bank transfers
BTC (Bitcoin) address
Identification
Social Security Number
Birth dates
Gender
Network location
IP address
IPv6 address
MAC address
International Mobile Equipment Identity (IMEI)
Vehicle identification number (VIN)
ICD-9 and ICD-10 codes (Used to identify diseases)
To download the log of the most recent sensitivity scan:
On the workspace management view, from the download menu, select Download Sensitivity Scan Log.
On Privacy Hub, click Download, then select Scan Log.
The log tracks the progress of the scan.
For improved performance, sensitivity scans can use parallel processing.
For relational databases such as PostgreSQL and SQL Server, to configure parallel processing, you use the environment setting TONIC_PII_SCAN_PARALLELISM_RDBMS
. The default value is 4.
For document-based databases such as MongoDB, you use the environment setting TONIC_PII_SCAN_PARALLELISM_DOCUMENTDB
. The default value is 1.
For information about how to configure environment settings, go to Configuring environment settings.
For each type of detected sensitive data, Structural suggests a recommended generator. For example, for a Social Security Number, Tonic recommends the SSN generator. For a first name, Structural recommends the Name generator configured with First as the value type.
From Privacy Hub, you can review and apply the recommended generators to columns that the sensitivity scan detected.
For more information, go to Reviewing and applying recommended generators.
The sensitivity scan provides an initial assessment of which column values are sensitive.
You can also indicate manually that a column is sensitive or not sensitive.
Privacy Hub, Database View, and Table View all provide options to indicate whether a column is sensitive or not sensitive.
The Structural API also provides endpoints to designate columns as sensitive or not sensitive.
Table View displays source or preview data for a single table. For a file connector workspace, each table corresponds to a file group.
Required workspace permission:
Source data: Preview source data
Destination data: Preview destination data
If you do not have either of these permissions, then you cannot display Table View.
From Table View, you can:
View information about the column data types and protection status.
Child workspaces inherit all table and column configuration from their parent workspace. For child workspaces, Table View is read-only. For more information, go to About workspace inheritance.
To display Table View:
On the workspace management view, click Table View.
On Workspaces view, from the dropdown menu in the Name column, select Table View.
You can also display Table View for a table in Database View. To display Table View, either click the arrow icon for the table, or click a row in the table.
When you display Table View from Database View, it displays the data for the selected table.
When you display Table View from the workspace management view or Workspaces view, it displays the most recently displayed table.
If Table View was never displayed before, then it displays the first table in the workspace. To change the selected table, from the Table dropdown list, select the table to view.
Required license: Enterprise
By default, a child workspace inherits the configuration from the parent workspace. You can override the table mode or column generator.
In a child workspace, each Model entry indicates whether the configuration overrides the parent configuration.
When a column overrides the parent configuration, an Overriding label displays above the column.
To filter Table View to only display columns with overrides, toggle Show Overrides Only to the on position.
On the column configuration or Model entry, to reset the configuration to match the parent workspace, click Reset.
Required workspace permission: Assign table modes
To change the table mode that is assigned to the table:
Click the current table mode.
On the table mode panel, from the Table Mode dropdown list, select the new table mode.
When you change the table mode, Tonic Structural updates the preview data as needed. For example, if you change the table mode to Truncate, then the preview data is empty.
For a child workspace, the table mode selection panel indicates whether the selected table mode is inherited from the parent workspace.
If the child workspace currently overrides the parent workspace configuration, then to reset the table mode to the table that is assigned in the parent workspace, click Reset.
The Model section of Table View displays the configured generators for the table columns.
The header for each Model entry is the column name. Linked columns and AI Synthesizer columns share an entry.
For linked columns, the heading is a comma-separated list of the linked columns.
For the AI Synthesizer, the heading is AI Synthesizer.
Each entry contains the following information:
The column and generator, in the format Column Name >> Generator Name
. For example, First_Name >> Name
indicates that the First_Name
column has the Name generator applied.
For linked columns and the AI Synthesizer, there is a Column Name >> Generator Name
entry for each column.
The selected configuration options for the generator.
By default, a child workspace inherits the configuration from its parent workspace. You can also override the configuration. For a child workspace, each Model entry indicates whether the configuration overrides the parent configuration. For configurations that override the parent, to remove the overrides and restore the inheritance, click Reset.
The Model entry also indicates when Tonic data encryption is enabled for the column.
To remove the generator from a column, click the delete icon.
For an AI Synthesizer entry, to display the model training settings, click the settings icon. For more information, go to #ai-synthesizer-model-training-config.
The columns section of Table View displays a sample set of data for the table.
The column heading background color indicates the column's protection status.
Red - At risk - The column is marked as sensitive, but the generator is still Passthrough.
Orange - Protected - The column has an assigned generator other than Passthrough. Protected columns might be either sensitive or not sensitive.
Gray - The column is not sensitive and the generator is Passthrough.
The Preview toggle at the top right of Table View allows you to choose whether to display original source data or the transformed data. You can switch back and forth to understand exactly how Structural transforms the data based on the table and column configuration.
By default, the Preview toggle is in the on position, and the displayed data reflects the selected table mode and the assigned generators. For tables that use Truncate mode, the preview data is empty. Truncated tables do not have data in the destination database.
To display the original source data, toggle Preview to the off position.
You can provide a query to filter the source data. The query is always against the source data, not the preview data, regardless of whether the Preview toggle is off or on.
For example, you configure a first name field to use the Name generator and enable consistency. You can then query the source data for a specific first name value to check that the preview data uses the same destination value for all of those records.
To apply a query to the source data:
Click the query filter icon, located between the table name and the table mode.
On the Table Filter dialog, provide the where clause for the query.
To apply the query, click Apply.
To close the dialog, click Close.
To clear an applied query, on the Table Filter dialog, click Clear.
If no filter is applied, then the query filter icon has a white background.
If a valid filter is applied, then the query filter icon has a gray background.
If the provided where clause is not valid, then the query filter icon has a red background.
In addition to the column name, the column heading identifies primary keys and foreign keys, and indicates the type of data.
Primary key columns are indicated by a gold key icon.
Foreign key columns are indicated by a black key icon.
For other columns, the icon reflects the type of data that is in the column, such as text, numeric values, or datetime values.
To display the configuration panel for a column, click the dropdown icon.
From the configuration panel, you can:
Required workspace permission: Configure column sensitivity
On the column configuration panel, the sensitivity toggle at the top right indicates whether the column is marked as sensitive.
To mark a column as sensitive, toggle the setting to the Sensitive position.
To mark a column as not sensitive, toggle the setting to the Not Sensitive position.
In a child workspace, you cannot configure whether a column is sensitive. A child workspace always inherits the sensitivity designation from its parent workspace.
When you copy a workspace, Structural performs a new sensitivity scan on the copy. It does not copy the sensitivity designations from the original workspace.
Required workspace permission: Configure column generators
On the column configuration panel, from the Generator Type dropdown list, select the generator to assign to the column.
When you select a generator, Structural displays the available configuration options for that generator. For details about the configuration options for each generator, go to the Generator reference.
To remove the selected generator or generator preset, and reset the generator to Passthrough, click the delete icon next to the Generator Type dropdown.
For more information about selecting and configuring generators and generator presets, go to Assigning and configuring generators.
Database View provides a complete view of your source database structure and configuration.
It consists of:
On the left, the list of tables in the source database.
On the right, the list of columns in those tables.
To display Database View, either:
On the workspace management view, in the workspace navigation bar, click Database View.
On Workspaces view, from the dropdown menu in the Name column, select Database View.
From Database View, you can assign table modes to tables, assign generators to columns, and determine column sensitivity.
The table list is grouped by schema. You can expand and collapse the list of tables in each schema. This does not affect the displayed columns.
For a file connector workspace, each table corresponds to a file group.
For each table, the table list includes the following information:
The name of the table.
The number of columns that have an assigned generator (a generator other than Passthrough). The number does not display if none of the table columns has an assigned generator.
The assigned table mode. The table list only shows the first letter of the table mode:
D = De-identify
S = Scale
T = Truncate
P = Preserve Destination
I = Incremental
For a child workspace, if the selected table mode overrides the parent workspace configuration, then the override icon displays.
To display Table View for a table, click the arrow icon to the right of the table entry.
You can filter the table list by name and by the assigned table mode. You can also filter the tables based on whether any of the columns have assigned generators.
As you filter the table list, the column list also is filtered to only include the columns for the filtered tables.
To filter the table list by name, in the filter field, begin to type text that is in the table name.
As you type, Tonic Structural filters the list to only display tables with names that contain the filter text.
To filter the table list based on the assigned table mode:
Click Filters.
On the filter panel, check the checkbox next to each table mode to include. By default, the list includes all of the table modes. As you check and uncheck the table mode checkboxes, Structural adds and removes the associated tables from the list.
You can filter the table list to only display tables that have no assigned generators:
Click Filters.
On the filter panel, to only show tables that do not have assigned generators, check the No Generators Applied checkbox.
Required workspace permission: Assign table modes
The table mode determines the number of rows and columns in the destination database. For details about the available table modes and how they work, go to Table modes.
To change the assigned table mode for a single table:
Click the table mode dropdown next to the table name.
From the table mode dropdown list, select the table mode.
For a child workspace, the table mode selection panel indicates whether the selected table mode is inherited from the parent workspace. If the child workspace currently overrides the parent workspace configuration, then to reset the table mode to the table that is assigned in the parent workspace, click Reset.
To change the assigned table mode for multiple tables:
Check the checkbox for each table to change the table mode for. To select a continuous range of tables, click the first table in the range, then Shift-click the last table in the range. To select all of the tables in a schema, click the schema name.
Click Bulk Edit.
On the panel, click the radio button for the table mode to assign to the selected tables.
The column list contains the following columns:
The Column column contains:
The name of the column, in the format table_name.column_name
. When you click the column name, Table View for the column table displays.
The name of the schema that contains the table.
The data type for the column.
Access to view example source and destination values, based on the assigned generator. For more information, go to #database-view-columns-sample-data.
When a generator other than Passthrough is assigned, or when Passthrough is assigned to a non-sensitive column, the generator name tag displays the name of the assigned generator. To display the column configuration panel, click the generator name tag.
For sensitive columns that are assigned Passthrough:
If the Structural sensitivity scan marked the column as sensitive, then the generator name tag displays the type of sensitive information that Structural detected, such as a first name or a street address. The first time you click the generator name tag, you choose whether to assign or ignore the recommended generator. For more information, go to #database-view-single-column-recommended-generator.
If the column was marked manually as sensitive, then the generator name tag displays At-Risk. To display the column configuration panel, click the generator name tag.
The generator name tag is color-coded to indicate the sensitivity and protection status.
Protected columns use blue.
Unprotected sensitive columns use red.
Unprotected non-sensitive columns use gray.
If the table is assigned Truncate or Preserve Destination mode, then the generator name tag is hidden, unless you assigned a generator before you set the table mode.
Foreign key columns do not display a generator name tag.
For primary key fields, the Applied Generator column contains a primary key tag.
For foreign keys, the foreign key tag replaces the generator name tag. Foreign key columns automatically inherit the value from the associated primary key column.
If the table is assigned a table mode other than De-Identify, then the Applied Generator column displays a table mode name tag.
In a child workspace, when the assigned generator or generator configuration overrides the parent workspace, then an Override tag displays in the column.
If the table mode overrides the parent workspace, then the table mode tag displays the override icon. When the child workspace overrides the table mode, the Applied Generator column always displays the table mode, including the De-Identify table mode.
The Applied Generator column also provides access to view and add comments. For more information, go to #database-view-columns-commenting.
To filter the column list, you can:
Use the table list to filter the displayed columns based on the table that the columns belong to.
Use the filter field to filter the columns by table or column name.
Use the Filters panel to filter the columns based on column attributes and generator configuration.
You can use column filters to quickly find columns that you want to verify or update the configuration for.
To filter the column list to only include columns for specific tables, either:
Check the checkbox for each table to display columns for.
To filter the column list by table or column name, in the filter field, begin to type text that is in the table or column name.
As you type, Structural filters the column list.
The Filters panel provides access to column filters other than the table and column name.
To display the Filters panel, click Filters.
To search for a filter or a filter value, in the search field, start to type the value. The search looks for text in the individual settings.
For each filter, the Filters panel indicates the number of matching columns, based on the selected tables and the current filters.
To add a filter, depending on the filter type, either check the checkbox or select a filter option. As you add filters, Structural applies them to the column list. Above the list, Structural displays tags for the selected filters.
To clear all of the currently selected filters, click Clear All.
To only display detected sensitive columns for which there is a recommended generator, on the Filters panel, check Has Generator Recommendation.
An at-risk column:
Is marked as sensitive
Is included in the destination data.
Is assigned the Passthrough generator.
To only display at-risk columns, on the Filters panel, check At-Risk Column.
When you check At-Risk Column, Structural adds the following filters under Privacy Settings:
Sets the sensitivity filter to Sensitive
Sets the protection status filter to Not protected
Sets the column inclusion filter to Included
You can filter the columns based on the column sensitivity.
On the Filters panel, under Privacy Settings, the sensitivity filter is by default set to All, which indicates to display both sensitive and non-sensitive columns.
To only display sensitive columns, click Sensitive.
To only display non-sensitive columns, click Not sensitive.
Note that when you check At-risk Column, Tonic automatically selects Sensitive.
You can filter the columns based on whether they have any generator other than Passthrough assigned. To filter the columns based on specific assigned generators, use the Applied Generator filter.
On the Filters panel, under Privacy Settings, the column protection filter is by default set to All, which indicates to display both protected and not protected columns.
To only display columns that have an assigned generator, click Protected.
To only display columns that do not have an assigned generator, click Not protected.
Note that when you check At-Risk Column, Structural automatically selects Not protected.
You can filter the columns based on whether they are populated in the destination database. For example, if a table is truncated, then the columns in that table are not populated.
On the Filters panel, under Privacy Settings, the column inclusion filter is by default set to All, which indicates to display both included and not included columns.
To only display columns that are populated in the destination database, click Included.
To only display columns that are not populated in the destination database, click Not included.
Note that when you check At-Risk Column, Structural automatically selects Included.
To only display columns that are assigned specific generators, on the Filters panel, under Applied Generator, check the checkbox for each generator to include.
The list of generators only includes generators that are assigned to the currently displayed columns and that are compatible with other applied filters.
To search for a specific generator, in the Filters search field, begin to type the generator name.
You can filter the columns by the column data type. For example, you can only display varchar
columns, or only columns that contain either numeric or integer values.
To only display columns that have specific data types, on the Filters panel, under Database Data Types, check the checkbox for each data type to include.
The list of data types only includes data types that are present in the currently displayed columns and that are compatible with other applied filters.
To search for a specific data type, in the Filters search field, begin to type the data type.
When the source database schema changes, you might need to update the configuration to reflect those changes. If you do not resolve the schema changes, then the data generation might fail. The data generation fails if there are unresolved conflicting changes, or if you configure Structural to always fail data generation when there are any unresolved changes.
For more information about schema changes, go to Viewing and resolving schema changes.
To only display columns that have unresolved schema changes, on the Filters panel, check Unresolved Schema Changes.
For detected sensitive columns, the sensitivity type indicates the type of data that was detected. Examples of sensitivity types include First Name, Address, and Email.
To only display columns that contain specific sensitivity types, on the Filters panel, under Sensitivity Type, check the checkbox for each sensitivity type to include.
The list of sensitivity types only includes sensitivity types that are present in the currently displayed columns.
To search for a specific sensitivity type, in the Filters search field, type the sensitivity type.
You can filter the column list based on whether the column is nullable.
On the Filters panel, under Data Attributes, the nullability filter is by default set to All, which indicates to display both nullable and non-nullable columns.
To only display columns that are nullable, click Nullable.
To only display columns that are not nullable, click Non-nullable.
You can filter the column list based on whether the column must be unique.
On the Filters panel, under Data Attributes, the uniqueness filter is by default set to All, which indicates to display both unique and not unique columns.
To only display columns that must be unique, click Unique.
To only display columns that do not require uniqueness, click Not unique.
You can filter the column list to indicate whether to include:
Columns that are not primary or foreign keys.
Columns that are foreign keys.
Columns that are primary keys.
On the Filters panel, under Column Type:
To display columns that are neither a primary key nor a foreign key, check Non-keyed.
To display columns that are primary keys, check Primary key.
To display columns that are foreign keys, check Foreign key.
In a child workspace, to only display columns that override the generator configuration that is in the parent workspace, on the Filters panel, check Overrides Inheritance.
You can enable Structural data encryption, a configuration that allows Structural to:
Decrypt source data before applying the generator
Encrypt generated data before writing it to the destination database
For more information, go to Configuring and using Tonic Structural data encryption.
When Structural data encryption is enabled, the generator configuration panel includes an option to use Structural data encryption.
To only display columns that are configured to use Structural data encryption, on the Filters panel, check Uses Data Encryption.
By default, the column list is sorted first by table name, then by column name. The columns for each table display together. Within each table, the columns are in alphabetical order.
You can also sort the column list by column name first, then by table. Columns that have the same name display together. Those columns are sorted by the name of the table.
The button at the right of the Column column heading indicates the current sort order.
T.C indicates that the table is sorted by table, then by column
C.T indicates that the table is sorted by column, then by table
To switch the sort order, click the button.
From the column list, to display the column configuration panel, click the generator name tag.
For a column that Structural detected as sensitive and that does not have an assigned generator (is assigned Passthrough), the generator name tag displays the type of sensitive data.
The first time that you click the generator name tag, Structural displays a panel that contains the following information:
The type of sensitive data that was detected
The recommended generator
Sample source and destination values based on the recommended generator
From the panel, you choose whether to assign or ignore the recommended generator for that type.
To assign the recommended generator, click Apply recommendation. Structural displays the column configuration panel with the recommended generator selected. You can then adjust the configuration or select a different generator.
To ignore the recommendation, click Ignore. Structural displays the column configuration panel to allow you to select the generator to assign to the column.
Required workspace permission: Configure column sensitivity
The Structural sensitivity scan provides an initial indication of whether a column is sensitive and, if it is sensitive, the type of sensitive data that is in the column. For more information, go to Identifying sensitive data.
From the column configuration panel, you can change whether a column is sensitive.
In a child workspace, you cannot configure whether a column is sensitive. A child workspace always inherits the sensitivity designation from its parent workspace.
On the column configuration panel, the sensitivity information is at the top right.
To indicate that a column is sensitive, toggle the Sensitivity setting to the on position.
To indicate that the column is not sensitive, toggle the Sensitivity setting to the off position.
Required workspace permission: Configure column generators
To change the generator that is assigned to a selected column:
Click the generator name tag for the column.
On the column configuration panel, from the Generator Type dropdown list, select the generator.
Configure the generator options.
To reset an assigned generator to Passthrough:
Click the generator name tag.
On the column configuration panel, click the delete icon next to the generator dropdown.
For details about the configuration options for each generator, go to the Generator reference.
For more information about selecting and configuring generators and generator presets, go to Assigning and configuring generators.
The bulk edit option allows you to configure multiple columns at the same time. From the bulk editing panel, you can:
Mark the selected columns as sensitive or not sensitive.
Assign a generator to the selected columns.
Apply the recommended generator to the selected columns.
Reset the generator configuration to the baseline. Requires that all of the selected columns are assigned the same preset.
To select the columns and display the bulk edit option:
Check the checkbox next to each column to update.
Click Bulk Edit.
Required workspace permission: Configure column sensitivity
On the Bulk Edit panel, under Edit Sensitivity:
To mark the selected columns as sensitive, click Sensitive.
To mark the selected columns as not sensitive, click Not Sensitive.
Required workspace permission: Configure column generators
On the Bulk Edit panel, under Bulk Edit Applied Generator, select and configure the generator to assign to the selected columns.
Required workspace permission: Configure column generators
If any of the selected columns are unprotected sensitive columns that have a recommended generator, then on the Bulk Edit panel, to assign the recommended generators, click Apply Recommendations.
Required workspace permission: Configure column generators
For a generator preset, the baseline configuration is the configuration that is saved for that preset. The baseline configuration determines the default configuration is used when you assign the preset to a column. After you select the preset, you can override the baseline configuration.
If all of the selected columns are assigned the same preset, then to restore the baseline configuration for all of the columns, click Reset to Baseline.
Required license: Professional or Enterprise
You can add comments to columns. For example, you might use a comment to explain why you selected a particular generator or marked a column as sensitive or not sensitive.
If a column does not have any comments, then to add a comment:
In the Applied Generator column, click the comment icon.
In the comment field, type the comment text.
Click Comment.
When a column has existing comments, the comment icon is green. To add comments:
Click the comment icon. The comments panel shows the previous comments. Each comment includes the comment user.
In the comment field, type the comment text.
Click Reply.
Required workspace permission:
Source data: Preview source data
Destination data: Preview destination data
For each column, you can display a sample list of the column values.
For columns that have an assigned generator, the sample shows both the current values and the possible values after the generator is applied.
To display the sample values, in the Column column, click the magnifying glass icon.
If the generator is Passthrough, then the sample data panel contains only Original Data.
If a different generator is assigned, then the sample data panel contains both Original Data and Protected Output.
Generators transform the data in a source database column. You assign the generators to use. Tonic Structural offers a variety of generators to transform different types of data.
For Enterprise instances, generator presets allow you to configure custom configurations of generators that you can then assign to columns.
Each table is assigned a table mode. The table mode determines at a high level how the table is populated in the destination database.
Required workspace permission: Assign table modes
Both Database View and Table View allow you to view and update the selected table mode for a table.
This is the default table mode for new tables.
In this mode, Tonic Structural copies over all of the rows to the destination database.
For columns that have the generator set to Passthrough, Structural copies the original source data to the destination database.
For columns that are assigned a generator other than Passthrough, Structural uses the generator to replace the column data in the destination database.
This mode drops all data for the table in the destination database.
For data connectors other than Spark-based data connectors, the table schema and any constraints associated with the table are included in the destination database.
Any existing data in the destination database is removed. For example, if you change the table mode to Truncate after an initial data generation, the next data generation clears the table data. For Spark-based data connectors, the table is removed.
If you assign Truncate mode to a table that has a foreign key constraint, it fails during data generation. If this is a requirement, contact support@tonic.ai for assistance.
This mode preserves the data in the destination database for this table. It does not add or update any records.
This feature is primarily used for very large tables that don't need to be de-identified during subsequent runs after the data exists in the destination database.
When you assign Preserve Destination mode to a table, Structural locks the generator configuration for the table columns.
The destination database must have the same schema as the source database.
You cannot use Preserve Destination mode when you:
Enable upsert for a workspace.
Write destination data to a container artifact.
Write destination data to an Ephemeral snapshot.
Incremental mode only processes the changes that occurred to the source table since the most recent data generation or other changes in the destination. This can greatly reduce generation time for large tables that don't have a lot of changes.
For Incremental mode to work, the following conditions must be satisfied:
The table must exist in the destination database. Either Structural created the table during data generation, or the table was created and populated in some other way.
A reliable updated date column must be present. When you select Incremental mode for a table, Structural prompts you to select the updated date column to use.
The table must have a primary key.
To maximize performance, we recommend that you have an index on the date updated field.
For tables that use Incremental mode, Structural checks the source database for records that have an updated date that that is greater than the maximum date in that column in the destination database.
When identifying records to update, Structural only checks the updated date. It does not check for other updates. Records where the generator configuration is changed are not updated if they do not meet the updated date requirement.
For the identified records, Structural checks for primary key matches between the source and destination databases, then does one of the following:
If the primary key value exists in the destination database, then Structural overwrites the record in the destination database.
If the primary key value does not exist in the destination database, then Structural adds a new record to the destination database.
This mode currently only updates and adds records. Rows that are deleted from the source database remain in the destination database.
To ensure accurate incremental processing of records, we recommend that you do not directly modify the destination database. A direct modification might cause the maximum updated date in the destination database to be after the date of the last data generation. This could prevent records from being identified for incremental processing.
You cannot use Incremental mode when you:
Enable upsert for a workspace.
Write destination data to a container artifact.
Write destination data to an Ephemeral snapshot
In this mode, Structural generates an arbitrary number of new rows, as specified by the user, using the generators that are assigned to the table columns.
You can use linking and partitioning to create complex relationships between columns.
Structural generates primary and foreign keys that reflect the distribution (1:1 or 1:many) between the tables in the source database.
You cannot use Scale mode when you enable upsert for a workspace.
For the Databricks data connector, the table mode configuration includes an Error on Overwrite setting. The setting indicates whether to return an error when Structural attempts to write data to a destination table that already contains data. The option is not available when you write destination data to Databricks Delta tables.
To return the error, toggle the setting to the on position.
To not return the error, toggle the setting to the off position.
For workspaces that use following data connectors, the table mode configuration for De-Identify mode includes an option to apply a filter to the table:
This option is only available for workspaces that use the following data connectors:
On the table mode configuration panel, you can use the Repartition or Coalesce option to indicate a number of partitions to generate.
By default, the destination database uses the same partitioning as the source database. The partition option is set to Neither.
The Repartition option allows you to provide a specific number of partitions to generate.
To use the Repartition option:
Click Repartition.
In the field, enter the number of partitions.
The Coalesce option allows you to provide a maximum number of partitions to generate. If the source data has fewer partitions than the number you specify, then Structural only generates that number. The Coalesce option should be more efficient than the Repartition option.
To use the Coalesce option:
Click Coalesce.
In the field, enter the number of partitions.
You can also view this .
For Database View, go to .
For Table View, go to .
For Spark-based data connectors (, , , ), the table is ignored completely.
For the , file groups are treated as tables. When a file group is assigned Truncate mode, the data generation process ignores the files that are in that file group.
When is enabled, the Truncate table mode does not actually truncate the destination table. Instead, it works more like Preserve Destination table mode, which preserves existing records in the destination table.
Incremental mode is currently supported on PostgreSQL, MySQL, and SQL Server. If you want to use this table mode with another database type, contact .
Table filters provide a way to generate a smaller set of data when a data connector does not support subsetting. For more information, go to .
When you consider which generator to use, it helps to be familiar with these generator characteristics.
The following table summarizes the available generators. It indicates whether each generator can be made consistent, can be linked, and is differentially private.
In the Consistency column, the table also indicates whether the generator can be made self-consistent only, or can be made either self-consistent or consistent with another column.
The Description column includes:
For generators that can be data-free, whether the generator is always data-free, or only data-free when consistency is disabled.
The possible privacy rankings for the generator. For details about the available privacy rankings, go to #privacy-report-privacy-ranking-about.
Consistency is an option for some generators that when turned on, maps the same input to the same output across an entire database.
Consistency can also be maintained across multiple databases of varying types. For example, if consistency is turned on for a name generator, it always maps the same input name (for example, Albert Einstein) to the same output (for example, Richard Feynman).
You can also view this video overview of consistency.
The primary reasons for using consistency are to:
Enable joining on columns that don't have explicit database constraints in the schema. This is often seen with values such as email addresses. With consistency, you can completely anonymize an email address and still use it in a join.
Preserve the approximate cardinality of a column. For example, a city column contains 50 different cities. To randomize this column but still have ~50 cities, you can use consistency to maintain the approximate cardinality. Because consistency does not guarantee uniqueness, the cardinality might change. However, it is guaranteed to not increase. If unique 1-to-1 mappings are required, a Key generator should be used.
Match duplicated data across 1 or more databases. For example, you have a user database that contains a username in both a column and a JSON blob, and another database that contains their website activity, identified by the same username values. To anonymize the username, but still have the username be the same in all locations/databases, use consistency.
Self-consistency indicates that the value in the destination database is consistent with the value of the same column in the source database.
For example, a column contains a first name. You make the assigned generator self-consistent. A given first name in the source database is always replaced by the same first name in the destination database. For example, the first name value John
is always replaced by the value Michael
.
Consistency with another column indicates that the value in the destination database is consistent with the value of a different column in the source database.
For example, a column contains an IP address. You make the assigned generator consistent with the username column. Every row that has the username User1
in the input database has the same IP address in the destination database.
When you select a generator as the sub-generator for a composite generator, in most cases you cannot configure the generator to be consistent with another column. Only the Conditional generator and the Regex Mask generator allow a sub-generator to be consistent with another column.
Note that consistency with another column cannot be configured in a generator preset. You can only configure it when you configure an individual column.
To enable consistency, on the generator configuration panel, toggle the Consistency switch.
Not all generators support consistency.
Consistency is a function of the both the data type and the value.
For example, a numeric field contains the value 123. A string/varchar field contains the value "123".
Both fields have consistent generators applied.
The output is not consistent between the two fields.
To demonstrate the effect of consistency on the output, we'll use a column that contains a first name, and that uses the Name generator.
Here is the sample input and output when consistency is not enabled:
In this sample data, the first name Melissa appears twice, but is mapped to Walton the first time and Linn the second time.
Here is the sample input and output when consistency is enabled:
In this case, the first name Melissa is mapped to Rosella both times.
A consistent generator ensures that the same input value always produces the same output value.
It does not guarantee that two different input values produce two different output values.
Consistent generators are not 1:1 mappings.
Consistency reduces the privacy of your data, because it reveals something about the frequency of the data values.
However, Tonic Structural does not store mappings of the source data to the destination data. In other words, someone can see that in the destination data the name Susan appears 20 times and the name John appears 3 times. But they cannot determine that Susan is mapped from Jane and John is mapped from Michael.
Any column, regardless of which table it resides in, is consistent with any other column that uses the same consistent generator.
For example, your database includes a Customers table and an Employees table. Each table contains a column for the first name of the customer or employee. You assign the Name generator to both columns to generate a first name, and make the generators consistent. The same first name value in either column is mapped to the same destination value. For example, the first name John is always mapped to Michael, whether the name John appears in the Customers table or the Employees table.
However, by default, consistency is not guaranteed between data generation runs, even if the run is on the same database.
By default, consistency is only guaranteed across a single data generation for a single workspace.
For example, for a column that contains a first name value, you assign the Name generator and configure the generator to be consistent. The first time you run data generation, all instances of the name John might be replaced with Michael. The next time you run data generation, all instances of the name John might instead be replaced with Gregory.
You can enable consistency across runs and workspaces so that, for example, every time you run a data generation, John is always replaced with Michael.
To do this, you configure a seed value. You can either:
Configure the Structural environment setting TONIC_STATISTICS_SEED
. This ensures consistency across all workspaces and data generation runs.
Configure a seed value for a workspace. This ensures consistency across all data generation runs for that workspace, as well as across other workspaces that have the same seed value.
Disable cross-data generation consistency for a workspace. This indicates to not have consistency across data generation runs or with other workspaces.
To ensure consistency across all data generations and workspaces, add the following environment setting to the Structural worker and web server containers:
TONIC_STATISTICS_SEED: <ANY 32-BIT SIGNED INTEGER>
When you configure a value for this environment setting, then consistency is across all data generations for all workspaces that do not either:
Have a workspace seed value configured.
Have disabled consistency across data generations.
For an individual workspace, you can override the Structural seed value. When you override the Structural seed value, you can either:
Disable consistency across data generation runs for the workspace.
Provide a seed value for the workspace.
When a workspace has a configured seed value, then consistency is across the data generation runs for that workspace.
Consistency is also across all of the data generations for all of the workspaces that have the same seed value.
On the workspace details view, to override the Structural seed value:
Toggle Override Statistics Seed to the on position.
To disable consistency across data generations, click Don't use consistency.
To provide a seed value for the workspace:
Click Consistency value.
In the field, enter the seed value. It must be a 32-bit signed integer. The value defaults to the current value of TONIC_STATISTICS_SEED
.
The following generators can be made consistent to themselves. This means that the same input value in the column always produces the same output value.
The following generators can be made consistent either to themselves or to other columns.
When a column is consistent to another column, the output value is based on the other column.
For example, a column contains a company name. You assign the Company Name generator, and make it consistent with the username column. Every row that has the username User1 in the input database has the same company name in the destination database.
Company Name (Deprecated)
Here are the details for the supported generators in Tonic Structural.
The table for each generator includes:
The generator ID to use in the Tonic API
Generates a random address-like string.
You can indicate which part of an address string that the column contains. For example, the column might contain only the street address or the city, or it might contain the full address.
To configure the generator:
From the Link To dropdown list, select the columns to link this column to. You can link columns that use the Address generator to mask one of the following address components:
City
City State
Country
Country Code
State
State Abbreviation
Zip Code
Latitude
Longitude
Note that when linked to another address column, a country or country code is always the United States.
From the address component dropdown list, select the address component that this column contains. The available options are:
Building Number
Cardinal Direction (North, South, East, West)
City
City Prefix (Examples: North, South, East, West, Port, New)
City Suffix (Examples: land, ville, furt, town)
City with State (Example: Spokane, Washington)
City with State Abbr (Example: Houston, TX)
Country (Examples: Spain, Canada)
Country Code (Uses the 2-character country code. Examples: ES, CA)
County
Direction (Examples: North, Northeast, Southwest, East)
Full Address
Latitude (Examples: 33.51, 41.32)
Longitude (Examples: -84.05, -74.21)
Ordinal Direction (Examples: Northeast, Southwest)
Secondary Address (Examples: Apt 123, Suite 530)
State (Examples: Alabama, Wisconsin)
State Abbr (Examples: AL, WI)
Street Address (Example: 123 Main Street)
Street Name (Examples: Broad, Elm)
Street Suffix (Examples: Way, Hill, Drive)
US Address
US Address with Country
Zip Code (Example: 12345)
Toggle the Consistency setting to indicate whether to make the column consistent. By default, the consistency is disabled.
If consistency is enabled, then by default, the generator is self-consistent. To make the generator consistent with another column, from the Consistent to dropdown list, select the column. When the Address generator is consistent with itself, then the same value in the source database is always mapped to the same destination value. For example, for a column that contains a state name, Alabama is always mapped to Illinois. When the Address generator is consistent with another column, then the same value in the other column always results in the same destination value for the address column. For example, if the address column is consistent with a name column, then every instance of John Smith in the name column in the source database has the same address value in the destination database.
For the Address generator, Spark workspaces (Amazon EMR, Databricks, and self-managed Spark clusters) only support the following address parts:
Building Number
City
Country
Country Code
Full Address
Latitude
Longitude
State
State Abbr
Street Address
Street Name
Street Suffix
US Address
US Address with Country
Zip Code
Within a table, the AI synthesizer uses the columns that are assigned the AI Synthesizer to train a model and generate the synthetic data.
It uses deep neural networks for high-fidelity data mimicking.
The privacy ranking is 3.
The algebraic generator identifies the algebraic relationship between three or more numeric values and generates new values to match. At least one of the values must be a non-integer.
This generator can be linked with other Algebraic generators.
To configure the generator, from the Link To dropdown list, select the columns to link this column to. You can select other columns that are assigned the Algebraic generator.
You must select at least three columns.
The column values must be numeric. At least one of the columns must contain a value other than an integer.
Generates unique alphanumeric strings of the same length as the input. For example, for the origin value ABC123
, the output value is a six-character alphanumeric string such as D24N05
.
To configure the generator, toggle the Consistency setting to indicate whether to make the generator self-consistent.
By default, the generator is not consistent.
This generator replaces letters with random other letters, and numbers with random other numbers. Punctuation and whitespace are preserved.
For example, for the following array value:
["ABC.123", 3, "last week"]
The output might be something like:
["KFR.860", 7, "sdrw mwoc"]
This generator securely masks letters and numbers. There is no way to recover the original data.
To configure the generator, toggle the Consistency setting to indicate whether to make the generator self-consistent.
By default, the generator is not consistent.
To configure the generator:
To assign a generator to a path expression:
Under Sub-generators, click Add Generator. On the sub-generator configuration panel, the Cell JSON field contains a sample value from the source database. You can use the previous and next icons to page through different values.
In the Path Expression field, type the JSONPath expression to identify the value to apply the generator to. To populate a path expression, you can also click a value in the Cell JSON field. Matched JSON Values shows the result from the value in Cell JSON.
By default, the selected generator is applied to any value that matches the expression. To limit the types of values to apply the generator to, from the Type Filter, specify the applicable types. You can select Any, or you can select any combination of String, Number, and Null.
From the Generator Configuration dropdown list, select the generator to apply to the path expression. You cannot select another composite generator.
Configure the selected generator. You cannot configure the selected generator to be consistent with another column.
To save the configuration and immediately add a generator for another path expression, click Save and Add Another. To save the configuration and close the add generator panel, click Save.
From the Sub-Generators list:
To edit a generator assignment, click the edit icon.
To remove a generator assignment, click the delete icon.
To move a generator assignment up or down in the list, click the up or down arrow.
Uses regular expressions to parse strings and replace specified substrings with the output of specified generators. The parts of the string to replace are specified inside unnamed top-level capture groups.
To configure the generator:
To add a regular expression:
Click Add Regex. On the configuration panel, Cell Value shows a sample value from the source database. You can use the previous and next options to navigate through the values.
By default, Replace all matches is enabled. To only match the first occurrence of a pattern, toggle Replace all matches to the off position.
In the Pattern field, enter a regular expression. If the expression is valid, then Structural displays the capture groups for the expression.
For each capture group, to select and configure the generator to apply, click the selected generator. You cannot select another composite generator.
To save the configuration and immediately add a generator for another path expression, click Save and Add Another. To save the configuration and close the add generator panel, click Save.
From the Regexes list:
To edit a regex, click the edit icon.
To remove a regex, click the delete icon.
Generates unique alpha-numeric strings based on any printable ASCII characters. The length of the source string is not preserved. You can choose to exclude lowercase letters from the generated values.
To configure the generator:
To exclude lowercase letters from the generated values, toggle Exclude Lowercase Alphabet to the on position.
Toggle the Consistency setting to indicate whether to make the generator consistent. By default, the generator is not consistent.
Generates a random company name-like string.
To configure the generator, toggle the Consistency setting to indicate whether to make the generator consistent.
By default, the generator is not consistent.
If consistency is enabled, then by default it is self-consistent. To make the generator consistent with another column, from the Consistent to dropdown list, select the column.
When the generator is consistent with itself, then a given source value is always mapped to the same destination value. For example, My Business is always mapped to New Business.
When the generator is consistent with another column, then a given source value in that other column always results in the same destination value for the company name column. For example, if the company name column is consistent with a name column, then every instance of John Smith in the name column in the source database has the same company name in the destination database.
The Categorical generator shuffles the existing values within a field while maintaining the overall frequency of the values. It disassociates the values from other pieces of data. Note that NULL is considered a separate value.
For example, a column contains the values Small
, Medium
, and Large
. Small
appears 3 times, Medium
appears 4 times, and Large
appears 5 times. In the output data, each value still appears the same number of times, but the values are shuffled to different rows.
To configure the generator:
From the Link To dropdown, select the columns to link to the current column. You can select from other columns that use the Categorical generator.
Toggle the Differential Privacy setting to indicate whether to make the output data differentially private. By default, differential privacy is disabled.
This generator replaces letters with random other letters and numbers with random other numbers. Punctuation, whitespace, and mathematical symbols are preserved.
For example, for the following input string:
ABC.123 123-456-789 Go!
The output would be something like:
PRX.804 296-915-378 Ab!
This generator securely masks letters and numbers. There is no way to recover the original data.
Character Scramble is similar to Character Substitution, with a couple of key differences. While you can enable consistency for the entire value, Character Scramble does not always replace the same source character with the same destination character. Because there is no guarantee of unique values, you cannot use Character Scramble on unique columns. Character Substitution, however, does always map the same source character to the same destination character. Character Substitution is always consistent, which makes it less secure than Character Scramble. You can use Character Substitution on unique columns.
To configure the generator, toggle the Consistency setting to indicate whether to make the generator self-consistent.
By default, the generator is not consistent.
Performs a random character replacement that preserves formatting (spaces, capitalization, and punctuation).
Characters are replaced with other characters from within the same Unicode Block. A given source character is always mapped to the same destination character. For example, M
might always map to V
.
For example, for the following input string:
Miami Store #162
The output would be something like:
Vgkjg Gmlvf #681
Note that for a numeric column, when a generated number starts with a 0, the starting 0 is removed. This could result in matching output values in different columns. For example, one column is changed to 113 and the other to 0113, which also becomes 113.
Character Substitution is similar to Character Scramble, with a couple of key differences. Because Character Substitution always maps the same source character to the same destination character, it is always consistent. It also can be used for unique columns. In Character Scramble, the character mapping is random, which makes Character Scramble slightly more secure. However, Character Scramble cannot be used for unique columns.
Generates a random company name-like string.
To configure the generator, toggle the Consistency setting to indicate whether to make the generator consistent.
By default, the generator is not consistent.
If consistency is enabled, then by default it is self-consistent. To make the generator consistent with another column, from the Consistent to dropdown list, select the column.
When the generator is consistent with itself, then a given source value is always mapped to the same destination value. For example, My Company is always mapped to New Company.
When the generator is consistent with another column, then a given source value in that other column always results in the same destination value for the company name column. For example, if the company name column is consistent with a name column, then every instance of John Smith in the name column in the source database has the same company name in the destination database.
Applies different generators to the value conditionally based on any value in the table.
For example, a Users table contains Name, Username, and Role columns. For the Username column, you can use a conditional generator to indicate that if the value of Role is something other than Test, then use the Character Scramble generator for the Username value. For Test users, the name is not masked.
The generator consists of a list of options. Each option includes the required conditions and the generator to use if those conditions are met.
The generator always contains a Default option. The Default option is used if the value does not meet any of the conditions. To configure the Default option:
From the Default dropdown list, select the generator to use by default.
Configure the selected generator.
To add a condition option:
Click + Conditional Generator.
To add a condition:
Click + Condition.
From the column list, select the column for which to check the value.
Select the comparison type.
Enter the column value to check for.
To remove a condition, click the delete icon for the condition.
From the Generator dropdown list, select the generator to run on the current column if the conditions are met. You cannot select another composite generator.
Choose the configuration options for the selected generator.
To view details for and edit a condition option, click the expand icon for that option.
To remove a condition option, click the delete icon for the option.
Uses a single value to mask all of the values in the column.
For example, you can replace every value in a string column with the String1
. Or you can replace every value in a numeric column with the value 12345
.
To configure the generator, in the Constant Value field, provide the value to use.
The value must be compatible with the field type. For example, you cannot provide a string value for an integer column.
Generates a continuous distribution to fit the underlying data.
This generator can be linked to other Continuous generators to create multivariate distributions and can be partitioned by other columns.
To configure the generator:
From the Link To drop-down list, select the other Continuous generator columns to link to. The linking creates a multivariate distribution.
Toggle the Differential Privacy setting to indicate whether to make the output data differentially private. By default, the generator is not differentially private.
Links columns in two tables. This column value is the sum of the values in a column in another table.
This generator does not provide a preview. The sums are not computed until the other table is generated.
For example, a Customers table contains a Total_Sales column. The Transactions table uses a foreign key Customer_ID column to identify the customer who made the transaction, and an Amount column that contains the amount of the sale. The Customer_ID value in the Transactions table is a value from the ID primary key column in the Customers table.
You assign the Cross Table Sum generator to the Total_Sales column. In the generator configuration, you indicate that the value is the sum of the Amount values for the Customer_ID value that matches the primary key ID value for the current row.
For the Customers row for ID 123
, the Total_Sales column contains the sum of the Amount column for Transactions rows where Customer_ID is 123
.
To configure the generator:
From the Foreign Table dropdown list, select the table that contains the column for which to sum the values.
From the Foreign Key dropdown list, select the foreign key. The foreign key identifies the row from the current table that is referred to in the foreign table.
From the Sum Over dropdown list, select the column for which to sum the values.
From the Primary Key dropdown list, select the primary key for the current table.
Masks text columns by parsing the values as rows whose columns are delimited by a specified character.
You can assign specific generators to specific indexes. You can also use the generator that is assigned to a specific index as the default. This applies the generator to every index that does not have an assigned generator.
The output value maintains the quotes around the index values.
For example, a column contains the following value:
"first","second","third"
You assign the Character Scramble generator to index 0 and assign Passthrough to index 2. You select index 0 as the index to use for the default generator.
In the output, the first and second values are masked by the Character Scramble generator. The third value is not masked. The output looks something like:
"wmcop", "xjorsl", "third"
To configure the generator:
In the Delimiter field, type the delimiter that is used as a separator for the value.
For example, for the value "first","second","third"
, the delimiter is a comma.
You can configure a generator for any or all of the indexes. To add a sub-generator for an index:
Under Sub-Generators, click Add Generator. On the add generator dialog, the Cell CSV field contains a sample value from the source data. You can use the navigation icons to page through the values.
In the CSV Index field, type the index to assign a generator to. The index numbers start with 0. You cannot use an index that already has an assigned generator. Matched CSV values shows the value at that index for the current sample column value.
Under Generator Configuration, from the Select a Generator dropdown list, select the generator to use for the selected index. You cannot select another composite generator. To remove the selection, click the delete icon.
Configure the selected generator. You cannot configure the selected generator to be consistent with another column.
To save the configuration and immediately add a generator for another index, click Save and Add Another. To save the configuration and close the add generator panel, click Save.
From the Sub-Generators list:
To edit a generator assignment, click the edit icon.
To remove a generator assignment, click the delete icon.
To move a generator assignment up or down in the list, click the up or down arrow.
After you configure a generator for at least one index, the Default Link dropdown list is displayed. From the Default Link dropdown list, select the index to use to determine how to mask values for indexes that do not have an assigned generator. For example, you assign the Character Scramble generator to index 2. If you set Default Link to 2, then all indexes that do not have an assigned generator use the Character Scramble generator.
To configure the generator:
From the Link To dropdown list, select the columns to link this column to. You can only select other columns that use the Custom Categorical generator.
In the Custom Categories text area, enter the list of values that the generator can choose from.
Put each value on a separate line.
To add a NULL value to the list, use the keyword {NULL}
.
Toggle the Consistency setting to indicate whether to make the column consistent. By default, consistency is disabled.
If you enable consistency, then by default the generator is self-consistent. To make the generator consistent with another column, from the Consistent to dropdown list, select the column. When a generator is self-consistent, then a given value in the source database is always mapped to the same value in the destination database. When a generator is consistent with another column, then a given source value in that column always results in the same value for the current column in the destination database. For example, a department column is consistent with a username column. For each instance of User1 in the source database, the value in the department column is the same.
Truncates a date value or a timestamp to a specific part.
For a date or a timestamp, you can truncate to the year, month, or day.
For a timestamp, you can also truncate to the hour, minute, or second.
To configure the generator:
From the dropdown list, select the part of the date or timestamp to truncate to. For both date and timestamp values, you can truncate to the year, month, or day. When you select one of these options, the time portion of a timestamp is set to 00:00:00. For the date, the values below the selected truncation value are set to 01. For example, when you truncate to month, the day value is set to 01, and the timestamp is set to 00:00:00. For a timestamp value, you also can truncate to the hour, minute, or second. The date values remain the same as the original data. The time values below the selected truncation value are set to 00. For example, when you truncate to minute, the seconds value is set to 00.
Toggle the Birth Date option. When you enable Birth Date, the generator shifts dates that are more than 90 years before the generation date to the date exactly 90 years before the generation date. For example, a generation occurs on January 1, 2023. Any date that occurs before January 1, 1933 is changed to January 1, 1933.
This is mostly intended for birthdate values, to group birthdates for everyone who is older than 89 into a single year. This is used to comply with HIPAA Safe Harbor.
Here are examples of date and time values and how the selected truncation affects the output:
This generator scrambles the characters in an email address. It preserves formatting and keeps the @
and .
characters.
For example, for the following input value:
johndoe@company.com
The output value would be something like:
brwomse@xorwxlt.slt
By default, the generator scrambles the domain. You can configure the generator to not mask specific domains. You can also specify a domain to use for all of the output email addresses.
For example, if you configure the generator to not scramble the domain company.com
, then the output for johndoe@company.com
would look something like:
brwomse@company.com
This generator securely masks letters and numbers. There is no way to recover the original data.
To configure the generator:
In the Email Domain field, enter a domain to use for all of the output values.
For example, use @mycompany.com
for all of the generated values. The generator scrambles the content before the @
.
In the Excluded Email Domains field, enter a comma-separated list of domains for which email addresses are not masked in the output values. This allows you, for example, to maintain internal or testing email addresses that are not considered sensitive.
Toggle the Replace invalid emails setting to indicate whether to replace an invalid email address with a generated valid email address. By default, invalid email addresses are not replaced. In the replacement values, the username is generated. If you specify a value for Email Domain, then the email addresses use that domain. Otherwise, the domain is generated.
Toggle the Consistency setting to indicate whether to make the column self-consistent. By default, consistency is disabled.
Generates timestamps fitting an event distribution. The source timestamp must include a date. It cannot be a time-only value.
Link columns to create a sequence of events across multiple columns. This generator can be partitioned by other columns.
To configure the generator:
From the Link To dropdown list, select the other Event Timestamps generator columns to link this column to. Linking creates a sequence across multiple columns.
The Options list displays the current column and linked columns. Use the Up and Down buttons to configure the column sequence.
This generator scrambles characters while preserving formatting and keeping the file extension intact.
For example, for the following input value:
DataSummary1.pdf
The output value would look something like:
RsnoPwcsrtv5.pdf
This generator securely masks letters and numbers. There is no way to recover the original data.
To configure the generator, toggle the Consistency setting to indicate whether to make the generator self-consistent.
By default, the generator is not consistent.
This generator replaces all instances of the find string with the replace string.
For example, you can indicate to replace all instances of abc with 123.
To configure the generator:
In the Find field, type the string to look for in the source column value.
To use a regular expression to identify the source value, check the Use Regex checkbox.
If you use a regular expression, use backslash ( \
) as the escape character.
In the Replace field, type the string to replace the matching string with.
The FNR generator transforms Norwegian national identity numbers. In Norwegian, the term for national identity number abbreviates to FNR.
The first six digits of an FNR reflects the person's birthdate. You can choose to preserve the birthdates from the source values in the destination values. If you do not preserve the source values, the destination values are still within the same date range as the source values.
Another digit in an FNR indicates whether the person is male or female. You can specify whether to preserve in the generated value the gender indicated in the source value.
The last digits in an FNR are a checksum value. The last digits in the destination value are not a checksum - the values are random.
To configure the generator:
To preserve the gender from the source value in the destination value, toggle Preserve Gender to the on position.
To preserve the birthdate from the source value in the destination value, toggle Preserve Birthdate to the on position.
Toggle the Consistency setting to indicate whether to make the generator consistent. By default, consistency is disabled.
If you enable consistency, then by default the generator is self-consistent. To make the generator consistent with another column, from the Consistent to dropdown list, select the column. When a generator is self-consistent, then a given value in the source database is always mapped to the same value in the destination database. When a generator is consistent with another column, then a given value for that other column in the source database results in the same value in the destination database. For example, if the FNR column is consistent with a Name column, then every instance of John Smith in the source database results in the same FNR in the destination database.
This generator can be used to mask columns of latitude and longitude.
The Geo generator divides the globe into grids that are approximately 4.9 x 4.9 km. It then counts the number of points within each grid.
During data generation, each (latitude, longitude) pair is mapped to its grid.
If the grid contains a sufficient number of points to preserve privacy, then the generator returns a randomly chosen point in that grid.
If the grid does not contain enough points to preserve privacy, then the generator returns a random coordinate from the nearest grid that contains enough points.
To configure the generator:
From the Link To dropdown list, select the column to link to this one. You typically assign the Geo generator to both the latitude and longitude column, then link those columns.
From the value type dropdown, select whether this column contains a latitude value or a longitude value.
This generator can be used to generate cities, states, and zip codes that follow HIPAA guidelines for safe harbor.
Zip Codes
How the HIPAA Address generator handles zip codes is based on whether the Replace zeros in truncated Zip Code toggle in the generator configuration is off or on.
By default, the setting is off. In this case, the last two digits of the zip code in the column are replaced with zeros, unless the zip code is a low population area as designated by the current census. For a low population area, all of the digits in the zip code are replaced with zeros.
If the setting is on, then the generator selects a real zip code that starts with the same three digits as the original zip code. For a low population area, if a state is linked, then the generator selects a random zip code from within that state. Otherwise the generator selects a random zip code from the United States.
Cities
When a zip code column is not linked, a random city is chosen in the United States. When a zip code is already added to the link, a city is chosen at random that has at least some overlap with the zip code.
If the original zip code is designated as a low population area then a random city is chosen within the state, this is done only if the user has linked a State column. If they have not, a random city within the United States is chosen.
For example, if the original city and zip code were (Atlanta, 30305), the zip code would be replaced with 30300. There are many cities that contain zip codes beginning in 303 such as Atlanta, Decatur, Chamblee, Hapeville, Dunwoody, College Park, etc.). One of these cities is chosen at random so that our final value is (Chamblee, 30300), for example.
States
HIPAA guidelines allow for information at the state level to be kept. Therefore, these values are passed through.
Latitude and longitude (GPS) coordinates
GPS coordinates are randomly generated in descending order of dependence of the linked HIPAA address components:
If a zip code is linked, a random point within the same 3-digit zip code prefix is generated, if the 3-digit zip code prefix is not designated a low population area. If it is a low population area, use the linked state.
If a state is available and a zip code and city are not, or the zip code or city are in a 3-digit zip code prefix that is designated a low population area, then a random GPS coordinate is generated somewhere within the state.
If no zip code, city, or state is linked, or one or more of them were provided, but there was a problem generating a random GPS coordinate within the linked areas, then a GPS coordinate is generated at a random location within the United States.
Note: If the city component of the HIPAA address is linked with latitude and/or longitude, the GPS coordinate components are randomly generated independently of the city.
Other address parts
All other address parts are generated randomly and hence their value is not influenced at all by the underlying value in the column.
To configure the generator:
From the Link To dropdown list, select the other columns to link to. You can only select columns that are also assigned the HIPAA Address generator.
From the address part dropdown list, select the type of address value that is in the column.
Toggle the Replace zeros in truncated Zip Code setting how to generate zip codes. If the setting is off, then the last two digits are replaced with zero. For low population areas, the entire zip code is populated with zeroes. If the setting is on, then a real zip code is selected that starts with the first three digits of the original zip code. For low population areas, if a state is linked, a random zip code from the state is used. Otherwise, a random zip code from the United States is used.
Toggle the Consistency setting to indicate whether to make the column self-consistent. By default, consistency is disabled.
For the HIPAA Address generator, Spark workspaces (Amazon EMR, Databricks, and self-managed Spark clusters) only support the following address parts:
City
City with State
City with State Abbr
State
State Abbr
US Address
US Address with Country
Zip Code
Generates random host names, based on the English language.
To configure the generator, toggle the Consistency setting to indicate whether to make the generator consistent.
By default, the generator is not consistent.
If you enable consistency, then by default the generator is self-consistent. To make the generator consistent with another column, from Consistent to, select the column.
When the generator is consistent with itself, then a given value in the source database is mapped to the same value in the destination database. For example, Host123 in the source database always produces MyHostABC in the destination database.
When the generator is consistent with another column, then a given source value in the other column results in the same host name value in the destination database. For example, a host name column is consistent with a department column. Every instance of Sales in the source data is given the same host name in the destination database.
Runs selected generators on specified key values in an HStore column in a PostgreSQL database. HStore columns contain a set of key-value pairs.
To configure the generator:
To assign a generator to a key:
Under Sub-generators, click Add Generator. On the sub-generator configuration panel, the Cell HStore field contains a sample value from the source database. You can use the previous and next icons to page through different values.
Under Enter a key, enter the name of a key from the column value.
For example, for the column value:
"pages"=>"446", "title"=>"The Iliad", "category"=>"mythology"
To apply a generator to the title, you would enter title
as the key.
Matched HStore Values shows the result from the value in Cell HStore.
From the Generator Configuration dropdown list, select the generator to apply to the key value. You cannot select another composite generator.
Configure the selected generator. You cannot configure the selected generator to be consistent with another column.
To save the configuration and immediately add a generator for another key, click Save and Add Another. To save the configuration and close the add generator panel, click Save.
From the Sub-Generators list:
To edit a generator assignment, click the edit icon.
To remove a generator assignment, click the delete icon.
To move a generator assignment up or down in the list, click the up or down arrow.
This is a composite generator.
Masks text columns by parsing the contents as HTML, and applying sub-generators to specified path expressions.
If applying a sub-generator fails because of an error, the generator selected as the fallback generator is applied instead.
For example, for the following HTML:
To get the value of h1
, the expression is //h1/text(
).
To get the value of the first list item, the expression is //ul/li[1]/text()
.
To configure the generator:
To assign a generator to a path expression:
Under Sub-generators, click Add Generator. On the sub-generator configuration panel, the Cell HTML field contains a sample value from the source database. You can use the previous and next icons to page through different values.
In the Path Expression field, type the path expression to identify the value to apply the generator to. Matched HTML Values shows the result from the value in Cell HTML.
From the Generator Configuration dropdown list, select the generator to apply to the path expression. You cannot select another composite generator.
Configure the selected generator. You cannot configure the selected generator to be consistent with another column.
To save the configuration and immediately add a generator for another path expression, click Save and Add Another. To save the configuration and close the add generator panel, click Save.
From the Sub-Generators list:
To edit a generator assignment, click the edit icon.
To remove a generator assignment, click the delete icon.
To move a generator assignment up or down in the list, click the up or down arrow.
From the Fallback Generator dropdown list, select the generator to use if the assigned generator for a path expression fails. The options are:
Generates unique integer values. By default, the generated values are within the range of the column’s data type.
You can also specify a range for the generated values. The source values must be within that range.
This generator cannot be used to transform negative numbers.
To configure the generator:
In the Minimum field, enter the minimum value to use for an output value. The minimum value cannot be larger than any of the values in the source data.
In the Maximum field, enter the maximum value to use for an output value. The maximum value cannot be smaller than any of the values in the source data.
Toggle the Consistency setting to indicate whether to make the column self-consistent. By default, consistency is disabled.
Generates a random IP address formatted string.
To configure the generator:
In the Percent IPv4 field, type the percentage of output values that are IPv4 addresses.
For example, if you set this to 60
, then 60% of the generated IP addresses are IPv4 addresses, and 40% of the generated IP addresses are IPv6 addresses.
If you set this to 100
, then all of the generated IP addresses are IPv4 addresses.
If you set this to 0
, then all of the generated IP addresses are IPv6 addresses.
Toggle the Consistency setting to indicate whether to make the column consistent. By default, consistency is disabled.
If you enable consistency, then by default the generator is self-consistent. To make the generator consistent with another column, from the Consistent to dropdown list, select the column. When a generator is self-consistent, then a given value in the source database is always mapped to the same value in the destination database. When a generator is consistent with another column, then a given source value in that column always results in the same IP address value in the destination database. For example, an IP address column is consistent with a username column. For each instance of User1 in the source database, the value in the IP address column is the same.
If an error occurs, the selected fallback generator is used for the entirety of the JSON value.
Sub-generators are applied sequentially, from the sub-generator at the top of the list to the sub-generator at the bottom of the list.
If multiple JSONPath expressions point to the same key, the most recently added generator takes priority.
JSON paths can also contain regular expressions and comparison logic, which allows the configured sub-generators to be applied only when there are properties that satisfy the query.
For example, a column contains this JSON:
[ { file_name: "foo.txt", b: 10 }, ... ]
The following JSON path only applies to array elements that contain a file_name
key for which the value ends in .txt
:
$.[?(@.file_Name =~ /^.*.txt$/)]
A JSON path can also be used to point to a key name recursively. For example, a column contains this JSON:
The following JSON path applies to all properties for which the key is first_name
:
$..first_name
To configure the generator:
To assign a generator to a path expression:
Under Sub-generators, click Add Generator. On the sub-generator configuration panel, the Cell JSON field contains a sample value from the source database. You can use the previous and next icons to page through different values.
In the Path Expression field, type the path expression to identify the value to apply the generator to. To create a path expression, you can also click the value in Cell JSON that you want the expression to point to. Matched JSON Values shows the result from the value in Cell JSON.
By default, the selected generator is applied to any value that matches the expression. To limit the types of values to apply the generator to, from the Type Filter, specify the applicable types. You can select Any, or you can select any combination of String, Number, Boolean, and Null.
From the Generator Configuration dropdown list, select the generator to apply to the path expression. You cannot select another composite generator.
Configure the selected generator. You cannot configure the selected generator to be consistent with another column.
To save the configuration and immediately add a generator for another path expression, click Save and Add Another. To save the configuration and close the add generator panel, click Save.
From the Sub-Generators list:
To edit a generator assignment, click the edit icon.
To remove a generator assignment, click the delete icon.
To move a generator assignment up or down in the list, click the up or down arrow.
From the Fallback Generator dropdown list, select the generator to use if the assigned generator for a path expression fails. The options are:
Generates a random MAC address formatted string.
To configure the generator:
In the Bytes Preserved field, enter the number of bytes to preserve in the generated address.
Toggle the Consistency setting to indicate whether to make the column self-consistent. By default, consistency is disabled.
Generates unique object identifiers.
Can be assigned to text columns that contain MongoDB ObjectId
values. The column value must be 12 bytes long.
To configure the generator:
A MongoID object identifier consists of an epoch timestamp, a random value, and an incremented counter. To only change the random value portion of the identifier, but keep the timestamp and counter portions, toggle Preserve Timestamp and Incremental Counter to the on position.
Toggle the Consistency setting to indicate whether to make the generator self-consistent. By default, the generator is not consistent.
Generates a random name string from a dictionary of first and last names.
You specify the name information that is contained in the column. A column might only contain a first name or last name, or might contain a full name. A full name might be first name first or last name first.
For example, a Name column contains a full name in the format Last, First. For the input value Smith, John
, the output value would be something like, Jones, Mary
.
To configure the generator:
From the name format dropdown list, select the type of name value that the column contains:
First. This also is commonly used for standalone middle name fields.
Last
First Last
First Middle Last
First Middle Initial Last
Last, First
Last, First Middle
Middle Initial
Toggle the Preserve Capitalization setting to indicate whether to preserve the capitalization of the column value. By default, the capitalization is not preserved.
Toggle the Consistency setting to indicate whether to make the column consistent. By default, consistency is disabled.
If you enable consistency, then by default the generator is self-consistent. To make the generator consistent with another column, from the Consistent to dropdown list, select the column.
Masks values in numeric columns. Adds or multiplies the original value by random noise.
The additive noise generator draws noise from an interval around 0 scaled to the magnitude of original value. For example, the default scale is 10% of the underlying value. The larger the value, the larger the amount of noise that is added.
The multiplicative noise generator multiplies the original value by a random scaling factor that falls within a specified range.
To configure the generator:
To use the additive noise generator:
From the dropdown list, choose Additive.
In the Relative noise scale field, type the percentage of the underlying value to scale the noise to. The default value is 10
.
Tonic samples the additive noise from a range between [-{
scale
/100} * |
value
|, {
scale
/ 100} * |
value
|)
, where scale
is the noise scale, and value
is the original data value.
The lower value of the range is inclusive, and the upper value of the range is exclusive.
For example, for the default noise scale of 10
, and a data value of 20
, the additive noise range would be [-.1 * 20, .1 * 20)
. In other words, between -2 (inclusive) and 2 (exclusive).
To use the multiplicative noise generator:
From the dropdown list, choose Multiplicative.
In the Min field, type the minimum value for the scaling factor. The minimum value is inclusive. The default value is 0.5
.
In the Max field, type the maximum value for the scaling factor. The maximum value is exclusive. The default value is 5
.
Tonic scales the original value from a range between [
min
,
max
)
, where min
is the minimum scaling factor, and max
is the maximum scaling factor.
For example, for the default values of 0.5
and 5
, Tonic multiplies the original data value by a value from between 0.5 (inclusive) and 5 (exclusive).
Toggle the Consistency setting to indicate whether to make the column consistent. By default, the consistency is disabled.
If you enable consistency, then by default the generator is self-consistent. To make the generator consistent with another column, from the Consistent to dropdown list, select the column. If the generator is self-consistent, then a given value in the source database is masked in exactly the same way to produce the value in the destination database. If the generator is consistent with another column, then for a given value in that other column, the column that is assigned the Noise generator is always masked in exactly the same way in the destination database. For example, a field containing a salary value is assigned the Noise Generator and is consistent with the username field. For each instance of User1, the Noise Generator masks the salary value in exactly the same way.
Generates NULL values to fill the rows of the specified column.
The Null generator has no configuration options.
Generates unique numeric strings of the same length as the input value.
For example, for the input value 123456
, the output value would be something like 832957
.
You can apply this generator only to columns that contain numeric strings.
To configure the generator, toggle the Consistency setting to indicate whether to make the generator self-consistent.
By default, the generator is not consistent.
Passthrough is the default option.
It passes through the value from the source database to the destination database without masking it.
Passthrough has no configuration options.
Generates a random phone number that matches the country or region of the input phone number while maintaining the format. For example, (123) 456-7890 or 123-456-7890.
If the input is not a valid phone number, the generator randomly replaces numeric characters. You can also replace invalid numbers with valid numbers.
To configure the generator:
Toggle the Replace invalid numbers setting to indicate whether to replace invalid input values with a valid output value. By default, the generator does not replace invalid values. It randomly replaces numeric characters.
Toggle the Consistency setting to indicate whether to make the generator self-consistent. By default, consistency is disabled.
Generates a random boolean value.
To configure the generator, in the Percent True field, enter the percentage of values to set to True
in the output.
For example, if you set this to 60
, then 60% of the output values are True
, and 40% of the output values are False
.
If you set this to 100
, then all of the output values are True
.
If you set this to 0
, then all of the output values are False
.
Generates a random double number between the specified minimum (inclusive) and maximum (exclusive).
To configure the generator:
In the Minimum field, type the minimum value to use in the output values. The minimum value is inclusive. The output values can be that value or higher.
In the Maximum field, type the maximum value to use in the output values. The maximum value is exclusive. The output values are lower than that value.
Generates a random hash string.
Returns a random integer between the specified minimum (inclusive) and maximum (exclusive).
For example, for a column that contains a percentage value, you can indicate to use a value between 0
and 101
.
To configure the generator:
In the Minimum field, type the minimum value to use in the output values. The minimum value is inclusive. The output values can be that value or higher.
In the Maximum field, type the maximum value to use in the output values. The maximum value is exclusive. The output values are lower than that value.
Generates random dates, times, and timestamps that fall within a specified range.
For example, you might want the output dates to all fall within a specific year or month.
To configure the generator, in the Range fields, provide the start and end dates, times, or timestamps to use for the output values.
Generates a random new UUID string.
This is a composite generator.
Uses regular expressions to parse strings and replace specified substrings with the output of specified generators. The parts of the string to replace are specified inside unnamed top-level capture groups.
Defining multiple expressions allows you to attach completely different sets of sub-generators to to a given cell, depending on the cell's value.
If multiple regular expressions match a given string, the regular expressions and their associated generators are applied in the order that they are specified. The first expression defined that matches has the selected sub-generators applied.
With the Replace all matches option, the Regex Mask generator behaves similarly to a traditional regex parser. It matches all occurrences of a pattern before the next pattern is encountered. For example, the pattern ^(a)$
applied to the string aaab
matches every occurrence of the letter a
, instead of just the first.
Note that for Spark-based data connectors, depending on your environment, there might be slight differences in the regular expression support. To ensure consistent results across all data connectors, use regular expression patterns that are compatible with both Java and C#.
Example expressions
In a cell that contains the string ProductId:123-BuyerId:234
, to mask the substrings 123
and 234
, specify the regular expression:
^ProductId:([0-9]{3})-BuyerId:([0-9]{3})$
This captures the two occurrences of three-digit numbers in the pattern ProductId:xxx-BuyerId:xxx
. This makes it possible to define a sub-generator on neither, either, or both of these captured substrings.
The following regular expression defines a broader capture that matches more cell values:
^(\w+).(\d+).(\w+).(\d+)$
This captures pairs of words ((\w+)
) and numbers ((\d+)
) if there is a single character of any value between them, instead of the relatively more specific pattern of the first expression.
To configure the generator:
To add a regular expression:
Click Add Regex. On the configuration panel, Cell Value shows a sample value from the source database. You can use the previous and next options to navigate through the values.
By default, Replace all matches is enabled. To only match the first occurrence of a pattern, toggle Replace all matches to the off position.
In the Pattern field, enter a regular expression. If the expression is valid, then Tonic displays the capture groups for the expression.
For each capture group, to select and configure the generator to apply, click the selected generator. You cannot select another composite generator.
To save the configuration and immediately add a generator for another path expression, click Save and Add Another. To save the configuration and close the add generator panel, click Save.
From the Regexes list:
To edit a regex, click the edit icon.
To remove a regex, click the delete icon.
Generates a column of unique integer values. The values increment by 1.
To configure the generator:
From the Link To dropdown list, select the other columns to link to the current column. You can only select columns that also use the Sequential Integer generator.
In the Starting Point field, type the number to use as the starting point.
By default, the starting point is 0
. This means that the column value in the first processed row is 0
. The value in the next processed row is 1
. The generator continues to increment the value by 1 in each row that it processes.
Generates values of ISO 6346 compliant shipping container codes. All generated codes are in the freight category ("U").
To configure the generator, toggle the Consistency setting to indicate whether to make the generator consistent.
By default, the generator is not consistent.
If you enable consistency, then by default the generator is self-consistent. To make the generator consistent with another column, from the Consistent to dropdown list, select the column.
When the generator is self-consistent, then a given value in the source database is always mapped to the same value in the destination database.
When the generator is consistent with another column, then a given value for the other column in the source database always results in the same shipping container code value in the destination database. For example, a shipping container column is consistent with an owner column. Every instance of an owner column from the source database has the same shipping container value in the destination database.
Generates a new valid Canadian Social Insurance Number that preserves the formatting of the original value.
For example, the original value might be 123456789
, 123 456 789
, or 123-456-789
. The output value uses the same format.
To configure the generator, toggle the Consistency setting to indicate whether to make the generator self-consistent.
By default, the generator is not consistent.
Generates a new valid United States Social Security Number.
You specify the percentage of values for which to include the dashes.
To configure the generator:
In the Percent with -'s field, type the percentage of output values for which to include dashes in the format.
For example, if you set this to 60
, then 60% of the output values are formatted 123-45-6789
, and 40% are formatted 123456789
.
If you set this to 100
, then all of the output values are formatted 123-45-6789
.
If you set this to 0
, then all of the output values are formatted 12345679
.
Toggle the Consistency setting to indicate whether to make the generator consistent. By default, consistency is disabled.
If you enable consistency, then by default the generator is self-consistent. To make the generator consistent with another column, from the Consistent to dropdown list, select the column. When a generator is self-consistent, then a given value in the source database is always mapped to the same value in the destination database. When a generator is consistent with another column, then a given value for that other column in the source database results in the same SSN in the destination database. For example, if the SSN column is consistent with a Name column, then every instance of John Smith in the source database results in the same SSN in the destination database.
Applies selected generators to specific StructFields within a StructType in a Spark database (Databricks and Amazon EMR).
For example, for the following StructType:
To get the value of the occupation
field, you would use the expression root.occupation
.
To configure the generator:
To assign a generator to a path expression:
Under Sub-generators, click Add Generator. On the sub-generator configuration panel, the Cell Struct field contains a sample value from the source database. You can use the previous and next icons to page through different values.
In the Path Expression field, type the path expression to identify the value to apply the generator to. Matched Struct Values shows the result from the value in Cell Struct.
From the Generator Configuration dropdown list, select the generator to apply to the path expression. You cannot select another composite generator.
Configure the selected generator. You cannot configure the selected generator to be consistent with another column.
To save the configuration and close the add generator panel, click Save.
From the Sub-Generators list:
To edit a generator assignment, click the edit icon.
To remove a generator assignment, click the delete icon.
To move a generator assignment up or down in the list, click the up or down arrow.
Shifts timestamps by a random amount of a specific unit of time within a set range.
For date-only values, the Timestamp Shift Generator supports the following date formats. The example values are all for February 23, 2021.
MM/dd/yyyy
- 02/23/2021
MM/dd/yy
- 02/23/21
MM-dd-yyyy
- 02-23-2021
yyyyMMdd
- 20210223
yyyy/MM/dd
- 2021/02/23
MMddyyyy
- 02232021
To configure the generator:
From the Date Part dropdown list, select the unit of time to use for the minimum and maximum shift.
In the Minimum Shift field, type the minimum amount the value can be shifted from the original value.
Use negative numbers to indicate to shift the date to the past.
For example, assume that the date part is Day. -3
indicates that the day cannot be shifted earlier than 3 days before the original day. 3
indicates that the date cannot be shifted earlier than 3 days after the original day.
In the Maximum Shift field, type the maximum amount by which the value can be shifted from the original value.
For example, assume that the date part is Day. 5
indicates that the date cannot be shifted later than 5 days after the original day.
Toggle the Consistency setting to indicate whether to make the generator consistent. By default, consistency is disabled.
If you enable consistency, then by default the generator is self-consistent. To make the generator consistent with another column, from the Consistent to dropdown list, select the column. When a column is consistent with itself, then the same date part value is always shifted by the same amount.
When a column is consistent with another column, then for the same value in the other column, the date part value is always shifted by the same amount. For example, for the same value of username, the birthdate column value is always shifted by the same amount.
If multiple columns that use the Timestamp Shift generator are consistent with the same other column, then for those columns, the date part value shifts by the same amount. For example, the startdate
and enddate
columns are both consistent with the username
column. Both startdate
and enddate
use the Timestamp Shift generator. For the same value of username
, both startdate
and enddate
are shifted by the same amount.
Generates unique email addresses. Replaces the username with a randomly generated GUID, and masks the domain with a character scramble.
This generator only guarantees uniqueness if the underlying column is unique.
To configure the generator:
In the Email Domain field, enter a domain to use for all of the output values.
For example, use @mycompany.com
for all of the generated values.
If you do not provide a value, then the generator uses a character scramble on the domain.
In the Excluded Email Domains field, enter a comma-separated list of domains for which email addresses are not masked in the output values. This allows you, for example, to maintain internal or testing email addresses that are not considered sensitive.
Toggle the Replace invalid emails setting to indicate whether to replace an invalid email address with a generated valid email address. By default, invalid email addresses are not replaced. In the replacement values, the username is generated. If you specify a value for Email Domain, then that value is used for the domain. Otherwise, the domain is generated.
Toggle the Consistency setting to indicate whether to make the generator self-consistent. By default, consistency is disabled.
This is a substitution cipher that preserves formatting, but keeps the URL scheme and top-level domain intact.
For example, for the following input value:
http://www.example.com/products/clothes
The output value would be something like:
http://www.example.com/sowrmsl/kwctlsn
This mask is not secure.
All foreign key columns that reference the configured column automatically have their UUID values masked.
To configure the generator:
To preserve the version and variant bits from the source UUID in the output value, toggle Preserve Version and Variant to the on position.
Toggle the Consistency setting to indicate whether to make the generator self-consistent. By default, the generator is not consistent.
Runs a selected generator on values that match a user specified path expression.
For example, for the following XML content:
To get the first_name
value, you would use /household/member/first_name
.
You can also select a fallback generator to run on the entire XML value if there is any error during data generation.
To configure the generator:
To assign a generator to a path expression:
Under Sub-generators, click Add Generator. On the sub-generator configuration panel, the Cell XML field contains a sample value from the source database. You can use the previous and next icons to page through different values.
In the Path Expression field, type the path expression to identify the value to apply the generator to. Matched XML Values shows the result from the value in Cell XML.
From the Generator Configuration dropdown list, select the generator to apply to the value at the path expression. You cannot select another composite generator.
Configure the selected generator. You cannot configure the selected generator to be consistent with another column.
To save the configuration and immediately add a generator for another path expression, click Save and Add Another. To save the configuration and close the add generator panel, click Save.
From the Sub-Generators list:
To edit a generator assignment, click the edit icon.
To remove a generator assignment, click the delete icon.
To move a generator assignment up or down in the list, click the up or down arrow.
From the Fallback Generator dropdown list, select the generator to use if any error occurs in the generation. The fallback generator is then used for the entire XML value. The options are:
The linking option for a generator allows multiple columns within the same table to use a single generator.
At a high level, consider using linking when columns share a strong interdependency or correlation.
When you link columns, you tell Tonic Structural that the columns are related to each other, and that Structural should take this relationship into account when it generates new data.
To link columns, you first assign the same generator to those columns.
After you assign the generator, then on the generator configuration panel for any of the columns, you can link the columns.
Categorical generators support linking and can be used to preserve hierarchical data. Examples of hierarchical data include:
City, State, Zip
Job Title, Department
Day of Month, Month, Year
To illustrate how linking works, we'll use an example of city and state columns. Here is the original data:
The below image shows the results when you apply the Categorical generator to city and state columns, but do not link the columns. Because the columns are not linked, the values in each column are shuffled independently. In the output, the city and state combinations are not valid. For example, Phoenix is not in Florida and Baltimore is not in Tennessee.
The next image shows the results when you apply the Categorical generator to and link the city and state columns. This preserves the data hierarchy and ensures that the city and state combinations are valid.
The following generators can be linked:
Some generators can be data-free. When a generator is data-free, it means that the output data is completely unrelated to the source data. There is no way to use the output data to uncover the source data. Data-free generators implicitly have differential privacy. A generator is not data-free if consistency is enabled.
The following generators are always data-free:
The following generators are data-free only when consistency is disabled:
Differential privacy is one technique that Tonic Structural uses to ensure the privacy of your data.
Differential privacy limits the effect of a single source record or user on the destination data. Someone who views the output of a process that has differential privacy cannot determine whether a particular individual's information was used to generate that output.
Data that is protected by a process with differential privacy cannot be reverse engineered, re-identified, or otherwise compromised.
Any generator that does not use the underlying data at all is considered "data-free". A data-free generator always has differential privacy.
Several Structural generators are either always data-free, or are data-free if consistency is not enabled.
The Categorical generator shuffles the values of a column while preserving the overall frequency of the values. Note that NULL is considered its own category of value.
Differential privacy (disabled by default) further protects the privacy of your data by:
First, adding noise to the frequencies of categories.
After that, if needed, removing rare categories from the possible samples.
Differential privacy is not appropriate when the data in each row is unique or nearly unique. As a general rule of thumb, categories that are represented by fewer than 15 rows are at risk of being suppressed.
Structural warns you when a column isn’t suitable for differential privacy. A column is not suitable for differential privacy if most or all categories have fewer than 15 rows.
The Continuous generator produces samples that preserve the individual column distributions and correlations between columns.
Suppose we want to count the number of users in a database that have some sensitive property. For example, the number of users with a particular medical diagnosis.
A common relaxation, called approximate differential privacy, allows for flexible privacy analysis with noise drawn that is from a wider array of distributions than the Laplace distribution.
Generator | Description | Consistency | Linking | Differential Privacy |
---|---|---|---|---|
The
The
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
By default, the AI Synthesizer is not available. To enable the AI Synthesizer, in the Structural web server container, set the environment setting TONIC_NN_GENERATOR_ENABLED
to true
. Go to .
For details, go to .
If a relationship cannot be found, then the generator defaults to the generator.
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
A version of the generator that can be used for array values.
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
This is a .
A version of the generator that can be used for array values.
Runs a selected generator on values that match a user-specified .
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
This is a .
A version of the generator that can be used for array values.
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
This generator is optimized for categories with fewer than 10,000 unique values. If your underlying data has more unique values (for example, your field is populated by freeform text entry), we recommend that you use the or generator.
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
This generator is deprecated. Use the generator instead.
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
This is a .
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
From the Partition By drop-down list, select one or more columns to use to partition the data. The selected columns must have the generator set to either Passthrough or Categorical. For more information about partitioning and how it works, go to .
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
This is a .
A version of the generator that selects from values that you provide instead of shuffling the original values.
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
Option | Date value | Timestamp value |
---|
From the Partition drop-down list, select one or more columns to use to partition the data. The selected columns must have their generator set to either Passthrough or Categorical. For more information about partitioning and how it works, go to .
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
The provides support for additional address parts in Spark workspaces.
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
This is a .
Path expressions are defined using the .
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
This is a .
Runs a selected generator on values that match a user specified .
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
By default, the numbers are United States phone numbers. Generated numbers pass Google's verification if the input is a valid phone number or if you replace invalid numbers.
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
For more information about regular expressions in C#, go to . For more information about regular expressions in Java, go to .
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
This is a .
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
Generates UUIDs on .
If is enabled, then to use it for this column, toggle Use data encryption process to the on position.
This is a .
Path expressions are defined using the .
In a , if you change the configuration of a linked column, the columns that it is linked to also are marked as having overrides to the parent workspace configuration.
Note that you cannot configure linking as part of a . You can only configure linking when you configure specific columns.
(deprecated)
The configuration options for the and generators include a Differential Privacy toggle to enable or disable differential privacy.
These steps ensure that a single row of source data has limited influence on the output values. By default, the for this generator is with , where is the number of rows.
When differential privacy is enabled, noise is added to the individual distributions and the correlation matrix, using the mechanism described in [].
The default privacy budget for this generator is with .
Differential privacy is a property of a randomized algorithm , which takes as input a database and produces some output The outputs could be counts or summary statistics or synthetic databases — the specific type is not important for this formulation.
For this formulation, we say two databases and are neighbors if they differ by a single row.
For a given , we say that is differentially private if, for all subset of outputs , we have:
When is non-zero, this is sometimes called approximately differentially private.
The parameter is the privacy budget of the algorithm, and quantifies in a precise sense an upper bound on how much information an adversary can gain from observing the outputs of the algorithm on an unknown database.
Suppose an attacker suspects that our secret databaseis one of two possible neighboring databases , with some fixed odds.
Ifis differentially private, then observing updates the the attacker's log odds of vs by at most .
The closer is to , the better the privacy guarantee, as an attacker is more and more limited in what information they can learn from .
Conversely, larger values of mean that an attacker can possibly learn significant information by observing .
Dwork, McSherry, Nissim and Smith introduced in [] the Laplace Mechanism as a way to publish these counts in a secure way, by adding noise sampled from the Laplace distribution.
This noise affords us plausible deniability. If the underlying count changed by , then the probability of observing the same noisy output does not change by much:
We illustrate this visually, showing the probability density function (pdf) of the observed values given true counts of (blue), (orange), and (green).
The blue shaded region shows that the the possibly noisy count values for and lie within a factor of of the noisy count values of , so this mechanism is differentially private with .
For example, the AnalyzeGauss mechanisms of [], and differentially private gradient descent of [], use Gaussian noise as a fundamental ingredient, which requires the following relaxation:
For a given and , we say that is differentially private if, for all subset of outputs , we have:
The parameter is often described as the risk of a (possibly catastrophic) privacy violation. While this formal definition does allow, for example, a mechanism that reveals a sensitive database with probability , in practice this is not a plausible outcome with carefully designed mechanisms. Also, taking to be small relative to the size of the database ensures that the risk of disclosure is low.
Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep Learning with Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS '16). Association for Computing Machinery, New York, NY, USA, 308–318. DOI:
Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam Smith. 2006 Calibrating Noise to Sensitivity in Private Data Analysis. In: Halevi S., Rabin T. (eds) Theory of Cryptography. (TCC '06). Lecture Notes in Computer Science, vol 3876. Springer, Berlin, Heidelberg. DOI:
Cynthia Dwork and Aaron Roth. 2014. The Algorithmic Foundations of Differential Privacy. Found. Trends Theor. Comput. Sci. 9, 3–4 (August 2014), 211–407. DOI:
Cynthia Dwork, Kunal Talwar, Abhradeep Thakurta, and Li Zhang. 2014. Analyze gauss: optimal bounds for privacy-preserving principal component analysis. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing (STOC '14). Association for Computing Machinery, New York, NY, USA, 11–20. DOI:
Generates a random string to replace a specific part of a mailing address. Data-free if not consistent. Privacy ranking: - 1 if not consistent - 4 if consistent
Yes - Self or other
Yes
Yes if not consistent
Uses deep neural networks for high-fidelity data mimicking. By default, not available. Privacy ranking: 3
No
No
No
Identifies the algebraic relationship between 3 or more numeric values (at least one non-integer) and generates new values to match. Privacy ranking: 3
No
Yes
No
Generates unique alphanumeric strings of the same length as the input. Privacy ranking: - 3 if not consistent - 4 if consistent
Yes - Self
No
No
Within an array, replaces letters with random other letters, and numbers with random other numbers. Privacy ranking: - 3 if not consistent - 4 if consistent
Yes - Self
No
No
Runs a selected generator on values that match a user-specified JSONPath. Privacy ranking: 5
--
--
--
Runs a selected generator on values that match a regular expression. Privacy ranking: 5
--
--
--
Generates unique alpha-numeric strings based on any printable ASCII characters. Privacy ranking: - 3 if not consistent - 4 if consistent
Yes - Self
No
No
Generates a random company name-like string.
Data-free if not consistent. Privacy ranking: - 1 if not consistent - 4 if consistent
Yes - Self or other
No
Yes if not consistent
Creates values at the same frequency as the values in the underlying data. Privacy ranking: - 2 if differential privacy enabled - 3 if differential privacy not enabled
No
Yes
Configurable
Replaces letters with random other letters and numbers with random other numbers. Privacy ranking: - 3 if not consistent - 4 if consistent
Yes - Self
No
No
Replaces characters randomly, but preserves formatting. Privacy ranking: 4
Yes - Implicitly consistent
No
No
Company Name (Deprecated)
This generator is deprecated. Use the Business Name generator instead. Generates a random company name-like string. Data-free if not consistent. Privacy ranking: - 1 if not consistent - 4 if consistent
Yes - Self or other
No
Yes if not consistent
Applies different generators to rows conditionally based on any value in the table. Privacy ranking: If a fallback generator is selected, then the lower of either 5 or the fallback generator. 5 if no fallback generator is selected.
No
No
No
Uses a single specified value to mask all values in the column. Data-free. Privacy ranking: 1
No
No
Yes
Generates a continuous distribution to fit the underlying data. Privacy ranking: - 2 if differential privacy enabled - 3 if differential privacy not enabled
No
Yes
Configurable
Populates the column using the sum of the values in other columns. Privacy ranking: 3
No
No
No
Masks a text column.
Parses the text as a row for which the columns are delimited by a specified character. Privacy ranking: 5
--
--
--
Selects from values you provide. Data-free if not consistent. Privacy ranking: - 1 if not consistent - 4 if consistent
Yes - Self or other
No
Yes if not consistent
Truncates dates or timestamps to a specific date or time part. Privacy ranking: 5
No
No
No
Scrambles characters in an email address.
Preserves the formatting and keeps the @
and .
.
Privacy ranking:
- 3 if not consistent
- 4 if consistent
Yes - Self
No
No
Generates timestamps that fit an event distribution. Privacy ranking: 3
No
Yes
No
Scrambles characters in a file name.
Preserves the formatting and the file extension. Privacy ranking: - 3 if not consistent - 4 if consistent
Yes - Self
No
No
Replaces all instances of the find string with the replace string. Privacy ranking: 5
No
No
No
Transforms Norwegian national identity numbers. Privacy ranking: - 3 if not consistent - 4 if consistent
Yes - Self or other
No
No
Masks columns that contain latitude and longitude values. Privacy ranking: 3
No
No
No
Can be used to generate cities, states, zip codes, and latitude/longitude values that follow HIPAA guidelines for safe harbor. Privacy ranking: - 3 if not consistent - 4 if consistent
Yes - Self
No
No
Generates random host names, based on the English language. Data-free if not consistent. Privacy ranking: - 1 if not consistent - 4 if consistent
Yes - Self or other
No
Yes if not consistent
Runs selected generators on specified key values in an HStore column in a PostgreSQL database. Privacy ranking: 5
--
--
--
Masks text columns.
Parses the contents as HTML, and applies sub-generators to the specified path expressions. Privacy ranking: 5
--
--
--
Generates unique integer values.
By default, the generated values are within the range of the column’s data type.
You can also specify a range for the generated values. The source values must be within that range. Data-free if not consistent. Privacy ranking: - 1 if not consistent - 4 if consistent
Yes - Self
No
Yes if not consistent
Generates a random IP address-formatted string. Data-free if not consistent. Privacy ranking: - 1 if not consistent - 4 if consistent
Yes - Self or other
No
Yes if not consistent
Runs a generator on values that match a user specified JSONPath. Privacy ranking: 5
--
--
--
Generates a random MAC address formatted string. Data-free if not consistent. Privacy ranking: - 1 if not consistent - 4 if consistent
Yes - Self
No
Yes if not consistent
Generates unique MongoDB objectId values. Privacy ranking: - 3 if not consistent - 4 if consistent
Yes - Self
No
No
Generates a random name string from a dictionary of first and last names. Data-free if not consistent. Privacy ranking: - 1 if not consistent - 4 if consistent
Yes - Self or other
No
Yes if not consistent
Masks values in numeric columns.
Adds or multiplies the original value by random noise. Privacy ranking: - 3 if not consistent - 4 if consistent
Yes - Self or other
No
No
Generates NULL
values to fill the rows of the specified column.
Data-free.
Privacy ranking: 1
No
No
Yes
Generates unique numeric strings of the same length as the input. Privacy ranking: - 3 if not consistent - 4 if consistent
Yes - Self
No
No
Default generator. Does not perform any action on the source data. Privacy ranking: 6
No
No
No
Generates a random phone number that matches the country or region and format of the input phone number. Privacy ranking: 3
Yes - Self
No
No
Generates a random boolean value. Data-free. Privacy ranking: 1
No
No
Yes
Generates a random double number between the specified min and max. Data-free. Privacy ranking: 1
No
No
Yes
Generates a random hash string. Data-free Privacy ranking: 1
No
No
Yes
Returns a random integer between the specified min and max. Data-free. Privacy ranking: 1
No
No
Yes
Generates random dates, times, and timestamps. Data-free. Privacy ranking: 1
No
No
Yes
Generates a random new UUID string. Data-free. Privacy ranking: 1
No
No
Yes
Uses regular expressions to parse strings.
Replaces specified substrings with output from selected sub-generators. Privacy ranking: 5
--
--
--
Generates a column of unique integer values that start with specified value and increment by 1. Privacy ranking: 3
No
Yes
No
Generates values of ISO 6346 compliant shipping container codes. Data-free if not consistent. Privacy ranking: - 1 if not consistent - 4 if consistent
Yes - Self or other
No
Yes if not consistent
Generates a new valid Canadian Social Insurance Number. Data-free if not consistent. Privacy ranking: - 1 if not consistent - 4 if consistent
Yes - Self
No
Yes if not consistent
Generates a new valid United States Social Security Number. Data-free if not consistent. Privacy ranking: - 1 if not consistent - 4 if consistent
Yes - Self or other
No
Yes if not consistent
Can apply other generators on specific StructFields within a StructType in Spark databases (Databricks and Amazon EMR). Privacy ranking: 5
--
--
--
Shifts timestamps by a random amount of a specific unit of time, within a set range. Privacy ranking: - 3 if not consistent - 4 if consistent
Yes - Self or other
No
No
Generates unique email addresses.
Replaces the username with a randomly generated GUID, and masks the domain with a character scramble. Privacy ranking: - 3 if not consistent - 4 if consistent
Yes - Self
No
No
A substitution cipher that preserves formatting but keeps the URL scheme and top-level domain intact. Privacy ranking: 3
No
No
No
Generates UUIDs on primary key columns. Privacy ranking: - 3 if not consistent - 4 if consistent
Yes - Self
No
No
Runs a selected generator on values that match a user-specified XPath. Privacy ranking: 5
--
--
--
Consistency
Map the same input values to the same output values across multiple columns, tables, and databases.
Linking
Identify columns that use the same generator and that are inter-dependent or correlated.
Differential privacy
Ensures that the output does not reveal anything that is attributable to a specific member of the source data.
Data-free generators
Indicates that the generator output is completely unrelated to the input.
Column partitioning
Base the value of a column on other related columns.
Uniqueness constraints
Generators that you can use on columns that have uniqueness constraints.
Format-preserving encryption (FPE)
Encrypts data in such a way that the output is in the same format as the input.
Original value | 2021-12-20 | 2021-12-20 13:42:55 |
Truncate to year | 2021-01-01 | 2021-01-01 00:00:00 |
Truncate to month | 2021-12-01 | 2021-12-01 00:00:00 |
Truncate to day | 2021-12-20 | 2021-12-20 00:00:00 |
Truncate to hour | Not applicable | 2021-12-20 13:00:00 |
Truncate to minute | Not applicable | 2021-12-20 13:42:00 |
Truncate to second | Not applicable | 2021-12-20 13:42:55 |
Partitioning allows the value of a column to be based on the values of other related columns. It is one way to generate more realistic destination values.
The following generators support partitioning:
Note that partitioning cannot be configured as part of a generator preset. You can only configure partitioning when you configure a specific column.
To enable partitioning, from the Partition by dropdown list, you choose one or more columns to partition by.
You can only choose columns that have the generator set to Passthrough or Categorical.
For each value or combination of values in the partitioning columns, Tonic Structural generates a distribution of values for the original column.
For example, you assign the Continuous generator to an Income column, and partition it by an Occupation column. For each Occupation value, Structural generates a distribution of Income values. In other words, it generates a range of incomes for each occupation, such as Doctor and Construction Worker.
If you choose multiple columns, then the distribution is for each combination of column values. For example, you partition by both Occupation and Region. Structural creates a distribution of income values for each combination of occupation and region. So there is a distribution for Doctor and Northeast, and a different distribution for Doctor and Southeast.
In the destination database, Structural sets the value of the partitioned column to a value from the appropriate distribution. The distribution that Structural uses is based on the value of the partitioning columns in the destination database, not the original value of the partitioning columns in the source database.
To continue our example, assume that the Occupation column uses the Categorical generator. During data generation, Structural assigns to each record a random occupation value from the current values. For one of the records, the occupation value is Doctor in the source database and Construction Worker in the destination database.
For the Income column for that record, Structural assigns a value from the distribution of income values for the Construction Worker occupation. In other words, it assigns an income value that is realistic for the destination occupation value based on the source data.
The partitioning option works well when you partition by only one or two columns.
To create a more complex model across several columns, instead of partitioning, use the AI Synthesizer.
A column that has a uniqueness constraint must have a unique value for every record.
Primary key columns automatically require uniqueness. Uniqueness can also be required for other columns. For example, in a users
table, userid
is the primary key column, but username
also must be unique.
The following generators can be used with columns that have uniqueness constraints:
Consistency | Yes, can be made self-consistent or consistent with another column. |
Linking | Yes, can be linked. |
Differential privacy | Yes, if consistency is not enabled. |
Data-free | Yes, if consistency is not enabled. |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking |
|
Generator ID (for the API) |
Consistency | No, cannot be made consistent. |
Linking | Yes, can be linked. |
Differential privacy | No |
Data-free | No |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking | 3 |
Generator ID (for the API) |
Consistency | Yes, can be made self-consistent. |
Linking | No, cannot be linked. |
Differential privacy | No |
Data-free | No |
Allowed for primary keys | Yes |
Allowed for unique columns | Yes |
Uses format-preserving encryption (FPE) | Yes |
Privacy ranking |
|
Generator ID (for the API) |
Consistency | Yes, can be made self-consistent. |
Linking | No, cannot be linked. |
Differential privacy | No |
Data-free | No |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking |
|
Generator ID (for the API) |
Consistency | Determined by the specified sub-generators. |
Linking | Determined by the specified sub-generators. |
Differential privacy | Determined by the specified sub-generators. |
Data-free | Determined by the specified sub-generators. |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking | 5 |
Generator ID (for the API) |
Consistency | Determined by the selected sub-generators. |
Linking | Determined by the selected sub-generators. |
Differential privacy | Determined by the selected sub-generators. |
Data-free | Determined by the selected sub-generators. |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking | 5 |
Generator ID (for the API) |
Consistency | Yes, can be made self-consistent. |
Linking | No, cannot be linked. |
Differential privacy | No |
Data-free | No |
Allowed for primary keys | Yes |
Allowed for unique columns | Yes |
Uses format-preserving encryption (FPE) | Yes |
Privacy ranking |
|
Generator ID (for the API) |
Consistency | Yes, can be made self-consistent or consistent with another column. |
Linking | No, cannot be linked. |
Differential privacy | Yes, if consistency is not enabled. |
Data-free | Yes, if consistency is not enabled. |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking |
|
Generator ID (for the API) |
Consistency | No, cannot be made consistent. |
Linking | Yes, can be linked. |
Differential privacy | Configurable |
Data-free | No |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking |
|
Generator ID (for the API) |
Consistency | Yes, can be made self-consistent |
Linking | No, cannot be linked |
Differential privacy | No |
Data-free | No |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking |
|
Generator ID (for the API) |
Consistency | This generator is implicitly self-consistent. You do not specify whether the generator is consistent. Every occurrence of a character always maps to the same substitute character. Because of this, it can be used to preserve a join between two text columns, such as a join on a name or email. |
Linking | No, cannot be linked. |
Differential privacy | No |
Data-free | No |
Allowed for primary keys | No |
Allowed for unique columns | Yes |
Uses format-preserving encryption (FPE) | No |
Privacy ranking | 4 |
Generator ID (for the API) |
Consistency | Yes, can be made self-consistent or consistent with another column. |
Linking | No, cannot be linked. |
Differential privacy | Yes, if consistency is not enabled. |
Data-free | Yes, if consistency is not enabled. |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking |
|
Generator ID (for the API) |
Consistency | Determined by the selected generators. |
Linking | Determined by the selected generators. |
Differential privacy | Determined by the selected generators. |
Data-free | Determined by the selected generators. |
Allowed for primary keys | No |
Allowed for unique columns | Yes |
Uses format-preserving encryption (FPE) | No |
Privacy ranking |
|
Generator ID (for the API) |
Consistency | No, cannot be made consistent. |
Linking | No, cannot be linked. |
Differential privacy | Yes |
Data-free | Yes |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking | 1 |
Generator ID (for the API) |
Consistency | No, cannot be made consistent. |
Linking | Yes, can be linked. |
Differential privacy | Configurable |
Data-free | No |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking |
|
Generator ID (for the API) |
Consistency | No, cannot be made consistent. |
Linking | No, cannot be linked. |
Differential privacy | No |
Data-free | No |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking | 3 |
Generator ID (for the API) |
Consistency | Determined by the selected sub-generators. |
Linking | Determined by the selected sub-generators. |
Differential privacy | Determined by the selected sub-generators. |
Data-free | Determined by the selected sub-generators. |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking | 5 |
Generator ID (for the API) |
Consistency | Yes, can be made self-consistent or consistent with another column. |
Linking | Yes, can be linked. |
Differential privacy | Yes, if consistency is not enabled. |
Data-free | Yes, if consistency is not enabled. |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking |
|
Generator ID (for the API) |
Consistency | No, cannot be made consistent. |
Linking | No, cannot be linked. |
Differential privacy | No |
Data-free | No |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking | 5 |
Generator ID (for the API) |
Consistency | Yes, can be made self-consistent. |
Linking | No, cannot be linked. |
Differential privacy | No |
Data-free | No |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking |
|
Generator ID (for the API) |
Consistency | No, cannot be made consistent. |
Linking | Yes, can be linked. |
Differential privacy | No |
Data-free | No |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking | 3 |
Generator ID (for the API) |
Consistency | Yes, can be made self-consistent. |
Linking | No, cannot be linked. |
Differential privacy | No |
Data-free | No |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking |
|
Generator ID (for the API) |
Consistency | No, cannot be made consistent. |
Linking | No, cannot be linked. |
Differential privacy | No |
Data-free | No |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking | 5 |
Generator ID (for the API) |
Consistency | Yes, can be made self-consistent or consistent with another column. |
Linking | No, cannot be linked |
Differential privacy | No |
Data-free | No |
Allowed for primary keys | No |
Allowed for unique columns | Yes |
Uses format-preserving encryption (FPE) | No |
Privacy ranking |
|
Generator ID (for the API) |
Consistency | No, cannot be made consistent. |
Linking | Yes, can be linked. |
Differential privacy | No |
Data-free | No |
Allowed for primary keys | No |
Allowed for unique columns | Yes |
Uses format-preserving encryption (FPE) | No |
Privacy ranking | 3 |
Generator ID (for the API) |
Consistency | Yes, can be made self-consistent. |
Linking | Yes, can be linked. |
Differential privacy | No |
Data-free | No |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking |
|
Generator ID (for the API) |
Consistency | Yes, can be made self-consistent or consistent with another column. |
Linking | No, cannot be linked. |
Differential privacy | Yes, if consistency is not enabled. |
Data-free | Yes, if consistency is not enabled. |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking |
|
Generator ID (for the API) |
Consistency | Determined by the selected sub-generators. |
Linking | Determined by the selected sub-generators. |
Differential privacy | Determined by the selected sub-generators. |
Data-free | Determined by the selected sub-generators. |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking | 5 |
Generator ID (for the API) |
Consistency | Determined by the selected sub-generators. |
Linking | Determined by the selected sub-generators. |
Differential privacy | Determined by the selected sub-generators. |
Data-free | Determined by the selected sub-generators. |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking | 5 |
Generator ID (for the API) |
Consistency | Yes, can be made self-consistent. |
Linking | No, cannot be linked. |
Differential privacy | Yes, if consistency is not enabled. |
Data-free | Yes, if consistency is not enabled. |
Allowed for primary keys | Yes |
Allowed for unique columns | Yes |
Uses format-preserving encryption (FPE) | Yes |
Privacy ranking |
|
Generator ID (for the API) |
Consistency | Yes, can be made self-consistent or consistent with another column. |
Linking | No, cannot be linked. |
Differential privacy | Yes, if consistency is not enabled. |
Data-free | Yes, if consistency is not enabled. |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking |
|
Generator ID (for the API) |
Consistency | Determined by the selected sub-generators. |
Linking | Determined by the selected sub-generators. |
Differential privacy | Determined by the selected sub-generators. |
Data-free | Determined by the selected sub-generators. |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking | 5 |
Generator ID (for the API) |
Consistency | Yes, can be made self-consistent. |
Linking | No, cannot be linked. |
Differential privacy | Yes, if consistency is not enabled. |
Data-free | Yes, if consistency is not enabled. |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | Yes |
Privacy ranking |
|
Generator ID (for the API) |
Consistency | Yes, can be made self-consistent |
Linking | No, cannot be linked |
Differential privacy | No |
Data-free | No |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking |
|
Generator ID (for the API) |
Consistency | Yes, can be made self-consistent or consistent with another column. Note that all Name generator columns that have the same consistency configuration are automatically consistent with each other. The columns must either be all self-consistent or all consistent with the same other column. For example, you can use this to ensure that a first name and last name column value always match the first name and last name in a full name column. |
Linking | No, cannot be linked. |
Differential privacy | Yes, if consistency is not enabled. |
Data-free | Yes, if consistency is not enabled. |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking |
|
Generator ID (for the API) |
Consistency | Yes, can be made self-consistent or consistent with another column. |
Linking | No, cannot be linked. |
Differential privacy | No |
Data-free | No |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking |
|
Generator ID (for the API) |
Consistency | No, cannot be made consistent. |
Linking | No, cannot be linked. |
Differential privacy | Yes |
Data-free | Yes |
Allowed for primary keys | No |
Allowed for unique columns | Yes |
Uses format-preserving encryption (FPE) | No |
Privacy ranking | 1 |
Generator ID (for the API) |
Consistency | Yes, can be made self-consistent. |
Linking | No, cannot be linked. |
Differential privacy | No |
Data-free | No |
Allowed for primary keys | Yes |
Allowed for unique columns | Yes |
Uses format-preserving encryption (FPE) | Yes |
Privacy ranking |
|
Generator ID (for the API) |
Consistency | No, cannot be made consistent. |
Linking | No, cannot be linked. |
Differential privacy | No |
Data-free | No |
Allowed for primary keys | No |
Allowed for unique columns | Yes |
Uses format-preserving encryption (FPE) | No |
Privacy ranking | 6 |
Generator ID (for the API) |
Consistency | Yes, can be made self-consistent. |
Linking | No, cannot be linked. |
Differential privacy | No |
Data-free | No |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking | 3 |
Generator ID (for the API) |
Consistency | No, cannot be made consistent. |
Linking | No, cannot be linked. |
Differential privacy | Yes |
Data-free | Yes |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking | 1 |
Generator ID (for the API) |
Consistency | No, cannot be made consistent. |
Linking | No, cannot be linked. |
Differential privacy | Yes |
Data-free | Yes |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking | 1 |
Generator ID (for the API) |
Consistency | No, cannot be made consistent. |
Linking | No, cannot be linked. |
Differential privacy | Yes |
Data-free | Yes |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking | 1 |
Generator ID (for the API) |
Consistency | No, cannot be made consistent. |
Linking | No, cannot be linked. |
Differential privacy | Yes |
Data-free | Yes |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking | 1 |
Generator ID (for the API) |
Consistency | No, cannot be made consistent. |
Linking | No, cannot be linked. |
Differential privacy | Yes |
Data-free | Yes |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking | 1 |
Generator ID (for the API) |
Consistency | No, cannot be made consistent. |
Linking | No, cannot be linked. |
Differential privacy | Yes |
Data-free | Yes |
Allowed for primary keys | No |
Allowed for unique columns | Yes |
Uses format-preserving encryption (FPE) | No |
Privacy ranking | 1 |
Generator ID (for the API) |
Consistency | Determined by the selected sub-generators. |
Linking | Determined by the selected sub-generators. |
Differential privacy | Determined by the selected sub-generators. |
Data-free | Determined by the selected sub-generators. |
Allowed for primary keys | No |
Allowed for unique columns | Yes |
Uses format-preserving encryption (FPE) | No |
Privacy ranking | 5 |
Generator ID (for the API) |
Consistency | No, cannot be made consistent. |
Linking | Yes, can be linked. |
Differential privacy | No |
Data-free | No |
Allowed for primary keys | No |
Allowed for unique columns | Yes |
Uses format-preserving encryption (FPE) | No |
Privacy ranking | 3 |
Generator ID (for the API) |
Consistency | Yes, can be made self-consistent or consistent with another column. |
Linking | No, cannot be linked. |
Differential privacy | Yes, if consistency is not enabled. |
Data-free | Yes, if consistency is not enabled. |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking |
|
Generator ID (for the API) |
Consistency | Yes, can be made self-consistent. |
Linking | No, cannot be linked. |
Differential privacy | No, cannot be made differentially private. |
Data-free | Yes, if consistency is not enabled. |
Allowed for primary keys | No |
Allowed for unique columns | Yes |
Uses format-preserving encryption (FPE) | Yes |
Privacy ranking |
|
Generator ID (for the API) |
Consistency | Yes, can be made self-consistent or consistent with another column. |
Linking | No, cannot be linked. |
Differential privacy | Yes, if consistency is not enabled. |
Data-free | Yes, if consistency is not enabled. |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking |
|
Generator ID (for the API) |
Consistency | Determined by the selected sub-generators. |
Linking | Determined by the selected sub-generators. |
Differential privacy | Determined by the selected sub-generators. |
Data-free | Determined by the selected sub-generators. |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking | 5 |
Generator ID (for the API) |
Consistency | Yes, can be made self-consistent or consistent with another column. |
Linking | No, cannot be linked. |
Differential privacy | No |
Data-free | No |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking |
|
Generator ID (for the API) |
Consistency | Yes, can be made self-consistent. |
Linking | No, cannot be linked. |
Differential privacy | No |
Data-free | No |
Allowed for primary keys | No |
Allowed for unique columns | Yes |
Uses format-preserving encryption (FPE) | No |
Privacy ranking |
|
Generator ID (for the API) |
Consistency | No, cannot be made consistent. |
Linking | No, cannot be linked. |
Differential privacy | No |
Data-free | No |
Allowed for primary keys | No |
Allowed for unique columns | Yes |
Uses format-preserving encryption (FPE) | No |
Privacy ranking | 3 |
Generator ID (for the API) |
Consistency | Yes, can be made self-consistent. |
Linking | No, cannot be linked. |
Differential privacy | No |
Data-free | No |
Allowed for primary keys | Yes |
Allowed for unique columns | Yes |
Uses format-preserving encryption (FPE) | Yes |
Privacy ranking |
|
Generator ID (for the API) |
Consistency | Determined by the selected sub-generators. |
Linking | Determined by the selected sub-generators. |
Differential privacy | Determined by the selected sub-generators. |
Data-free | Determined by the selected sub-generators. |
Allowed for primary keys | No |
Allowed for unique columns | No |
Uses format-preserving encryption (FPE) | No |
Privacy ranking | 5 |
Generator ID (for the API) |
Most Tonic Structural generators consume source data and perform an operation on it to produce destination data. For example, the Character Scramble generator takes the original data from the source database, replaces the letters and numbers with random letters and numbers, and then writes the result to the destination database.
Composite generators do not generate data directly.
Structural provides the following composite generators:
Most composite generators treat the input as structured data that the generator parses using a domain-specific syntax, such as:
XPath for XML or HTML
JSONPath for JSON or a Spark StructType
Regular expressions for text
These generators allow you to select a sub-value of the input, and then configure a specific generator to apply to only that sub-value. This means that you can take your original structured data and selectively mask the content.
For example, for the following structured content:
{ name: { first: "Tj", last: "Bass" } }
You indicate to use the Name generator to replace the value of last
. The result is something like:
{ name: { first: "Tj", last: "Pine" } }
The Conditional generator is slightly different. It allows you to apply a specific generator when the column value matches a specific condition. For example, you can indicate to apply a Character Scramble generator only if the column value is something other than "test".
You cannot configure generator presets for composite generators from the Generator Presets view. The Generator Presets view does not have access to data to use for path expressions or conditions. From a column configuration panel, you can save the current configuration as the new baseline configuration, and reset the configuration to the current baseline.
For any composite generator, when you select the generator to apply to a selected sub-value or based on a specified condition, you cannot select another composite generator. For example, you cannot apply a Conditional or XML Mask generator to the value of a specified path expression.
For composite generators other than the Conditional or Regex Mask generators, you cannot configure a sub-generator to be consistent with another column.
Using format-preserving encryption (FPE) means to encrypt data in such a way that the output is in the same format as the input. For example, a number in the input produces a number in the generated output.
For the following generators, Tonic Structural uses FPE to encrypt the generated values. Note that the Structural implementation of FPE might not guarantee compliance with standards. For example, the ASCII Key generator does not guarantee that the length of the output data matches the length of the input data.
Each generator supports a specific input character set or domain.
If you see encryption errors, then it probably means that the column contains values that are incompatible with the selected generator. To address this, you need to choose a different generator.
When a generator attempts to process data that is not within the expected domain, it results in encryption errors. For example, the generator cannot process a string that includes non-numeric characters such as letters or symbols. The generator cannot process any value that is not a valid UUID.
One option is the generator, which has very few restrictions on the allowed values.
Another option is to use the generator, which allows you to assign different generators based on column values.
Generators that are applied to primary key columns are different from other generators in the following ways:
The generated data must be unique in order to not break constraints
The generators are consistent (same input → same output), so that when this generator is applied to a primary key column and its linked foreign key columns, no links are broken.
This is accomplished using format preserving encryption.
For more information on this, and details on how to provide your own encryption key, contact support@tonic.ai.
You apply a primary key generator in the same way as you do any other generator.
Tonic Structural then automatically applies the same generator to all foreign key columns that reference the primary key.
Foreign keys are either defined by the source schema or added from the Foreign Key Relationships page. For more information, go to Viewing and adding foreign keys.
Structural currently supports the following generators for primary key columns:
ASCII Key The ASCII Key generator does not preserve the format of the input value. It uses the ASCII alphabet for input and the alphanumeric alphabet for output. This leads to output values that are longer than the input values.
If you need support for additional types, contact support@tonic.ai.
Primary key generators are not supported in the Scale table mode. The process requires control over the key columns to make sure that all of the relationships are maintained.
You also cannot assign a primary key generator on a table that is related to a Scale mode table through a foreign key.
These hints and tips can help you to choose generators and address some specific use cases.
Tonic Structural provides several options for de-identifying names of individuals names. The method that you select depends on the specific use case, including the required realism of the output and privacy needs.
The following are a few of the generator options and how and why you might use them.
Name generator Randomly returns a name from a dictionary of primarily Westernized names, unrelated to the original value. Can provide complete privacy, unless you use Consistency. The output is realistic because the values returned are real names.
Categorical generator This generator shuffles all of the values in the field while preserving the overall frequency of the values. It ensures that the output contains realistic-looking names, and that the output uses the names from the original data set. This can be beneficial if the original data contains, for example, names that are common to a particular region and that should be maintained. When you use this generator with the Differential Privacy option, it ensures the output is secure from re-identification. However, if the source data set is small or each name is highly unique, Structural might prevent you from using this option.
Custom Categorical Allows you to provide your own dictionary of values. These values are included in the output at the same frequency that the original values occur in the source data.
Character Scramble Randomly replaces characters with other characters. The output does not provide realistic looking names, but it provides a high level of privacy that prevents recovery of the original data. It does preserve whitespace, punctuation (such as hyphenated names), and capitalization. Because it is a character-level replacement, it preserves the length of the input string.
Character Substitution Similar to Character Scramble, but uses a single character mapping throughout the generated data. This reduces the privacy level, but ensures consistency and uniqueness. This generator also has more support for additional unicode blocks to ensure that the output characters more closely match the input. This might be helpful if the input includes names with characters outside of the basic Latin (a-z, A-Z) characters.
Rows of data often have multiple date or timestamp fields that have a logical dependency, such as START_DATE
and END_DATE
.
In this case, a randomly generated date is not viable, because it could produce a nonsensical output where events occur chronologically out of order.
The following generator options handle these scenarios:
Timestamp Shift generator (with Consistency)
To solve the problem described above, you ensure that two or more timestamps are randomly shifted by the same amount instead of independently from each other.
The key is to use the consistency option.
For example, a row of data represents an individual that is identified by a primary key of PERSON_ID
. The row also contains START_DATE
and END_DATE
columns. You can apply a timestamp shift to the START_DATE
and END_DATE
columns within a desired range, and make both columns consistent to PERSON_ID
.
Whenever the generator encounters the same PERSON_ID
value, it shifts the dates by the same amount.
Event Timestamps generator You can apply the Event Timestamps generator to multiple date columns on the same table. You can link them to follow the underlying distribution of dates. For more information, go to the blog post Simulating event pipelines for fun and profit (and for testing too).
Date Truncation generator This generator can sometimes address the described problem. You can configure this generator to truncate the input to the year, month, day, hour, minute, or second. It guarantees that a secondary event does not occur BEFORE a primary event. However, truncation might cause them to become the same date value or timestamp. Whether you can use this generator for this purpose depends on the typical time separation between the two events relative to the truncation option, and whether truncation provides an adequate level of privacy for the particular use case.
Free text refers to text fields in the source database that might come from an "uncontrolled" source such as user text entry. In these cases, any record might or might not contain sensitive information.
Some possible examples include:
Notes from a doctor or healthcare provider that contain Protected Health Information (PHI)
Other personally identifiable information, such as a Social Security number or telephone number, that a user enters into an open-ended text entry form
Structural provides several suitable options. The method that you select depends on the specific use case, including the required realism of the output and any privacy requirements.
Here are a few generator options for free text fields, with information on how and why you might use them.
Character Scramble generator Randomly replaces characters with other characters. The output does not contain meaningful text, but it provides a high level of privacy that prevents recovery of the original data. The Character Scramble generator does preserve whitespace, punctuation, and capitalization. Because it is a character-level replacement, it preserves the length of the input string.
Regex Mask generator
Uses regular expressions to parse strings. It then replaces specified substrings with the output of selected generators. The parts of the string to replace are specified in unnamed top-level capture groups.
The Regex Mask generator can preserve more realism of the underlying text, but introduces privacy risks. Any sensitive information that does not conform to a known and configured pattern is not captured and replaced.
As an example of matching specific formats, a configuration that includes the following two patterns would replace both telephone numbers that use the ###-###-####
format, and SSNs that use the ###-##-####
format, but leave the surrounding text unmodified:
SSN: ([0-9]{3}-[0-9]{2}-[0-9]{4})
Telephone Number: ([0-9]{3}-[0-9]{3}-[0-9]{4})
You can configure multiple regular expression patterns to handle all known or expected sensitive information formats. You cannot use this method to replace values that you cannot use a regular expression to reliably identify, such as names within free text.
When you use this option, make sure to enable Replace all matches for each pattern.
Constant, Custom Categorical, and Null generators Each of these options provides the highest level of privacy, because they completely remove or replace the original text. You might use each one for different reasons:
Null: If the field is nullable and the use case does not require any data in the field, you can use the Null generator to replace the values with NULL.
Constant: Allows you to provide a fixed value to replace all of the source value. For example, you could provide a "Lorem ipsum" string or other dummy value that is appropriate for your data set.
Custom Categorical: Similar to the Constant generator, it replaces the original value with a fixed value. To increase the cardinality of the output, you enter a list of possible values. The values are randomly used on the output records.
Most Structural generators preserve NULL values that are in the data.
They do not automatically preserve empty values.
To make sure that any empty values stay empty in the destination database:
Assign the Conditional generator to the column.
For the default generator, select the generator to apply to the non-empty values.
Create a condition to look for empty values. You can either:
Use the regex comparison against the regex whitespace value (\s*
).
Use the =
operator and leave the value empty or empty except for a single space.
If you are not sure which characters the empty strings use, the regex option is more flexible. However, it is less efficient.
For the empty value condition, set the generator to Passthrough.
You sometimes might want to apply the same generator to all of the text values in a JSON, HTML, or XML value. For example, you might want to apply the Character Scramble to all of the text.
Instead of creating separate path expressions for each path, you can use one or two path expressions that capture all of the values.
For the Array JSON Mask or JSON Mask generator, the path expression $..*
captures all of the text values. You can then select the generator to apply to the values.
For the HTML Mask and XML Mask generators, you create two path expressions:
//text()
gets all of the text nodes.
//@*
gets all of the attribute values.
You apply the generator to each expression.
Sub-generators are applied sequentially. You can apply the wildcard paths in addition to more specific paths and generators.
For example, one path expression references a specific name or address and uses the Name or Address generator. The wildcard path expressions use the Character Scramble generator to mask any unknown fields in the document that could contain sensitive information.
As another example, you might assign the Passthrough generator to specific known fields that never contain sensitive information.
When your XML includes namespaces, then to include the namespaces in the path expression, specify the elements as:
For example, for the following XML:
A working XPath to mask the name value is:
You might sometimes set default date values to the absolute minimum and maximum values that are allowed by the database. For example, for SQL Server, these values are January 1, 1753 and December 31, 9999.
When you assign the Timestamp Shift generator, the minimum value cannot be shifted backward and the maximum value cannot be shifted forward.
To skip those default values and shift the other values:
Assign the Conditional generator to the column.
For the default generator, select the Timestamp Shift generator.
Create conditions to look for the minimum or maximum values.
For those conditions, set the generator to Passthrough.
You might sometimes want to add values that are the output of a generator to the results of the transformation by another generator.
For example, you use Character Scramble to mask a username. You might also want to prefix the value with a fixed constant value, or append a sequential integer.
To accomplish this:
Apply the Regex Mask generator to the column.
In addition to the capture groups that are specific to your data:
Use (^)
as a capture group for a prefix.
Use ($)
as a capture group for a suffix.
Use ()
as an empty group at any point in the regex pattern.
Apply the relevant generators to each capture group.
So to implement the example above (prefix with a constant, scramble the value, append a sequential integer), you provide the expression (^)(.*)()($)
.
This produces four capture groups:
Group 0 is for the prefix. You assign the Constant generator and provide the value to use as the prefix.
Group 1 captures all of the original values. You assign the Character Scramble generator.
Group 2 captures any empty values. You assign the Constant generator to provide a value to use for those values.
Group 3 is for the suffix. You assign the Sequential Integer generator.
Required workspace permission: Configure column generators
The Tonic Structural sensitivity scan identifies specific types of sensitive data. For each sensitivity type that it detects, Structural can have a recommended generator. For example, for a value that the sensitivity scan identifies as a Social Security Number, Structural recommends the SSN generator. For a first name, Structural recommends the Name generator configured with First as the value type.
From Privacy Hub and Database View, you can review and apply the recommended generators.
In Privacy Hub, on the settings view of the column details panel, for a detected sensitive column that does not have an applied generator, and that has a recommended generator, Structural displays a button for the recommended generator.
To apply the recommended generator, click the button.
On Database View, for a detected sensitive column that does not have an applied generator, the generator name tag displays the type of sensitive data, such as a first name or an email address.
To apply the recommended generator:
Click the generator name tag.
On the recommended generator panel, click Apply recommendation.
When there are detected sensitive columns that are not protected, Privacy Hub displays a Sensitivity Recommendations banner. The banner displays the number of detected, unprotected columns.
To review the recommended generators, and determine whether to apply them, click Review Recommendations.
The Recommended Generators by Sensitivity Type panel displays the list of sensitivity types for which there are detected, unprotected columns.
To display the columns for a sensitivity type, click the expand icon for that type.
To hide the column list, click the collapse icon.
For each column, the list includes the following information:
The table and schema name
The column name, with the column data type
An example value from the source data (Original Data), with a corresponding destination value when the recommended generator is applied (Expected Output).
To display a larger sample of source and destination values, click the view icon in the Expected Output column.
To filter the lists, you can use either:
Schema name
Table name
Column name
Start to type text in the schema, table, or column name. As you type, Structural applies the filter to all of the lists.
When you first display the panel, all of the columns are selected. The columns that are affected when you apply recommended generators or ignore columns.
Within each sensitivity type, you can select or deselect individual columns.
You can use the checkbox in the column heading to select or deselect all of the columns for a sensitivity type.
To apply the recommended generator to the selected columns for a sensitivity type, click the Apply option for that sensitivity type.
When you apply the recommended generator, Structural removes the column from the list.
If the recommended generator is incorrect, then you can ignore the recommendation.
To ignore the recommended generator for the selected columns in a sensitivity type:
Click the Ignore option for the sensitivity type.
In the Ignore dropdown list, click Ignore generator recommendation.
When you ignore the generator recommendation:
The column is removed from the list.
The recommended generator is removed. This includes the recommendation on the Privacy Hub column configuration panel.
The column continues to be marked as sensitive.
Required workspace permission: Configure column sensitivity
You can mark selected columns for a sensitivity type as not sensitive. For example, a value might be correctly identified as a first name, but be a test value that is not actually sensitive and does not need to be transformed.
To mark selected columns in a sensitivity type as not sensitive:
Click the Ignore option for the sensitivity type.
In the Ignore dropdown list, click Mark as not sensitive.
When you mark a column as not sensitive, it is removed from the list.
To apply the recommended generators to all of the selected columns across all of the sensitivity types, click Apply All.
On Database View, the Bulk Edit option includes an option to apply the recommended generators to the selected columns for which there is an available recommendation.
From Database View, to apply recommended generators to multiple columns:
Check the checkbox for each column to update.
Click Bulk Edit.
On the bulk editing panel, click Apply Recommendations.
Privacy Hub, Database View, and Table View all provide an option to assign a generator to a column.
For self-hosted Enterprise instances, the selected generator is a generator preset. A generator preset provides a specific configuration for a generator. Whenever a user selects the preset, the generator automatically uses the saved configuration for the preset, which we call the baseline configuration. Tonic Structural provides a built-in preset for most generators. You can also create custom presets.
After you select the preset, you can:
Override the baseline generator preset configuration. For example, if the built-in preset for the Name generator uses the First Last format, but the column contains a first name, you can change the format to First.
Remove the overrides to the baseline configuration.
Save the updated configuration as the new baseline for the generator preset.
Save the updated configuration as a new custom generator preset.
For more information about generator presets, go to Managing generator presets.
Required license to manage generator presets: Enterprise
For Basic and Professional instances, users select and configure generators separately for each column.
Required workspace permission: Configure column generators
From the Generator Type dropdown, select the generator to assign to the column.
The list contains the names of the generators that can be applied to the column.
Use the filter field to search by generator name.
For self-hosted Enterprise instances, the generator names represent built-in and custom generator presets. When you select a generator preset, the configuration is updated to match the current baseline configuration for that preset.
To remove the selected generator and set the generator to Passthrough, click the delete icon next to the generator dropdown list.
After you select a generator preset, you can change the generator configuration. For details about the available configuration options for each generator, see the Generator reference.
Overriding the configuration does not affect the baseline configuration for the generator preset.
A column is also considered to have overrides when someone changed the baseline configuration of the generator preset after it was assigned to the column.
Note that the following configuration options are not part of the preset configuration:
On the column configuration panel, you use the Reset to baseline button to remove any overrides to the current baseline configuration for the generator preset.
From the column configuration panel, you can save the updated configuration as the baseline configuration for the generator preset.
To do this, click Preset Options, then select Update baseline configuration. On the confirmation panel, click Confirm.
When you update the baseline configuration for the generator preset, Structural does not change the configuration of other columns that use the previous baseline configuration.
Whenever you select a generator preset, it uses the current baseline configuration.
From the generator configuration panel, you can save the current configuration as a new custom generator preset.
When you create a new custom generator preset, it is selected as the generator preset for the column.
To do this:
Click Preset Options, then select Create a new generator preset.
On the Create New Preset dialog, in the New Preset Name field, provide a name for the new custom generator preset.
Click Create.
Required license for workspace inheritance: Enterprise
In a child workspace, the configuration panel indicates whether the column currently inherits the configuration from the parent workspace.
The inheritance stops if you select a different generator or change the generator configuration.
The inheritance stops if you select a different generator or generator preset (including the Passthrough generator) or change the configuration.
When the column overrides the parent configuration, to reset to the parent configuration and restore the inheritance, click Reset.
The AI Synthesizer generator is intended for use cases that require high-fidelity mimicked data. It can be used instead of the continuous or categorical generators.
This generator uses deep neural networks to learn models of your data, which can be sampled to generate new synthetic rows that faithfully mimic the statistical properties of your data.
The expressiveness of deep neural networks allows this generator to capture subtle relationships in the data that may be difficult to express using linking and partitioning generators. The relationships are learned from the data, instead of specified by the user.
Because this generator uses neural networks to learn from the data, performance is limited by the time required to train a model.
The privacy ranking is 3.
For the Tonic Structural API, the generator ID is NnGenerator
.
By default, the AI Synthesizer is not available. To enable the AI Synthesizer, in the Structural web server container, set the environment setting TONIC_NN_GENERATOR_ENABLED
to true
. For more information, go to Configuring environment settings.
Within each table, to configure the AI Synthesizer:
Assign the AI Synthesizer generator to the columns to use in the model. You also determine the type of data in each column.
Determine whether the table contains event data. For event data, you must select the primary entity and order columns.
For each table, you assign the AI Synthesizer generator to each column that you want to include in the trained model. AI Synthesizer trains one model per table.
You can assign the AI Synthesizer generator to columns that contain categorical, numeric, or location data. You cannot assign the AI Synthesizer to a datetime column.
Structural identifies the type of the column, but you can make adjustments to these assignments. For example:
A numeric column might actually be an enum, which would make it a categorical column.
A city name might be designated categorical, but is actually a location.
On the generator configuration panel for the column, from the type dropdown list, select the column type.
A table might contain event data, meaning that you want to preserve relationships between both rows and columns. For example, you might want to track financial transactions across time for each user.
To indicate that a table contains event data, on the generator dialog for any of the columns, check the Event Data checkbox.
The checkbox applies to the entire table.
For event data, you specify:
The column to use to identify the row (primary entity). For example, to track activity for users, you might use a column that contains a user name or identifier.
The column to use to sort the rows (order). This column should contain a numeric representation of a datetime value.
On the generator configuration panel:
To identify the current column as the primary entity, from the type dropdown list, select Primary Entity.
To identify the current column as the column to use for ordering, from the type dropdown list, select Order.
The Primary Entity and Order options are only available when Event Data is checked. The Order option is only available for numeric columns.
When the AI Synthesizer generator is assigned to at least one column in the table, then in Table View for that table, the AI Synthesizer panel displays.
The panel displays the list of columns that are included, and, for each column, the selected encoding type.
To remove a column, click the delete icon. The column is removed from the list, and the column generator is reset to Passthrough. For event data, if you remove the primary column or the order column, then you must assign that role to a different column.
To configure the model training, click the settings icon. The settings on the settings panel are slightly different depending on whether the model contains event data.
On the settings panel, the following parameters are common to all models:
In the Epochs field, enter the number of times that the training process goes over the data. The default is 300. A higher value can increase the accuracy of the training results. However, it increases the amount of time that it takes to complete the training. It can also decrease the privacy of the results.
In the Batch Size field, enter the number of examples to use during each training step. The default is 500. A higher value can make the training more regular, but might require more epochs to converge to similar results.
In the Reconstruction Loss Factor field, type the loss function for the model. The default is 2. The loss function for a variational autoencoder is essentially the sum of a “reconstruction loss” function and a regularization term. A higher value can help to produce decoded samples that are close to encoded samples, but also can make latent representations more complicated and reduce the diversity of synthetic samples.
In the Latent Dimension field, enter the dimension of latent representation. The default is 128. This latent dimension represents the complexity of the data. If the specified value is much higher than the dimensionality of the issue that you want to analyze, it can reduce the quality of the results.
In the Maximum Categorical Dimension field, enter the dimension for columns that have categorical or location encoding. The default is 35. If a column contains more distinct categories than this parameter, the most frequent categories are embedded as distinct one-hot vectors. The remaining categories are combined into a single one-hot vector. This limit prevents the model size from becoming extremely large and generally improves data quality.
For event data, to configure the RNN-VAE Parameters:
In the Maximum Sequence Length field, enter the maximum number of steps in a sequence that Structural considers when it trains the event model. The default is 20. Longer source sequences are truncated to the maximum length. The resulting synthetic sequences have a length up to this value. Long sequences take longer to process, and can reduce the quality of the results.
In the RNN Encoder Hidden Size field, enter the number of parameters in the RNN internal states to use for the encoder network. The default is 256.
In the RNN Decoder Hidden Size field, enter the number of parameters in the RNN internal states to use for the decoder network. The default is 256.
In the RNN Decoder Fully Connected Size field, enter the value to represent the complexity of the decoder’s fully connected layer. The default is 128. The hidden state passes through the fully connected layer to generate samples at each time interval.
In the Sequence Length Loss Factor field, enter the loss factor for sequencing for the model. The default is 2.0. The sequence length loss factor indicates how important it is to predict the sequence length. When you increase this number, the AI Synthesizer uses more of the model's capacity to capture the statistical properties of sequence lengths.
In the Order Column Loss Factor field, enter the loss factor for the column value order. The default is 1024.0 The order column loss factor determines how important it is to predict the order of the column values. Similar to the sequence loss factor, when you increase this factor, it increases the realism of the synthetic order column values. The scale is different because order column values use different encodings.
For non-event data, to configure the VAE Parameters:
In the Encoder Layer Sizes field, type a comma-separated list of non-negative integers to specify the number of layers and the size of each layer for the encoder. The default is 256,256,256, which indicates that there are three layers, and that the size of each layer is 256. A higher number of layers or larger layer size increases the expressive capacity of the model. However, to produce good results, you must start with a larger dataset.
In the Decoder Layer Sizes field, type a comma-separated list of non-negative integers to specify the number of layers and the size of each layer for the decoder. The default is 256,256,256, which indicates that there are three layers, and that the size of each layer is 256. A higher number of layers or larger layer size increases the expressive capacity of the model. However, to produce good results, you must start with a larger dataset.
In a child workspace, the AI Synthesizer panel under Model indicates whether the configuration is inherited from the parent workspace.
The inheritance stops if you make any changes to the AI Synthesizer configuration. When the configuration overrides the parent configuration, to reset to the parent configuration and restore the inheritance, click Reset.
Model training starts when you start the generation job.
This can take some time, depending on the size of the table and the number of columns that use the AI Synthesizer generator.
For example, a table that has 30 AI Synthesizer columns and 200,000 rows can take 2.5 hours to train.
The status information on Jobs page includes the status of the model training.
After the model is trained, the new synthetic data writes to the destination database.
Required license: Enterprise
On Basic or Professional instances, you select and configure generators separately for each column.
Required global permission: Create and manage generator presets
A generator preset is a saved configuration for a generator.
Tonic Structural provides a built-in preset for every generator. You can update the configuration of the built-in presets.
You can also create custom generator presets that have different configurations. For example, for the Address generator, you can have one generator preset to use for city columns, and another generator preset to use for full addresses. You can edit and delete the custom generator presets. The custom generator presets are available to assign to columns throughout the Structural instance.
Generator presets allow you to standardize the configuration for generators, and saves your users from having to replicate the same configuration selections across different columns, tables, and workspaces. For example, you might modify the generator preset for the Integer Key generator to enable consistency. Whenever a user assigns the Integer Key generator to a column, consistency is enabled.
For information about assigning and updating generator presets for a column, go to Assigning and configuring generators.
You can also view the video tutorial about generator presets.
The Generator Presets view contains the list of built-in generator presets for the entire Structural instance. The configured presets are not specific to a workspace or a user.
To display the Generator Presets view, in the Tonic heading, click Generator Presets.
For each generator preset, the list provides the following information:
The name of the generator preset. For the built-in presets, the generator preset name always matches the generator name.
Whether the generator preset is built-in or custom.
The number of occurrences. Includes the number of occurrences that use the current baseline configuration, and the number of occurrences that have overrides to the baseline configuration.
An occurrence has an override if, after a user assigns the generator preset to a column, one the following occurs:
A user changes the generator configuration options for that occurrence.
A user changes the baseline configuration for the generator preset.
When the preset configuration was most recently modified.
You cannot create or configure generator presets for generators that do not have any configuration options. For example, the Null generator does not have any configuration options.
For composite generators, you cannot create or configure generator presets from Generator Presets view. Generator Presets does not have access to data from which to create path expressions. You can create a new preset or update a preset baseline configuration from a column configuration panel in Privacy Hub, Database View, or Table View.
The list indicates when a generator does not allow you to configure a preset.
You can filter the list of generator presets by the preset name, whether it is built-in or custom, and by the underlying generator type.
To filter by the preset name, begin typing text from the name. As you type, Structural filters the list to only include the matching presets.
To filter the list based on whether the preset is built-in or custom:
Click Filter by Type.
In the dropdown list: To only include built-in presets, click Built-in. To only include custom presets, click Custom.
Tonic adds the selection to the selected filters.
Every generator preset is based on a Structural generator type. For example, there is a built-in generator preset for the Address generator, and you can also create custom generator presets based on the Address generator.
To filter the list based on the generator type:
Click Filter by Generator.
In the generator list, click a generator to include. You can use the search field to search for a specific generator. When you click the generator name, Structural adds the generator to the selected filters.
You can sort the generator preset list by the preset name and the by the modification date.
To sort the generator preset list by a column, click the column heading. To reverse the sort order, click the column heading again.
To create a new custom generator preset, you can either create a completely new preset, or copy an existing preset.
For composite generators such as JSON Mask, you cannot create a generator preset from Generator Presets view. Generator Presets view does not have access to data to use for path expressions. You can create presets for composite generators from a column configuration panel in Privacy Hub, Database View, or Table View.
You cannot create a custom preset at all for the AI Synthesizer, or for a generator that has no configuration options. For example, you cannot create a custom preset for the Null generator.
To create a completely new custom generator preset:
On the Generator Presets view, click Create Preset.
On the Create Preset panel, configure the generator preset.
Click Create.
When you copy an existing generator preset, the new generator preset by default inherits the configuration from the copied generator preset.
To copy an existing generator preset:
On the Generator Presets view, click the copy icon for the generator preset that you want to copy.
On the Copy Preset dialog, enter a name for the new generator preset, then click Copy. The new preset is added to the Generator Presets list, and the details panel is displayed to allow you to change the new preset configuration.
After you update the configuration, click Save and Apply.
On the confirmation panel, click Confirm.
To edit a preset, you must be either an editor or owner of at least one workspace in the Structural instance. If you are not an editor or owner of a workspace, then you can view the list of presets, but you cannot edit the presets.
When you change the configuration of a generator preset, the updated configuration becomes the new baseline configuration for the generator preset.
The baseline configuration is used whenever you select the generator preset. Existing occurrences of the generator preset keep their current configuration. You can reset those occurrences to use the current baseline configuration.
A change to the generator preset description is not considered a change to the baseline configuration.
For composite generators such as JSON Mask, you cannot update a generator preset from Generator Presets view. Generator Presets view does not have access to data to use for path expressions. You can update the baseline configuration from a column configuration panel in Privacy Hub, Database View, or Table View.
To update the baseline configuration of a generator preset:
On the Generator Presets view, click the edit icon for the preset.
On the Configuration tab of the Edit Preset panel, update the configuration. You cannot change the selected generator for the preset.
Click Save and Apply.
On the confirmation panel, click Confirm.
Each generator preset includes the following configuration:
Preset Name - The name of the generator preset. You can change the name of built-in presets. Built-in presets always use the generator name.
Preset Description - A longer description of the generator preset and how it is intended to be used.
Generator Type - Used to select the generator for a new generator preset. When you copy or edit a generator preset, you cannot change the selected generator type.
Generator configuration - The configuration options for the selected generator. For details on the specific configuration options for each generator, go to the Generator reference.
The following items are not included in the generator preset configuration. They are always configured for individual columns after you select the generator preset:
On the generator preset details panel, the Occurrences tab indicates where the generator preset is used. You can also see whether each occurrence overrides the current baseline configuration.
The Occurrences tab displays the list of workspaces that contain occurrences of the preset. Each workspace indicates the total number of occurrences that use the current baseline configuration and that have overrides to the current baseline configuration.
For workspaces that you have access to:
You can expand the workspace to display the list of columns that use the generator preset. For each column, the entry indicates whether the column uses the current baseline configuration.
You can click the Database View icon to navigate to Database View.
For workspaces that you do not have access to, you can only see the total number of occurrences. You cannot display the column list or navigate to Database View.
You can delete custom generator presets. You cannot delete built-in generator presets.
When you delete a custom generator preset, existing occurrences are assigned the built-in generator preset for that generator. If the current configuration does not match the baseline configuration for the built-in generator preset, then the occurrences also are marked as having overrides.
For example, a column is assigned a custom generator preset for the Name generator. The custom generator preset is deleted. The column is then assigned the built-in generator preset for the Name generator, and is marked as having overrides.
To delete a custom generator preset:
On the Generator Presets view, click the delete icon for the generator preset.
On the confirmation dialog, click Delete Preset.
Some data values require custom processing before or after the generator is applied.
If you require custom processing for data values, Tonic.ai can work with you to develop and deploy custom value processors for your Tonic Structural instance. Once a custom value processor is deployed, you can select the processor as part of the generator configuration for each column.
One common use case for custom processing is to decrypt source data before applying a generator, and encrypt destination data before writing it to the destination database.
Required license: Professional or Enterprise
Not available on Tonic Structural Cloud.
Required global permission: Configure Tonic data encryption
A common use case for custom processing is encrypted source data. The data might need to be decrypted before a generator is applied, and encrypted before it is saved to the destination database.
Structural data encryption allows you to configure decryption and encryption to use during data generation. The data encryption process supports AES encryption, and allows you to use either the CBC, ECB, or CFB cipher modes.
When Structural data encryption is enabled, the configuration panel for each column includes a toggle to use Structural data encryption for that column.
For columns that use both Structural data encryption and a custom value processor:
Decryption occurs before a pre-processing custom value processor.
Encryption occurs after a post-processing custom value processor.
You enable and configure the data encryption from the Data Encryption tab of the Tonic Settings view. To display the Tonic Settings view, in the Tonic heading, click Tonic Settings.
To use Structural data encryption, you must provide:
A Base64-encoded decryption key as the value of the TONIC_DATA_DECRYPTION_KEY
environment setting.
A Base64-encoded encryption key as the value of the TONIC_DATA_ENCRYPTION_KEY
environment setting.
Both key values must use the same key size - either 128, 192, or 256.
Structural validates whether the values are set correctly. Structural enables the rest of the Data Encryption tab settings only if the keys are set correctly.
By default, Structural data encryption is disabled. To enable it, toggle Enable Data Encryption to the on position.
When you enable Structural data encryption, you choose whether to use decryption, encryption, or both.
You use decryption if the source data is encrypted and must be decrypted before the generators are applied.
You use encryption to encrypt the transformed data before saving it to the destination database.
To use decryption only, select Use Decryption.
To use encryption only, select Use Encryption.
To both decrypt and encrypt data, select Use Decryption and Encryption.
Structural only supports AES encryption. The AES Encryption setting shows the current key size.
The key size is based on the values you provided for the decryption and encryption key environment settings.
From the Cipher Mode dropdown list, select the cipher mode to use for Structural data encryption. The available cipher modes are:
CBC
ECB
CFB
Before it decrypts or encrypts data, Structural applies an initialization vector.
By default, Structural generates a random initialization vector, and Use custom Initialization Vector (IV) is in the off position.
To provide custom initialization vectors for Structural to use:
Toggle Use custom Initialization Vector (IV) to the on position.
If the Structural data encryption configuration includes encryption, then in the Encryption IV field, enter the static initialization vector to use to encrypt data.
If the Structural data encryption configuration includes decryption, then in the Decryption IV field, enter the static initialization vector to use to decrypt data.
After it encrypts the destination data, but before it stores it, Structural can prepend a string to the encrypted data.
To configure Structural data encryption to prepend a string:
Toggle Prepend value to encrypted data to the on position.
In the Custom Value field, enter the string to prepend.
After you complete the configuration, the Preview Results panel allows you to test the decryption and encryption.
If the configuration is incomplete, you cannot run the test.
If the configuration is for decryption only:
In the Ciphertext field, enter an encrypted text string.
Click Run Test.
Verify that the text in the Plaintext Result field is correct.
If the configuration is for encryption only:
In the Plaintext field, enter an unencrypted text string.
Click Run Test.
Verify that the text in the Ciphertext Result field is correct.
If the configuration is for both decryption and encryption, then you provide an encrypted string. The test decrypts the string into plain text, then re-encrypts that string.
In the Ciphertext field, enter an encrypted text string.
Click Run Test.
Verify that the text in the Plaintext Result field and the Ciphertext Result field is correct.
To save the configuration, click Save.
To revert any changes since you last saved the configuration, click Revert.
You can also view this .
Structural data encryption allows you to set up decryption and encryption to apply to columns. For more information, go to .
For more information, go to .
Some Tonic Structural data connectors do not support subsetting.
However, for the following connectors that do not support subsetting, to generate a smaller set of data, you instead can add table filters.
The following data connectors support both subsetting and table filtering:
You can only filter tables that use De-identify table mode. The filter identifies the rows from the source database to process and include in the destination database.
Note that unlike subsetting, table filters do not guarantee referential integrity.
To add a filter, in the Table Filter text area on the table mode panel, provide the WHERE
clause for the filter, then click Apply.
For Databricks workspaces where the source database uses Delta files, the filter WHERE
clause can only refer to columns that have partitions.
For Amazon EMR, Google BigQuery, and Spark with Livy, the filter WHERE
clause can refer to columns without partitions. However, the performance is better when the referenced columns have partitions.
On the workspace configuration for Amazon EMR, Databricks, and Spark with Livy, the Enable partition filter validation toggle determines whether Structural validates the WHERE
clause when you create it. By default, the toggle is in the on position, and the WHERE
clause is validated.
For Amazon Redshift, Google BigQuery, and Snowflake, Structural always validates the WHERE
clause.
Required license: Professional or Enterprise
Subsetting allows you to intelligently reduce the size of your destination database. It takes a representative sample of the data while preserving the data's referential integrity.
You configure how Tonic Structural should generate a subset. When you generate the output data, you decide whether to enable the subsetting process.
For example, you can configure subsetting to get 5% of all transactions, or all of the data that is associated with customers who live in California.
Here are a few examples where subsetting data might be important or necessary:
You want to use your production database in staging or test environments, without the PII. The database is very large, so you want to use only a portion of it.
You want a test database that contains a few specific rows from production (and related rows from other tables) so that you can reproduce a bug.
You want to share data with others but you don’t want them to have all of it. For example, you can provide developers an anonymized subset that also enables them to run the test database locally on their local machines.
To learn more about our approach to subsetting, go to the following technical blog posts:
Subsetting uses foreign keys to determine the relationships in the data. These relationships enable the subsetting process to traverse the database as it builds the subset. Foreign keys are either configured in your source data, or configured using the Structural virtual foreign key tool. For more information, go to Subsetting and foreign keys.
For subsetting, each table in the source database falls into one of the following categories:
Target tables are the seed tables that provide the initial set of rows to include in the subset. Structural retrieves the initial subset of data from the target tables. Structural then uses those rows to identify the information to pull from related tables.
A target table typically contains an important object that is well connected to everything else in the source data. For example, users, transactions, or claims. A subset should usually have a very small number of target tables.
When you identify a target table, you specify how to retrieve the subset of the data that you want from the table. You can request a percentage of the data, or use a WHERE
clause to identify a specific subset of data.
For more information, go to #subsetting-configure-target-tables.
A lookup table contains a static set of values that is used in other tables in your subset. For example, a lookup table might contain a list of postal codes or country names that are referenced in other tables.
Structural always retrieves all of the data in a lookup table. It does not check whether or where the lookup values are used.
It does not pull records from related tables based on lookup table values. Relationships with lookup tables are ignored during the subsetting process.
For more information, go to #identifying-lookup-tables.
Related tables are tables that are connected by direct or indirect relationships with a target table, and that are not identified as lookup tables.
Downstream tables have data that is required to maintain referential integrity for the subset. These tables have primary keys that are referenced by foreign keys in related tables.
Upstream tables contain data that has a foreign key that references a primary key in the target table. For large upstream tables, if the foreign key columns are not indexed, the subsetting process can be significantly slower.
These upstream records are not required to maintain referential integrity, but can contain useful information. In the subset configuration, you can filter these upstream records either by date or by using a WHERE
clause.
Some related tables are both downstream and upstream. In that case, you can provide a filter that applies only to the upstream records. Because the downstream records are required for referential integrity, they cannot be filtered.
For example, a transactions table contains a foreign key column to identify the customer. The value is the primary key of a record in the customers table. The customers table is downstream of the transactions table - the transaction data is incomplete without the customer information. The transactions table is upstream of the customers table.
Structural pulls data from related tables in order to preserve referential integrity in the output data subset.
In many cases, the relationship is direct. For example, a target table contains a list of events. The events table identifies the user that hosted the event. The user is identified using a foreign key relationship from the events table to the users table. The users table is a related table. The subset includes all the users that the events refer to.
The relationship also might be indirect. To continue the example, the events table identifies a user from the users table. The users table identifies the company that each user belongs to. The company is identified using a foreign key relationship from the users table to the companies table. The companies table is also a related table. The subset needs to include all of the companies that are referred to by the users that the events table refers to.
For an example of how Structural identifies related tables, view the example diagram in #subsetting-how-tonic-creates.
Tables other than target tables, lookup tables, or related tables are not part of the subset.
By default, Structural copies only the table schema of out-of-subset tables. It does not populate any of the data.
You can also choose to process the tables using the table mode that is assigned to each table.
For more information, go to #subsetting-config-out-of-subset.
Structural creates the subset before it applies any transformations to the source data.
To provide a basic overview of how Structural creates the subset, we'll use the following simple example schema:
The Events table is the target table for the subset. The Events table includes information about the event hosts (Hosts table) and the event venue (Venues table). For each host, the data includes the company that the host belongs to (Companies table).
The Attendees table includes the event that the attendee registered for.
The Hosts, Companies, Venues, and Attendees tables are all related tables for the subset.
The States table provides a lookup of state values to use for the company, venue, and attendee addresses. It is a lookup table for the subset. A subset always includes all of the data in a lookup table.
When you enable subsetting for a data generation job:
To create the basis of the subset, Structural gets data from the target tables based on the configured filters, either a percentage or a WHERE
clause.
In our example, Structural gets the subset of data from the Events table.
Structural then traverses your database based on the relationships that originate from the target tables.
Structural first goes upstream. For the upstream process, Structural traverses through tables that reference a target table, based on the data collected in step 1. In other words, the value of the primary key for a target table record is the value of a foreign key column in the upstream table. This step continues until there are no remaining upstream tables to process. To continue our example, Structural retrieves the attendees for the event records that are in the subset.
Next, Structural goes downstream. Structural traverses all of the tables to look for foreign key columns for which the value is the primary key of an upstream table record. To continue our example, Structural retrieves the hosts and venues that are referred to in the event records that it retrieved in the first or second pass on the events table. It also retrieves the companies that are referred to in the host records. During this downstream step, Structural considers both upstream and downstream tables to ensure that the subset includes every connected table. For example, if the Venues table included a foreign key column that referenced a primary key from the Attendees table, Tonic would have to return to the Attendees table to get those attendee records.
You might want to be aware of how Structural retrieves subset data in the following cases, which can result in either more or less data than you might expect.
If there are multiple target tables, and the tables are related to each other, Structural takes the union of the required data for both the target table configuration and the table relationships.
For example, table A contains a foreign key column that refers to table B. You configure both tables as target tables. For table B, Structural pulls both the directly targeted set of records, and the records that the targeted table A records refer to.
If a table is upstream of multiple target tables, then Structural only pulls records from that table that contain references to targeted records in all of the target tables.
For example, in related table Child1
, column1
is a foreign key that refers to a primary key in target table Parent1
. column2
is a foreign key that refers to a primary key in target table Parent2
.
If column1
and column2
both refer to targeted records in Parent1
and Parent2
, then that Child1
record is included in the subset. If only one of those columns refers to a targeted record in Parent1
or Parent2
, then that Child1
record is not included.
If you are having trouble with your subsetting configuration or results, the following hints and tips might be helpful.
Remember that subsetting is an iterative process. It can take multiple attempts to get the exact subset of data that you want.
One way to troubleshoot a subsetting configuration is to start with a very small subset. With a small subset, it is easier to verify that the data generation returns the data you are expecting.
For example, start with a single target table, and use a query to limit the subset to a single record. Add the necessary lookup tables.
After you verify the results in this very small subset, you can gradually increase the subset size in subsequent iterations. Verify the results after each iteration, and continue to increase the subset size until you reach the full subset.
For each iteration, verify that:
The subset includes the target table records.
The subset includes all of the lookup table data.
The subset includes all of the related records - records that each target table record refers to, and records that refer to the target table record.
An ideal subset contains a small number of target tables. A target table typically contains an important object that is well connected to everything else in the source data.
Tables that contain static values that are used in multiple other tables should be lookup tables, not target tables.
For other tables that are related to the target table, allow the subsetting process to identify the necessary rows to include in the subset based on the foreign key configuration. Do not add them as target tables.
When Tonic Structural detects a circular foreign key dependency, to break the dependency, it sets all of the values of one of the columns to NULL
. For more information, go to #circular-dependencies.
If your source data includes circular foreign keys, make sure that at least one of those columns is NULL
able.
Note that subsetting performance does not improve in a linear fashion.
For example, if you subset 10% of a target table, the generation does not take 10% of the time of a standard de-identification run. It might take 90% of the time.
To improve performance:
Make sure to properly configure all relevant tables as lookup tables. Lookup tables are automatically copied to the subset in their entirety, and do not require processing to identify which rows to include. Structural also does not look for records that are upstream of a lookup table.
For upstream tables, in particular large upstream tables, add indexes to virtual foreign key columns. If there are no indexes on the foreign key columns, subsetting can be much slower.
Alternatively, instead of adding indexes, assign Truncate mode to the upstream tables.
Required license: Enterprise
By default, a child workspace inherits the subsetting configuration from its parent workspace. Any changes to the subsetting configuration in the parent workspace are copied to the child workspace.
If you make any change to the subsetting configuration in the child workspace, then it no longer inherits the subsetting configuration. The changes to the parent workspace do not affect the child workspace.
You can reset the child workspace to restore the inheritance.
For a child workspace, the Configuration tab on Table View indicates whether the child workspace currently inherits the configuration from the parent workspace.
Inherits parent configuration means that the child workspace inherits the parent configuration.
Overrides parent configuration means that the child workspace overrides the parent configuration. Changes to the parent configuration are not copied to the child workspace.
For a child workspace that overrides its parent configuration, you can reset the inheritance. When you reset the inheritance, the overrides are removed from the subsetting configuration. Changes to the parent configuration are once again copied to the child workspace.
To reset the inheritance, in the Overrides parent configuration notice, click Reset, then on the confirmation dialog, click Reset again.
The configuration overrides are removed. The child workspace inherits any subsequent configuration changes from the parent workspace.
For a parent workspace, you can view the current inheritance status of all of the child workspaces. When you change the subsetting configuration for a parent workspace, it applies to the child workspaces that have not overridden the subsetting configuration.
The Child Workspaces tab contains the list of child workspaces.
For each workspace, the list includes:
The workspace name.
The inheritance status. Inheriting indicates that the child workspace inherits the configuration from the parent. Overriding indicates that the child workspace overrides the configuration and does not inherit it from the parent.
Your role in the child workspace.
The owner of the child workspace.
You cannot reset the inheritance status from the Child Workspaces tab. If you have access to a child workspace, to switch to that workspace, click the arrow icon in the rightmost column.
To display the Subsetting view, either:
On the workspace management view, in the workspace navigation bar, click Subsetting.
On Workspaces view, from the dropdown menu in the Name column, select Subsetting.
On Workspaces view, click the subsetting icon for the workspace.
The Configuration tab on the Subsetting view shows the current subsetting configuration.
It consists of:
Subsetting summary
Table View and Graph View. Both views display the source data tables and show the current subsetting configuration, and allow you to update the configuration. Table View displays a tabular list of tables. Graph View displays a diagram that shows the relationships between the tables.
Configuration to enable subsetting for data generation
Configuration for handling out-of-subset tables
Results of the most recent subsetting data generation
The panels at the top of the Configuration tab provide a clickable summary of the current subsetting configuration.
The summary includes the following values:
Target shows the number of target tables.
Lookup shows the number of lookup tables.
In Subset shows the number of tables that are in the subset. This includes target tables, lookup tables, and related tables.
Out of Subset shows the number of tables that are not in the subset.
When you click a summary panel:
On Table View, the table list is filtered to only display matching tables. For example, when you click Target, the list is filtered to only include target tables.
On Graph View, the matching tables are highlighted with a shadow behind the table objects.
After you run data generation with subsetting, on Table View, the Latest Results tab displays on the Configuration tab.
Before you run data generation with subsetting, the Latest Results tab does not display.
The Latest Results tab displays details for the most recent data generation with subsetting. It ignores data generation runs that do not use subsetting.
The subsetting results include:
The job status (successful, failed, canceled).
The amount of time it took to complete the run.
The percentage of the source data that is included in the subset destination data.
The volume of data in the source data and the subset destination data.
The percent reduction from the original source data to the subset destination data.
When the job began and ended.
To display the details for the data generation job, click View Job Details.
The Configuration tab contains the list of tables in the source database. It shows how each table is affected by the most recently completed subsetting configuration.
For each table, the table list includes:
Whether the table is a target table, a lookup table, or a related table that is filtered.
Whether the table is in or out of the subset. Target and lookup tables are always in the subset. Related tables also are in the subset. Other tables are out of the subset.
The number of rows in the table before and after the subset is created. For tables that are in the subset, the percentage of table data that is in the subset. For more information, go to #subsetting-view-calculate-pre-post-subset-rows.
The number of direct inbound (downstream) and outbound (upstream) relationships for the table. An inbound relationship means that a primary key from another table is used as a value in the current table. An outbound relationship means that the primary key of the current table is a foreign key in another table. You can filter the upstream records to only include the records that you need. For target tables, the relationships are used to determine the related tables that are included in the subset. The related tables can also include other tables where the relationship is indirect.
You can sort the list based on values in a selected column. To use a column to sort the list, click the column heading. To switch the sort order, click the column heading again.
The Sort by dropdown list provides the following options to sort the list:
Rows pre-subset - Sort by the number of rows in the table before subsetting.
Rows post-subset - Sort by the number of rows in the table after subsetting. Before you run a data generation job to create the subset, this value is unknown.
Inbound relationships - Sort the list based on the number of inbound relationships.
Outbound relationships - Sort the list based on the number of outbound relationships.
Total relationships - Sort the list based on the total number of inbound and outbound relationships.
By default, the drop-down sort options sort the table list in descending order. For example, when you select Rows pre-subset, the table that currently has the largest number of rows is at the top of the list. To change the sort direction, select the option again.
You can filter the list based on:
The table name
Whether the table is in or out of the subset
Whether the table is a target or lookup table
To filter by table name, begin typing the name text into the filter field. As you type, the list is filtered to only include tables whose names contain the filter text.
To filter the list to show only target tables, lookup tables, in-subset tables, or out-of-subset tables, do one of the following:
Click the panel at the top of the tab.
From the Filter Tables drop-down list, select the filter option.
To remove a table subset status filter, click the delete icon.
You can combine a name filter and a table subset status filter. For example, you can filter the list to show in-subset tables that contain the text "test".
You cannot combine the table subset status filters. When you select a different filter, the current filter is replaced.
Graph View displays a diagram of the source data tables and the relationships between them. It also indicates:
Whether each table is in the subset.
Whether the subsetting status for the table changed since the last subsetting data generation.
Each table block provides the following information about the table:
At the top left:
The name of the table
The name of the schema that contains the table
At the top right, the status of the table in the context of the subset. A table might be a target table, a lookup table, a related table that is in the subset, or a table that is out of the subset.
At the bottom, the number of rows in the table before and after the subset is created. For more information, go to #subsetting-view-calculate-pre-post-subset-rows.
It also indicates the effect on the table of subset configuration changes that occurred since the most recent subsetting generation. For more information, go to #subsetting-config-identify-changes-since-last-run.
The Graph View diagram connects tables that are related to each other based on a foreign key relationship. The position of the tables indicates the type of relationship.
Tables that have an upstream relationship with another table are displayed above the table.
Tables that have a downstream relationship with another table are displayed below the table.
In the example schema from #subsetting-how-tonic-creates, the Events table contains a list of events:
The Attendees table refers to the event for the attendee. Attendees is upstream of Events, and would display above the Events table in Graph View.
The Events table refers to a venue from the Venues table. Venues is downstream of Events, and would display below the Events table in Graph View.
To find and focus on a specific table:
In the search field, begin to type text in the table name. As you type, Tonic Structural filters the list to display matching tables.
When you see the table that you want, click the table name. Structural highlights the connections to other tables and displays the table details panel.
To navigate around Graph View, you can click and drag to pan around the graph.
You can also use the navigation tools at the bottom left of Graph View to zoom in and zoom out.
For tables that contain fewer than 1,000 rows, the pre-subset number of rows is displayed as <1k.
For tables that are in the subset, the resulting rows are based on the target table and related table configuration.
For tables that are not in the subset, the resulting rows are based on whether you enable Process tables that are out of subset. For more information, go to #subsetting-config-out-of-subset.
If the data generation job hasn't run yet, or the details from the job are not yet available, then the number of rows after the subset is marked as unknown.
If you updated the configuration for a table since the most recent data generation, then on Table View, an information icon displays next to the post-subset value.
When you click a table in either Table View or Graph View, the table details panel displays to the right of the table.
The table details include:
Whether the table is in the subset
The number of rows before and after the subsetting. For more information, go to #subsetting-view-calculate-pre-post-subset-rows.
The number of outbound and inbound relationships
For target tables, the subset configuration
The list of inbound and outbound relationships with other tables. When you click a table name, Structural selects and displays the details for that table.
A target table is a table for which you specify a subset of the data to include in the destination database.
To identify the subset of data to include, you can either:
Specify a percentage of the table to include in the destination database.
You can use this option when you care about the specific volume of data, but not the specific rows.
Tonic Structural converts the percentage to a filter or a WHERE
clause, depending on your database type.
Depending on how your tables are related, the target tables in the final subset might contain more rows than the percentage that you specified. These additional rows are required to maintain referential integrity. To view the tables that contribute to the additional rows, see the subset steps. For additional assistance, reach out to your Tonic.ai contact.
Provide a WHERE
clause to specify the subset of data to include in the destination database.
The WHERE
clause allows you to be more specific about the data to include. For example, you might want to only include data for a specific user or date range.
To combine a specified set of records with a random set of the remaining records, use a WHERE
clause. For example, to get all users that are from Alabama, and 5 percent of the other records, use the following WHERE
clause:
To identify and configure subsetting for a target table:
In Table View or Graph View, click the table.
On the table details panel, from the Select Table Type dropdown list, select Target Table (Percentage).
In the Target Percentage field, type the percentage of the data to include in the destination database. The default is 5, which indicates to use 5% of the rows in the table. You can specify a decimal value, including a value that is less than 1. For example, you might configure the subset to include .5 percent of the rows, or 33.33 percent of the rows.
In Table View or Graph View, click the table.
On the table details panel, from the Select Table Type dropdown list, select Target Table (Where Clause).
In the Target Where Clause field, type the WHERE
clause to use to identify the data to include in the destination database.
For example, the target table contains a column called event_id
. To select all rows where event_id
is greater than 1000, add the following WHERE
clause:
event_id > 1000
For a more complex WHERE
clause, you can display an editor with a larger text area.
Click Open in Editor.
In the text area, enter the WHERE
clause.
Click Save.
You can query across tables within the WHERE
clause.
For example, you configure the customers
table as a target table, but you also want to use information from the customers_legacy
table to identify the target records in customers
.
In the following example query, the matching records in customers
have a Customer_Key
value that matches a CustomerKey
value in customers_legacy
, and where the value of Occupation
in customers_legacy
is Detective
:
You can also create a query that selects a random percentage of a specified set of data.
For example, in PostgreSQL, to select 50% of the records that have an identifier that is divisible by 3, you could use the following WHERE
clause:
To remove a target table:
In Table View or Graph View, click the table.
On the table details panel, from the table type dropdown list, select Remove.
A lookup table contains a list of values that are used to populate columns in other tables. For example, a list of states, countries, or currencies. Lookup tables are sometimes referred to as reference tables.
Structural always copies lookup tables to the destination database in their entirety. If you do not configure a table as a lookup table, then Structural treats the table as a related table, and copies only rows that are used in the subset data. Structural also pulls in rows from other tables that refer to the table, but that are not necessarily related to the target tables. This could result in an unexpectedly large subset.
For example, in a Users table, every user record refers to a state in the States table. If you do not identify States as a lookup table, then the subset would include every record in the Users table.
Here are some typical properties of a lookup table:
It is fairly small and rarely updated.
Many tables point to the table, but it does not point to another table.
The table contains a set of unique values.
To identify an individual table as a lookup table:
In Table View or Graph View, click the table.
On the table details panel, from the Select Table Type dropdown list, select Lookup Table.
To identify multiple tables as lookup tables:
On Table View, check the checkbox for each table to identify as a lookup table.
From the Actions dropdown list, select Add Lookup Tables.
To remove the lookup designation for a table:
In Table View or Graph View, click the table.
On the table details panel, from the dropdown list, select Remove.
Records that reference required subset records are considered upstream records. Unlike downstream records, upstream records are not required for referential integrity. Upstream records are optional.
To reduce the size of the subset, you can apply a filter to these optional records. To filter the records, you can either:
Use a date column to specify an amount of time before the current date for which to include records. For example, you can only include records for which the update date is one week before the current date.
Use a WHERE
clause to identify the records to include.
You can filter a table that contains both upstream and downstream records. However, the filter only applies to the optional upstream records.
In the table list, when an upstream table is filtered, a Filtered icon displays.
The date filter allows you to filter optional records based on the value of a date-based column.
To filter an upstream table by date:
In Table View or Graph View, click the table. On the table details panel, under Filter Table, Select table filter is set by default to None, which indicates that the table is not filtered.
From Select table filter, select Filter By Date Column.
From the Date Column dropdown list, select the date column to use for the filter. To improve performance, select a column that is indexed.
Under Get data from the last, from the time unit dropdown list, select the unit of time to use for the filter. You can filter records based on their age in days, weeks, months, or years.
In the field, enter the number of the selected unit before the current date for which to include the upstream records. For example, you select days as the unit, and set the number to 4. Structural then pulls related records for which the date column value is up to 4 days before the current date.
You can also filter the upstream records using a WHERE
clause.
To filter the upstream records using a WHERE
clause:
In Table View or Graph View, click the table. On the table details panel, under Filter Table, Select table filter is set by default to None, which indicates that the table is not filtered.
From Select table filter, select Filter by Where Clause.
In the Where Clause text area, enter the WHERE
clause to use to filter the related records.
For a more complex WHERE
clause, you can display an editor with a larger text area.
Click Open in Editor.
In the text area, enter the WHERE
clause.
Click Save.
To copy the WHERE
clause to the clipboard, click Copy To Clipboard.
To remove an upstream filter, from Select table filter, select None.
As you make changes to the subsetting configuration, Table View and Graph View indicates how the changes affect the next run of the subsetting generation when compared to the most recent subsetting generation.
When a table's inclusion in the subset is affected, on Graph View, a colored marker is added to the bottom of the table box.
On Table View, a colored icon displays next to the table. A tooltip indicates the type of change.
The possible types of changes are:
Added to the subset. For example:
A new target table
A table that is newly included because it is related to a new target table
A new lookup table
Removed from the subset. For example:
A removed target table
A table that is removed because it is related to a removed target table
A removed lookup table
Modified in the subset. This usually reflects a change to a target table configuration. You might:
Change the type of target table (percentage or WHERE
clause)
Change the percentage
Change the WHERE
clause
Change the upstream filter
When you run a subsetting generation, Tonic clears the markers.
Tables other than target tables, lookup tables, or related tables are not in the subset.
The subsetting configuration includes how to copy all of these tables to the destination database.
You can either:
Use the table modes that are assigned to the out-of-subset tables.
Truncate all of the out-of-subset tables. The table schema is preserved, but none of the data is copied to the destination database.
On Table View, on the Configuration tab, you use the Process tables that are out of subset toggle to determine how to handle these tables. After you run subsetting data generation, the toggle is on the Options tab.
By default, the setting is turned off, and Structural truncates the out-of-subset tables.
To use the assigned table mode to process each table, toggle the setting to the on position.
If you configured subsetting, then when you run a data generation job, you can either generate the entire dataset, or use the subsetting configuration to generate a subset.
On Table View, on the Configuration tab, the Use Subsetting toggle indicates whether to generate a subset. After you run subsetting data generation, the Use Subsetting toggle is on the Options tab.
By default, the toggle is in the off position. When you run a data generation job, it generates the entire destination data dataset.
To instead generate a subset, toggle Use subsetting to the on position.
When you run a data generation job, you are also prompted to confirm whether to generate the entire dataset or a subset. These two toggles are synchronized. If you turn on the Use Subsetting toggle on the Configuration tab, then it is on by default on the generation confirmation panel.
You can sometimes use parallel processing to improve the performance of the subsetting process. Parallel processing allows multiple subsetting steps to be processed at the same time. The steps cannot rely on the output of other steps that are processed in parallel.
To enable parallel processing for subsetting, set the environment setting TONIC_TABLE_PARALLELISM
to a number greater than 1 (the default). You can configure this setting from Tonic Settings. This setting determines the maximum number of subsetting steps that Structural can process in parallel. For regular data generation, it also determines the number of tables that Structural operates on at the same time.
The effect of subsetting parallelism on performance depends on your subsetting configuration, the layout of your schema, the performance characteristics of the machine that runs Structural, and the performance characteristics of your databases.
We recommend that you start with a relatively small number such as 4, and then run a data generation job to see how it affects performance. If performance improves, you can increase the number incrementally until the performance no longer improves.
The environment setting only controls the maximum number of steps that can be processed in parallel. Performance should not degrade if your system cannot support parallelism or won't benefit from using it.
If you have any other questions, contact support@tonic.ai.
Your data might be stored in separate but related databases. In Tonic Structural, each database provides the source data for a different workspace.
For example, a Users database contains a list of users. Each service also has a separate database. The Service1 and Service2 databases refer to identifiers of users from the Users database, but there are no direct foreign key relationships.
When you generate a subset from each database, you might want to ensure that the resulting data is complete and cohesive. For example, your application connects to and pulls data from each database. This means that your end-to-end testing also requires corresponding data from each database.
To continue the previous example, you generate subsets from the Users, Service1, and Service2 databases. Your application pulls data from each database. For the data to be complete and have referential integrity, the subsets from the Service1 and Service2 service databases should only contain records that refer to the users in the subset from the Users database.
Here are some options to generate subset from separate databases to produce data that is complete and cohesive:
In all cases, when you generate subsets across different databases, you must use consistency to ensure that common columns have the same values in each subset.
One way to produce complete and cohesive data across databases is to use deterministic WHERE
clauses in your target table configuration. A deterministic WHERE
clause always produces the same results, and is never random.
A percentage is not deterministic. Structural selects a specific number of records, but selects those records at random.
Not all WHERE
clauses are deterministic. For example, the Users, Service1, and Service2 databases each have a TotalValue
column that reflects the total spent as a whole and for each service. Filtering based on TotalValue
does not guarantee that you get a cohesive set of records.
Instead, provide a WHERE
clause that can be used in each database to produce a cohesive set of records across the databases. For example, use a WHERE
clause to look for a specific set of UUID values in each database.
In our example, if we target the same set of user UUIDs in the Users, Service1, and Service2 databases, we produce a complete and cohesive set of records for those users.
When you use a deterministic WHERE
clause in each database, you can run the subsetting jobs independently.
This is somewhat similar to using a deterministic WHERE
clause. It is one way to provide input to create a deterministic WHERE
clause.
You can run a query outside of Structural, and then use the results as input to the subset configuration. For example, you could run a query to identify users that are located in the United States.
One way to use the results would be to store the results somewhere in a database accessible to each workspace that you can reference in the WHERE
clauses. You could also return the result as a hard-coded list, and create WHERE
clauses that use an IN()
filter that contains a long list of these hard-coded values. You could even use the Structural API to update the WHERE
clause values as a part of an automated process.
This method allows you to run the subsetting jobs independently.
Another option is to run the jobs on the workspaces serially. The results of a job on one workspace feed into the job on the next workspace.
To do this, you run the first job, which can have a target percentage or a non-deterministic WHERE
clause.
After this job completes, use the results as input to a WHERE
clause in the second workspace. For example, the results might be a set of user ID values.
Depending on how the databases are set up, you might be able to query the results directly. For fully isolated databases, you could export the list and hard-code it in the WHERE
clause of the second workspace.
You can only use this option if the relevant column values are not changed by the generation process. If the column has a generator applied, then the output column values from the first database do not exist in the source column values in the second database.
The previous options are ideal for when the related databases are completely isolated from each other.
However, in some cases you can connect different database instances directly to query across them. Many database engines provide this capability, such as:
Oracle database links
SQL Server linked servers
PostgreSQL foreign-data wrapper with foreign server
If your environment allows and supports these mechanisms, then you can directly reference the external server in a query.
For columns that are common across all of the databases, you must ensure that a specific value in the source databases results in the same value in all of the destination databases.
To do this, you must assign a generator that supports consistency, and enable consistency on the column.
You must also configure Structural to ensure consistency across databases.
For more information, go to Enabling consistency.
The Subset Steps tab outlines the steps that Tonic Structural uses to create the subset based on the current configuration.
The steps include the processing of the target, lookup, and related tables. However, the list is not necessarily a one-to-one correspondence with the in-subset tables. A table might appear in multiple steps in order to satisfy referential integrity.
The steps do not include the out-of-subset tables.
Each step includes:
The Configuration tab displays the results of the most recent subsetting run, as well as the subsetting configuration that was in place during each run. You can use the Previous Subsetting Runs tab to view details for any of the previous 100 subsetting runs.
You can use information about previous runs to see how changes to the subsetting configuration affect the subset results.
From the Select a previous subset run dropdown list, select the subsetting run to display the details for.
The runs are identified by the run date and time.
The details for a selected run include the following:
A summary of the run results.
Details about the subsetting configuration that was in place at the time of the run.
The panel at the left of the tab summarizes the results of the selected run. The run summary includes the following:
The status of the run (successful, failed, canceled).
The amount of time it took to complete the run.
For successful runs:
The percentage of source data included in the destination database
The volume of data in the source database
The volume of data in the destination database
To display the job details for the run, click View Job Details.
The subsetting configuration reflects the configuration that was in place at the time of the selected run. It is read-only. To make adjustments to the subsetting configuration, return to the Configuration tab.
Previous Subsetting Runs only displays Table View.
The panels above the table list show the number of target tables, lookup tables, related tables, and out-of-subset tables.
For each table, the list indicates whether the table is in the subset. It identifies the target and lookup tables. To view the subset configuration for an individual table, click the row.
The subsetting process uses foreign keys to navigate the relationships in your data. It uses these relationships to identify the data to include in the subset. Without foreign keys, Tonic Structural does not know how to navigate the relationships in your data. Properly configured foreign keys allow Structural to select the necessary rows from other tables, which ensures referential integrity.
Foreign keys are often set up directly within the source database. You can also set up virtual foreign keys within Structural. For example, a foreign key relationship might be missing, or your database might not use foreign keys. If your database uses polymorphic keys, then you must use the foreign key upload to add those keys manually.
For information about Foreign Keys view, including how to create and upload virtual foreign keys, go to Viewing and adding foreign keys.
You can also add virtual foreign keys from Subsetting view.
For example, on Graph View, you might notice that a relationship between tables is missing. You can immediately add a virtual foreign key to establish that relationship.
To create a virtual foreign key from Subsetting view:
Display the table details panel for the table that contains the foreign key.
Click Create Virtual Foreign Key.
Under Foreign Key from this Table, select the column in the current table that contains the foreign key. To find the foreign key column, begin to type the column name.
Under Primary Key in Another Table, select the column that contains the primary key. To find the primary key column, begin to type the column name or the name of the table.
Click Save.
Foreign key relationships can sometimes have circular dependencies, also referred to as cyclical dependencies.
In the simplest case, a circular dependency occurs when two tables each contain a foreign key that references the other table. In the following example, the Employees
table contains a department_id
foreign key column that references the Departments
table, and the Departments
table contains a manager_id
foreign key column that references the Employees
table.
Circular dependencies can also come from a much longer chain of references, where you follow references through several tables before returning to the original table.
Circular dependencies can also occur when a table references itself. In the following example, the Employees
table contains a manager_id
foreign key column that contains an employee ID value from the id
column.
During subsetting, if the circular dependency isn't broken, then there is an endless loop of going back and forth between the tables that reference each other.
To break a circular dependency, Structural identifies a foreign key column that is NULL
able, and sets its values to NULL
. When the process reaches a NULL
value, it stops looking for additional related records. Structural applies the minimum number of NULL
values that are needed to break the circular dependencies.
If none of the foreign key values are NULL
able, then the circular dependency cannot be broken, and the subset generation fails.
Tonic can detect circular dependencies before you run subsetting.
When a table contains foreign keys that are part of a circular dependency that Structural breaks:
On Graph View, a Cycle Break marker is added to the table object. The marker includes the name of the foreign key column.
The table details panel also indicates that there is a cycle break and lists the affected columns.
Foreign keys define relationships between tables. The value of a foreign key column in a table is the primary key of a row from a different table. For example, a transactions
table includes a customer_id
column. The value of customer_id
is a primary key value from the id
column in the customers
table. A table can also have composite foreign keys that consist of multiple columns.
Tonic Structural uses foreign keys when it generates subsets and when it applies generators to primary or foreign keys.
During data generation, when generators are assigned to primary key columns, Structural ensures that the foreign keys are synchronized with the primary keys.
When Structural creates a subset, it uses foreign keys to identify the related tables and rows to include in the subset.
Often, foreign key relationships are defined in the source database. When you have missing relationships or cannot define them in the source database, Structural offers a virtual foreign key tool to allow you to add additional foreign keys to ensure that all relationships are maintained. Structural only uses these virtual foreign keys during the generation process. It does not write the virtual foreign keys to the destination database.
From the Foreign Keys view, you can view the current foreign keys (all license tiers) and add virtual foreign keys (Professional and Enterprise tier only). To display the Foreign Keys view:
On the workspace management view, in the workspace navigation bar, click Foreign Keys.
On Workspaces view, from the dropdown menu in the Name column, select Foreign Keys.
On the Foreign Keys view, the View Foreign Key Relationships tab contains the list of foreign keys in the source database.
For each foreign key:
Foreign Key contains the name of the columns (tableName.columnName
) that contain the foreign key values.
Primary Key contains the name of the column (tableName.columnName
) that contains the primary key value used to populate the foreign key column.
Virtual foreign keys that you added are displayed with a checkbox.
You can delete those keys. You cannot delete keys that are defined in the source database.
You can filter the foreign keys by the name of the foreign key column or the primary key column. In the filter field, begin to type text that is in the column name. As you type, Structural filters the list.
You can sort the foreign keys by the name of the foreign key column or primary key column.
To sort the list:
Click the Sort dropdown for the column that you want to use to sort the list.
On the sort panel, click the sort order to use.
Required license: Professional or Enterprise
Required workspace permission: Configure virtual foreign keys
Tonic allows you to add virtual foreign keys to your source database. You would use this feature to add a specific foreign key that is missing, or if your source database does not use foreign keys.
You can add the foreign keys one at a time from the Add Foreign Key Relationships tab, or you can upload a JSON file that contains the foreign keys.
If your database uses polymorphic keys (typically if you have a Ruby on Rails application), then you must use the JSON file upload to configure those keys.
You cannot create virtual foreign keys from a child workspace. You can only create virtual foreign keys from a parent workspace.
You can also create virtual foreign keys from a table details panel in Subsetting view.
You can configure virtual foreign keys from the Add Foreign Key Relationships tab. You cannot configure polymorphic keys here. Polymorphic keys must be uploaded from a JSON file.
To add virtual foreign keys to your source database:
Under Select Foreign Keys, check the checkboxes to identify the foreign key fields. These are the fields that contain a value that is a primary key from another table.
The Select Foreign Keys list contains the columns that are not already configured as foreign key columns.
The top level of the Select Foreign Keys list displays the unique column names. This is the column name only, without the table name. Next to each column name is the number of times that it appears in the source database.
You can use the sort dropdown list to sort the list either by the column name or by the number of times the column appears.
You expand the column name to display the list of columns that have that name. This list uses the tableName.columnName
format.
For example, a database has a customer_id
column in both the sales
and customers
tables. On the Select Foreign Keys tab, the top level entry is customer_id
. Under customer_id
are entries for sales.customer_id
and customers.customer_id
.
As you select and deselect columns, they are added to or removed from the Foreign Key Preview list. Under Create New Foreign Key, the number of keys to add is also updated. From Foreign Key Preview, to remove a selected column, click its delete icon. This performs the same function as unchecking the checkbox in the Select Foreign Keys list.
From the Select Primary Key dropdown list, select the column that provides the values for the selected foreign key columns.
To create the virtual foreign keys, click Create n foreign keys. n is the number of keys that are created, based on the number of foreign key columns that you selected.
You can upload a JSON file that contains the virtual foreign keys. For example, you can create a JSON file that can be used to populate virtual foreign keys in multiple workspaces that have the same source data structure.
If you already have configured virtual foreign keys, then the uploaded virtual foreign keys replace the existing ones.
The virtual foreign key JSON also allows you to add polymorphic keys. You cannot add polymorphic keys from the Add Foreign Key Relationships tab.
On the Foreign Key Relationships view, to upload a foreign keys file:
Click Upload Foreign Key JSON. If you already have virtual foreign keys configured, then the button is Update Foreign Key JSON.
On the upload dialog, to search for and select the file, click Browse.
After you select the file, click Upload.
The uploaded keys are added to the View Foreign Key Relationships list. Those keys replace any existing virtual foreign keys.
The foreign key JSON is an array of foreign key entries. Here is an example of a foreign key file that contains a single entry:
To illustrate the field values, we'll use the following example, which reflects the example entry above. A paystubs
table lists the pay stubs that were issued to employees. paystubs
contains an employee_id
field. employee_id
identifies the employee that received the pay stub. employee_id
is a foreign key. It contains the value of the id
field in employees
, which is the primary key field for the employees
table. Both paystubs
and employees
are in the public
schema.
In the foreign keys JSON, each entry contains the following fields.
fk_schema
- The name of the schema for the table that contains the foreign key. For our example, fk_schema
is public
.
fk_table
- The name of the table that contains the foreign key. For our example, fk_table
is paystubs
.
fk_columns
- An array that contains the names of the foreign key columns. In our example, the fk_columns
array contains a single value, employee_id
.
target_schema
- The name of the schema for the table that contains the referenced primary key. In our example, target_schema
is public
.
target_table
- The name of the table that contains the referenced primary key. In our example, target_table
is employees
.
target_columns
- An array that contains the names of the primary key columns. In our example, the target_columns
array contains a single value, id
.
The ability to provide multiple columns in fk_columns
and target_columns
is used to support composite foreign keys. fk_columns
and target_columns
must contain the same number of columns. The corresponding columns must be in the same order in both arrays.
For example, a sales
table contains sales_person_id
and sales_manager_id
, which refer to the id
and manager_id
columns in the employees
table.
In the JSON:
fk_table
is sales
, and fk_columns
is [sales_person_id, sales_manager_id]
.
target_table
is employees
, and target_columns
is [id, manager_id]
.
The entry for this example would look like:
Some application types have polymorphic keys. Polymorphic keys allow a single column in one table to contain foreign key values that refer to primary keys from multiple other tables. These types of keys cannot be represented in a traditional relational database, but are common in application frameworks such as Ruby on Rails.
For example, a person can have multiple addresses, and a company can have multiple addresses. To support this without complicated joins between tables, the addresses table includes the following columns:
A column that contains the identifier of the company or the person that the address belongs to.
Another column that identifies whether the identifier is a company or a person.
For example, the people
table contains:
The companies
table contains:
The addresses
table contains:
In the addresses
table, to identify the address owner, the address_owner_id
column contains an id
value from either the people
or companies
table. The address_owner_type
column identifies whether the identifier is a person or a company.
The value of address_owner_id
is 1
for both records. However, address 1 belongs to John Doe, and address 2 belongs to My Company.
Each entry in the polymorphic keys JSON identifies the fields that contain the foreign key values and the foreign key type. It also lists the foreign key types, and identifies the source of the identifier for that foreign key type.
The following is the JSON for the example above:
Each entry contains the following fields:
fk_table
- The name of the table that contains the foreign key values. In our example, this is the addresses
table.
fk_schema
- The name of the schema for the table that contains the foreign key. In our example, the schema is public
.
fk_columns
- An array that contains the names of the columns that contain the foreign key values. In our example, the value is address_owner_id
.
nullable
- Whether the foreign key column is nullable.
polymorphic_target
- Identifies the target types and the identifier source for each type.
polymorphic_target
contains the following fields:
fk_type_column
- In the table that contains the foreign key, the name of the column that contains the foreign key type. In our example, this is the address_owner_type
column in addresses
.
types
- A list of the target types.
Each entry in types
identifies the name of the type. In our example, our types are Person
and Company
. Note that these are the values of the type column in the polymorphic table, not necessarily the names of the tables they point to. For example, the Person
type refers to the people
table.
Each type has the following attributes:
target_schema
- The schema that contains the target table. In our example, the tables for both types belong to the public
schema.
target_table
- The table that contains the primary key value. In our example, for the Person
type, the target table is people
. For the Company
type, the target table is companies
.
target_columns
- An array containing the column that contains the primary key value. In our example, the name of the identifier column in both tables is id
.
If you created virtual foreign keys, then you can download those keys to a JSON file. For example, you might want to upload the same set of virtual foreign keys to another workspace that uses the same source data.
To download the virtual foreign keys, click Download Foreign Key JSON.
You can delete virtual foreign keys. You cannot delete foreign keys that are defined in the source database.
To delete an individual virtual foreign key, click its delete icon.
To delete multiple virtual foreign keys:
Check the checkbox next to each virtual foreign key to delete.
Click Bulk Delete.
Required license: Enterprise
Required workspace permission: View the Protection Audit Trail
On Privacy Hub, the Protection Audit Trail tracks the following actions related to detecting and protecting sensitive data:
A Tonic Structural sensitivity scan flags a column as sensitive.
A user changes the assigned table mode for a table.
A user manually flags a column as either sensitive or not sensitive.
A user changes the assigned generator for a column.
From Privacy Hub, a user uses the Sensitivity Recommendations option to either:
Apply the recommended generator to selected columns
Ignore the recommended generator for selected columns
Mark selected columns as not sensitive
For these updates, there is an entry for each updated column.
Target and lookup tables are added to or removed from the subsetting configuration.
A post-job script is added or removed.
For child workspaces, the list includes updates to the parent workspace that the child workspace inherits. The list also indicates when the child workspace either breaks or restores inheritance.
In addition to the workspace-specific updates, the Protection Audit Trail also tracks when a generator preset is created, updated, or deleted.
The Protection Audit Trail entries are grouped by the date on which an action occurred.
By default, the list shows 10 actions per page. To change the number of actions per page, select an option from the View dropdown list.
Each entry in the Protection Audit Trail list provides the following information:
The left side of the entry shows the affected area and the type of action. Depending on the action, the affected area is either:
A table
A column
Subsetting
Post-Job Scripts
The right side of the entry shows who performed the action.
For a sensitivity scan, this is Privacy Scan.
For an action that a user performed, this is the user email address. For Structural users, the entry also indicates the user's role in the workspace.
For child workspaces, for an update that the child workspace inherited, this is Inherits from parent configuration.
Required license: Enterprise
Required workspace permission: Download Privacy Report (to download the report)
In Tonic Structural, data privacy measures how well data is protected based on the applied generator and the generator configuration.
The Privacy Report captures details about the level of data protection for the data in a workspace.
As you configure the data protection, you can use a preview Privacy Report as a checkpoint to review the generators that you applied or to look for at-risk data.
You can export the preview from Structural before you run a generation, to increase your confidence or to confirm that the de-identification configuration is complete.
Every time you run a data generation job, Structural creates a Privacy Report to reflect the protection level at the time the job ran.
The Privacy Report consists of the following:
A .csv list of columns that includes column properties along with the privacy status and ranking
A set of charts that summarizes the privacy rankings for the columns
The Privacy Report includes the privacy status and the privacy ranking.
The privacy status reflects:
Whether a column is sensitive.
Whether a generator other than Passthrough is applied.
Whether the column is included in the destination data.
The possible values for privacy status are:
At-Risk - The column is sensitive, but has Passthrough as the assigned generator.
Protected - The column has a generator other than Passthrough assigned. A protected column could be either sensitive or not sensitive.
Non-Sensitive - The column is not sensitive, and has Passthrough as the assigned generator.
Not Included - The column is not included in the destination database. For example, for a truncated table, the columns are not included.
Privacy ranking indicates the level of protection for a column based on the assigned generator and the generator configuration. Privacy ranking does not consider whether the column is sensitive or not sensitive.
The privacy ranking for a column can be a number from 1 to 6. 1 indicates the highest level of data privacy, and 6 the lowest level.
The ranking is based on the following attributes:
Whether the generator uses differential privacy
Whether the generator is data-free
Whether the generator has consistency enabled
Whether the generator transforms all of the data in the column
The following table describes the rankings, and shows how generator attributes correspond to the rankings.
The Privacy Report .csv file contains summary statistics and field level details. The table is also included in the downloadable PDF that contains the privacy ranking charts.
Here is a stylized version of the report that shows the column groupings:
The fields for each row in the Privacy Report fall into the following categories.
The Privacy Report includes all of the schema detail that is viewable in the Structural application, such as Database View and Table View). The schema in the source matches the destination.
The schema information is contained in the following columns:
Schema - Schema name from the source database.
Table - Table name from the source database.
TableMode - The table mode that is currently applied to the table.
Column - Column name from the source database.
DataType - Data type that is detected in the source database.
Data sensitivity reflects attributes such as:
Whether the data includes personally identifiable information (PII)
Whether the data is regulated by law
Whether the data is business confidential
It affects decisions on how to protect the data.
During the sensitivity scan, Structural identifies suspected sensitive fields. You can also manually indicate that a column is sensitive or not sensitive.
The data sensitivity information is contained in the following columns:
Tonic Detected Sensitivity - Indicates whether the Structural sensitivity scan identified the column as sensitive.
TRUE
indicates that Structural identified the column as sensitive.
FALSE
indicates that Structural did not identify the column as sensitive.
Current Sensitivity - Indicates whether the column is currently identified as sensitive. If you did not make a manual change to the sensitivity, then Current Sensitivity matches Tonic Detected Sensitivity.
TRUE
indicates that the column is currently identified as sensitive.
FALSE
indicates that the column is currently identified as not sensitive.
SensitiveType - For fields that Structural identifies as sensitive, the detected data type. For example, Structural detects a field of type Address that might be sensitive. For fields that you manually identify as sensitive, SensitiveType is Manual.
Structural generators protect sensitive information while maintaining usefulness of the data for data consumers.
The protection section of the Privacy Report provides key details about how the masking transformations protect data.
The protection information is contained in the following columns:
Generator - The generator that is currently applied to the column. For information about how each generator transforms data, go to the Generator reference.
ProtectionType - Indicates the level of protection provided by the assigned generator and generator configuration. The possible protection type values are:
Masked - Applied to columns that have a generator other than Passthrough assigned. The selected generator provides some protection against viewgo toing source data.
If both IsDifferentiallyPrivate and IsDataFree are FALSE
, then ColumnPrivacyStatus is Masked
.
Consistency decreases the protection level. If consistency is enabled, then ColumnPrivacyStatus is Masked
.
Anonymized - Applied to columns for which the assigned generators and the generator configuration are guaranteed against reverse engineering. The assigned generator either uses differential privacy, or is considered data-free, where the output data is completely unlinked from the source data. The assigned generator does not have consistency enabled.
IsDifferentiallyPrivate - Indicates whether the assigned generator supports differential privacy and that differential privacy is enabled.
TRUE
indicates that both of these are true.
FALSE
indicates that either the assigned generator does not support differential privacy, or that differential privacy is not enabled.
Differential privacy guarantees the highest level of privacy, and eliminates the ability to re-identify the data.
IsDataFree - Indicates whether the assigned generator uses the underlying data. If the output data is completely unlinked to the source data, the generator is considered data-free, with a high degree of protection.
IsConsistent - Indicates whether consistency is enabled for a given field. This is also set to true if the generator is always consistent.
Consistency ensures that a given input always results in the same output. It retains data utility at the cost of a higher level of protection.
When consistency is on, ColumnPrivacyStatus is Masked
instead of Anonymized
. For more information, go to Privacy Status.
ConsistencyColumn - In some cases, a column is configured to be consistent to another column. If the consistency is to another column, then ConsistencyColumn contains the name of that column.
Privacy indicates how well the protection measures actually protect the source data.
The privacy information is included in the following columns:
ColumnPrivacyStatus - The privacy status of the column. Reflects whether a generator is applied, whether the column is sensitive, and whether the column is included in the destination database.
ColumnPrivacyRank - The privacy ranking of the column. Reflects the applied generator and the generator configuration. Does not reflect whether the column is sensitive or included.
The Privacy Report privacy ranking charts summarize the privacy ranking values for the workspace data.
The privacy ranking charts are provided in a downloadable PDF file. The file also includes the Privacy Report table, which contains the same content as the .csv file.
The first page of the file contains definitions of the privacy ranking values.
The PDF then contains two sets of charts:
The first set of charts summarizes the privacy ranking values for all columns. It includes all of the privacy rankings from 1-6.
The second set of charts summarizes the privacy ranking values for columns that have an assigned generator. It does not include privacy ranking 6, which is assigned to columns that do not have an assigned generator.
Each set of charts contains:
A donut chart that displays the number of columns and the relative number of columns with each privacy ranking.
A bar chart that shows the number of columns with each privacy ranking.
For each privacy ranking, a summary that includes:
The percentage of columns with that ranking.
The number of columns with that ranking.
On the job details view, the Privacy Report tab summarizes the privacy status for the columns that are included in the destination data. It does not reflect columns that were excluded, such as columns in truncated tables.
It shows the number of columns that are At-Risk, Protected, and Not Sensitive.
From Privacy Hub and the workspace download menu, you can download a Privacy Report .csv or PDF file that reflects the current workspace configuration.
These reports indicate how well your data would be protected if you generated data with that configuration.
From the workspace management view, click the download icon, then:
To download the Privacy Report PDF file, click Download Privacy Report PDF.
To download the Privacy Report .csv file, click Download Privacy Report CSV.
From Privacy Hub, click Download, then
To download the Privacy Report .csv file, click Privacy Report CSV.
To download the Privacy Report PDF file, click Privacy Report PDF.
From the job details view for a data generation job, you can download a Privacy Report .csv or PDF file that reflects the workspace configuration at the time of data generation.
These reports indicate how well your data was protected by that configuration.
On the job details view, to display the download options, click Download.
In the download menu:
To download the Privacy Report .csv file, click Privacy Report CSV.
To download the Privacy Report PDF file, click Privacy Report PDF.
Required workspace permission: Resolve schema change warnings
A database schema can evolve over time. For example, a table is added, a column is removed, or a column data type changes.
It's important that you are aware of these changes and that you update your data generation configuration to address these changes.
In some cases, if you don't update the configuration, then sensitive data might be leaked. For example, when a new column is added, by default the generator is Passthrough. If you do not assign a different generator, then the next time you generate data, the source data is copied to the destination database without being masked.
In other cases, the data generation fails if you don't update the configuration. For example, a column changes its data type from integer to string. If the column is assigned the Random Integer generator, the data generation fails.
Tonic Structural monitors your source database to look for changes to the data schema. It alerts you to those changes, and allows you to acknowledge or resolve the changes. You can also configure your workspace so that you cannot generate data when there are unacknowledged or unresolved schema changes.
Structural detects the following schema changes.
Conflicting schema issues can cause the data generation to fail if they are not resolved.
Structural detects the following conflicting schema issues:
Table is removed from the schema
Column is removed from the schema
Column changes data type
Column changes nullability, for columns that are assigned the NULL generator
A column that has an assigned generator becomes a foreign key. Foreign key columns must inherit the generator from the primary key.
Required license: Professional or Enterprise
Non-conflicting schema changes do not cause the data generation to fail. However, to prevent leakage of sensitive data, you should address these changes before you generate data.
Structural detects the following non-conflicting schema changes:
Table is added to the schema. This includes new file groups that you add to a file connector workspace.
Column is added to the schema
When you navigate to a workspace in Structural, Structural always runs a scan to check for schema changes.
Structural can also run a periodic schema change detection scan in the background.
For databases other than Databricks, Snowflake on AWS, Snowflake on Azure, and MongoDB, Structural by default runs a background scan every two hours.
For Databricks, Snowflake on AWS, Snowflake on Azure, and MongoDB, Structural does not run any periodic scans. The data structure for these databases makes it expensive to run them. Instead, you can enable a daily schema change detection scan.
For information on how to configure whether and when Structural runs the periodic or daily detection scans, go to #schema-changes-detection-configure.
For data connectors other than Databricks, Snowflake on AWS, Snowflake on Azure, and MongoDB, you use the following environment settings to configure the periodic schema change detection. You configure the settings in the web server container:
TONIC_ENABLE_QUICK_PERIODIC_SCHEMA_CHANGE_SCANS
- Boolean to indicate whether to enable the periodic background schema change scan.
Default is true
.
TONIC_PERIODIC_QUICK_SCHEMA_CHANGE_SCAN_INTERVAL_IN_MINUTES
- If periodic background schema change detection is enabled, the number of minutes between scans.
The default value is 120
, which indicates to run the schema change detection every two hours.
For Databricks, Snowflake on AWS, Snowflake on Azure, and MongoDB, use the following environment settings to enable and configure the daily schema change detection scan.
TONIC_ENABLE_DAILY_EXPENSIVE_SCHEMA_CHANGE_SCANS
- Boolean to indicate whether to enable the daily schema change detection scan.
Default is false
.
TONIC_DAILY_EXPENSIVE_SCHEMA_CHANGE_SCANS_HOUR
- If the daily schema change detection is enabled, this sets the hour at which to run the scan.
The value is an integer between 0 and 23, where 0 is midnight and 23 is 11:00 PM.
For example, a value of 14 indicates to run the job at 2:00 PM every day. Default is 0
.
Conflicting schema issues always prevent data generation. By default, non-conflicting schema changes do not block data generation.
However, you can configure Structural to always prevent data generation whenever there are any unacknowledged or unresolved schema changes.
To block data generation for any schema changes, on the Edit Workspace page, under Source Settings, toggle the Block data generation if schema changes detected setting to the on position.
The Workspaces view provides a summary of the unaddressed schema changes for each workspace. The Schema Changes view contains the complete list.
On the Workspaces view, the Schema Changes column shows the number of conflicting and non-conflicting schema changes.
To display a more detailed summary of the schema change detection, hover over the column. The summary includes the timestamp of the last schema scan, and a link to the Schema Changes view.
To display the Schema Changes view, either:
On the workspace management view, in the workspace navigation bar, click Schema Changes.
On Workspaces view, from the dropdown menu in the Name column, select Schema Changes.
To resolve an issue, you must have permission to perform the associated action.
The Conflicting Schema Issues list contains the schema changes that make your current workspace configuration invalid and that you have not yet resolved.
An issue is resolved when either:
You resolve the issue from the Conflicting Schema Issues list.
For columns that have nullability or data type changes, you change the assigned generator in Privacy Hub, Database View, or Table View.
If there are any unresolved conflicting schema issues, then data generation is blocked. If there are no conflicting schema issues, then the Conflicting Schema Issues section is not displayed.
For parent and child workspaces, for removed tables and columns, when a child workspace overrides the parent workspace configuration for the table or column, you must resolve the change in the child workspace.
If there is a conflicting change for the removed table or column in the parent workspace configuration, then regardless of whether the configuration is inherited, you must resolve the change in the parent workspace before the change is resolved for the child workspace.
For changes to column nullability or data type, you resolve the change separately in the child and parent workspaces. Depending on the configuration, the conflict might only exist in one of the workspaces.
For each issue, the list includes:
Table name
Column name, if the change affects a specific column
Description of the schema change
For changes to columns data type or nullability, a link to Database View. The link filters Database View to display only that column.
Resolve button or Select dropdown list. For changes to column data type or nullability, the Select dropdown list allows you to either resolve the issue or update the column configuration. For a child workspace, if the issue must be resolved in the parent workspace, the button is Go to Parent. If you do not have access to the parent workspace, then the button is disabled.
The list does not include changed or removed columns for which the assigned generator is Passthrough.
For the following types of issues, you can only resolve the issue. Resolving the issues allows Structural to do the required cleanup to reflect the removal. For more information, go to #schema-changes-resolution-steps.
Removed table
Removed column
For these issues, to resolve the issue, click Resolve.
For a column that changed nullability or data type, you can use the Select dropdown list to either:
Resolve the issue. For more information, go to #schema-changes-resolution-steps.
Assign a different generator to the column and then resolve the issue.
For these issues:
Click Select.
To have Structural resolve the issue:
Select Reset to Passthrough.
On the confirmation dialog, click Resolve.
To select a different generator for the column:
Select Apply New Generator.
On the generator configuration panel, select and configure the generator. For detailed configuration options for each generator, go to the Generator reference.
When you change the generator configuration, the Mark Resolved button is enabled. To close the panel and also resolve the issue, click Mark Resolved.
For a child workspace, if the issue must also be resolved in the parent workspace, then the button changes to Go to Parent.
To resolve all of the issues:
Click Resolve All Issues.
On the confirmation dialog, click Resolve All.
For more information, go to #schema-changes-resolution-steps.
For a child workspace, for issues that must also be resolved in the parent workspace, the button changes to Go to Parent.
To resolve conflicting issues, other than for the columns that you assign a new generator to, Structural takes the following actions:
Removes the configuration for the affected table or column.
For a column that has a changed data type or nullability, Structural resets the generator to Passthrough.
Removes the links to the affected columns. The columns that were linked to the affected columns otherwise keep their current configuration.
Required license: Professional or Enterprise
For Basic license users, if you know that there are non-conflicting changes, you can run a new sensitivity scan to get the protection status of the new columns.
The Notifications list contains schema changes that do not make the current configuration invalid. These changes are new tables and new columns.
If there are non-conflicting schema changes, then data generation is blocked only if you configured your workspace to block data generation for all unaddressed schema changes.
If there are no non-conflicting schema changes, then the Notifications list is not displayed.
Structural automatically dismisses a notification when:
You assign Truncate or Preserve Destination table mode to a new table.
You assign a generator other than Passthrough to a new column.
Structural does not automatically dismiss non-conflicting schema changes in a child workspace, even if the parent workspace configuration is updated. You always dismiss the changes separately in the parent and child workspaces.
Dismissed notifications are removed from the list. Dismissing a notification does not change your workspace configuration configuration.
For each notification, the list includes:
Table name
Column name, for new columns
Description of the schema change
A link to Database View. The link automatically filters Database View to only display the affected table or column.
Dismiss button or Select dropdown list. For new columns, the Select dropdown list allows you to either dismiss the notification or assign a generator to the column.
For a new table, the only option is to dismiss the notification. To dismiss the notification, click Dismiss.
For a new column, you can use the Select dropdown list to either:
Dismiss the notification.
Assign a different generator to the column and then dismiss the notification.
For a new column:
Click Select.
To dismiss the notification, select Dismiss Notification.
To assign a generator for the column:
Select Apply New Generator.
On the generator configuration panel, select and configure the generator. For detailed configuration options for each generator, go to the Generator reference.
When you change the generator configuration, the Dismiss button is enabled. To close the panel and also dismiss the notification, click Dismiss.
To dismiss all of the notifications:
Click Dismiss All Notifications.
On the confirmation dialog, click Dismiss All.
Whenever there are schema changes, especially new tables and columns, it is important to determine whether those new tables and columns contain sensitive data.
By default, Structural copies all rows from a table. The column generator is set to Passthrough, meaning that the source data is copied as is to the destination database.
From Privacy Hub, you can run a new sensitivity scan. You can then use the updated results to guide the table and column configuration.
Required workspace permission: Run data generation
The data generation job uses the configured tables modes and generators to transform the data in the source database or source files. The transformed data is used to create the destination database or to write transformed files to file storage.
In the simplest type of data generation, Tonic Structural uses the configured table modes and generators to transform data in the source database and write the transformed data to the destination location. The destination location is usually a database server, but might also be:
A storage location such as an S3 bucket
A container repository
A Tonic Ephemeral snapshot
Required license: Professional or Enterprise
After the initial data generation, Structural runs an upsert job to add or update the appropriate records from the intermediate database to the destination database. The upsert job only adds and updates records. It does not remove any records from previous data generation jobs.
Before Structural can run an upsert job, the destination database must already exist and have the correct schema defined. To initialize the destination database:
Disable upsert.
Run a regular data generation.
Re-enable upsert.
To start the data generation, at the top right of the workspace management view, click Generate Data.
As you configure the data generation options, Structural runs checks to verify that you can use the current configuration to generate data.
If any of these checks do not pass, then when you click Generate Data, Structural displays information about why you cannot run the data generation job.
If all of those checks pass, then when you click Generate Data, if there are no warnings, the Confirm Generation panel displays.
Data generation is always blocked by conflicting schema changes.
The workspace configuration includes whether to block data generation for all schema changes, including non-conflicting changes.
If this setting is turned off, then if there are non-conflicting schema changes, when you click Generate Data, a warning displays. Non-conflicting schema changes include new tables and columns. If the new columns contain sensitive data, then if you do not assign generators before you generate data, that sensitive data will be in the destination database.
If you are sure that the data in the new tables and columns is not sensitive, then to continue to the Confirm Generation panel, click Continue to Data Generation.
The Confirm Generation panel allows you to confirm the details for the data generation. If subsetting is configured, you can determine whether to generate the subset. Structural can also provide tips on how to improve the data generation performance.
If you configured subsetting, then you can indicate whether to only generate the subset.
To create a subset based on the current subsetting configuration, toggle Use Subsetting to the on position.
The initial setting matches the current setting in the subsetting configuration. If Use subsetting is turned on on the Subsetting view, then it is on by default on the Generation Confirmation panel.
When you change the setting on the generation confirmation panel, it also updates the setting on the Subsetting view.
If upsert is enabled for the workspace, then you can also determine whether to use upsert for data generation.
If upsert is enabled for the workspace, then by default Use Upsert is in the on position.
To not use upsert, toggle Use Upsert to the off position. When upsert is turned off, the data generation is a simple data generation that directly populates and replaces the destination database.
Tonic.ai has released an improved version of the data generation process. We are enrolling Structural instances in the new process. Tonic.ai will contact you before we enroll your instance.
After your instance is enrolled, your PostgreSQL workspaces always use the new data generation process. For the new process, the job type is Data Pipeline Generation instead of Data Generation.
If your instance is not yet enrolled, then on the Confirm Generation panel, to use the new data generation process, toggle Data Pipeline V2 to the on position.
When upsert is enabled, the Confirm Generation panel provides access to the connection information for the intermediate database. To display the intermediate database connection details, click Intermediate Upsert Database.
If the intermediate database information is incorrect, to navigate to the workspace configuration view to make updates, click Edit Intermediate.
The Confirm Generation panel provides the destination information for the workspace. To display the destination database connection details, click Destination Settings.
Depending on the workspace configuration and data connector type, the destination information is either:
Connection information for a database server
A storage location such as an S3 bucket
Configuration for an Ephemeral snapshot
Information to create container artifacts
If the destination information is incorrect, to navigate to the workspace configuration view to make updates, click Edit Destination Settings.
Required global permission: Enable diagnostic logging
If the data connector is not configured to use diagnostic logging, then you can choose whether to enable diagnostic logging for an individual data generation job. The option is also available for data connectors that do not have a diagnostic logging setting.
On the Confirm Generation panel, to enable diagnostic logging for the job, toggle Enable Diagnostic Logging to the on position.
Access to diagnostic logs is also controlled by the Enable diagnostic logging global permission. If you do not have this permission, then you cannot download diagnostic logs.
For data generation, assigning Truncate table mode to tables that you don't need data for can improve generation performance.
For subsetting, if an upstream table is very large, and the foreign key columns are not indexed, then it can make the subsetting process run more slowly.
The Want faster generations? message displays at the bottom of the Confirm Generation panel. It displays for all non-subsetting jobs. For subsetting jobs, the panel only displays if Structural identified columns that you should consider indexing.
To display information about tips for faster generation, click Generation Tips.
On the Generation Tips panel for subsetting jobs, the Add Indexes panel displays the first few columns that you might consider indexing.
To display a panel with a suggested SQL command to add the index, click the information icon next to the column.
On the panel, to copy the command to the clipboard, click Copy SQL to Clipboard.
If there are additional columns that are not listed, then to display the full list of columns to index, click Show all columns.
On the full list, to download the list to a CSV file, click Download list of columns (.csv).
On the Generation Tips panel for non-subsetting jobs, the Truncate Tables panel displays the hint to truncate tables that contain data that you do not need in the destination database.
To navigate to Database View to change the current configuration, click Go to Database View.
On the Confirm Generation panel, after you confirm the generation details, to start the data generation, click Run Generation.
When upsert is enabled, to start the data generation and upsert jobs:
Click Run Generation + Upsert.
In the menu, click Run Generation + Upsert.
If upsert is enabled for a workspace, then on the Confirm Generation panel, the more common option is to run both data generation and upsert.
After you run at least one successful data generation to the intermediate database, then you can also choose to run only the upsert process.
For example, if the data generation succeeds but the upsert process fails, then after you address the issues that caused the upsert to fail, you can run the upsert process again.
You also must start the upsert job manually if you turn off Automatically Start Upsert After Successful Data Generation in the workspace settings.
From the Confirm Generation panel, to run upsert only:
Click the Run Generation + Upsert button.
In the menu, click Run Upsert Only.
When you run upsert only, the process uses the results of the most recent data generation.
The following issues prevent a data generation or subsetting job.
The following errors occur when you attempt to generate a subset. They do not apply if Use subsetting is turned off.
When upsert is enabled, the following issues cause the upsert job to fail.
id | first_name | last_name |
---|---|---|
id | company_name |
---|---|
id | address | address_owner_id | address_owner_type |
---|---|---|---|
Ranking and description | Differential privacy | Data-free | Consistent | All data transformed |
---|---|---|---|---|
For a workspace, the data generation job uses the configured generators for each file group to transform the data in the source files. The transformed data is used to create output files that correspond to the source files.
When is enabled, Structural first identifies the tables and rows to include in the subset. It uses the configured table modes and generators to transform the data. It then writes the transformed data to the destination location.
When is enabled, Structural runs a data generation job that writes the transformed data to an intermediate database. The data generation can include subsetting.
For a workspace, if the source files came from a local file system, then the destination files are written to the large file store in the Structural application database. You can .
If the destination data is written to a container artifacts, then from the Confirm Generation panel, you can configure custom tag values to use for the artifacts that are generated by the data generation job. For information about how to configure the tag values, go to .
By default, Structural redacts sensitive values from the logs. To help support troubleshooting, some Structural data connectors can be configured to use diagnostic logging, which generates unredacted versions of the log files. For details, go to .
Structural displays a notification that the job has started. To track the progress of the data generation job and view the results, click the View Job button on the notification, or go to the .
Issue | Description | How to resolve |
---|
Issue | Description | How to resolve |
---|
Issue | Description | How to resolve |
---|
Issue | Description | How to resolve |
---|
Issue | Description | How to resolve |
---|
Issue | Description | How to resolve |
---|
Action Step and Table Name
Identifies the table, and indicates whether the table is a target table or a lookup table.
Status
For target tables and lookup tables, Status is Direct. This indicates that the subsetting process pulls data directly from the table.
For target tables, this is based on the percentage or WHERE
clause.
For lookup tables, the subsetting process copies the entire table.
For related tables, the status is either Downstream or Upstream.
Contributing Tables
For related tables, the Contributing Tables column indicates the number of tables that affect the data that Structural pulls from the table. To display the contributing tables and how the current table is affected by those tables, click the information icon.
Source/Destination Rows
The number of rows in the source data and in the subset. For tables that contain fewer than 1,000 rows, the pre-subset value is <1k. Before you run the data generation, the number of rows in the subset is unknown. Otherwise, the number reflects the results of the most recent data generation.
1
John
Doe
2
Mary
Smith
1
My Company
2
Example Company
1
123 Main Street
1
Person
2
234 Elm Street
1
Company
1
The generator is data-free and irreversible.
There is no way to uncover information about the original data from the output data.
Examples: Random Boolean, Random Integer, Constant, Null
True
True
False
True
2
Uses the original data in a way that obscures the original data points.
Changing individual data points in the original data does not change the output data.
However, the shape of the output data can provide information about the input data.
Examples: Continuous and Categorical generators, when set to differentially private
True
False
False
True
3
Uses the underlying data in a way that cannot be reversed, but can identify values that exist in the original data.
Example: Categorical generator, when not set to differentially private
False
False
False
True
4
Data is transformed in a secure way that is theoretically reversible.
Examples: Name generator with consistency, Integer Key generator with consistency
False
False
True
True
5
Data might be unprotected.
Primarily applies to generators that have sub-fields, where there is always a chance that the data is not protected.
Examples: HTML Mask, JSON Mask, Regex Mask, XML Mask
False
False
True or False
Maybe - might be only partially transformed
6
Data is not protected. The Passthrough generator is applied.
False
False
Not applicable
False
Destination database not populated | The destination database is empty. | Update the workspace to disable upsert, then run a regular data generation job to populate the destination database. You can then re-enable upsert. |
Invalid table mode | A table is assigned a mode other than De-Identify or Truncate. | Change the table mode to De-Identify or Truncate. |
Unable to connect to the intermediate database | The intermediate database connection is either missing or incomplete. | Edit the workspace configuration to complete the intermediate database connection. |
Unresolved schema conflicts | There are schema conflicts between the source and destination databases. | Update the source or destination database schema to resolve the changes. |
For a workspace that writes the destination data to container artifacts, the Job History view displays a list of the generated data volumes.
For each data volume, it also provides access to
The digest you use to download the data volume from the registry
A Docker Compose file template to help you to stand up a database with a volume mount that uses the generated data volume
When the workspace is configured to write the destination data to a container repository, the Job History view provides access to the generated data volumes.
The Job History view for the workspace is then divided into the following tabs:
The Jobs tab contains the list of jobs.
The Container Artifacts tab lists the data volumes that were created by data generation jobs.
On the Container Artifacts tab, each entry represents a data volume created by a data generation job. The volumes that were generated most recently are at the top of the list.
For each volume, the list contains:
The image that the volume was based on
The assigned tags
The identifier of the job
When the artifact was created
The user who ran the job that created the artifact
To view the details about a data volume, click the details icon. The details panel contains:
The job identifier
The assigned tags
When the volume was created
Who created the volume
The full registry and reference path to the volume
From the artifact details panel, you can:
Copy the job identifier
Display the job details view
Download the Compose file for the volume
Copy the digest for the volume. You use the digest to download the data volume from the registry.
For each data volume, you use the digest to retrieve the data volume from the registry. The Docker Compose file provides authentication information for the data on the data volume.
To copy the volume digest for a generated data volume:
On the Container Artifacts tab, click the details icon for the data volume.
On the details panel, click the copy digest button.
Use the volume digest to pull the data volume from the registry.
Because this is a data volume and not an image, you must use a tool such as the ORAS CLI.
The data volume downloads as a .tar.gz file.
Extract the downloaded file to your local machine.
Here is a basic example of downloading and extracting the volume:
You can download the Docker Compose file for a job from the jobs list, the job details view, or the Container Artifacts list.
From the job list:
Click the download icon in the far right column.
Select Volume Compose File.
From the job details view, under Generated Artifacts, click Compose File.
From the Container Artifacts list:
Click the details icon for the data volume.
On the details panel, click the download icon.
In the volumes
section of the Compose file, replace the template path value with the path to the extracted data volume.
Structural license expired | The current Structural license is expired. |
No workspace configured | The Structural instance does not have any workspaces to generate data from. |
Insufficient workspace permissions | You do not have permission to run data generation on this workspace. |
Unable to connect to the source database | The source database connection is either missing or incomplete. |
Unable to connect to the destination database | The destination database connection is either missing or incomplete. |
Scale mode - Invalid generators | A table that uses Scale mode has columns with assigned generators that are not valid for Scale mode. | Change the selected generator for the columns. |
Scale mode - Passthrough generator or sub-generator for the Conditional generator | A table that uses Scale mode has columns that use the Conditional generator, and that are assigned Passthrough as a generator or sub-generator. | Change the selected generator, sub-generator, or default generator. |
Preserve Destination - cannot resolve foreign key references | A table that uses Preserve Destination mode is referenced from another table. |
Truncate - cannot resolve foreign key references | A table that uses Truncate mode is referenced from another table. |
Cross Table Sum - incomplete generation | A column is assigned the Cross Table Sum generator, but some required configuration field values are missing. |
Incremental mode - circular foreign key dependency | There is a circular foreign key dependency between tables that use Incremental mode. |
AI Synthesizer configured when it is disabled | A table is configured to use the AI Synthesizer, but the AI Synthesizer is not enabled. |
Unresolved schema changes | There are detected schema changes that are not resolved. |
No target tables configured | The subsetting configuration does not include any target tables. |
Invalid target table configuration | A target table has an invalid percentage value or |
In-subset table uses Scale mode | A table that is in the subset uses Scale table mode. |
In-subset table uses Truncate mode. | A table that is in the subset uses Truncate table mode. |
In-subset table uses Preserve Destination mode. | A table that is in the subset uses Preserve Destination table mode. |
In-subset table uses Incremental mode. | A table that is in the subset uses Incremental table mode. |
During Tonic Structural data generation, performance bottlenecks typically come from one of the following sources:
Network IO. Specifically, the bandwidth capacity of the network that connects Structural to the database instances.
Disk IO. The disk IO of the databases.
Tonic server and workspace configuration. Structural performs several complex data computations and transformations. Depending on your workspace selections, these tasks can take a long time to perform.
In most cases, slow data generation times are caused by disk IO and network IO.
When possible, ensure that Structural has a fast network pipe between Structural and each source and destination database.
It is always advisable to install Structural on or near the hardware that runs your database instances.
This is normally limited by the database hardware.
If you run in a public cloud, you can configure options to access faster disks.
For SQL Server, you can increase your write speeds on your destination database. For details, go to SQL Server.
To reduce the required disk and network IO, you can copy less data from the source to the destination.
In some cases, you don't need the data from every table, or from specific columns within a table. Or you might be happy with the data that is already in the destination, and so you don't need to copy it again from the source.
Here are some tips to reduce the data load:
Put large tables that contain unneeded data into Truncate mode. In Truncate mode, Structural does not copy any of the table data to the destination database.
For example, audit or transaction tables might not be needed for typical QA testing.
Avoid copying over large columns such as varchar(max), blob, XML, and JSON columns.
If you do not need the data in a column, then to reduce the required IO, either:
If the column is nullable, apply a NULL generator.
Apply a Constant generator
For subsequent generation runs from the same source database:
For large tables that have not changed, use Preserve Destination mode. In Preserve Destination mode, Structural does not copy the table over, but instead uses the existing data in the destination database.
For large tables that have very few changes, use Incremental mode. In Incremental mode, Structural only copies over the changes that occurred since the previous generation.
When you believe that the Structural server is the bottleneck, then to improve performance, you can tune the following settings that control parallel processing.
You apply these settings as environment settings in your tonic_worker
container. For more information on configuring environment settings, go to Configuring environment settings.
The following settings are not limited to specific data connectors:
The following settings apply to specific data connectors:
Required license: Professional or Enterprise
Tonic Structural can execute custom SQL scripts on the destination database when a database generation job is complete.
Post-job scripts allow you to make adjustments to the destination database. For example, you might have a set of regular demo users that you always want to have available. You can use a post-job script to add these demo users to the destination database after each data generation run.
You manage post-job scripts from the Post-Job Actions view. To display the Post-Job Actions view, either:
On the workspace management view, in the workspace navigation bar, click Post-Job Actions.
On Workspaces view, from the dropdown menu in the Name column, select Post-Job Actions.
On the Post-Job Actions view, the Scripts list contains the list of post-job scripts.
For each post-job script, the list contains:
A toggle to enable or disable the script
The name of the script
The user who created the script
The date and time when the script was most recently updated
Options to edit or delete the script
Required workspace permission: Configure post-job scripts and webhooks
To create a post-job script, in the Post-Job Scripts panel, click Create Post-Job Script.
On the script configuration dialog, provide the script details, then click Save.
On the script configuration dialog:
In the Script Name field, provide a name for the script.
In the SQL Script field, type or paste the SQL script.
For a MySQL database, you must explicitly pass a USE
statement to define the database.
To format the script for readability, click Beautify.
By default, if a post-job script fails, then the entire data generation job fails. To instead register a warning without failing the data generation job, toggle Enable Warnings to the on position.
To save the script configuration, click Save.
To edit a post-job script:
In the Scripts list, click the edit icon for the script.
On the script configuration dialog, make the updates to the script.
Click Save.
Required workspace permission: Configure post-job scripts and webhooks
To delete a post-job script configuration:
In the Scripts list, click the delete icon for the script.
On the confirmation dialog, click Delete.
Required workspace permission: Configure post-job scripts and webhooks
In the Scripts list, the scripts are displayed in the order in which they are executed. The script at the top of the list is executed first, and the others follow in order from top to bottom.
To change the execution sequence, change the list order.
To execute a script earlier, drag it to a higher location in the list.
To execute a script later, drag it to a lower location in the list.
You use the toggle at the left of each script to control whether the script is enabled.
When the toggle is in the on position, the script runs.
When the toggle is in the off position, the script does not run.
Required license: Enterprise license.
By default, a child workspace inherits the configured post-job scripts from its parent workspace. If you make any changes to the child workspace configuration, including adding, editing, or deleting a script, the inheritance is removed. The child workspace no longer inherits any post-job script changes from its parent workspace.
For a child workspace, the Post-Job Scripts view indicates the current inheritance status.
Inherits parent configuration means that the child workspace inherits the post-job scripts from the parent workspace.
Overrides parent configuration means that the child workspace does not inherit the post-job scripts from the parent workspace.
To reset the inheritance, in the Overrides parent configuration notice, click Reset, then on the confirmation dialog, click Reset again.
The overrides are removed. The child workspace inherits any subsequent configuration changes from the parent workspace.
For a parent workspace, you can view the current inheritance status of the child workspaces.
The Child Workspaces tab contains the list of child workspaces.
For each workspace, the list includes:
The workspace name.
The inheritance status. Inheriting indicates that the child workspace inherits the configuration from the parent. Overriding indicates that the child workspace overrides the configuration and does not inherit it from the parent.
Your role in the child workspace.
The owner of the child workspace.
You cannot reset the inheritance status from the Child Workspaces tab. If you have access to a child workspace, to switch to that workspace, click the arrow icon in the rightmost column.
Tonic Structural data science mode allows you to create data models that provide views of underlying data from SQL query results. A model represents a downstream analysis or a data science task to answer a specific question.
Based on the defined model parameters, the model training process generates a set of synthesized data with values that correspond to those in the original data.
Users can export the results to a Jupyter notebook, and use Jupyter analysis and visualizations to verify that the synthesized data corresponds accurately to the source data. You can also export generated model data to a CSV file to use the trained data for other analysis.
In Structural, the data science mode workflow involves the following steps:
To get started, you create a data science mode workspace. After you create the workspace, to identify it as a data science mode workspace, toggle Enable data science mode to the on position. In the workspace configuration, you identify the source of the data to use to create the model. You can connect to an existing database, or you can upload CSV files that contain the data.
Next, you create and configure the model. The model configuration starts with a SQL query to retrieve the set of data to use in the model. You then configure the model parameters to guide the model training. You can also adjust the column data types in the query results.
After you complete the model configuration, you train the model. When it trains a model, Structural uses the model configuration to generate new, de-identified data that is based on the SQL query results.
You then analyze the resulting model data. The Model Synthesis Report contains visualizations that provide insight into how well the generated data replicates the shape of the original data.
You can export the model to use for further analysis. The exported model allows you to generate samples of synthetic data in your Python workflow. You can export the model to a Jupyter notebook that is based on a template that Structural provides. You can export a code snippet to use as a starting point for your own Jupyter notebook. You can also generate and export a CSV file containing the generated model data. From the Jupyter notebook or CSV file, you can sample the generated model data to use in other analysis tools.
Required license: Professional or Enterprise
Tonic Structural allows you to set up webhooks to fire HTTP POST requests when a data generation or upsert job completes successfully, fails, or is canceled.
Webhooks are only supported for data generation jobs and for upsert jobs, if upsert is enabled. You cannot trigger a webhook after other jobs such as sensitivity scans.
Webhooks enable Structural to integrate more seamlessly into your workflow. These requests can pass information about the data generation job, and can be used to trigger actions in other systems.
One common use of the Structural webhooks feature is to post a message to a Slack channel.
Child workspaces never inherit the webhooks configuration from their parent workspace. Child workspaces always have their own webhooks.
Webhooks require access to the Structural notifications server. The notifications server URL and port are set as the value of the environment setting TONIC_NOTIFICATIONS_URL
.
On a Docker deployment, the default value is https://tonic_notifications:7001
. For a Kubernetes deployment deployed using Structural's provided Helm chart, the default value is https://tonic-notifications:7001
.
If the notifications server on your instance does not match the default value, then you must update the value of TONIC_NOTIFICATIONS_URL
.
Before you create a webhook, make sure that you have the required information.
Each webhook requires a webhook URL. This is the URL that receives the webhook message.
The application that you send the webhook to should provide information about how to obtain the URL. For example, for information on how to generate the webhook URL for a Slack notification, go to Sending messages using Incoming Webhooks in the Slack documentation.
Check whether the webhook requires any header values.
For example, an application might require:
A content-type
header. For example, Content-type: application/json
The version of an API to use. This might be needed to send an API call to perform an action based on the job status. For example, Accept: application/vnd.pagerduty+json;version=2
Authorization for a third-party service. For example, Authorization: Bearer <token value>
By default, the webhook message contains the workspace identifier and name, the job identifier, and the job status.
You also determine whether your application requires any other properties.
For example, for a Slack notification webhook, you provide a text
property that contains the text of the Slack notification.
You manage webhooks from the Post-Job Actions view. To display the Post-Job Actions view, either:
On the workspace management view, in the workspace navigation bar, click Post-Job Actions.
On Workspaces view, from the dropdown menu in the Name column, select Post-Job Actions.
On the Post-Job Actions view, the Webhooks list contains the list of webhooks.
For each webhook, the list contains:
A toggle to enable or disable the webhook
The name of the webhook
The job statuses that trigger the webhook
The webhook URL
The user who created the webhook
The date and time when the webhook was most recently updated
Required workspace permission: Configure post-job scripts and webhooks
To create a webhook, in the Webhooks panel, click Create Webhook.
On the webhook configuration dialog, you can set up, preview, and test the webhook.
To save the webhook, click Save. The webhook is added to the Webhooks list.
On the Settings & Headers tab, you set most of the webhook configuration, except for the message body.
In the Webhook Name field, provide a name for the webhook.
In the Webhook URL field, provide the URL to send the webhook request to.
By default, a webhook requires SSL certificate validation. To bypass the validation, and trust the server certificate, check Trust the Server Certificate (bypass SSL certificate validation). You can use this option if the server has a trustworthy self-signed certificate.
Under Trigger Events, select the data generation job events that trigger the webhook. The webhook can be triggered when a job succeeds, a job fails, or a job is canceled. To trigger a webhook in response to an event, check the event checkbook. For example, to trigger the webhook when a job is canceled, check the Job Cancelled checkbox.
Under Trigger Job Types, select the types of jobs that trigger the webhook. You can trigger a webhook after a data generation job or after an upsert job.
The header list always contains a Content-Type
header. The default value is application/json
.
You cannot delete the Content-Type
header, but you can change the value.
To add custom header values for the webhook request:
To add a header row, click Add Header.
In the Header Name field for each header, provide the header name.
In the Header Value field for each header, provide the header value.
To remove a header row, click its delete icon.
From the Message Body tab, you can customize the body of the request. The message body is sent as a JSON payload that consists of a set of keys and values.
For each property, the Property Name field contains the key, and the Property Value field contains the value.
By default, the message body contains the following properties. The values are variables that are replaced by the actual values for the triggering event. You can use these variables in the values of your custom properties.
jobId
- The identifier of the job. To include the job ID in a custom property value, use the {jobId}
variable.
jobStatus
- The status of the job. To include the job status in a custom property value, use the {jobStatus}
variable.
jobType
- The type of job (data generation or upsert). To include the job type in a custom property, use the {jobType}
variable.
workspaceId
- The identifier of the workspace. To include the workspace ID in a custom property value, use the {workspaceId}
variable.
workspaceName
- The name of the workspace. To include the workspace name in a custom property value, use the {workspaceName}
variable.
You can also add other properties to the message body that are needed for the particular webhook. For example, for a Slack notification webhook, you provide a text
property that contains the text of the notification.
To add a property:
Click Add Property.
In the Property Name field, provide the key name.
In the Property Value field, provide the value.
You can include the default variables in the value. The following example of a text
value for a Slack notification includes the job type, job identifier, workspace name, and job status:
{jobType} job {jobId} for workspace {workspaceName} completed with a status of {jobStatus).
To remove a property, click its delete icon.
The Preview tab contains a preview of the JSON body of the request. In the preview, the variables are replaced by sample values.
To copy the JSON to the clipboard, click Copy to clipboard. You can then, for example, use the copied JSON to test the webhook request in another tool such as Postman.
From the webhook configuration dialog, you can send a test request. The test request includes the configured headers and message body. The message body uses sample values for the variables.
To send a test request, click Test Webhook.
To edit a webhook:
In the Webhooks list, click the edit icon for the webhook.
On the webhook configuration dialog, update the webhook configuration.
Click Save.
Required workspace permission: Configure post-job scripts and webhooks
To delete a webhook:
In the Webhooks list, click the delete icon for the webhook.
On the confirmation dialog, click Delete.
Required workspace permission: Configure post-job scripts and webhooks
You use the toggle at the left of each webhook to determine whether the webhook is enabled.
When the toggle is in the on position, the webhook is enabled. It is triggered by the selected generation job statuses.
When the toggle is in the off position, the webhook is not enabled, and is not triggered by the selected generation job statuses.
Before you can train a data science model, you must complete the following required tasks. These tasks are required for both Tonic Structural Cloud and for self-hosted Structural instances.
The Tonic data science library provides Python libraries that allow you to use tools such as Jupyter Notebook to sample and train your data models.
To install the data science library on your system, run the following command from your terminal:
For details about the Python libraries, see:
To train a data science model, you must have a Structural API token.
The Models view lists the current models in the workspace. To display the Models view, in the workspace navigation bar, click Models.
For each model, the model list provides the following information:
Model identifier.
Model name.
Model type, which indicates whether the model contains event data.
Training status.
For a model that was never trained, the status is No jobs run. Otherwise, the training status indicates the number of training jobs that are running, queued, and completed.
The date and time when the model was most recently trained.
From the model list, you can:
View model details.
Edit a model configuration.
Copy a model configuration.
Delete a model configuration.
Train a model.
You can use the model name to filter or sort the list. You can also sort the list based on the model type.
To filter the list, in the search field, type text that is in the model name.
To sort the list, click the column heading of the Model Name or Type column. To reverse the order of the sort, click the column heading again.
Make sure to test the connection.
Make sure to test the connection.
OR Remove the AI Synthesizer from the generator configuration.
Setting | Description |
---|---|
Setting and default value | Description |
---|---|
- Allows you to access and sample trained models.
- Allows you to assess the fidelity and privacy of a trained model.
For details on how to generate an API token, go to .
TONIC_CONSTRAINT_PARALLELISM
Default: 8
The number of constraints that a worker can apply in parallel during a job. You can configure this setting from Tonic Settings.
TONIC_PROCESS_PARALLELISM
Default: 1
The number of threads to devote to performing the data transformations.
Certain Structural configurations can introduce CPU bottlenecks. This typically occurs when you configure composite generators such as JSON Mask or XML Mask with a large number of paths.
If your workspace has a very high number of generators, or a large number of JSON Mask, XML Mask, Integer Primary Key, or Alphanumeric Primary Key generators, then you should increase this value to at least 2. You can configure this setting from Tonic Settings.
TONIC_TABLE_PARALLELISM
Default: 1
The number of tables that Structural operates on at the same time.
For subsetting, the number of subsetting steps that a worker processes in parallel during a subsetting job. For more information, go to #subsetting-parallelism.
If your Structural server has enough CPU, and your source and target databases are not fully utilized, then we recommend that you to increase this variable to 2.
Depending on your hardware, you can even increase it higher. You can configure this setting from Tonic Settings.
TONIC_WRITE_PARALLELISM
Default: 2
The number of threads to devote to writing rows to the output database.
For Data Pipeline V2 on PostgreSQL, this should be a factor of TONIC_JOBFLOW_MAX_DESTINATION_CONNECTIONS
.
For example, if TONIC_JOBFLOW_MAX_DESTINATION_CONNECTIONS
is 8, then TONIC_WRITE_PARALLELISM
should be 1, 2, or 4.
You can configure this setting from Tonic Settings.
TONIC_BIGQUERY_READ_PARALLELISM
Default: 2
Google BigQuery only.
The number of read threads per table for Google BigQuery.
TONIC_INDEX_RESTORATION_PARALLELISM
Default: 1
MySQL and PostgreSQL only.
At the end of the data generation run, the number of indexes to restore concurrently in the destination database.
TONIC_JOBFLOW_MAX_DESTINATION_CONNECTIONS
Default: 16
Only applies to the Data Pipeline V2 data generation process for PostgreSQL.
The maximum number of connections to the destination database.
Each action requires at least one connection.
We recommend that you set this value to the number of CPUs on the destination database server.
You can configure this setting from Tonic Settings.
TONIC_JOBFLOW_MAX_SOURCE_CONNECTIONS
Default: 8
Only applies to the Data Pipeline V2 data generation process for PostgreSQL.
The maximum number of connections to the source database.
Each action requires at least one connection.
We recommend that you set this value to the number of CPUs on the source database server.
You can configure this setting from Tonic Settings.
TONIC_MYSQL_COPY_TABLE_WRITE_PARALLELISM
Default: 1
MySQL only.
The number of tables that a worker can copy in parallel during a job.
TONIC_ORACLE_DATA_PUMP_PARALLELISM
Default: 0
Oracle only, and only on Oracle Enterprise Edition databases.
The maximum number of processes of active execution for Data Pump to use.
TONIC_PARTITION_PARALLELISM
Default: 1
MySQL and SQL Server only.
The number of table partitions per table that are read from concurrently during a job.
TONIC_READ_RANGES_PARALLELISM
Default: 8
PostgreSQL only.
The number of ranges per table to read in parallel.
Required workspace permission: Configure, train, and export models
To create a model configuration, you can either create a completely new configuration or make a copy of an existing configuration.
From the Models view, to create a new model configuration, click Create Model.
The model configuration page displays. From the model configuration page, you:
In the Model Name field, set the model name. The model name cannot contain spaces.
In the Model Description field, provide a longer description of the model.
Before you can save the new model configuration, you must provide a name and a SQL query.
To save the model configuration, click Save.
To return to the Models view, click All Models.
You can use an existing model as the basis for a new model configuration.
When you copy a model, the new model only inherits the configuration. It does not inherit any training results. The model is not trained.
To create a copy of an existing model configuration:
On the Models view, click the actions icon (...) for the model configuration to copy.
In the actions menu, click Copy.
On the Copy Model dialog, enter a name for the new model.
Click Copy.
Required workspace permission: Configure, train, and export models
The data model configuration includes the following elements.
Run a SQL query
The query results provide the underlying data for the model.
Configure the model parameters
The model parameters guide the model training.
Adjust the column types
Update the column types for the model data as needed.
Required workspace permission: Configure, train, and export models
When you finish configuring the model, you can train it.
To train a model, either:
On the Models view, click the Train option for the model.
On the model details view, click Train Model.
To view a list of training jobs:
In the workspace navigation bar, click Jobs. The Job History view lists the training jobs for all of the models in the workspace. To display the details for a job, click the job ID.
On the Models view, click the model row. The model details contains the list of training jobs that were run for that model. For more information, go to Reviewing the training results.
You can also use the Structural Python API to generate code to train the model.
The standard model library allows you to configure and train a model.
The reporting library allows you to assess the fidelity and privacy of a trained model.
Required workspace permission: Configure, train, and export models
From the Models view, a workspace owner or editor can view the current model details, edit the model configuration, or delete the model.
To view the details for a model, on the Models view, click the model row.
The model details view displays the list of training jobs for the model.
From the model details view, you can edit the model configuration and train the model based on the current configuration.
To edit the configuration of a model:
To display the configuration view for a model, either:
On the Models view, in the model actions menu, click Edit.
On the model details view, click Edit Model.
On the model configuration view, make your changes to the model configuration. For details, go to Configuring a model.
To save the changes, click Save.
To delete a model, on the Models view, in the model actions menu, click Delete.
Required workspace permission: Configure, train, and export models
The model details view displays the list of training jobs that were run against the model. The job list can include information about the job itself, as well as about the model configuration that was in place when the job was run.
For jobs that are running, queued, or failed, you can view the job details. For queued and running jobs, you can cancel the job.
For completed jobs, you can view a visual summary of the results for a specific job, and compare jobs.
You can configure the columns to include in the jobs list. By default, the jobs list includes:
The job identifier
The job status
The model version that the job ran against. When you change either the query or the column types, Tonic Structural updates the model version. If the updates cause the model configuration to match an existing version, the model is assigned that existing version number. Structural only assigns new version numbers to unique versions. Note that for training jobs that ran before we introduced model versioning, the model version is always 0.
When the job was submitted
When the job was completed
The general model parameter values that were used
For a tabular model, you can also display the tabular-specific parameters. For an event-driven model, you can also display the event-specific parameters.
To manage the displayed columns, click the list icon at the top right of the table. The column list contains the full list of available columns, and indicates whether each column is currently displayed.
To change whether a column is currently displayed, click the column name.
You can use the job status to filter the list. To filter the list:
In the Training Status column heading, click the filter icon.
On the filter panel, check the checkbox for each status to include.
As you check and uncheck the checkboxes, Structural updates the list.
You can use the following columns to sort the list:
Status
Model version
Job submission
Job completion
To sort by a column, click the column heading. To reverse the sort order, click the column heading again.
The Model Synthesis Report for a completed model training job provides a visual summary of the training results. It allows you to see how well the values in the generated data correspond to those in the original data. This indicates how realistic the generated data is.
Structural produces a Model Synthesis Report for each completed model training job.
From the model details view, to display the Model Synthesis Report for a previous training job, click the Synthesis Report option for that job. The option is only available for completed jobs.
From the job details view, to display the Model Synthesis Report for the job, click Synthesis Report.
The Model Synthesis Report contains the following sets of visualizations.
For each categorical column, the Categorical section shows the distribution of each value in both the original data and the generated data.
For example, the possible values for a contract
column are Month-to-month, Two year, and One year. In the Categorical section, the visualization for contract
would show the number of real and generated columns that have each value.
The closer the value counts match, the more realistic the generated data.
For each numeric column, the Continuous section shows the distribution of values in the original data and the generated data.
The closer the distributions match, the more realistic the generated data.
The Correlations section contains a correlation matrix for the original data and a correlation matrix for the generated data.
Each correlation matrix shows how the values in each numeric column correspond to the values in the other numeric columns. For example, as the tenure for a customer increases, does their bill amount also increase?
The correlation is displayed using a color code that represents a value between -1 and 1. -1 indicates that an increase in one value always corresponds to a decrease in the other value. 0 indicates that there is no correlation between the values. 1 indicates that an increase in one value always corresponds to an increase in the other value.
The blocks that correlate a column to itself always have a correlation of 1.
The more similar the correlations between the matrices, the more realistic the generated data.
The Measure of Privacy section shows how closely each generated record matches the most similar original record. It also plots how closely each original record matches the most similar other original record.
While the overall shape of the data should be similar between the original and generated data, the generated data should not replicate actual records.
To compare the model configuration and results for multiple jobs:
Check the checkbox for each job to include in the comparison.
Click Compare Jobs.
The comparison page displays a panel for each job.
At the top of the panel are the job start and end times.
Below that are tabs that summarize the results and contain the configuration that was in place when the training job ran:
Parameters shows the model version and the model parameter values
Schema contains the data schema
Query contains the query used to produce the model data
From the actions menu at the top right of the panel, you can:
Display the job details
View the Model Synthesis Report for the job
Required workspace permission: Configure, train, and export models
You can export a trained model to a Jupyter notebook, export generated model data to a CSV file, or copy a code snippet that you can use as a starting point for your own Jupyter notebook.
To display the Export Model panel for a completed training job:
On the Models view, click the model to export from.
On the model details view, click the Export option for the training job to use for the export.
The Export model panel contains the code snippet, CSV download, and Jupyter notebook export options. If you didn't previously install the data science libraries, it also provides the command to do that.
Tonic Structural provides a code snippet that you can use as a starting point to create your own Jupyter notebooks.
The results include:
Sample source data
The resulting synthetic data
Visualizations to compare the source data to the synthetic data, to help you to analyze the quality of the synthetic data
On the Export model panel, the Code Snippet section contains the snippet.
The template contains the following code. Values in <>
are populated automatically by the values from your Structural instance and model.
You can generate and download CSV files of model data records.
Structural stores generated files for 14 days in an S3 bucket that you choose. You configure the S3 bucket as the value of the environment setting TONIC_S3_BUCKET_FOR_SYNTHETIC_DATA_CSVS
. For more information, go to Configuring environment settings.
The generated file cannot be larger than 1GB. Structural automatically truncates the generated file if needed to stay within the limit.
On the Export model panel, under Download synthetic data directly to a CSV, to generate a new file:
Under How many rows, enter the number of rows of data to generate for the file.
Optionally, under Random Seed, enter a seed value to use for the data generation.
Providing a seed value guarantees a consistent set of results every time you generate a CSV file. Without the seed value, the result set is random.
Click Generate CSV.
Structural generates the file and saves it to the configured S3 bucket. It also adds the file to the list of previously generated files. The list displays up to 10 previously generated files.
To download a generated file, in the file list, click the file name.
You can export a trained model to a Jupyter notebook file. You can then use the notebook to analyze the model.
From the Export model panel, to export a model to a Jupyter notebook file:
Under Export Jupyter Notebook, from the Choose API Token drop-down list, choose the API token to use to access the model data.
Click Download Notebook. Structural generates and downloads the Jupyter notebook file.
The following diagram shows how data and requests flow within within the Tonic Structural application:
The Structural application database is a PostgreSQL database that stores the workspace and Structural configuration.
The configuration includes:
Which users have administrative access to the Structural instance.
Configuration of the workspaces in the Structural instance. Each workspace includes:
Data connections
Users and roles
Generation configuration (table modes, generators, subsetting)
Post-job actions
Job tracking
For a file connector workspace that uses files uploaded from a local file system, the Structural application database stores the encrypted source files. It also stores the generated destination files.
Each workspace is connected to a source database and a destination location. Depending on the workspace, the destination location might be a database, a Tonic Ephemeral data snapshot, a container repository, or a file system.
Source databases and destination locations are external to Structural, but Structural must be able to read from the source database and write to the destination location. For a file connector workspace that uses local files, the destination location is the Structural application database.
The source database contains the original data. We recommend that you use a static copy of your production database that was restored from a backup.
The destination location contains the results of the Structural data generation.
Each of the components below can have one or more containers.
Runs the Structural user interface. It also runs scans to ascertain the structure of the source data, including the scans to detect schema changes.
The web server displays preview data from the source database, and pulls the configuration from the application database.
The web server also receives and processes requests and calls from the Structural API.
The web server also handles the migration of the Structural database when a new Structural version makes changes to it.
Sends email notifications to notify users about comments on source database columns. The commenting feature requires an enterprise tier license.
The notification component also processes Structural webhooks. Webhooks perform specific actions when a data generation job completes, fails, or is canceled.
A Structural instance can have multiple workers. Additional workers allow you to process multiple jobs at the same time.
Structural workers run sensitivity scans. During a sensitivity scan, Structural looks for specific types of sensitive information in the source database. For details, go to #sensitive-data-types.
The workers also run data generation. The data generation process pulls data from the source database and applies the configured generators and subsetting. The configuration is in the application database.
The workers write the resulting data to the destination location.
Runs the AI Synthesizer generator. If a workspace uses AI Synthesizer, then the Structural worker calls the Structural machine learning during data generation.
The SMTP server processes the notifications sent by the Structural notification component.
All communication between Structural components, between Structural components and the application database, and between the notification component and the SMTP server use TLS encryption.
TLS encryption also is used by default to encrypt communication between Structural and the source and destination.
To generate de-identified data, Tonic Structural requires access to customer data that might be sensitive in nature, protected by regulation or contract, or that otherwise requires special handling to meet processing obligations.
To ensure the security of your data, when you configure and use the Structural application, Structural advises that you use industry best practices for secure data handling.
We have compiled the following recommendations for using Structural securely. This list of suggestions is not comprehensive, and is based on a general use case.
Your use case might require additional considerations depending on the type of data that is processed, your underlying systems, and other legal and organizational requirements.
The following recommendations apply both to Structural Cloud and to self-hosted instances of Structural.
You should grant Structural accounts to users based on the principle of least privilege. Each user should only have access to the workspaces and datasets that they need to perform their required tasks.
Structural produces de-identified data that is stored in destination databases. Some end users might not need access to the Structural application at all, but still need access to the destination data.
Restricting access to Structural includes restricting access to the application and API keys that provide access to the Structural API.
Periodically review the current user access to Structural to ensure that the current access levels are appropriate.
Maintain protective measures for data as it moves from your data store to Structural.
Configure databases that the Structural application connects to (source and destination databases) to only accept encrypted connections that use industry standard cryptographic algorithms.
Make sure that there are physical security and environmental controls for all of your devices and access points.
This includes devices that are used by remote or home-based employees who use Structural.
If you have a Professional or Enterprise license, use an external identity provider to manage access to Tonic.
When you use an external identity provider, you can control the password, multifactor, location, and other authentication requirements to meet your specific use case.
For self-hosted instances, the following additional recommendations apply.
Deploy Structural in an environment that prevents unauthorized and accidental access from outside the system.
This can include:
Configuring and using web application and network firewalls
Using AWS Security Groups, Azure Network Security Groups, or Google Cloud firewall rules to control access to Structural and to control Structural access to other networked devices
Using firewalls or stateless access control lists to deny traffic on unapproved ports or based on the traffic direction or type
If applicable, allowlisting end-user traffic to IP addresses within a network or VPN
Maintain protective measures for data as it moves from your end users to the Structural application. Configure your infrastructure deployment to use encryption-in-transit. Structural can be configured in multiple ways to use and enforce encryption-in-transit.
Tonic.ai recommends that all customers who deploy Structural enforce encrypted communication.
Inbound traffic to the Structural application can be load balancer configured with TLS termination. Some customers either do not want to or cannot use a load balancer. In that case, when you set up Structural, you install a certificate to encrypt inbound traffic to the Structural application. You can also use this configuration to ensure encrypted communication between the load balancer and the application.
For outbound traffic (traffic from the Structural application to source and destination databases), you can configure Tonic to enforce SSL/TLS communication.
For increased security, ensure that the Tonic web server only listens on https
and not on http
.
To configure this, set the environment setting TONIC_HTTPS_ONLY
to true
.
See Configuring environment settings.
Because of its access to sensitive data, you should configure and monitor network traffic for environments that run the Structural application.
At a minimum, Tonic.ai suggests that you use industry standard IDS/IPS systems to detect unauthorized access.
Use industry standard disk encryption on all of the underlying storage that is associated with your Structural instances and the associated databases.
Collect logs from Structural components and analyze them for anomalies that indicate malicious acts, natural disasters, and errors. Analyze anomalies to determine whether they represent security events.
Enable log sharing with Tonic.ai to allow Tonic.ai staff to monitor these logs. Tonic.ai staff can apply their domain knowledge of Structural to the log analysis.
Tonic.ai releases updates to the Structural software multiple times a week. Updates can include fixes to improve Tonic security.
We recommend that you upgrade Structural at least once every two weeks. For details, go to Updating Structural.
In the query results, the column headings include the identified column type. The training process uses numeric, categorical, and location columns in the training process.
Numeric columns contain a number value.
Categorical columns contain a specific set of values. For example, a categorical column might identify the marital status of a person represented in the data.
Location columns identify a physical location. For example, a location column might contain a zip code or a city name.
Tonic Structural assigns initial types when it runs the query. Typically:
String columns are assigned as categorical.
Numeric columns are assigned as numeric.
Datetime value columns are assigned as datetime. Ideally, in your SQL query you converted datetime values to a numeric representation of time such as epoch time. The columns are then assigned as numeric.
You can make adjustments to these assignments. For example:
A numeric column might actually be an enum, which would make it a categorical column.
A city name might be designated categorical, but is actually a location.
To change the designation of a column:
Click the dropdown arrow next to the current type.
From the popup menu, select the type.
For columns other than numeric columns, you can designate the column as a categorical column or a location column.
For numeric columns, you can also restore the column type to numeric.
For a self-hosted instance of Tonic Structural, you install Structural in your own environment.
The self-hosted version of Structural is only available to customers who purchased Structural or are undergoing a formal evaluation of Structural. For details, contact sales@tonic.ai. For information about the Structural free trial, go to the Tonic.ai web site.
In the query editor, provide a SQL query to identify the subset of data to obtain from the source database. The query must be deterministic - it must return the same data every time it runs.
You can use the table and column list on the Source tab at the left as a reference. If you uploaded CSV files, then each file becomes a table, with the file name (minus the extension) as the table name. For example, you upload a file named my_model_data.csv. This becomes a table named my_model_data
.
If the model contains event data, then make sure that the query results include a numeric column that can be used to sort the data based on a datetime value. You might need to transform a datetime column to use a numeric format.
To run the query, either click Run Query or press Shift-Enter. The query results are used to populate the table below the query editor and the Schema list on the model details view.
Structural system requirements
Overall system requirements for Structural deployment.
Deploy Structural on Docker
How to use Docker Compose to deploy Structural on Docker.
Deploy Structural on Kubernetes
How to use Helm to deploy Structural on Kubernetes.
Enter and update your license key
How to enter a new key or update an existing key.
Set up host integration
Host integration is required to monitor Structural services and update Structural.
For a self-hosted Tonic Structural instance, you deploy Structural to a public cloud account (for example, AWS, GCP, or Azure) or data center.
Use this checklist to prepare to install Structural. Structural architecture includes a diagram of the Structural components and how they are connected.
Provision a server that meets the required specifications.
You deploy Structural to either a Kubernetes cluster or a Docker container. Ensure that the Kubernetes or Docker environment meets the required specifications:
Provision a PostgreSQL database that meets the required specifications.
Determine whether to host the Structural application database in Docker.
To ensure a smooth installation and configuration process, all of the Structural components must have the appropriate network configurations.
Source databases contain the original data for Structural data generation or data science mode. For Structural data generation, Structural writes the transformed data to a destination database.
Overview for database administrators contains an overview of the requirements for Structural source and destination databases.
Structural application server -> Structural application database
The Structural application server must have a valid network path to the Structural application database.
Structural application server -> quay.io
The Structural application server must have access to download the Structural application images from quay.io. Ensure that any proxies or firewalls that might block access are configured to allow access.
Structural users -> Structural web application
The Structural application server runs a web server (HTTPS/port 443 and HTTP/port 80). Ensure that all Structural users can reach the Structural application from their browser.
Structural application server --> Source and destination databases
The Structural application server must have a valid network path to the source and destination databases.
Structural application database remote access
If the Structural application database is not hosted on Docker, then it must be accessible and allow remote access.
For PostgreSQL and MySQL workspaces, you can configure Tonic Structural to write destination data to a container artifact instead of to a database server. For more information, go to Writing data generation output to a container repository.
If Structural is deployed on Kubernetes, then the option is supported automatically.
If Structural is deployed on Docker, then to enable the option, you can set up a separate Kubernetes cluster to use specifically for that purpose.
Set up a Kubernetes cluster
On a Docker instance, set up a separate Kubernetes cluster to use
Grant required permissions
Ensure that Structural has the required permissions to write destination data to container artifacts
A Docker instance of Tonic Structural does not automatically support the option to write destination data to a container artifact.
To enable this option, you can set up a separate Kubernetes cluster. You then configure Structural environment settings to enable Structural to use that Kubernetes cluster as the destination location.
You can install the Kubernetes cluster on the same server where Docker is installed, or on a remote host that has network access to the Docker server.
You can use any compatible Kubernetes distribution. Here are links to the installation instructions for a few different options that will work:
The Structural service account must have the permissions listed in Required access to write destination data to container artifacts.
In the kubeconfig file, you must change the server
property value from localhost
to either:
If the cluster is remote, the Kubernetes host IP address or hostname
If the cluster is on the same host, to host.docker.internal
To allow Structural to connect to the Kubernetes cluster and to write destination data to it, you must configure the following environment settings.
You can add these settings manually to the list on the Environment Settings tab of Tonic Settings.
To allow Structural to write output data to the Kubernetes cluster, Structural also needs the path where kubeconfig is mounted to the Structural worker.
In the Docker Compose file, to specify the kubeconfig path, add the KUBECONFIG
environment variable to the tonic_worker environment
section.
Setting | Description |
---|---|
CONTAINERIZATION_USE_REMOTE_KUBERNETES
Whether Structural can write destination data to a remote Kubernetes cluster.
Set this to true.
CONTAINERIZATION_PULL_SECRET
A base64 encoded Docker secret used to pull datapacker images.
This should be the same pull secret that you use to pull other images from Tonic.
CONTAINERIZATION_IMAGE_REPOSITORY
The repository where the base images are located.
If you use the images provided by Structural, then you do not need to set this.
CONTAINERIZATION_REMOTE_KUBERNETES_HOST
IP address or hostname of the host for the Kubernetes cluster. If you installed Kubernetes on the same host as Docker, then you do not need to set this.
CONTAINERIZATION_MANAGE_NAMESPACE
Whether to allow Structural to manage the remote namespace.
If you set this to true, then you can include {workspaceId}
and {jobId}
as placeholders in the value of CONTAINERIZATION_NAMESPACE
.
You must also add an rbac grant to enable the Structural service account to work with namespaces.
CONTAINERIZATION_NAMESPACE
The namespace where Structural writes the destination data.
If CONTAINERIZATION_MANAGE_NAMESPACE
is true
, then the namespace can include the placeholders {workspaceId}
and {jobId}
to represent the specific workspace identifier and data generation job identifier.
To enable Tonic Structural to write destination data to container artifacts, the Structural service account requires specific levels of access to Kubernetes.
The required access applies both on a Kubernetes cluster where Structural is deployed and, for Docker instances, on the separate Kubernetes cluster that you install.
On the Kubernetes cluster, the Structural service account must be granted a rolebinding that grants the following access to the Structural Kubernetes cluster:
On a Kubernetes instance of Structural, you can allow Structural to create the rolebinding automatically. In the Structural Helm chart, the following setting determines whether to have Structural automatically create and grant the rolebinding. By default, the setting is true
.
If your access management method does not allow you to use this default configuration, then:
Change the setting to false
.
Create and grant the rolebinding.
For a separate Kubernetes cluster, the environment setting CONTAINERIZATION_MANAGE_NAMESPACE
indicates whether to allow Structural to manage the remote namespace.
If the setting is true, then you must add the following rbac
grant to enable the Structural service account to manage namespaces.
When you start your Tonic Structural instance for the first time, you must provide your license key in order to activate Structural.
You also update the license key when it expires or when you upgrade to a new license tier.
On a new instance of Structural, to provide the Structural license key and activate Structural:
On the Welcome to Tonic panel, click Get Started.
On the Input License Key panel, in the License Key text area, paste your license key.
Click Activate.
Structural verifies the license key. If the license key is valid, the Structural login screen is displayed.
Required global permission: Update the Tonic license key
When your license key expires, Structural displays a banner across the top of the application. The banner contains a message to indicate that the license is expired.
The expired license banner contains an Update Tonic License button.
The Update Tonic License button also displays on the System Status tab of Tonic Settings view. You can use this option to update an expired license key or to provide a new key when you upgrade to a higher license plan.
You obtain your updated license key from Tonic.ai. To update the license key:
Click Update Tonic License.
On the Update License Key panel, in the License Key text area, paste the new license key.
Click Update.
Structural verifies the license key. If the license is valid, the license key is updated, and the expired license banner is removed.
The Tonic Settings option to update the license key is only available to users who have the Update the Tonic license key global permission. All self-hosted instances should have at least one user with this permission.
If your instance does not have a user with the Update the Tonic license key global permission, then you can set the Structural license key as the value of the TONIC_LICENSE
environment setting.
If your instance has a user with the Update the Tonic license key global permission, then Structural ignores the TONIC_LICENSE
environment setting. You must use the Tonic Settings option to update the license.
These topics talk about groups of related generators that have similar functions and configurations.
Only available for PostgreSQL, MySQL, and SQL Server.
Not compatible with upsert.
Not compatible with Preserve Destination or Incremental table modes.
If Ephemeral supports your workspace database type, then you can choose to write the destination data to a snapshot in Ephemeral. You can then use the snapshot to start Ephemeral databases.
To write the transformed data to Ephemeral, under Destination Settings, click Ephemeral Database.
Structural can write the data snapshot to either Ephemeral Cloud or to a self-hosted instance of Ephemeral. By default, Structural writes the data snapshot to Ephemeral Cloud.
For Ephemeral Cloud, Structural writes the snapshot to the account for the user who runs the data generation job. If that user has an Ephemeral account on Ephemeral Cloud, then Structural uses that account. If the user does not have an account, then Structural creates a two-week Ephemeral free trial account for the user.
Note that if you are on a self-hosted instance of Ephemeral, then you must always provide an Ephemeral API key.
To write a snapshot to Ephemeral Cloud:
Click Tonic Ephemeral cloud.
If you are on a self-hosted instance of Structural, in the API Key field, provide an Ephemeral API key from your Ephemeral account.
To write the snapshot to a self-hosted instance of Ephemeral:
Click Tonic Ephemeral self-hosted.
In the API Key field, provide an Ephemeral API key from your Ephemeral account. Structural writes the snapshot to the Ephemeral account that is associated with the API key.
In the Tonic Ephemeral URL field, provide the URL to your self-hosted Ephemeral instance.
If you do not configure any advanced settings, then:
The snapshot uses the same name as the workspace, and has no description.
The snapshot size allocation is determined by the source data size.
Structural discards the temporary Ephemeral database that is created during the data generation.
To change any of these settings, click Advanced settings.
By default, the snapshot name uses the workspace name.
When you run data generation, if a snapshot with the same name already exists in Ephemeral, then Structural overwrites that snapshot with the new snapshot.
Under Advanced settings:
In the Snapshot name field, provide the name of the snapshot. The snapshot name can use the following placeholder values to help identify the snapshot:
{workspaceName}
- Inserts the name of the workspace.
{workspaceId}
- Inserts the identifier of the workspace.
{jobId}
- Inserts the identifier of the data generation job that created the snapshot.
{timestamp}
- Inserts the timestamp when the snapshot was created.
Including the job ID or timestamp ensures that a data generation job does not overwrite a previous snapshot.
Optionally, in the Snapshot description field, provide a longer description of the snapshot.
By default, the Ephemeral size allocation for the snapshot is based on the size of the source data.
To instead provide a custom data size allocation, under Advanced settings:
Toggle Custom data size allocation to the on position.
In the field, enter the size allocation in gigabytes.
When Structural creates the Ephemeral snapshot, it creates a temporary Ephemeral database.
By default, Structural deletes that database when the data generation is complete.
To instead keep the database, under Advanced settings, toggle Keep database active in Tonic Ephemeral after data generation to the on position.
For a MySQL workspace, you can provide a customization file that helps to ensure that the temporary Ephemeral database is configured correctly.
To provide the customization details:
Toggle Use custom configuration to the on position.
In the text area, paste the contents of the customization file.
On the model configuration view, the Advanced tab at the left contains the options that Tonic Structural uses during the data training and generation process.
By default, models are tabular. A tabular model focuses on the relationships between columns.
However, a model might be event driven, meaning that you want to correspond both rows and columns. For example, you might want to track financial transactions across time for each user.
For an event driven model, you specify:
The column to use to identify the row. For example, to track activity for users, you might use a column that contains a user name or identifier.
The column to use to sort the rows. This column contains a numeric representation of a datetime value.
Optionally, columns to use to provide conditions for sampling the data. When you sample the data, you specify the column values to use in the generated events. For example, you choose to condition the data based on a region column. When you sample the data, you can specify the regions for which to generate events.
To indicate that a model is event driven:
From the Model dropdown, select Event Driven.
From the Primary Entity drop-down list, select the column to use to identify the row.
From the Order drop-down list, select the column to use to sort the rows. The order column can be a numeric column, a date column, or a datetime column.
Under Condition On, to configure a list of columns for conditional sampling:
To add a column, begin to type the column name. From the list of matching columns, select the column to add. You can only use categorical columns. The columns also should contain static data. For example, for a transaction, the account type is static. It is not affected by the transaction. The transaction type and remaining balance are dynamic. They are specific to an individual transaction.
To remove a column, click its delete icon.
The parameters under General Parameters are common to all models:
In the Epochs field, enter the number of times that the training process goes over the data. The default is 300. A higher value can increase the accuracy of the training results. However, it increases the amount of time that it takes to complete the training. It can also decrease the privacy of the results.
Use the Early Stopping toggle to indicate whether to use early stopping for model training. If Early Stopping is turned on, then the model training does not have to run the full number of epochs. It stops running when the model begins to overfit to the training data. If Early Stopping is turned off, then the model training runs the full number of configured epochs.
In the Batch Size field, enter the number of examples to use during each training step. The default is 500. A higher value can make the training more regular, but might require more epochs to converge to similar results.
In the Reconstruction Loss Factor field, type the loss function for the model. The default is 2. The loss function for a variational autoencoder is essentially the sum of a “reconstruction loss” function and a regularization term. A higher value can help to produce decoded samples that are close to encoded samples, but also can make latent representations more complicated and reduce diversity of synthetic samples.
In the Latent Dimension field, enter the dimension of latent representation. The default is 128. This latent dimension represents the complexity of the data. If the specified value is much higher than the dimensionality of the issue that you want to analyze, it can reduce the quality of the results.
In the Maximum Categorical Dimension field, enter the dimension for columns that have categorical or location encoding. The default is 35. If a column contains more distinct categories than this parameter, the most frequent categories are embedded as distinct one-hot vectors. The remaining categories are combined into a single one-hot vector. This limit prevents the model size from becoming extremely large and generally improves data quality.
For an event driven model, to configure the RNN-VAE Parameters:
In the Maximum Sequence Length field, enter the maximum number of steps in a sequence that Tonic considers when it trains the event model. The default is 20. Longer source sequences are truncated to the maximum length. The resulting synthetic sequences have a length up to this value. Long sequences take longer to process, and can reduce the quality of the results.
In the Maximum Order Dimension field:
If the order column is numeric, then the order column is discretized. Set Maximum Order Dimension to the number of pieces to discretize the order column into.
If the order column is a date or datetime, set Maximum Order Dimension to the maximum number of distinct dates that the model considers. For datetime values, the time is ignored. If the number of dates in the data exceeds Maximum Order Dimension, then the model training fails.
In the RNN Encoder Hidden Size field, enter the number of parameters in the RNN internal states to use for the encoder network. The default is 256.
In the RNN Decoder Hidden Size field, enter the number of parameters in the RNN internal states to use for the decoder network. The default is 256.
In the RNN Decoder Fully Connected Size field, enter the value to represent the complexity of the decoder’s fully connected layer. The default is 128. The hidden state passes through the fully connected layer to generate samples at each time interval.
In the Sequence Length Loss Factor field, enter the loss factor for sequencing for the model. The default is 128. The sequence length loss factor indicates how important it is to predict the sequence length. When you increase this number, Structural uses more of the model's capacity to capture the statistical properties of sequence lengths.
In the Order Column Loss Factor field, enter the loss factor for the column value order. The default is 128. The order column loss factor determines how important it is to predict the order of the column values. Similar to the sequence loss factor, when you increase this factor, it increases the realism of the synthetic order column values. The scale is different because order column values use different encodings.
For a tabular model, to configure the VAE Parameters:
In the Encoder Layer Sizes field, type a comma-separated list of non-negative integers to specify the number of layers and the size of each layer for the encoder. The default is 256,256,256, which indicates that there are three layers, and that the size of each layer is 256. A higher number of layers or larger layer size increases the expressive capacity of the model. However, to produce good results, you must start with a larger dataset.
In the Decoder Layer Sizes field, type a comma-separated list of non-negative integers to specify the number of layers and the size of each layer for the decoder. The default is 256,256,256, which indicates that there are three layers, and that the size of each layer is 256. A higher number of layers or larger layer size increases the expressive capacity of the model. However, to produce good results, you must start with a larger dataset.
You install Tonic Structural either on in a Docker container or on a Kubernetes cluster.
You cannot deploy Structural on Mac computers with Apple silicon (M1, M2).
The server or cluster that you deploy Structural to must at a minimum have access to the following resources:
If your source database is larger than 500 GB and/or contains "large" (~1+ MB values) data such as JSONB, XML, NVARCHAR(max), then:
Increase the virtual CPUs to 8
Add 16GB to the minimum memory
If you have questions about the number of resources to allocate based on your source databases, contact support@tonic.ai.
For Docker, our recommendation is to use Linux with Docker and Docker Compose installed.
When you deploy Structural using Docker:
Both Docker and Docker Compose must be installed on the machine.
The Docker Daemon must be running.
The minimum required Docker version for future Structural compatibility is 20.10.10
.
Note that by default, Docker sets its MTU (maximum transmission unit) to 1500. Make sure that the MTU for the Docker network matches your environment. Some networks, such as the Google Cloud Platform (GCP) VPCs, have a lower default MTU (1460 in the case of GCP VPCs), which causes network problems for your Structural instance.
To change the Docker MTU setting, you must:
You can deploy Structural manually to a Kubernetes cluster. For a manual deployment, you must have:
A cluster created and configured.
A namespace to deploy Structural.
Both kubectl and helm must be installed on the machine. The minimum acceptable versions are:
kubectl: 1.17+
helm: 3+
You can run Structural on Amazon Elastic Container Service (Amazon ECS) on either Fargate or Amazon Elastic Compute Cloud (Amazon EC2) hosts.
Depending on your requirements, this can be either as a single or multiple task definitions.
For more information about Structural in Amazon ECS or example task definition files, contact Tonic.ai support.
The Structural images are obtained from quay.io. For manual deployment, Structural provides the required credentials for you to use.
If possible, allowlist *.quay.io.
If you cannot allowlist based on DNS names, you can allowlist the IP addresses. To get the IP addresses for a URL, run the URL through nslookup (for example, nslookup cdn01.quay.io). You then allowlist those IP addresses.
The Structural application database (sometimes referred to as the metadata database) is a PostgreSQL database that stores the workspace and Structural configuration.
In most cases, the Structural application database is an external database that is hosted on a separate server.
For an external database, one small host (for example, an RDS t3.small on AWS with at least 100 GB of storage) can serve as the PostgreSQL server.
If you plan to create file connector workspaces that use files from a local file system, make sure that the storage space can accommodate the uploaded and generated files.
To prevent the loss of Structural metadata, keep regular backups of the PostgreSQL instance.
You should keep your PostgreSQL version relatively up-to-date with the current PostgreSQL LTS. The current recommended PostgreSQL version is 13+.
Tonic.ai might periodically conduct a campaign to request updates of self-hosted PostgreSQL instances before a scheduled update in the minimum supported version.
For the Structural application database, the current minimum supported version is PostgreSQL 10+.
The user credentials that you provide to Structural for the application database must have permission to create a database, create tables, insert, and select.
You must either:
Grant the account that Structural uses the necessary permissions to create the extension.
For a deployment to Docker, you have the option to run PostgreSQL in a Docker container on the Structural application server.
If you use this configuration, mount the data directory for the PostgreSQL container on the host machine and schedule regular backups.
To enable the accelerated processing, the server where Structural is deployed must have access to NVIDIA GPU, 16GB GPU RAM.
To use GPU resources:
Platform-specific notes:
To make future updates easier, fork this repository.
The repository readme includes more detail on how to set the required and optional configuration parameters.
Structural notifies you when the current version is more than 10 versions behind the most recent release. The notification is on the System Status tab of Tonic Settings view.
When you make changes to your deployment with Helm, if your tonicVersion
tag is latest
, make sure that you update all of your individual pods/containers to the same version.
To make future updates easier, fork this repository.
The repository readme includes more detail on how to set environment settings. It also provides information on how to determine which containers are required for your deployment. For example, whether you use Docker to deploy the Structural application PostgreSQL database.
On the machine where you plan to deploy Structural, log in to Quay.io with credentials that Structural provides:
If you run Structural in a cloud environment, we strongly suggest that you enable SSL, or that you use some other mechanism to protect traffic to the machine. For example, you might make the instance available only over VPN. Your cloud provider should have instructions on how to accomplish this.
Structural notifies you when the current version is more than 10 versions behind the most recent release. The notification is on the System Status tab of Tonic Settings view.
At a minimum, to update Structural, run the following:
To free additional disk space before you complete the update, you can optionally include commands to remove unused images and volumes.
Tonic Ephemeral is a separate Tonic.ai product that allows you to create temporary databases to use for testing and demos. For more information about Ephemeral, go to the .
Resource | Requirement |
---|
Change it in the
On the Structural Docker network, set the driver bridge option
If you cannot use a wildcard, then you can allowlist specific quay.io URLs. For example: quay.io, cdn.quay.io, cdn01.quay.io, cdn02.quay.io, cdn03.quay.io. The actual URLs are controlled by Quay. The following includes a list of quay.io URLs.
Instead of an external database, a deployment to Docker also provides an option to .
The Structural application database requires the extension
Install the extension.
To enable this, uncomment the that Structural provides.
For a Kubernetes deployment, Tonic Settings view includes an option to .
To enable this option, Structural requires access to .
For workspaces, the model training process can use GPU acceleration.
Ensure that the correct for your instance.
If deploying on Kubernetes, follow the instructions at .
If deploying on Docker, follow the .
AWS ships several with NVIDIA drivers pre-installed.
On Azure, you can add an .
For information on installing NVIDIA drivers on GCP, see the .
A Tonic Structural Helm chart is located at: .
During the onboarding period, you are provided access credentials to our image repository. If you require new credentials, or you experience issues accessing the repository, contact to get access to our Quay.io docker repository.
Review the .
To deploy and validate access to Structural, follow the .
To get the latest Structural version, users with the Update Tonic global permission can use the . Alternatively, if you need to specify a particular version of Structural to use, set , then run the following:
A Tonic Structural Docker Compose repository is located here: .
When you sign up with Structural, you are provided access credentials to our image repository. If you require new credentials, or you are unable to access the repository, contact to get access to our Quay.io Docker repository.
Review the .
To deploy and validate access to Structural, follow the .
If you use the Compose , then use the docker-compose
syntax instead of docker compose
.
Composite generators
Composite generators apply a generator to a specific data element or based on a condition.
Primary key generators
Learn about generators that you can apply to primary key columns.
CPUs | 4 virtual CPUs Must use x86 CPU architecture (Structural does not support ARM architecture) |
Memory | Minimum of 16GB Recommend 32GB |
Available hard drive space | Minimum 100GB Recommend 250GB If you use subsetting, then we recommend that you use a non-burstable storage class, such as AWS io1, that is provisioned with and can sustain high input/output operations per second (IOPS). This can provide a significant performance improvement. |