LogoLogo
Release notesAPI docsDocs homeStructural CloudTonic.ai
  • Tonic Structural User Guide
  • About Tonic Structural
    • Structural data generation workflow
    • Structural deployment types
    • Structural implementation roles
    • Structural license plans
  • Logging into Structural for the first time
  • Getting started with the Structural free trial
  • Managing your user account
  • Frequently Asked Questions
  • Tutorial videos
  • Creating and managing workspaces
    • Managing workspaces
      • Viewing your list of workspaces
      • Creating, editing, or deleting a workspace
      • Workspace configuration settings
        • Workspace identification and connection type
        • Data connection settings
        • Configuring secrets managers for database connections
        • Data generation settings
        • Enabling and configuring upsert
        • Writing output to Tonic Ephemeral
        • Writing output to a container repository
        • Advanced workspace overrides
      • About the workspace management view
      • About workspace inheritance
      • Assigning tags to a workspace
      • Exporting and importing the workspace configuration
    • Managing access to workspaces
      • Sharing workspace access
      • Transferring ownership of a workspace
    • Viewing workspace jobs and job details
  • Configuring data generation
    • Privacy Hub
    • Database View
      • Viewing and configuring tables
      • Viewing the column list
      • Displaying sample data for a column
      • Configuring an individual column
      • Configuring multiple columns
      • Identifying similar columns
      • Commenting on columns
    • Table View
    • Working with document-based data
      • Performing scans on collections
      • Using Collection View
    • Identifying sensitive data
      • Running the Structural sensitivity scan
      • Manually indicating whether a column is sensitive
      • Built-in sensitivity types that Structural detects
      • Creating and managing custom sensitivity rules
    • Table modes
    • Generator information
      • Generator summary
      • Generator reference
        • Address
        • Algebraic
        • Alphanumeric String Key
        • Array Character Scramble
        • Array JSON Mask
        • Array Regex Mask
        • ASCII Key
        • Business Name
        • Categorical
        • Character Scramble
        • Character Substitution
        • Company Name
        • Conditional
        • Constant
        • Continuous
        • Cross Table Sum
        • CSV Mask
        • Custom Categorical
        • Date Truncation
        • Email
        • Event Timestamps
        • File Name
        • Find and Replace
        • FNR
        • Geo
        • HIPAA Address
        • Hostname
        • HStore Mask
        • HTML Mask
        • Integer Key
        • International Address
        • IP Address
        • JSON Mask
        • MAC Address
        • Mongo ObjectId Key
        • Name
        • Noise Generator
        • Null
        • Numeric String Key
        • Passthrough
        • Phone
        • Random Boolean
        • Random Double
        • Random Hash
        • Random Integer
        • Random Timestamp
        • Random UUID
        • Regex Mask
        • Sequential Integer
        • Shipping Container
        • SIN
        • SSN
        • Struct Mask
        • Timestamp Shift Generator
        • Unique Email
        • URL
        • UUID Key
        • XML Mask
      • Generator characteristics
        • Enabling consistency
        • Linking generators
        • Differential privacy
        • Partitioning a column
        • Data-free generators
        • Supporting uniqueness constraints
        • Format-preserving encryption (FPE)
      • Generator types
        • Composite generators
        • Primary key generators
    • Generator assignment and configuration
      • Reviewing and applying recommended generators
      • Assigning and configuring generators
      • Document View for file connector JSON columns
      • Generator hints and tips
      • Managing generator presets
      • Configuring and using Structural data encryption
      • Custom value processors
    • Subsetting data
      • About subsetting
      • Using table filtering for data warehouses and Spark-based data connectors
      • Viewing the current subsetting configuration
      • Subsetting and foreign keys
      • Configuring subsetting
      • Viewing and managing configuration inheritance
      • Viewing the subset creation steps
      • Viewing previous subsetting data generation runs
      • Generating cohesive subset data from related databases
      • Other subsetting hints and tips
    • Viewing and adding foreign keys
    • Viewing and resolving schema changes
    • Tracking changes to workspaces, generator presets, and sensitivity rules
    • Using the Privacy Report to verify data protection
  • Running data generation
    • Running data generation jobs
      • Types of data generation
      • Data generation process
      • Running data generation manually
      • Scheduling data generation
      • Issues that prevent data generation
    • Managing data generation performance
    • Viewing and downloading container artifacts
    • Post-job scripts
    • Webhooks
  • Installing and Administering Structural
    • Structural architecture
    • Using Structural securely
    • Deploying a self-hosted Structural instance
      • Deployment checklist
      • System requirements
      • Deploying with Docker Compose
      • Deploying on Kubernetes with Helm
      • Enabling the option to write output data to a container repository
        • Setting up a Kubernetes cluster to use to write output data to a container repository
        • Required access to write destination data to a container repository
      • Entering and updating your license key
      • Setting up host integration
      • Working with the application database
      • Setting up a secret
      • Setting a custom certificate
    • Using Structural Cloud
      • Structural Cloud notes
      • Setting up and managing a Structural Cloud pay-as-you-go subscription
      • Structural Cloud onboarding
    • Managing user access to Structural
      • Structural organizations
      • Determining whether users can create accounts
      • Creating a new account in an existing organization
      • Single sign-on (SSO)
        • Structural user authentication with SSO
        • Enabling and configuring SSO on Structural Cloud
        • Synchronizing SSO groups with Structural
        • Viewing the list of SSO groups in Tonic Structural
        • AWS IAM Identity Center
        • Duo
        • GitHub
        • Google
        • Keycloak
        • Microsoft Entra ID (previously Azure Active Directory)
        • Okta
        • OpenID Connect (OIDC)
        • SAML
      • Managing Structural users
      • Managing permissions
        • About permission sets
        • Built-in permission sets
        • Available permissions
        • Viewing the lists of global and workspace permission sets
        • Configuring custom permission sets
        • Selecting default permission sets
        • Configuring access to global permission sets
        • Setting initial access to all global permissions
        • Granting Account Admin access for a Structural Cloud organization
    • Structural monitoring and logging
      • Monitoring Structural services
      • Performing health checks
      • Downloading the usage report
      • Tracking user access and permissions
      • Redacted and diagnostic (unredacted) logs
      • Data that Tonic.ai collects
      • Verifying and enabling telemetry sharing
    • Configuring environment settings
    • Updating Structural
  • Connecting to your data
    • About data connectors
    • Overview for database administrators
    • Data connector summary
    • Amazon DynamoDB
      • System requirements and limitations for DynamoDB
      • Structural differences and limitations with DynamoDB
      • Before you create a DynamoDB workspace
      • Configuring DynamoDB workspace data connections
    • Amazon EMR
      • Structural process overview for Amazon EMR
      • System requirements for Amazon EMR
      • Structural differences and limitations with Amazon EMR
      • Before you create an Amazon EMR workspace
        • Creating IAM roles for Structural and Amazon EMR
        • Creating Athena workgroups
        • Configuration for cross-account setups
      • Configuring Amazon EMR workspace data connections
    • Amazon Redshift
      • Structural process overview for Amazon Redshift
      • Structural differences and limitations with Amazon Redshift
      • Before you create an Amazon Redshift workspace
        • Required AWS instance profile permissions for Amazon Redshift
        • Setting up the AWS Lambda role for Amazon Redshift
        • AWS KMS permissions for Amazon SQS message encryption
        • Amazon Redshift-specific Structural environment settings
        • Source and destination database permissions for Amazon Redshift
      • Configuring Amazon Redshift workspace data connections
    • Databricks
      • Structural process overview for Databricks
      • System requirements for Databricks
      • Structural differences and limitations with Databricks
      • Before you create a Databricks workspace
        • Granting access to storage
        • Setting up your Databricks cluster
        • Configuring the destination database schema creation
      • Configuring Databricks workspace data connections
    • Db2 for LUW
      • System requirements for Db2 for LUW
      • Structural differences and limitations with Db2 for LUW
      • Before you create a Db2 for LUW workspace
      • Configuring Db2 for LUW workspace data connections
    • File connector
      • Overview of the file connector process
      • Supported file and content types
      • Structural differences and limitations with the file connector
      • Before you create a file connector workspace
      • Configuring the file connector storage type and output options
      • Managing file groups in a file connector workspace
      • Downloading generated file connector files
    • Google BigQuery
      • Structural differences and limitations with Google BigQuery
      • Before you create a Google BigQuery workspace
      • Configuring Google BigQuery workspace data connections
      • Resolving schema changes for de-identified views
    • MongoDB
      • System requirements for MongoDB
      • Structural differences and limitations with MongoDB
      • Configuring MongoDB workspace data connections
      • Other MongoDB hints and tips
    • MySQL
      • System requirements for MySQL
      • Before you create a MySQL workspace
      • Configuring MySQL workspace data connections
    • Oracle
      • Known limitations for Oracle schema objects
      • System requirements for Oracle
      • Structural differences and limitations with Oracle
      • Before you create an Oracle workspace
      • Configuring Oracle workspace data connections
    • PostgreSQL
      • System requirements for PostgreSQL
      • Before you create a PostgreSQL workspace
      • Configuring PostgreSQL workspace data connections
    • Salesforce
      • System requirements for Salesforce
      • Structural differences and limitations with Salesforce
      • Before you create a Salesforce workspace
      • Configuring Salesforce workspace data connections
    • Snowflake on AWS
      • Structural process overviews for Snowflake on AWS
      • Structural differences and limitations with Snowflake on AWS
      • Before you create a Snowflake on AWS workspace
        • Required AWS instance profile permissions for Snowflake on AWS
        • Other configuration for Lambda processing
        • Source and destination database permissions for Snowflake on AWS
        • Configuring whether Structural creates the Snowflake on AWS destination database schema
      • Configuring Snowflake on AWS workspace data connections
    • Snowflake on Azure
      • Structural process overview for Snowflake on Azure
      • Structural differences and limitations with Snowflake on Azure
      • Before you create a Snowflake on Azure workspace
      • Configuring Snowflake on Azure workspace data connections
    • Spark SDK
      • Structural process overview for the Spark SDK
      • Structural differences and limitations with the Spark SDK
      • Configuring Spark SDK workspace data connections
      • Using Spark to run de-identification of the data
    • SQL Server
      • System requirements for SQL Server
      • Before you create a SQL Server workspace
      • Configuring SQL Server workspace data connections
    • Yugabyte
      • System requirements for Yugabyte
      • Structural differences and limitations with Yugabyte
      • Before you create a Yugabyte workspace
      • Configuring Yugabyte workspace data connections
      • Troubleshooting Yugabyte data generation issues
  • Using the Structural API
    • About the Structural API
    • Getting an API token
    • Getting the workspace ID
    • Using the Structural API to perform tasks
      • Configure environment settings
      • Manage generator presets
        • Retrieving the list of generator presets
        • Structure of a generator preset
        • Creating a custom generator preset
        • Updating an existing generator preset
        • Deleting a generator preset
      • Manage custom sensitivity rules
      • Create a workspace
      • Connect to source and destination data
      • Manage file groups in a file connector workspace
      • Assign table modes and filters to source database tables
      • Set column sensitivity
      • Assign generators to columns
        • Getting the generator IDs and available metadata
        • Updating generator configurations
        • Structure of a generator assignment
        • Generator API reference
          • Address (AddressGenerator)
          • Algebraic (AlgebraicGenerator)
          • Alphanumeric String Key (AlphaNumericPkGenerator)
          • Array Character Scramble (ArrayTextMaskGenerator)
          • Array JSON Mask (ArrayJsonMaskGenerator)
          • Array Regex Mask (ArrayRegexMaskGenerator)
          • ASCII Key (AsciiPkGenerator)
          • Business Name (BusinessNameGenerator)
          • Categorical (CategoricalGenerator)
          • Character Scramble (TextMaskGenerator)
          • Character Substitution (StringMaskGenerator)
          • Company Name (CompanyNameGenerator)
          • Conditional (ConditionalGenerator)
          • Constant (ConstantGenerator)
          • Continuous (GaussianGenerator)
          • Cross Table Sum (CrossTableAggregateGenerator)
          • CSV Mask (CsvMaskGenerator)
          • Custom Categorical (CustomCategoricalGenerator)
          • Date Truncation (DateTruncationGenerator)
          • Email (EmailGenerator)
          • Event Timestamps (EventGenerator)
          • File Name (FileNameGenerator)
          • Find and Replace (FindAndReplaceGenerator)
          • FNR (FnrGenerator)
          • Geo (GeoGenerator)
          • HIPAA Address (HipaaAddressGenerator)
          • Hostname (HostnameGenerator)
          • HStore Mask (HStoreMaskGenerator)
          • HTML Mask (HtmlMaskGenerator)
          • Integer Key (IntegerPkGenerator)
          • International Address (InternationalAddressGenerator)
          • IP Address (IPAddressGenerator)
          • JSON Mask (JsonMaskGenerator)
          • MAC Address (MACAddressGenerator)
          • Mongo ObjectId Key (ObjectIdPkGenerator)
          • Name (NameGenerator)
          • Noise Generator (NoiseGenerator)
          • Null (NullGenerator)
          • Numeric String Key (NumericStringPkGenerator)
          • Passthrough (PassthroughGenerator)
          • Phone (USPhoneNumberGenerator)
          • Random Boolean (RandomBooleanGenerator)
          • Random Double (RandomDoubleGenerator)
          • Random Hash (RandomStringGenerator)
          • Random Integer (RandomIntegerGenerator)
          • Random Timestamp (RandomTimestampGenerator)
          • Random UUID (UUIDGenerator)
          • Regex Mask (RegexMaskGenerator)
          • Sequential Integer (UniqueIntegerGenerator)
          • Shipping Container (ShippingContainerGenerator)
          • SIN (SINGenerator)
          • SSN (SsnGenerator)
          • Struct Mask (StructMaskGenerator)
          • Timestamp Shift (TimestampShiftGenerator)
          • Unique Email (UniqueEmailGenerator)
          • URL (UrlGenerator)
          • UUID Key (UuidPkGenerator)
          • XML Mask (XmlMaskGenerator)
      • Configure subsetting
      • Check for and resolve schema changes
      • Run data generation jobs
      • Schedule data generation jobs
    • Example script: Starting a data generation job
    • Example script: Polling for a job status and creating a Docker package
Powered by GitBook
On this page
  • Identifying the source database
  • Enabling validation of table filters
  • Blocking data generation on all schema changes
  • Connecting to the Databricks cluster
  • Connecting to the destination server
  • Selecting the output type
  • Configuring the output settings for Databricks Delta tables
  • Configuring the output settings for Amazon S3 or Azure

Was this helpful?

Export as PDF
  1. Connecting to your data
  2. Databricks

Configuring Databricks workspace data connections

Last updated 1 month ago

Was this helpful?

During workspace creation, under Connection Type, select Databricks.

Identifying the source database

In the Source Server section:

  1. By default, the workspace includes a single schema. To enable the option to provide multiple schemas, toggle Use multiple Unity Catalog source schemas to the on position.

  2. In the Catalog Name field, provide the name of the catalog where the source database is located. If you do not provide a catalog name, then the default catalog is used. For Unity Catalog, this is the catalog that you configured as the default. For earlier versions that do not support Unity Catalog, the default is hive_metastore.

  3. If you did not enable multiple schemas, then in the Database Name field, provide the name of the source database schema. If you did enable multiple schemas, then in the Database Names field, for each database schema to include, type the database name, then press Enter. You must provide at least two database schemas.

Enabling validation of table filters

For Databricks workspaces, you can provide where clauses to filter tables. For details, go to .

The Enable partition filter validation toggle indicates whether Tonic Structural validates those filters when you create them.

By default, the setting is in the on position, and Structural validates the filters. To disable the validation, toggle Enable partition filter validation to the off position.

Blocking data generation on all schema changes

By default, data generation is not blocked for schema changes that do not conflict with your workspace configuration.

To block data generation when there are any schema changes, regardless of whether they conflict with your workspace configuration, toggle Block data generation on schema changes to the on position.

Connecting to the Databricks cluster

In the Databricks Cluster section, you provide the connection information for the cluster.

  1. Under Databricks Type, select whether to use Databricks on AWS or Azure Databricks.

  2. In the Host URL field, provide the URL for the cluster host.

  3. In the HTTP Path field, provide the path to the cluster.

  4. In the Port field, provide the port to use to access the cluster.

  5. By default, data generation jobs run on the specified cluster. To instead run data generation jobs on an ephemeral Databricks job cluster:

    1. Toggle Use Databricks Job Cluster to the on position.

    2. In the Cluster Information text area, provide the details for the job cluster.

  6. For clusters that use Databricks runtime 10.4 and below, Structural installs a cluster initialization script, which is stored as a Databricks workspace file. By default, this script is uploaded to the /Shared workspace directory. To upload the script to a different directory, set Workspace Path to an absolute path in the workspace tree. Structural must have access to the directory.

  7. To test the connection to the cluster, click Test Cluster Connection.

Connecting to the destination server

In the Destination Settings section, you specify where Structural writes the destination database.

Selecting the output type

Under Output Storage Type, select the type of storage to use for the destination data:

  • To use Databricks Delta tables, click Databricks.

  • To use Amazon S3, click Amazon S3 Files.

  • To use Azure, click Azure Data Lake Storage Gen2 Files.

Configuring the output settings for Databricks Delta tables

If you selected Databricks as the output type:

  1. In the Catalog Name field, provide the name of the catalog that contains the database If the Databricks cluster connection supports multiple catalogs (Unity Catalog) and you do not specify a catalog, then Structural uses the default catalog. For connections that use the legacy metastore, you can leave the field blank, or set it to hive_metastore. Note that if you specify a catalog that does not already exist, then the user that is associated with the API token must have permission to create the catalog.

  2. If you did not provide multiple source database schemas, then in the Database Name field, provide the name of the database. If you do not specify a database, Structural uses the database name default in the active catalog. If you did provide multiple source database schemas, then the Database Name field does not display. Structural automatically creates destination schemas that match the source schemas.

  3. The Skip Destination Database Schema Creation option determines whether Structural creates the destination database schema during data generation.

    Your Structural administrator determines whether the option is available and the default setting.

    When the setting is in the on position, then Structural does not create the schema, and you must manage it yourself. When the setting is in the off position, then Structural does create the schema.

Configuring the output settings for Amazon S3 or Azure

If you selected either Amazon S3 Files or Azure Data Lake Storage Gen2 Files as the output type:

  1. In the Output Location field, provide the location in either Amazon S3 or Azure for the destination data.

  2. By default, Structural writes the results of each data generation to a different folder. To create the folder, it appends a GUID to the end of the output location. To instead always write the results to the specified output location, and overwrite the results of the previous job, toggle Create job specific destination folder to the off position.

    • TONIC_WORKSPACE_ERROR_ON_OVERRIDE. Whether to prevent overwrites of previous writes. By default, this setting is true, and attempts to overwrite return an error. To allow overwrites, set this to false.

    • TONIC_WORKSPACE_DEFAULT_SAVE_MODE. The mode to use to save tables to a non-job-specific folder. When this is set to a value other than null, which is the default, then this setting takes precedence over TONIC_WORKSPACE_ERROR_ON_OVERRIDE. The available values are Append, ErrorIfExists, Ignore, and Overwrite.

  3. By default, each output table is written in the format used by the corresponding input table. To instead write all output tables to a single format:

    1. Toggle Write all output to a specific type to the on position.

    2. From the Select output type dropdown list, select the output format to use. The options are:

      • Avro

      • JSON

      • Parquet

      • Delta

      • CSV

      • ORC

    3. If you select CSV, you also configure the file format.

      1. To treat the first row as a header, check Treat first row as a column header. The box is checked by default.

      2. In the Column Delimiter field, type the character to use to separate the columns. The default is a comma (,).

      3. In the Escape Character field, type the character to use to escape special characters. The default is a backslash (\).

      4. In the Quoting Character field, type the character to use to quote text values. The default is a double quote (").

      5. In the NULL Value Replacement String field, type the string to use to represent null values. The default is an empty string.

In the API Token field, provide the API token for Databricks. For information on how to generate an API token, go to the .

If you use non-job-specific folders for destination data, then the following determine how Structural handles overwrites. You can configure these settings from the Environment Settings tab on Structural Settings. Note that any defined table-level Error on Override setting takes precedence over these settings.

Databricks documentation
environment settings
Applying a filter to tables