Configuring Databricks workspace data connections

During workspace creation, under Connection Type, select Databricks.

Identifying the source database

In the Source Server section:

  1. In the Catalog Name field, provide the name of the catalog where the source database is located. If you do not provide a catalog name, then the default catalog is used. For Unity Catalog, this is the catalog that you configured as the default. For earlier versions that do not support Unity Catalog, the default is hive_metastore.

  2. In the Database Name field, provide the name of the source database.

Enabling validation of table filters

For Databricks workspaces, you can provide where clauses to filter tables. For details, go to Applying a filter to tables.

The Enable partition filter validation toggle indicates whether Tonic Structural should validate those filters when you create them.

By default, the setting is in the on position, and Structural validates the filters. To disable the validation, toggle Enable partition filter validation to the off position.

Blocking data generation on all schema changes

By default, data generation is not blocked as long as schema changes do not conflict with your workspace configuration.

To block data generation when there are any schema changes, regardless of whether they conflict with your workspace configuration switch Block data generation on schema changes to the on position.

Connecting to the Databricks cluster

In the Databricks Cluster section, you provide the connection information for the cluster.

  1. Under Databricks Type, select whether to use Databricks on AWS or Azure Databricks.

  2. In the API Token field, provide the API token for Databricks. For information on how to generate an API token, go to the Databricks documentation.

  3. In the Host URL field, provide the URL for the cluster host.

  4. In the HTTP Path field, provide the path to the cluster.

  5. In the Port field, provide the port to use to access the cluster.

  6. By default, data generation jobs run on the specified cluster. To instead run data generation jobs on an ephemeral Databricks job cluster:

    1. Toggle Use Databricks Job Cluster to the on position.

    2. In the Cluster Information text area, provide the details for the job cluster.

  7. For clusters that use Databricks runtime 10.4 and below, Structural installs a cluster initialization script, which is stored as a Databricks workspace file. By default, this script is uploaded to the /Shared workspace directory. To upload the script to a different directory, set Workspace Path to an absolute path in the workspace tree. Structural must have access to the directory.

  8. To test the connection to the cluster, click Test Cluster Connection.

Connecting to the destination server

In the Destination Settings section, you specify where Structural writes the destination database.

Selecting the output type

Under Output Storage Type, select the type of storage to use for the destination data:

  • To use Databricks Delta tables, click Databricks.

  • To use Amazon S3, click Amazon S3 Files.

  • To use Azure, click Azure Data Lake Storage Gen2 Files.

Configuring the output settings for Databricks Delta tables

If you selected Databricks as the output type:

  1. In the Catalog Name field, provide the name of the catalog that contains the database If the Databricks cluster connection supports multiple catalogs (Unity Catalog) and you do not specify a catalog, then Structural uses the default catalog. For connections that use the legacy metastore, you can leave the field blank, or set it to hive_metastore. Note that if you specify a catalog that does not already exist, then the user that is associated with the API token must have permission to create the catalog.

  2. In the Database Name field, provide the name of the database.

  3. The Skip Destination Database Schema Creation option determines whether Structural creates the destination database schema during data generation.

    Your Structural administrator determines whether the option is available and the default setting.

    When the setting is in the on position, then Structural does not create the schema, and you must manage it yourself. When the setting is in the off position, then Structural does create the schema.

If you do not specify a database, Structural uses the database name default in the active catalog.

Configuring the output settings for Amazon S3 or Azure

If you selected either Amazon S3 Files or Azure Data Lake Storage Gen2 Files as the output type:

  1. In the Output Location field, provide the location in either Amazon S3 or Azure for the destination data.

  2. By default, Structural writes the results of each data generation to a different folder. To create the folder, it appends a GUID to the end of the output location. To instead always write the results to the specified output location, and overwrite the results of the previous job, toggle Create job specific destination folder to the off position.

    If you use non-job-specific folders for destination data, then the following environment settings determine how Structural handles overwrites. You can configure these settings from the Environment Settings tab on Structural Settings. Note that any defined table-level Error on Override setting takes precedence over these settings.

    • TONIC_WORKSPACE_ERROR_ON_OVERRIDE. Whether to prevent overwrites of previous writes. By default, this setting is true, and attempts to overwrite return an error. To allow overwrites, set this to false.

    • TONIC_WORKSPACE_DEFAULT_SAVE_MODE. The mode to use to save tables to a non-job-specific folder. When this is set to a value other than null, which is the default, then this setting takes precedence over TONIC_WORKSPACE_ERROR_ON_OVERRIDE. The available values are Append, ErrorIfExists, Ignore, and Overwrite.

  3. By default, each output table is written in the format used by the corresponding input table. To instead write all output tables to a single format:

    1. Toggle Write all output to a specific type to the on position.

    2. From the Select output type dropdown list, select the output format to use. The options are:

      • Avro

      • JSON

      • Parquet

      • Delta

      • CSV

      • ORC

    3. If you select CSV, you also configure the file format.

      1. To treat the first row as a header, check Treat first row as a column header. The box is checked by default.

      2. In the Column Delimiter field, type the character to use to separate the columns. The default is a comma (,).

      3. In the Escape Character field, type the character to use to escape special characters. The default is a backslash (\).

      4. In the Quoting Character field, type the character to use to quote text values. The default is a double quote (").

      5. In the NULL Value Replacement String field, type the string to use to represent null values. The default is an empty string.

Last updated