Before you run your first Tonic Structural generation job on your Databricks cluster, you must complete the following steps.
These steps are not necessary if you only want connect to Databricks and view data in the Structural application.
Structural requires a standard or single node all-purpose cluster. High Concurrency clusters cannot be used to generate data, because they do not support running Scala workloads. Structural requires the ability to run Scala workloads on the cluster.
For versions earlier than 11.1, Structural also requires the ability to run Python workloads on the cluster.
SQL Warehouses (formerly SQL Endpoints) are also supported if the Use Databricks Jobs Cluster for running jobs option is enabled. In this case, the SQL Warehouse is used to power the Structural application, and the Job Cluster is used to run data generation.
The cluster access mode also must support Scala, as well as JAR jobs:
If you use Unity Catalog, then you must use Single user access mode.
If you do not use Unity Catalog, then you can use either:
Single user
No isolation shared
If Cluster, Pool, and Jobs Access Control is enabled on your instance of Databricks, then the user whose API token is used in the Configuring Databricks workspace data connections steps must have Can Manage permissions on the cluster.
The Structural application must be able to install our library on the cluster and to restart the cluster after the library is installed. A new library is installed on initial run and after you upgrade to a newer version of Structural.
If you configure a workspace to write to a catalog that does not already exist, then the user must also have permission to create the catalog.
Structural does not require you to set any specific Spark configuration parameters. However, this may be necessary in your environment. You can set the parameters on the cluster details in your Databricks portal.
To add these, on the cluster configurations page:
Expand the Advanced Options.
Select the Spark tab
In the Spark Config field, enter the configurations.
If you use a jobs cluster, you can provide these in the spark_conf
section of the payload.
You might need Spark configuration parameters in the following cases:
Authenticating to an S3 bucket if not using an instance profile.
Authenticating to ADLSv2 if not using direct access.
Legacy Spark date compatibility.
If you require legacy Spark date compatibility, set the following optional properties:
spark.sql.legacy.parquet.datetimeRebaseModeInRead
CORRECTED
spark.sql.legacy.avro.datetimeRebaseModeInWrite
CORRECTED
spark.sql.legacy.parquet.datetimeRebaseModeInWrite
CORRECTED
spark.sql.legacy.avro.datetimeRebaseModeInRead
CORRECTED
To create clusters, you can use the Databricks API on the endpoint /api/2.0/clusters/create
.
For information about the Databricks API, including how to authenticate, see the AWS or Azure documentation.
Below is a sample payload that you can use as a starting point to create a Structural-compatible cluster.
Before you create a workspace that uses the Databricks data connector, you must complete the following configuration.
On AWS, Tonic Structural reads data from external tables that use Amazon S3 as the storage location. It writes files to S3 buckets.
On Azure, Structural reads data from external tables that use Azure Data Lake Storage Gen2 (ADLSv2). It writes the files to ADLSv2.
The Databricks cluster must be granted appropriate permissions to access the storage locations.
For information on how to configure an instance profile for access to Amazon S3, go to Configuring S3 access with instance profiles in the Databricks documentation.
This is the recommended method to grant the cluster access to the S3 buckets.
The instance profile definition that Databricks provides assumes that the cluster reads from and writes to the same S3 bucket.
If your source and destination S3 buckets are different, you can use an instance profile similar to the following, which separates the read and write permissions.
Replace <source-bucket>
and <destination-bucket>
with your S3 bucket names.
If your S3 buckets are owned by the same account in which the Databricks cluster is provisioned, you do not need an S3 bucket policy.
If your S3 buckets are in a separate account, then to allow the cluster access to the S3 buckets, you must create an S3 bucket policy as a cross-account trust relationship.
Similar to the instance profile, if you use separate S3 buckets for the source and destination, you can split the Databricks-provided definitions for the source and destination as shown in the following examples.
Source S3 bucket policy
This policy limits the instance profile to read-only (Get
, List
) access to the source S3 bucket.
Destination S3 bucket policy
This policy grants the instance profile both read and write access to the destination S3 bucket.
If you cannot or do not want to configure an instance profile, you can instead directly grant the cluster access to the S3 bucket.
To do this, you use your AWS Access Key and AWS Secret Access Key to set the following Spark configuration properties and values.
fs.s3n.awsAccessKeyId
<AWS Access Key ID>
spark.fs.s3a.access.key
<AWS Access Key ID>
spark.fs.s3a.secret.key
<AWS Secret Access Key>
fs.s3n.awsSecretAccessKey
<AWS Secret Access Key>
You enter the Spark configuration parameters when you set up your cluster.
Azure provides several options for accessing ADLSv2 from Databricks.
For details, go to Access Azure Data Lake Storage Gen2 and Blob Storage in the Azure Databricks documentation.
For some of the methods, you must set various Spark configuration properties. The Azure Databricks documentation provides Python examples that use spark.conf.set(<property>,<value>)
.
For Structural, you must provide these in the cluster configuration. Several of the methods recommend the use of secrets. To reference a secret, follow these instructions. You enter the Spark configuration parameters when you set up your cluster.
Structural uses the abfss
driver.
By default, during data generation to Databricks Delta tables, Structural creates the database schema for the destination database tables, then populates the database tables based on the workspace configuration.
You can also choose to manage the database schema yourself. You can:
Configure Structural to always skip the schema creation
Enable a workspace configuration option to skip the schema creation
You can add these settings to the Environment Settings list on Structural Settings.
To manage the database schema yourself, set the TONIC_DATABRICKS_SKIP_CREATE_DB
to true
.
When the environment setting TONIC_DATABRICKS_ENABLE_WORKSPACE_SKIP_CREATE_DB
is true, then TONIC_DATABRICKS_SKIP_CREATE_DB
determines the default value in the workspace configuration.
When TONIC_DATABRICKS_ENABLE_WORKSPACE_SKIP_CREATE_DB
is false
, then TONIC_DATABRICKS_SKIP_CREATE_DB
determines for the entire instance whether Structural skips the schema creation.
By default, the workspace configuration includes the Skip Destination Database Schema Creation option, which allows you to determine for the specific workspace whether to skip the database schema creation.
The default configuration is based on the value of TONIC_DATABRICKS_SKIP_CREATE_DB
.
To not include the option in the workspace configuration, set TONIC_DATABRICKS_ENABLE_WORKSPACE_SKIP_CREATE_DB
to false
.