Setting up your Databricks cluster
Before you run your first Tonic Structural data generation job on your Databricks cluster, you must complete the following steps.
These steps are not necessary if you only want connect to Databricks and view data in the Structural application.
Cluster requirements
Cluster type
Structural requires a standard or single node all-purpose cluster. High Concurrency clusters cannot be used to generate data, because they do not support running Scala workloads. Structural requires the ability to run Scala workloads on the cluster.
For versions earlier than 11.1, Structural also requires the ability to run Python workloads on the cluster.
SQL Warehouses (formerly SQL Endpoints) are also supported if the Use Databricks Jobs Cluster for running jobs option is enabled. In this case, the SQL Warehouse is used to power the Structural application, and the Job Cluster is used to run data generation.
Cluster access mode
The cluster access mode also must support Scala, as well as JAR jobs:
If you use Unity Catalog, then you must use Single user access mode.
If you do not use Unity Catalog, then you can use either:
Single user
No isolation shared
Cluster permissions
If Cluster, Pool, and Jobs Access Control is enabled on your instance of Databricks, then the user whose API token is used in the Configuring Databricks workspace data connections steps must have Can Manage permissions on the cluster.
The Structural application must be able to install our library on the cluster and to restart the cluster after the library is installed. A new library is installed during the initial run and after you upgrade to a newer version of Structural.
If you configure a workspace to write to a catalog that does not already exist, then the user must also have permission to create the catalog.
Setting cluster Spark configuration parameters
Structural does not require you to set any specific Spark configuration parameters. However, this may be necessary in your environment. You can set the parameters on the cluster details in your Databricks portal.
To add these, on the cluster configurations page:
Expand the Advanced Options.
Select the Spark tab.
In the Spark Config field, enter the configurations.
If you use a jobs cluster, you can provide these in the spark_conf
section of the payload.
You might need Spark configuration parameters in the following cases:
Authenticating to an S3 bucket if not using an instance profile.
Authenticating to ADLSv2 if not using direct access.
Legacy Spark date compatibility.
If you require legacy Spark date compatibility, set the following optional properties:
spark.sql.legacy.parquet.datetimeRebaseModeInRead
CORRECTED
spark.sql.legacy.avro.datetimeRebaseModeInWrite
CORRECTED
spark.sql.legacy.parquet.datetimeRebaseModeInWrite
CORRECTED
spark.sql.legacy.avro.datetimeRebaseModeInRead
CORRECTED
Creating a cluster using the API
To create clusters, you can use the Databricks API on the endpoint /api/2.0/clusters/create
.
For information about the Databricks API, including how to authenticate, see the AWS or Azure documentation.
Below is a sample payload that you can use as a starting point to create a Structural-compatible cluster.
Last updated