Setting up your Databricks cluster

Before you run your first Tonic Structural data generation job on your Databricks cluster, you must complete the following steps.

These steps are not necessary if you only want connect to Databricks and view data in the Structural application.

Cluster requirements

Cluster type

Structural requires a standard or single node all-purpose cluster. High Concurrency clusters cannot be used to generate data, because they do not support running Scala workloads. Structural requires the ability to run Scala workloads on the cluster.

For versions earlier than 11.1, Structural also requires the ability to run Python workloads on the cluster.

SQL Warehouses (formerly SQL Endpoints) are also supported if the Use Databricks Jobs Cluster for running jobs option is enabled. In this case, the SQL Warehouse is used to power the Structural application, and the Job Cluster is used to run data generation.

Cluster access mode

The cluster access mode also must support Scala, as well as JAR jobs:

If you use Unity Catalog, then you must use Single user access mode.
If you do not use Unity Catalog, then you can use either:
- Single user
- No isolation shared

Cluster permissions

If Cluster, Pool, and Jobs Access Control is enabled on your instance of Databricks, then the user whose API token is used in the Configuring Databricks workspace data connections steps must have Can Manage permissions on the cluster.

The Structural application must be able to install our library on the cluster and to restart the cluster after the library is installed. A new library is installed during the initial run and after you upgrade to a newer version of Structural.

If you configure a workspace to write to a catalog that does not already exist, then the user must also have permission to create the catalog.

Setting cluster Spark configuration parameters

Structural does not require you to set any specific Spark configuration parameters. However, this may be necessary in your environment. You can set the parameters on the cluster details in your Databricks portal.

To add these, on the cluster configurations page:

Expand the Advanced Options.
Select the Spark tab.
In the Spark Config field, enter the configurations.

If you use a jobs cluster, you can provide these in the spark_conf section of the payload.

You might need Spark configuration parameters in the following cases:

Authenticating to an S3 bucket if not using an instance profile.
Authenticating to ADLSv2 if not using direct access.
Legacy Spark date compatibility.

If you require legacy Spark date compatibility, set the following optional properties:

Property/Key

Value

spark.sql.legacy.parquet.datetimeRebaseModeInRead

CORRECTED

spark.sql.legacy.avro.datetimeRebaseModeInWrite

CORRECTED

spark.sql.legacy.parquet.datetimeRebaseModeInWrite

CORRECTED

spark.sql.legacy.avro.datetimeRebaseModeInRead

CORRECTED

Creating a cluster using the API

To create clusters, you can use the Databricks API on the endpoint /api/2.0/clusters/create.

For information about the Databricks API, including how to authenticate, see the AWS or Azure documentation.

Below is a sample payload that you can use as a starting point to create a Structural-compatible cluster.

{
    "cluster_name": "newcluster",
    "spark_version": "14.2.x-cpu-ml-scala2.12",
    "node_type_id": "i3.xlarge",
    "driver_node_type_id": "i3.xlarge",
    "spark_conf": {
        //Optional - For AWS if not using instance_profile_arn below
        //"spark.fs.s3a.access.key": "<AWS Access Key ID>",
        //"fs.s3n.awsAccessKeyId": "<AWS Access Key ID>",
        //"spark.fs.s3a.secret.key": "<AWS Secret Access Key>",
        //"fs.s3n.awsSecretAccessKey": "<AWS Secret Access Key>",
        //Optional - For Azure, using account key to access ADLSv2
        //spark.hadoop.fs.azure.account.key.<acount-name>.dfs.core.windows.net: <account-key>
        //Optional - For backwards data compatibility with Spark 3.0 (See SPARK-31404)        
        //"spark.sql.legacy.avro.datetimeRebaseModeInRead": "CORRECTED",
        //"spark.sql.legacy.parquet.datetimeRebaseModeInWrite": "CORRECTED",
        //"spark.sql.legacy.avro.datetimeRebaseModeInWrite": "CORRECTED",
        //"spark.sql.legacy.parquet.datetimeRebaseModeInRead": "CORRECTED"
    },
    "aws_attributes": {
        "instance_profile_arn": <AWS Databricks Instance Profile ARN>,
        "availability": "ON_DEMAND",
        "zone_id": "us-east-1d",
        "ebs_volume_count": 1,
        "ebs_volumne_type": "GENERAL_PURPOSE_SSD",
        "ebs_volume_size": 100
    },
    "num_workers": 1,
    "spark_env_vars": {
        "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
    },
    "init_scripts": [
        {
            "dbfs": {
                "destination": "dbfs:/spark-dotnet/tonic-init.sh"
            }
        }
    ],
    "cluster_log_conf": {
        "dbfs": {
          "destination": "dbfs:/cluster-logs"
        }
    }
}

Last updated 5 months ago

Was this helpful?