Initial Cluster Setup

Modifications that must be made to your cluster before first Tonic generation

This page describes the necessary steps you must take before running your first Tonic generation job on your Databricks cluster. Note that these steps are not necessary if you wish to only connect to Databricks and view data in Tonic's UI.

Initialization Script

Tonic requires an initialization script to run against your Databricks cluster upon startup. To add a script navigate to your cluster in the Databricks console and under "Advanced Options" go to "Init Scripts" section and add the script /spark-dotnet/tonic-init.sh

Adding Tonic's initialization script to Databrick's cluser

Environmental Variables

Tonic requires several environmental variables to be set. Environmental variables can be set under 'Advanced Options' in the 'Spark Section' under 'Spark Config'. The following variables must be set:

Variable name

Value

spark.sql.legacy.parquet.datetimeRebaseModeInRead

CORRECTED

spark.sql.legacy.avro.datetimeRebaseModeInWrite

CORRECTED

spark.sql.legacy.parquet.datetimeRebaseModeInWrite

CORRECTED

spark.sql.legacy.avro.datetimeRebaseModeInRead

CORRECTED

fs.s3n.awsAccessKeyId

<AWS Access Key ID>

spark.fs.s3a.access.key

<AWS Access Key ID>

spark.fs.s3a.secret.key

<AWS Secret Access Key>

fs.s3n.awsSecretAccessKey

<AWS Secret Access Key>

Creating cluster via API

Clusters can be created via the Databricks API on the endpoint /api/2.0/clusters/create. Documentation for the Databricks API including how to authentice can be found here.

Below is a sample payload you can use as a starting point to stand up your own Tonic compatible

{
"cluster_name": "newcluster",
"spark_version": "7.3.x-scala2.12",
"node_type_id": "i3.xlarge",
"driver_node_type_id": "i3.xlarge",
"spark_conf": {
"spark.fs.s3a.access.key": "<AWS Access Key ID>",
"fs.s3n.awsAccessKeyId": "<AWS Access Key ID>",
"spark.fs.s3a.secret.key": "<AWS Secret Access Key>",
"fs.s3n.awsSecretAccessKey": "<AWS Secret Access Key>",
"spark.sql.legacy.avro.datetimeRebaseModeInRead": "CORRECTED",
"spark.sql.legacy.parquet.datetimeRebaseModeInWrite": "CORRECTED",
"spark.sql.legacy.avro.datetimeRebaseModeInWrite": "CORRECTED",
"spark.sql.legacy.parquet.datetimeRebaseModeInRead": "CORRECTED"
},
"aws_attributes": {
"availability": "ON_DEMAND",
"zone_id": "us-east-1d",
"ebs_volume_count": 1,
"ebs_volumne_type": "GENERAL_PURPOSE_SSD",
"ebs_volume_size": 100
},
"num_workers": 1,
"spark_env_vars": {
"PYSPARK_PYTHON": "/databricks/python3/bin/python3"
},
"init_scripts": [
{
"dbfs": {
"destination": "dbfs:/spark-dotnet/tonic-init.sh"
}
}
],
"cluster_log_conf": {
"dbfs": {
"destination": "dbfs:/cluster-logs"
}
}
}

‚Äč