1 of 4

Before you create a Databricks workspace

Before you create a workspace that uses the Databricks data connector, you must complete the following configuration.

Granting access to storage

On AWS, Tonic Structural reads data from external tables that use Amazon S3 as the storage location. It writes files to S3 buckets.

On Azure, Structural reads data from external tables that use Azure Data Lake Storage Gen2 (ADLSv2). It writes the files to ADLSv2.

The Databricks cluster must be granted appropriate permissions to access the storage locations.

When you use AWS Databricks with Amazon S3

For information on how to configure an instance profile for access to Amazon S3, go to Configuring S3 access with instance profiles in the Databricks documentation.

This is the recommended method to grant the cluster access to the S3 buckets.

Modifications to the Databricks instructions

Instance profile for separate source and destination S3 buckets

The instance profile definition that Databricks provides assumes that the cluster reads from and writes to the same S3 bucket.

If your source and destination S3 buckets are different, you can use an instance profile similar to the following, which separates the read and write permissions.

Replace <source-bucket> and <destination-bucket> with your S3 bucket names.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "S3SourceRoot",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::<source-bucket>"
            ]
        },
        {
            "Sid": "S3SourceSubdirectories",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::<source-bucket>/*"
            ]
        },
        {
            "Sid": "S3DestinationRoot",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::<destination_bucket>"
            ]
        },
        {
            "Sid": "S3DestinationSubdirectories",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject",
                "s3:PutObjectAcl"
            ],
            "Resource": [
                "arn:aws:s3:::<destination-bucket>/*"
            ]
        }
    ]
}

S3 bucket policy for cross-account access

If your S3 buckets are owned by the same account in which the Databricks cluster is provisioned, you do not need an S3 bucket policy.

If your S3 buckets are in a separate account, then to allow the cluster access to the S3 buckets, you must create an S3 bucket policy as a cross-account trust relationship.

Similar to the instance profile, if you use separate S3 buckets for the source and destination, you can split the Databricks-provided definitions for the source and destination as shown in the following examples.

Source S3 bucket policy

This policy limits the instance profile to read-only (Get, List) access to the source S3 bucket.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Example source permissions",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<aws-account-id-databricks>:role/<iam-role-for-s3-access>"
      },
      "Action": [
        "s3:GetBucketLocation",
        "s3:ListBucket"
      ],
      "Resource": "arn:aws:s3:::<source-s3-bucket-name>"
    },
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<aws-account-id-databricks>:role/<iam-role-for-s3-access>"
      },
      "Action": [
        "s3:GetObject"
      ],
      "Resource": "arn:aws:s3:::<source-s3-bucket-name>/*"
    }
  ]
}

Destination S3 bucket policy

This policy grants the instance profile both read and write access to the destination S3 bucket.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Example destination permissions",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<aws-account-id-databricks>:role/<iam-role-for-s3-access>"
      },
      "Action": [
        "s3:GetBucketLocation",
        "s3:ListBucket"
      ],
      "Resource": "arn:aws:s3:::<destination-s3-bucket-name>"
    },
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<aws-account-id-databricks>:role/<iam-role-for-s3-access>"
      },
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:DeleteObject",
        "s3:PutObjectAcl"
      ],
      "Resource": "arn:aws:s3:::<destination-s3-bucket-name>/*"
    }
  ]
}

Alternatives to the instance profile

If you cannot or do not want to configure an instance profile, you can instead directly grant the cluster access to the S3 bucket.

To do this, you use your AWS Access Key and AWS Secret Access Key to set the following Spark configuration properties and values.

Property/Key

Value

fs.s3n.awsAccessKeyId

spark.fs.s3a.access.key

spark.fs.s3a.secret.key

fs.s3n.awsSecretAccessKey

You enter the Spark configuration parameters when you set up your cluster.

When you use Azure Databricks with ADLSv2

Azure provides several options for accessing ADLSv2 from Databricks.

For details, go to Access Azure Data Lake Storage Gen2 and Blob Storage in the Azure Databricks documentation.

For some of the methods, you must set various Spark configuration properties. The Azure Databricks documentation provides Python examples that use spark.conf.set(<property>,<value>).

For Structural, you must provide these in the cluster configuration. Several of the methods recommend the use of secrets. To reference a secret, follow these instructions. You enter the Spark configuration parameters when you set up your cluster.

Structural uses the abfss driver.

Setting up your Databricks cluster

Before you run your first Tonic Structural data generation job on your Databricks cluster, you must complete the following steps.

These steps are not necessary if you only want connect to Databricks and view data in the Structural application.

Cluster requirements

Cluster type

Structural requires a standard or single node all-purpose cluster. High Concurrency clusters cannot be used to generate data, because they do not support running Scala workloads. Structural requires the ability to run Scala workloads on the cluster.

For versions earlier than 11.1, Structural also requires the ability to run Python workloads on the cluster.

SQL Warehouses (formerly SQL Endpoints) are also supported if the Use Databricks Jobs Cluster for running jobs option is enabled. In this case, the SQL Warehouse is used to power the Structural application, and the Job Cluster is used to run data generation.

Cluster access mode

The cluster access mode also must support Scala, as well as JAR jobs:

If you use Unity Catalog, then you must use Single user access mode.
If you do not use Unity Catalog, then you can use either:
- Single user
- No isolation shared

Cluster permissions

If Cluster, Pool, and Jobs Access Control is enabled on your instance of Databricks, then the user whose API token is used in the Configuring Databricks workspace data connections steps must have Can Manage permissions on the cluster.

The Structural application must be able to install our library on the cluster and to restart the cluster after the library is installed. A new library is installed during the initial run and after you upgrade to a newer version of Structural.

If you configure a workspace to write to a catalog that does not already exist, then the user must also have permission to create the catalog.

Setting cluster Spark configuration parameters

Structural does not require you to set any specific Spark configuration parameters. However, this may be necessary in your environment. You can set the parameters on the cluster details in your Databricks portal.

To add these, on the cluster configurations page:

Expand the Advanced Options.
Select the Spark tab.
In the Spark Config field, enter the configurations.

If you use a jobs cluster, you can provide these in the spark_conf section of the payload.

You might need Spark configuration parameters in the following cases:

Authenticating to an S3 bucket if not using an instance profile.
Authenticating to ADLSv2 if not using direct access.
Legacy Spark date compatibility.

If you require legacy Spark date compatibility, set the following optional properties:

Property/Key

Value

spark.sql.legacy.parquet.datetimeRebaseModeInRead

CORRECTED

spark.sql.legacy.avro.datetimeRebaseModeInWrite

CORRECTED

spark.sql.legacy.parquet.datetimeRebaseModeInWrite

CORRECTED

spark.sql.legacy.avro.datetimeRebaseModeInRead

CORRECTED

Creating a cluster using the API

To create clusters, you can use the Databricks API on the endpoint /api/2.0/clusters/create.

For information about the Databricks API, including how to authenticate, see the AWS or Azure documentation.

Below is a sample payload that you can use as a starting point to create a Structural-compatible cluster.

{
    "cluster_name": "newcluster",
    "spark_version": "14.2.x-cpu-ml-scala2.12",
    "node_type_id": "i3.xlarge",
    "driver_node_type_id": "i3.xlarge",
    "spark_conf": {
        //Optional - For AWS if not using instance_profile_arn below
        //"spark.fs.s3a.access.key": "<AWS Access Key ID>",
        //"fs.s3n.awsAccessKeyId": "<AWS Access Key ID>",
        //"spark.fs.s3a.secret.key": "<AWS Secret Access Key>",
        //"fs.s3n.awsSecretAccessKey": "<AWS Secret Access Key>",
        //Optional - For Azure, using account key to access ADLSv2
        //spark.hadoop.fs.azure.account.key.<acount-name>.dfs.core.windows.net: <account-key>
        //Optional - For backwards data compatibility with Spark 3.0 (See SPARK-31404)        
        //"spark.sql.legacy.avro.datetimeRebaseModeInRead": "CORRECTED",
        //"spark.sql.legacy.parquet.datetimeRebaseModeInWrite": "CORRECTED",
        //"spark.sql.legacy.avro.datetimeRebaseModeInWrite": "CORRECTED",
        //"spark.sql.legacy.parquet.datetimeRebaseModeInRead": "CORRECTED"
    },
    "aws_attributes": {
        "instance_profile_arn": <AWS Databricks Instance Profile ARN>,
        "availability": "ON_DEMAND",
        "zone_id": "us-east-1d",
        "ebs_volume_count": 1,
        "ebs_volumne_type": "GENERAL_PURPOSE_SSD",
        "ebs_volume_size": 100
    },
    "num_workers": 1,
    "spark_env_vars": {
        "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
    },
    "init_scripts": [
        {
            "dbfs": {
                "destination": "dbfs:/spark-dotnet/tonic-init.sh"
            }
        }
    ],
    "cluster_log_conf": {
        "dbfs": {
          "destination": "dbfs:/cluster-logs"
        }
    }
}

Configuring the destination database schema creation

By default, during data generation to Databricks Delta tables, Structural creates the database schema for the destination database tables, then populates the database tables based on the workspace configuration.

You can also choose to manage the database schema yourself. You can use to:

Configure Structural to always skip the schema creation.
Enable a workspace configuration option to skip the schema creation.

You can add these settings to the Environment Settings list on Structural Settings.

Configuring whether to always skip the schema creation

To manage the database schema yourself, set the TONIC_DATABRICKS_SKIP_CREATE_DB to true.

When the environment setting TONIC_DATABRICKS_ENABLE_WORKSPACE_SKIP_CREATE_DB is true, then TONIC_DATABRICKS_SKIP_CREATE_DB determines the default value in the workspace configuration.

When TONIC_DATABRICKS_ENABLE_WORKSPACE_SKIP_CREATE_DB is false, then TONIC_DATABRICKS_SKIP_CREATE_DB determines for the entire instance whether Structural skips the schema creation.

Configuring whether to provide a workspace configuration option

By default, the workspace configuration includes the Skip Destination Database Schema Creation option, which allows you to determine for the specific workspace whether to skip the database schema creation.

The default configuration is based on the value of TONIC_DATABRICKS_SKIP_CREATE_DB.

To not include the option in the workspace configuration, set TONIC_DATABRICKS_ENABLE_WORKSPACE_SKIP_CREATE_DB to false.

Setting up your Databricks cluster

Before you run your first Tonic Structural data generation job on your Databricks cluster, you must complete the following steps.

These steps are not necessary if you only want connect to Databricks and view data in the Structural application.

Cluster requirements

Cluster type

For versions earlier than 11.1, Structural also requires the ability to run Python workloads on the cluster.

Cluster access mode

The cluster access mode also must support Scala, as well as JAR jobs:

If you use Unity Catalog, then you must use Single user access mode.
If you do not use Unity Catalog, then you can use either:
- Single user
- No isolation shared

Cluster permissions

If you configure a workspace to write to a catalog that does not already exist, then the user must also have permission to create the catalog.

Setting cluster Spark configuration parameters

To add these, on the cluster configurations page:

Expand the Advanced Options.
Select the Spark tab.
In the Spark Config field, enter the configurations.

If you use a jobs cluster, you can provide these in the spark_conf section of the payload.

You might need Spark configuration parameters in the following cases:

Authenticating to an S3 bucket if not using an instance profile.
Authenticating to ADLSv2 if not using direct access.
Legacy Spark date compatibility.

If you require legacy Spark date compatibility, set the following optional properties:

Property/Key

Value

spark.sql.legacy.parquet.datetimeRebaseModeInRead

CORRECTED

spark.sql.legacy.avro.datetimeRebaseModeInWrite

CORRECTED

spark.sql.legacy.parquet.datetimeRebaseModeInWrite

CORRECTED

spark.sql.legacy.avro.datetimeRebaseModeInRead

CORRECTED

Creating a cluster using the API

To create clusters, you can use the Databricks API on the endpoint /api/2.0/clusters/create.

For information about the Databricks API, including how to authenticate, see the AWS or Azure documentation.

Below is a sample payload that you can use as a starting point to create a Structural-compatible cluster.

{
    "cluster_name": "newcluster",
    "spark_version": "14.2.x-cpu-ml-scala2.12",
    "node_type_id": "i3.xlarge",
    "driver_node_type_id": "i3.xlarge",
    "spark_conf": {
        //Optional - For AWS if not using instance_profile_arn below
        //"spark.fs.s3a.access.key": "<AWS Access Key ID>",
        //"fs.s3n.awsAccessKeyId": "<AWS Access Key ID>",
        //"spark.fs.s3a.secret.key": "<AWS Secret Access Key>",
        //"fs.s3n.awsSecretAccessKey": "<AWS Secret Access Key>",
        //Optional - For Azure, using account key to access ADLSv2
        //spark.hadoop.fs.azure.account.key.<acount-name>.dfs.core.windows.net: <account-key>
        //Optional - For backwards data compatibility with Spark 3.0 (See SPARK-31404)        
        //"spark.sql.legacy.avro.datetimeRebaseModeInRead": "CORRECTED",
        //"spark.sql.legacy.parquet.datetimeRebaseModeInWrite": "CORRECTED",
        //"spark.sql.legacy.avro.datetimeRebaseModeInWrite": "CORRECTED",
        //"spark.sql.legacy.parquet.datetimeRebaseModeInRead": "CORRECTED"
    },
    "aws_attributes": {
        "instance_profile_arn": <AWS Databricks Instance Profile ARN>,
        "availability": "ON_DEMAND",
        "zone_id": "us-east-1d",
        "ebs_volume_count": 1,
        "ebs_volumne_type": "GENERAL_PURPOSE_SSD",
        "ebs_volume_size": 100
    },
    "num_workers": 1,
    "spark_env_vars": {
        "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
    },
    "init_scripts": [
        {
            "dbfs": {
                "destination": "dbfs:/spark-dotnet/tonic-init.sh"
            }
        }
    ],
    "cluster_log_conf": {
        "dbfs": {
          "destination": "dbfs:/cluster-logs"
        }
    }
}

Granting access to storage

On AWS, Tonic Structural reads data from external tables that use Amazon S3 as the storage location. It writes files to S3 buckets.

On Azure, Structural reads data from external tables that use Azure Data Lake Storage Gen2 (ADLSv2). It writes the files to ADLSv2.

The Databricks cluster must be granted appropriate permissions to access the storage locations.

When you use AWS Databricks with Amazon S3

For information on how to configure an instance profile for access to Amazon S3, go to Configuring S3 access with instance profiles in the Databricks documentation.

This is the recommended method to grant the cluster access to the S3 buckets.

Modifications to the Databricks instructions

Instance profile for separate source and destination S3 buckets

The instance profile definition that Databricks provides assumes that the cluster reads from and writes to the same S3 bucket.

If your source and destination S3 buckets are different, you can use an instance profile similar to the following, which separates the read and write permissions.

Replace <source-bucket> and <destination-bucket> with your S3 bucket names.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "S3SourceRoot",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::<source-bucket>"
            ]
        },
        {
            "Sid": "S3SourceSubdirectories",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::<source-bucket>/*"
            ]
        },
        {
            "Sid": "S3DestinationRoot",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::<destination_bucket>"
            ]
        },
        {
            "Sid": "S3DestinationSubdirectories",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject",
                "s3:PutObjectAcl"
            ],
            "Resource": [
                "arn:aws:s3:::<destination-bucket>/*"
            ]
        }
    ]
}

S3 bucket policy for cross-account access

If your S3 buckets are owned by the same account in which the Databricks cluster is provisioned, you do not need an S3 bucket policy.

If your S3 buckets are in a separate account, then to allow the cluster access to the S3 buckets, you must create an S3 bucket policy as a cross-account trust relationship.

Source S3 bucket policy

This policy limits the instance profile to read-only (Get, List) access to the source S3 bucket.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Example source permissions",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<aws-account-id-databricks>:role/<iam-role-for-s3-access>"
      },
      "Action": [
        "s3:GetBucketLocation",
        "s3:ListBucket"
      ],
      "Resource": "arn:aws:s3:::<source-s3-bucket-name>"
    },
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<aws-account-id-databricks>:role/<iam-role-for-s3-access>"
      },
      "Action": [
        "s3:GetObject"
      ],
      "Resource": "arn:aws:s3:::<source-s3-bucket-name>/*"
    }
  ]
}

Destination S3 bucket policy

This policy grants the instance profile both read and write access to the destination S3 bucket.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Example destination permissions",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<aws-account-id-databricks>:role/<iam-role-for-s3-access>"
      },
      "Action": [
        "s3:GetBucketLocation",
        "s3:ListBucket"
      ],
      "Resource": "arn:aws:s3:::<destination-s3-bucket-name>"
    },
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<aws-account-id-databricks>:role/<iam-role-for-s3-access>"
      },
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:DeleteObject",
        "s3:PutObjectAcl"
      ],
      "Resource": "arn:aws:s3:::<destination-s3-bucket-name>/*"
    }
  ]
}

Alternatives to the instance profile

If you cannot or do not want to configure an instance profile, you can instead directly grant the cluster access to the S3 bucket.

To do this, you use your AWS Access Key and AWS Secret Access Key to set the following Spark configuration properties and values.

Property/Key

Value

fs.s3n.awsAccessKeyId

spark.fs.s3a.access.key

spark.fs.s3a.secret.key

fs.s3n.awsSecretAccessKey

You enter the Spark configuration parameters when you set up your cluster.

When you use Azure Databricks with ADLSv2

Azure provides several options for accessing ADLSv2 from Databricks.

For details, go to Access Azure Data Lake Storage Gen2 and Blob Storage in the Azure Databricks documentation.

For some of the methods, you must set various Spark configuration properties. The Azure Databricks documentation provides Python examples that use spark.conf.set(<property>,<value>).

Structural uses the abfss driver.