1 of 9

Databricks

Databricks is a cloud-based platform for big data processing.

Tonic Structural can run Spark jobs on Databricks on AWS and Azure Databricks.

Structural process overview for Databricks

The following high-level diagram describes how Tonic Structural data generation is processed for Databricks.

For a Databricks workspace, the source data comes from a Databricks database.

The Databricks API coordinates the running of the data generation job on the Databricks cluster.

The destination data is written to either an S3 bucket or to a location in Azure Data Lake Storage Gen2.

System requirements for Databricks

Supported versions of Databricks

Tonic Structural supports Spark 2.4.x, Spark 3.0.x, and Spark 3.1.x. Spark 2.4.2 is not supported.

Any version of Databricks that runs one of those Spark versions should be compatible. Structural specifically has been tested against Databricks versions 9.1, and 10.4.

Supported providers

Structural supports the following data providers:

Source Provider

Output Provider

Supported table types

Databricks supports both MANAGED and EXTERNAL tables.

MANAGED tables store all of their data within Databricks.
EXTERNAL tables store their data on a separate file system (often S3).

Structural can read from both table types. When it writes output data, Structural only writes to EXTERNAL tables.

Structural differences and limitations with Databricks

Required license: Professional or Enterprise

No workspace inheritance

Databricks workspaces do not support workspace inheritance.

Table mode limitations

You can only assign the De-Identify or Truncate table modes.

For Truncate mode, the table is ignored completely. The table does not exist in the destination database.

Generator limitations

Based on the version of Databricks, a Databricks workspace can only use the following generators:

Databricks 10.4 and earlier

Databricks 11.3 and later

No subsetting, but support for table filtering

Databricks workspaces do not support subsetting.

However, for tables that use the De-Identify table mode, you can provide a WHERE clause to filter the table. For details, go to Using table filtering for data warehouses and Spark-based data connectors.

No upsert

Databricks workspaces do not support upsert.

No output to container artifacts

For Databricks workspaces, you cannot write the destination data to container artifacts.

No output to an Ephemeral snapshot

For Databricks workspaces, you cannot write the destination data to an Ephemeral snapshot.

Before you create a Databricks workspace

Before you create a workspace that uses the Databricks data connector, you must complete the following configuration.

Granting access to storage

On AWS, Tonic Structural reads data from external tables that use Amazon S3 as the storage location. It writes files to S3 buckets.

On Azure, Structural reads data from external tables that use Azure Data Lake Storage Gen2 (ADLSv2). It writes the files to ADLSv2.

The Databricks cluster must be granted appropriate permissions to access the storage locations.

When you use AWS Databricks with Amazon S3

For information on how to configure an instance profile for access to Amazon S3, go to .

This is the recommended method to grant the cluster access to the S3 buckets.

Modifications to the Databricks instructions

Instance profile for separate source and destination S3 buckets

The that Databricks provides assumes that the cluster reads from and writes to the same S3 bucket.

If your source and destination S3 buckets are different, you can use an instance profile similar to the following, which separates the read and write permissions.

Replace <source-bucket> and <destination-bucket> with your S3 bucket names.

S3 bucket policy for cross-account access

If your S3 buckets are in a separate account, then to allow the cluster access to the S3 buckets, you must create an S3 bucket policy as a cross-account trust relationship.

Similar to the instance profile, if you use separate S3 buckets for the source and destination, you can split the Databricks-provided definitions for the source and destination as shown in the following examples.

Source S3 bucket policy

This policy limits the instance profile to read-only (Get, List) access to the source S3 bucket.

Destination S3 bucket policy

This policy grants the instance profile both read and write access to the destination S3 bucket.

Alternatives to the instance profile

If you cannot or do not want to configure an instance profile, you can instead directly grant the cluster access to the S3 bucket.

To do this, you use your AWS Access Key and AWS Secret Access Key to set the following Spark configuration properties and values.

When you use Azure Databricks with ADLSv2

Azure provides several options for accessing ADLSv2 from Databricks.

For some of the methods, you must set various Spark configuration properties. The Azure Databricks documentation provides Python examples that use spark.conf.set(<property>,<value>).

Structural uses the abfss driver.

Setting up your Databricks cluster

Before you run your first Tonic Structural generation job on your Databricks cluster, you must complete the following steps.

These steps are not necessary if you only want connect to Databricks and view data in the Structural application.

Cluster requirements

Cluster type

Structural requires a standard or single node all-purpose cluster. High Concurrency clusters cannot be used to generate data, because they do not support running Scala workloads. Structural requires the ability to run Scala workloads on the cluster.

For versions earlier than 11.1, Structural also requires the ability to run Python workloads on the cluster.

SQL Warehouses (formerly SQL Endpoints) are also supported if the Use Databricks Jobs Cluster for running jobs option is enabled. In this case, the SQL Warehouse is used to power the Structural application, and the Job Cluster is used to run data generation.

Cluster access mode

The cluster access mode also must support Scala, as well as JAR jobs:

If you use Unity Catalog, then you must use Single user access mode.
If you do not use Unity Catalog, then you can use either:
- Single user
- No isolation shared

Cluster permissions

If Cluster, Pool, and Jobs Access Control is enabled on your instance of Databricks, then the user whose API token is used in the Configuring Databricks workspace data connections steps must have Can Manage permissions on the cluster.

The Structural application must be able to install our library on the cluster and to restart the cluster after the library is installed. A new library is installed on initial run and after you upgrade to a newer version of Structural.

If you configure a workspace to write to a catalog that does not already exist, then the user must also have permission to create the catalog.

Setting cluster Spark configuration parameters

Structural does not require you to set any specific Spark configuration parameters. However, this may be necessary in your environment. You can set the parameters on the cluster details in your Databricks portal.

To add these, on the cluster configurations page:

Expand the Advanced Options.
Select the Spark tab
In the Spark Config field, enter the configurations.

If you use a jobs cluster, you can provide these in the spark_conf section of the payload.

You might need Spark configuration parameters in the following cases:

Authenticating to an S3 bucket if not using an instance profile.
Authenticating to ADLSv2 if not using direct access.
Legacy Spark date compatibility.

If you require legacy Spark date compatibility, set the following optional properties:

Creating a cluster using the API

To create clusters, you can use the Databricks API on the endpoint /api/2.0/clusters/create.

For information about the Databricks API, including how to authenticate, see the AWS or Azure documentation.

Below is a sample payload that you can use as a starting point to create a Structural-compatible cluster.

{
    "cluster_name": "newcluster",
    "spark_version": "14.2.x-cpu-ml-scala2.12",
    "node_type_id": "i3.xlarge",
    "driver_node_type_id": "i3.xlarge",
    "spark_conf": {
        //Optional - For AWS if not using instance_profile_arn below
        //"spark.fs.s3a.access.key": "<AWS Access Key ID>",
        //"fs.s3n.awsAccessKeyId": "<AWS Access Key ID>",
        //"spark.fs.s3a.secret.key": "<AWS Secret Access Key>",
        //"fs.s3n.awsSecretAccessKey": "<AWS Secret Access Key>",
        //Optional - For Azure, using account key to access ADLSv2
        //spark.hadoop.fs.azure.account.key.<acount-name>.dfs.core.windows.net: <account-key>
        //Optional - For backwards data compatibility with Spark 3.0 (See SPARK-31404)        
        //"spark.sql.legacy.avro.datetimeRebaseModeInRead": "CORRECTED",
        //"spark.sql.legacy.parquet.datetimeRebaseModeInWrite": "CORRECTED",
        //"spark.sql.legacy.avro.datetimeRebaseModeInWrite": "CORRECTED",
        //"spark.sql.legacy.parquet.datetimeRebaseModeInRead": "CORRECTED"
    },
    "aws_attributes": {
        "instance_profile_arn": <AWS Databricks Instance Profile ARN>,
        "availability": "ON_DEMAND",
        "zone_id": "us-east-1d",
        "ebs_volume_count": 1,
        "ebs_volumne_type": "GENERAL_PURPOSE_SSD",
        "ebs_volume_size": 100
    },
    "num_workers": 1,
    "spark_env_vars": {
        "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
    },
    "init_scripts": [
        {
            "dbfs": {
                "destination": "dbfs:/spark-dotnet/tonic-init.sh"
            }
        }
    ],
    "cluster_log_conf": {
        "dbfs": {
          "destination": "dbfs:/cluster-logs"
        }
    }
}

Configuring the destination database schema creation

By default, during data generation to Databricks Delta tables, Structural creates the database schema for the destination database tables, then populates the database tables based on the workspace configuration.

You can also choose to manage the database schema yourself. You can:

Configure Structural to always skip the schema creation
Enable a workspace configuration option to skip the schema creation

You can add these settings to the Environment Settings list on Structural Settings.

Configuring whether to always skip the schema creation

To manage the database schema yourself, set the environment setting TONIC_DATABRICKS_SKIP_CREATE_DB to true.

When the environment setting TONIC_DATABRICKS_ENABLE_WORKSPACE_SKIP_CREATE_DB is true, then TONIC_DATABRICKS_SKIP_CREATE_DB determines the default value in the workspace configuration.

When TONIC_DATABRICKS_ENABLE_WORKSPACE_SKIP_CREATE_DB is false, then TONIC_DATABRICKS_SKIP_CREATE_DB determines for the entire instance whether Structural skips the schema creation.

Configuring whether to provide a workspace configuration option

By default, the workspace configuration includes the Skip Destination Database Schema Creation option, which allows you to determine for the specific workspace whether to skip the database schema creation.

The default configuration is based on the value of TONIC_DATABRICKS_SKIP_CREATE_DB.

To not include the option in the workspace configuration, set TONIC_DATABRICKS_ENABLE_WORKSPACE_SKIP_CREATE_DB to false.

Configuring Databricks workspace data connections

During workspace creation, under Connection Type, select Databricks.

Identifying the source database

In the Source Server section:

In the Catalog Name field, provide the name of the catalog where the source database is located. If you do not provide a catalog name, then the default catalog is used. For Unity Catalog, this is the catalog that you configured as the default. For earlier versions that do not support Unity Catalog, the default is hive_metastore.
In the Database Name field, provide the name of the source database.

Enabling validation of table filters

For Databricks workspaces, you can provide where clauses to filter tables. For details, go to #table-mode-filter-tables.

The Enable partition filter validation toggle indicates whether Tonic Structural should validate those filters when you create them.

By default, the setting is in the on position, and Structural validates the filters. To disable the validation, toggle Enable partition filter validation to the off position.

Blocking data generation on all schema changes

By default, data generation is not blocked as long as schema changes do not conflict with your workspace configuration.

To block data generation when there are any schema changes, regardless of whether they conflict with your workspace configuration switch Block data generation on schema changes to the on position.

Connecting to the Databricks cluster

In the Databricks Cluster section, you provide the connection information for the cluster.

Under Databricks Type, select whether to use Databricks on AWS or Azure Databricks.
In the API Token field, provide the API token for Databricks. For information on how to generate an API token, go to the Databricks documentation.
In the Host URL field, provide the URL for the cluster host.
In the HTTP Path field, provide the path to the cluster.
In the Port field, provide the port to use to access the cluster.
By default, data generation jobs run on the specified cluster. To instead run data generation jobs on an ephemeral Databricks job cluster:
1. Toggle Use Databricks Job Cluster to the on position.
2. In the Cluster Information text area, provide the details for the job cluster.
For clusters that use Databricks runtime 10.4 and below, Structural installs a cluster initialization script, which is stored as a Databricks workspace file. By default, this script is uploaded to the /Shared workspace directory. To upload the script to a different directory, set Workspace Path to an absolute path in the workspace tree. Structural must have access to the directory.
To test the connection to the cluster, click Test Cluster Connection.

Connecting to the destination server

In the Destination Settings section, you specify where Structural writes the destination database.

Selecting the output type

Under Output Storage Type, select the type of storage to use for the destination data:

To use Databricks Delta tables, click Databricks.
To use Amazon S3, click Amazon S3 Files.
To use Azure, click Azure Data Lake Storage Gen2 Files.

Configuring the output settings for Databricks Delta tables

If you selected Databricks as the output type:

In the Catalog Name field, provide the name of the catalog that contains the database If the Databricks cluster connection supports multiple catalogs (Unity Catalog) and you do not specify a catalog, then Structural uses the default catalog. For connections that use the legacy metastore, you can leave the field blank, or set it to hive_metastore. Note that if you specify a catalog that does not already exist, then the user that is associated with the API token must have permission to create the catalog.
In the Database Name field, provide the name of the database.
The Skip Destination Database Schema Creation option determines whether Structural creates the destination database schema during data generation.
Your Structural administrator determines whether the option is available and the default setting.
When the setting is in the on position, then Structural does not create the schema, and you must manage it yourself. When the setting is in the off position, then Structural does create the schema.

If you do not specify a database, Structural uses the database name default in the active catalog.

Configuring the output settings for Amazon S3 or Azure

If you selected either Amazon S3 Files or Azure Data Lake Storage Gen2 Files as the output type:

In the Output Location field, provide the location in either Amazon S3 or Azure for the destination data.
By default, Structural writes the results of each data generation to a different folder. To create the folder, it appends a GUID to the end of the output location. To instead always write the results to the specified output location, and overwrite the results of the previous job, toggle Create job specific destination folder to the off position.
If you use non-job-specific folders for destination data, then the following environment settings determine how Structural handles overwrites. You can configure these settings from the Environment Settings tab on Structural Settings. Note that any defined table-level Error on Override setting takes precedence over these settings.
- TONIC_WORKSPACE_ERROR_ON_OVERRIDE. Whether to prevent overwrites of previous writes. By default, this setting is true, and attempts to overwrite return an error. To allow overwrites, set this to false.
- TONIC_WORKSPACE_DEFAULT_SAVE_MODE. The mode to use to save tables to a non-job-specific folder. When this is set to a value other than null, which is the default, then this setting takes precedence over TONIC_WORKSPACE_ERROR_ON_OVERRIDE. The available values are Append, ErrorIfExists, Ignore, and Overwrite.
By default, each output table is written in the format used by the corresponding input table. To instead write all output tables to a single format:
1. Toggle Write all output to a specific type to the on position.
2. From the Select output type dropdown list, select the output format to use. The options are:
  - Avro
  - JSON
  - Parquet
  - Delta
  - CSV
  - ORC
3. If you select CSV, you also configure the file format.
  1. To treat the first row as a header, check Treat first row as a column header. The box is checked by default.
  2. In the Column Delimiter field, type the character to use to separate the columns. The default is a comma (,).
  3. In the Escape Character field, type the character to use to escape special characters. The default is a backslash (\).
  4. In the Quoting Character field, type the character to use to quote text values. The default is a double quote (").
  5. In the NULL Value Replacement String field, type the string to use to represent null values. The default is an empty string.

Configuring Databricks workspace data connections

During workspace creation, under Connection Type, select Databricks.

Identifying the source database

In the Source Server section:

In the Catalog Name field, provide the name of the catalog where the source database is located. If you do not provide a catalog name, then the default catalog is used. For Unity Catalog, this is the catalog that you configured as the default. For earlier versions that do not support Unity Catalog, the default is hive_metastore.
In the Database Name field, provide the name of the source database.

Enabling validation of table filters

For Databricks workspaces, you can provide where clauses to filter tables. For details, go to #table-mode-filter-tables.

The Enable partition filter validation toggle indicates whether Tonic Structural should validate those filters when you create them.

By default, the setting is in the on position, and Structural validates the filters. To disable the validation, toggle Enable partition filter validation to the off position.

Blocking data generation on all schema changes

By default, data generation is not blocked as long as schema changes do not conflict with your workspace configuration.

Connecting to the Databricks cluster

In the Databricks Cluster section, you provide the connection information for the cluster.

Under Databricks Type, select whether to use Databricks on AWS or Azure Databricks.
In the API Token field, provide the API token for Databricks. For information on how to generate an API token, go to the Databricks documentation.
In the Host URL field, provide the URL for the cluster host.
In the HTTP Path field, provide the path to the cluster.
In the Port field, provide the port to use to access the cluster.
By default, data generation jobs run on the specified cluster. To instead run data generation jobs on an ephemeral Databricks job cluster:
1. Toggle Use Databricks Job Cluster to the on position.
2. In the Cluster Information text area, provide the details for the job cluster.
For clusters that use Databricks runtime 10.4 and below, Structural installs a cluster initialization script, which is stored as a Databricks workspace file. By default, this script is uploaded to the /Shared workspace directory. To upload the script to a different directory, set Workspace Path to an absolute path in the workspace tree. Structural must have access to the directory.
To test the connection to the cluster, click Test Cluster Connection.

Connecting to the destination server

In the Destination Settings section, you specify where Structural writes the destination database.

Selecting the output type

Under Output Storage Type, select the type of storage to use for the destination data:

To use Databricks Delta tables, click Databricks.
To use Amazon S3, click Amazon S3 Files.
To use Azure, click Azure Data Lake Storage Gen2 Files.

Configuring the output settings for Databricks Delta tables

If you selected Databricks as the output type:

In the Catalog Name field, provide the name of the catalog that contains the database If the Databricks cluster connection supports multiple catalogs (Unity Catalog) and you do not specify a catalog, then Structural uses the default catalog. For connections that use the legacy metastore, you can leave the field blank, or set it to hive_metastore. Note that if you specify a catalog that does not already exist, then the user that is associated with the API token must have permission to create the catalog.
In the Database Name field, provide the name of the database.
The Skip Destination Database Schema Creation option determines whether Structural creates the destination database schema during data generation.
Your Structural administrator determines whether the option is available and the default setting.
When the setting is in the on position, then Structural does not create the schema, and you must manage it yourself. When the setting is in the off position, then Structural does create the schema.

If you do not specify a database, Structural uses the database name default in the active catalog.

Configuring the output settings for Amazon S3 or Azure

If you selected either Amazon S3 Files or Azure Data Lake Storage Gen2 Files as the output type:

In the Output Location field, provide the location in either Amazon S3 or Azure for the destination data.
By default, Structural writes the results of each data generation to a different folder. To create the folder, it appends a GUID to the end of the output location. To instead always write the results to the specified output location, and overwrite the results of the previous job, toggle Create job specific destination folder to the off position.
If you use non-job-specific folders for destination data, then the following environment settings determine how Structural handles overwrites. You can configure these settings from the Environment Settings tab on Structural Settings. Note that any defined table-level Error on Override setting takes precedence over these settings.
- TONIC_WORKSPACE_ERROR_ON_OVERRIDE. Whether to prevent overwrites of previous writes. By default, this setting is true, and attempts to overwrite return an error. To allow overwrites, set this to false.
- TONIC_WORKSPACE_DEFAULT_SAVE_MODE. The mode to use to save tables to a non-job-specific folder. When this is set to a value other than null, which is the default, then this setting takes precedence over TONIC_WORKSPACE_ERROR_ON_OVERRIDE. The available values are Append, ErrorIfExists, Ignore, and Overwrite.
By default, each output table is written in the format used by the corresponding input table. To instead write all output tables to a single format:
1. Toggle Write all output to a specific type to the on position.
2. From the Select output type dropdown list, select the output format to use. The options are:
  - Avro
  - JSON
  - Parquet
  - Delta
  - CSV
  - ORC
3. If you select CSV, you also configure the file format.
  1. To treat the first row as a header, check Treat first row as a column header. The box is checked by default.
  2. In the Column Delimiter field, type the character to use to separate the columns. The default is a comma (,).
  3. In the Escape Character field, type the character to use to escape special characters. The default is a backslash (\).
  4. In the Quoting Character field, type the character to use to quote text values. The default is a double quote (").
  5. In the NULL Value Replacement String field, type the string to use to represent null values. The default is an empty string.

Setting up your Databricks cluster

Before you run your first Tonic Structural generation job on your Databricks cluster, you must complete the following steps.

These steps are not necessary if you only want connect to Databricks and view data in the Structural application.

Cluster requirements

Cluster type

For versions earlier than 11.1, Structural also requires the ability to run Python workloads on the cluster.

Cluster access mode

The cluster access mode also must support Scala, as well as JAR jobs:

If you use Unity Catalog, then you must use Single user access mode.
If you do not use Unity Catalog, then you can use either:
- Single user
- No isolation shared

Cluster permissions

If you configure a workspace to write to a catalog that does not already exist, then the user must also have permission to create the catalog.

Setting cluster Spark configuration parameters

To add these, on the cluster configurations page:

Expand the Advanced Options.
Select the Spark tab
In the Spark Config field, enter the configurations.

If you use a jobs cluster, you can provide these in the spark_conf section of the payload.

You might need Spark configuration parameters in the following cases:

Authenticating to an S3 bucket if not using an instance profile.
Authenticating to ADLSv2 if not using direct access.
Legacy Spark date compatibility.

If you require legacy Spark date compatibility, set the following optional properties:

Property/Key

Value

Creating a cluster using the API

To create clusters, you can use the Databricks API on the endpoint /api/2.0/clusters/create.

For information about the Databricks API, including how to authenticate, see the AWS or Azure documentation.

Below is a sample payload that you can use as a starting point to create a Structural-compatible cluster.

{
    "cluster_name": "newcluster",
    "spark_version": "14.2.x-cpu-ml-scala2.12",
    "node_type_id": "i3.xlarge",
    "driver_node_type_id": "i3.xlarge",
    "spark_conf": {
        //Optional - For AWS if not using instance_profile_arn below
        //"spark.fs.s3a.access.key": "<AWS Access Key ID>",
        //"fs.s3n.awsAccessKeyId": "<AWS Access Key ID>",
        //"spark.fs.s3a.secret.key": "<AWS Secret Access Key>",
        //"fs.s3n.awsSecretAccessKey": "<AWS Secret Access Key>",
        //Optional - For Azure, using account key to access ADLSv2
        //spark.hadoop.fs.azure.account.key.<acount-name>.dfs.core.windows.net: <account-key>
        //Optional - For backwards data compatibility with Spark 3.0 (See SPARK-31404)        
        //"spark.sql.legacy.avro.datetimeRebaseModeInRead": "CORRECTED",
        //"spark.sql.legacy.parquet.datetimeRebaseModeInWrite": "CORRECTED",
        //"spark.sql.legacy.avro.datetimeRebaseModeInWrite": "CORRECTED",
        //"spark.sql.legacy.parquet.datetimeRebaseModeInRead": "CORRECTED"
    },
    "aws_attributes": {
        "instance_profile_arn": <AWS Databricks Instance Profile ARN>,
        "availability": "ON_DEMAND",
        "zone_id": "us-east-1d",
        "ebs_volume_count": 1,
        "ebs_volumne_type": "GENERAL_PURPOSE_SSD",
        "ebs_volume_size": 100
    },
    "num_workers": 1,
    "spark_env_vars": {
        "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
    },
    "init_scripts": [
        {
            "dbfs": {
                "destination": "dbfs:/spark-dotnet/tonic-init.sh"
            }
        }
    ],
    "cluster_log_conf": {
        "dbfs": {
          "destination": "dbfs:/cluster-logs"
        }
    }
}

Granting access to storage

On AWS, Tonic Structural reads data from external tables that use Amazon S3 as the storage location. It writes files to S3 buckets.

On Azure, Structural reads data from external tables that use Azure Data Lake Storage Gen2 (ADLSv2). It writes the files to ADLSv2.

The Databricks cluster must be granted appropriate permissions to access the storage locations.

When you use AWS Databricks with Amazon S3

For information on how to configure an instance profile for access to Amazon S3, go to .

This is the recommended method to grant the cluster access to the S3 buckets.

Modifications to the Databricks instructions

Instance profile for separate source and destination S3 buckets

The that Databricks provides assumes that the cluster reads from and writes to the same S3 bucket.

If your source and destination S3 buckets are different, you can use an instance profile similar to the following, which separates the read and write permissions.

Replace <source-bucket> and <destination-bucket> with your S3 bucket names.

S3 bucket policy for cross-account access

If your S3 buckets are owned by the same account in which the Databricks cluster is provisioned, you do not need an .

If your S3 buckets are in a separate account, then to allow the cluster access to the S3 buckets, you must create an S3 bucket policy as a cross-account trust relationship.

Source S3 bucket policy

This policy limits the instance profile to read-only (Get, List) access to the source S3 bucket.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Example source permissions",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<aws-account-id-databricks>:role/<iam-role-for-s3-access>"
      },
      "Action": [
        "s3:GetBucketLocation",
        "s3:ListBucket"
      ],
      "Resource": "arn:aws:s3:::<source-s3-bucket-name>"
    },
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<aws-account-id-databricks>:role/<iam-role-for-s3-access>"
      },
      "Action": [
        "s3:GetObject"
      ],
      "Resource": "arn:aws:s3:::<source-s3-bucket-name>/*"
    }
  ]
}

Destination S3 bucket policy

This policy grants the instance profile both read and write access to the destination S3 bucket.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Example destination permissions",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<aws-account-id-databricks>:role/<iam-role-for-s3-access>"
      },
      "Action": [
        "s3:GetBucketLocation",
        "s3:ListBucket"
      ],
      "Resource": "arn:aws:s3:::<destination-s3-bucket-name>"
    },
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<aws-account-id-databricks>:role/<iam-role-for-s3-access>"
      },
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:DeleteObject",
        "s3:PutObjectAcl"
      ],
      "Resource": "arn:aws:s3:::<destination-s3-bucket-name>/*"
    }
  ]
}

Alternatives to the instance profile

If you cannot or do not want to configure an instance profile, you can instead directly grant the cluster access to the S3 bucket.

To do this, you use your AWS Access Key and AWS Secret Access Key to set the following Spark configuration properties and values.

Property/Key

Value

You enter the Spark configuration parameters when you .

When you use Azure Databricks with ADLSv2

Azure provides several options for accessing ADLSv2 from Databricks.

For details, go to .

For some of the methods, you must set various Spark configuration properties. The Azure Databricks documentation provides Python examples that use spark.conf.set(<property>,<value>).

For Structural, you must provide these in the cluster configuration. Several of the methods recommend the use of . To reference a secret, follow . You enter the Spark configuration parameters when you .

Structural uses the abfss driver.