Granting access to storage

On AWS, Tonic Structural reads data from external tables that use Amazon S3 as the storage location. It writes files to S3 buckets.

On Azure, Structural reads data from external tables that use Azure Data Lake Storage Gen2 (ADLSv2). It writes the files to ADLSv2.

The Databricks cluster must be granted appropriate permissions to access the storage locations.

When you use AWS Databricks with Amazon S3

For information on how to configure an instance profile for access to Amazon S3, go to Configuring S3 access with instance profiles in the Databricks documentation.

This is the recommended method to grant the cluster access to the S3 buckets.

Modifications to the Databricks instructions

Instance profile for separate source and destination S3 buckets

The instance profile definition that Databricks provides assumes that the cluster reads from and writes to the same S3 bucket.

If your source and destination S3 buckets are different, you can use an instance profile similar to the following, which separates the read and write permissions.

Replace <source-bucket> and <destination-bucket> with your S3 bucket names.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "S3SourceRoot",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::<source-bucket>"
            ]
        },
        {
            "Sid": "S3SourceSubdirectories",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::<source-bucket>/*"
            ]
        },
        {
            "Sid": "S3DestinationRoot",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::<destination_bucket>"
            ]
        },
        {
            "Sid": "S3DestinationSubdirectories",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject",
                "s3:PutObjectAcl"
            ],
            "Resource": [
                "arn:aws:s3:::<destination-bucket>/*"
            ]
        }
    ]
}

S3 bucket policy for cross-account access

If your S3 buckets are owned by the same account in which the Databricks cluster is provisioned, you do not need an S3 bucket policy.

If your S3 buckets are in a separate account, then to allow the cluster access to the S3 buckets, you must create an S3 bucket policy as a cross-account trust relationship.

Similar to the instance profile, if you use separate S3 buckets for the source and destination, you can split the Databricks-provided definitions for the source and destination as shown in the following examples.

Source S3 bucket policy

This policy limits the instance profile to read-only (Get, List) access to the source S3 bucket.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Example source permissions",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<aws-account-id-databricks>:role/<iam-role-for-s3-access>"
      },
      "Action": [
        "s3:GetBucketLocation",
        "s3:ListBucket"
      ],
      "Resource": "arn:aws:s3:::<source-s3-bucket-name>"
    },
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<aws-account-id-databricks>:role/<iam-role-for-s3-access>"
      },
      "Action": [
        "s3:GetObject"
      ],
      "Resource": "arn:aws:s3:::<source-s3-bucket-name>/*"
    }
  ]
}

Destination S3 bucket policy

This policy grants the instance profile both read and write access to the destination S3 bucket.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Example destination permissions",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<aws-account-id-databricks>:role/<iam-role-for-s3-access>"
      },
      "Action": [
        "s3:GetBucketLocation",
        "s3:ListBucket"
      ],
      "Resource": "arn:aws:s3:::<destination-s3-bucket-name>"
    },
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<aws-account-id-databricks>:role/<iam-role-for-s3-access>"
      },
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:DeleteObject",
        "s3:PutObjectAcl"
      ],
      "Resource": "arn:aws:s3:::<destination-s3-bucket-name>/*"
    }
  ]
}

Alternatives to the instance profile

If you cannot or do not want to configure an instance profile, you can instead directly grant the cluster access to the S3 bucket.

To do this, you use your AWS Access Key and AWS Secret Access Key to set the following Spark configuration properties and values.

Property/Key

Value

fs.s3n.awsAccessKeyId

spark.fs.s3a.access.key

spark.fs.s3a.secret.key

fs.s3n.awsSecretAccessKey

You enter the Spark configuration parameters when you set up your cluster.

When you use Azure Databricks with ADLSv2

Azure provides several options for accessing ADLSv2 from Databricks.

For details, go to Access Azure Data Lake Storage Gen2 and Blob Storage in the Azure Databricks documentation.

For some of the methods, you must set various Spark configuration properties. The Azure Databricks documentation provides Python examples that use spark.conf.set(<property>,<value>).

For Structural, you must provide these in the cluster configuration. Several of the methods recommend the use of secrets. To reference a secret, follow these instructions. You enter the Spark configuration parameters when you set up your cluster.

Structural uses the abfss driver.

Last updated 6 months ago

Was this helpful?