1 of 9

Amazon EMR

Amazon Elastic MapReduce (EMR) is Amazon's managed Spark Cluster.

Tonic Structural uses EMR to support the processing of flat files (such as parquet, csv, and avro) in S3.

Structural process overview for Amazon EMR

The following high-level diagram describes how Tonic Structural data generation is processed for Amazon EMR.

For an Amazon EMR workspace, the source data comes from a database in a Glue catalog. The source data is fed to the Structural web server and Structural worker through Amazon Athena.

The Structural worker calls the EMR Steps API to coordinate the data generation job on the Spark cluster.

The destination data is written to an S3 bucket.

System requirements for Amazon EMR

Supported versions of Spark and Amazon EMR

Tonic Structural supports Spark 2.4.x, Spark 3.0.0, Spark 3.0.1, and Spark 3.2.0. However, note that Spark 2.4.2 is not supported.

We suggest using EMR-6.1.0 or EMR-6.2.0 with Spark 3.0.0 or Spark 3.0.1, respectively. Any version between 5.2.8 and 6.0.2 should work.

Supported providers

Structural supports the following data providers:

Source Provider

Output Provider

Parquet

CSV

Avro

JSON

ORC

Metadata catalog

Structural requires a metadata catalog when connecting to your data. Currently only AWS Glue is supported when working with Amazon EMR.

Structural writes data to Amazon S3 only. Structural does not write output data back into a catalog.

Amazon S3 server side encryption requirements

If your S3 buckets have server side encryption enabled via AWS KMS, then your Spark cluster must have Hadoop 2.8.1+ installed.

Structural differences and limitations with Amazon EMR

Required license: Professional or Enterprise

Not available on Structural Cloud.

No workspace inheritance

Amazon EMR workspaces do not support workspace inheritance.

Table mode limitations

You can only assign the De-Identify or Truncate table modes.

For Truncate mode, the table is ignored completely. The table does not exist in the destination database.

Generator limitations

Amazon EMR workspaces cannot use the following generators:

Algebraic
Array Character Scramble
Array JSON Mask
Array Regex Mask
Cross-Table Sum
CSV Mask
Event Timestamps
HTML Mask
JSON Mask
SIN

The following generators are supported, but with restrictions:

Character Scramble is only supported for text columns.
Timestamp Shift is only supported on date column types.

No subsetting, but support for table filtering

Amazon EMR workspaces do not support subsetting.

However, for tables that use the De-Identify table mode, you can provide a WHERE clause to filter the table. For details, to go Using table filtering for data warehouses and Spark-based data connectors.

No upsert

Amazon EMR workspaces do not support upsert.

No output to container artifacts

For Amazon EMR workspaces, you cannot write the destination data to container artifacts.

No output to an Ephemeral snapshot

For Amazon EMR workspaces, you cannot write the destination data to an Ephemeral snapshot.

Limited job logs

The logging of Spark jobs on the job details page is more limited than it is for other data connectors. This is because of how Spark clusters are distributed and managed.

The Jobs view provides information about the job's status as it runs.

After the job starts, it provides a tracking URL. The tracking URL leads to the Spark management portal, where you can find additional, more detailed logs.

Before you create an Amazon EMR workspace

Before you create a workspace that uses the Amazon EMR data connector, you need to complete the following configuration.

Creating IAM roles for Structural and Amazon EMR

Tonic Structural with Amazon EMR uses the following IAM roles:

A role that is used by the Structural server
A role that is used by the Amazon EC2 instance profile for the Amazon EMR cluster

These roles must have the required permissions in order for Structural to run successfully on Amazon EMR data. Each role must be assigned to the appropriate profile.

If the Glue catalog is under a different AWS account from the one used for Structural and Amazon EMR, then use the instructions in Configuration for cross-account setups.

Structural server role for Amazon EMR

Identifying the profile that has the Structural server role

By default, Structural uses the IAM profile that is attached to the instance where Structural runs.

If you do not want to use that IAM profile, then to identify the profile to use:

Set the environment setting TONIC_AWS_ACCESS_KEY_ID to the AWS access key that is associated with the IAM profile.
Set the environment setting TONIC_AWS_SECRET_ACCESS_KEY to the secret key that is associated with the access key.

For information about how to configure Structural environment settings, go to Configuring environment settings.

Required permissions for the Structural server role

The Structural server role must have the the following permissions:

/*Ability to launch jobs via EMR Steps API and also to view clusters*/
{
    "Sid": "EmrListClustersPerms",
    "Effect": "Allow",
    "Action": "elasticmapreduce:ListClusters",
    "Resource": "*"
},
{
    "Sid": "EmrPerms",
    "Effect": "Allow",
    "Action": [
        "elasticmapreduce:DescribeStep",
        "elasticmapreduce:AddJobFlowSteps",
        "elasticmapreduce:DescribeCluster"
    ],
    "Resource": [
        "arn:aws:elasticmapreduce:<region>:<account-id>:cluster/<emr-cluster-id>"
    ]
},

/*Ability to query Glue catalog*/
{
    "Sid": "GluePerms",
    "Effect": "Allow",
    "Action": [
        "glue:GetUserDefinedFunctions",
        "glue:BatchGetPartition",
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:GetPartition",
        "glue:GetPartitions",
        "glue:GetTable",
        "glue:GetTables",
        "glue:GetTableVersion",
        "glue:GetTableVersions"
    ],
    "Resource": [
        "arn:aws:glue:<region>:<account-id>:catalog",
        "arn:aws:glue:<region>:<account-id>:database/*",
        "arn:aws:glue:<region>:<account-id>:table/*"
    ]
},
{
    "Sid": "S3SourcePermissions",
    "Effect": "Allow",
    "Action": [
        "s3:ListBucket",
        "s3:GetObject"
    ],
    "Resource": [
        "arn:aws:s3:::<source-s3-bucket>",
        "arn:aws:s3:::<source-s3-bucket>/*"
    ]
},
{
    "Sid": "S3DestPermissions",
    "Effect": "Allow",
    "Action": [
        "s3:PutObject"
    ],
    "Resource": [
        "arn:aws:s3:::<destination-s3-bucket>/*"
    ]
}

/*Ability to issue queries against Athena*/
{
    "Sid": "AthenaPerms",
    "Effect": "Allow",
    "Action": [
        "athena:StartQueryExecution",
        "athena:GetQueryExecution",
        "athena:GetQueryResults",
        "athena:GetWorkGroup"
    ],
    "Resource": "arn:aws:athena:<region>:<account-id>:workgroup/tonic-emr-workgroup"
},
{
    "Sid": "AthenaQueryResultPermissions",
    "Effect": "Allow",
    "Action": [
        "s3:GetBucketLocation",
        "s3:GetObject",
        "s3:ListBucket",
        "s3:ListBucketMultipartUploads",
        "s3:ListMultipartUploadParts",
        "s3:AbortMultipartUpload",
        "s3:CreateBucket",
        "s3:PutObject"
    ],
    "Resource": [
        "arn:aws:s3:::<athena-query-results-bucket>",
        "arn:aws:s3:::<athena-query-results-bucket>/*"
    ]
}

Amazon EC2 instance profile role

Identifying the Amazon EC2 instance profile

The profile is the Amazon EC2 instance profile that you assigned as the value of EC2 instance profile when you created the Amazon EMR cluster.

Required permissions for the Amazon EC2 instance profile role

By default, a new Amazon EMR cluster is assigned the role EMR_EC2_DefaultRole, which contains all of the required permissions, plus additional permissions.

However, AWS recommends that you create a custom IAM role for your Amazon EMR cluster's Amazon EC2 instance profile role.

The following permissions reflect the minimum permissions needed for Structural data generation:

{
    "Sid": "GluePerms",
    "Effect": "Allow",
    "Action": [
        "glue:GetUserDefinedFunctions",
        "glue:BatchGetPartition",
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:GetPartition",
        "glue:GetPartitions",
        "glue:GetTable",
        "glue:GetTables",
        "glue:GetTableVersion",
        "glue:GetTableVersions"
    ],
    "Resource": [
        "arn:aws:glue:<region>:<account-id>:catalog",
        "arn:aws:glue:<region>:<account-id>:database/*",
        "arn:aws:glue:<region>:<account-id>:table/*"
    ]
},
{
    "Sid": "S3SourceBucketPerms",
    "Effect": "Allow",
    "Action": [
        "s3:ListBucket",
        "s3:GetObject"
    ],
    "Resource": [
        "arn:aws:s3:::<source-s3-bucket>",
        "arn:aws:s3:::<source-s3-bucket>/*"
    ]
},
{
    "Sid": "S3DestBucketPerms",
    "Effect": "Allow",
    "Action": [
        "s3:ListBucket",
        "s3:GetObject",
        "s3:PutObject"
    ],
    "Resource": [
        "arn:aws:s3:::<destination-s3-bucket>",
        "arn:aws:s3:::<destination-s3-bucket>/*"
    ]
},
{
    "Sid": "S3EmrLogBucketPerms",
    "Effect": "Allow",
    "Action": "s3:PutObject",
    "Resource": [
        "arn:aws:s3:::<emr-logs-s3-bucket>/*"
    ]
}

For Amazon EMR, the Glue catalog must contain a default database. If the default database does not exist, then Amazon EMR attempts to create it.

Before you run a Structural data generation, you must either:

Ensure that the default catalog exists
Add glue:CreateDatabase to the list of permissions that are granted to this role

Structural does not otherwise require this permission, and does not explicitly attempt to create a database.

Decrypt and encrypt permissions

You must add decrypt and encrypt permissions to your account on both the source and destination paths.

If you use the Amazon EMR Steps API, then the Amazon EMR role that is assigned to your cluster must have Decrypt access to the AWS KMS key that is used on the output bucket.

Creating Athena workgroups

Tonic Structural uses Amazon Athena to run queries against your AWS Glue catalog. These queries are typically for powering the front-end experience. They are also used in conjunction with gathering data used by certain generators during a data generation.

To use Athena, you must , and also create an Athena Workgroup. You can do this from the Athena homepage in the AWS Console.

By default, Structural expects the WorkGroup name to be tonic-emr-workgroup. To override the default value, configure the following environment setting:

If you do override this value, you must override it in both the Structural web server container and on the Amazon EMR cluster.

Configuration for cross-account setups

Tonic Structural supports operating on AWS Glue catalogs in AWS accounts different from where Structural and Amazon EMR are configured.

For the instructions in this topic, we'll use the following example:

AWS Account A contains the Amazon EMR Cluster, Athena workgroup, and destination S3 bucket.
AWS Account B contains the AWS Glue data catalog and source S3 bucket.

The following instructions explain how to set up each required AWS component. For cross-account setups, you use these instructions instead of the instructions in Creating IAM roles for Structural and Amazon EMR.

These instructions assume that both accounts reside in the same region. If your accounts belong to different regions, then see the Amazon documentation for instructions on how to set up a VPC for cross-account access.

Granting access to the required resources

The account that has Structural and Amazon EMR must be granted accesses to the resources for the account that has the AWS Glue catalog.

To continue our example, you must first grant Account A access to Account B's resources. To do this, set up the following resource-based policies for Account B's AWS Glue data catalog and source S3 bucket.

Account B Glue data catalog resource policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::<account-A-id>:role/<tonic-role>",
                    "arn:aws:iam::<account-A-id>:role/<emr-ec2-instance-profile-role>"
                ]
            },
            "Action": [
                "glue:GetUserDefinedFunctions",
                "glue:BatchGetPartition",
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:GetPartition",
                "glue:GetPartitions",
                "glue:GetTable",
                "glue:GetTables",
                "glue:GetTableVersion",
                "glue:GetTableVersions"
            ],
            "Resource": [
                "arn:aws:glue:<region>:<account-B-id>:catalog",
                "arn:aws:glue:<region>:<account-B-id>:database/*",
                "arn:aws:glue:<region>:<account-B-id>:table/*"
            ]
        }
    ]
}

Register Account B's glue data catalog as an Athena data source inside Account A's Athena console. For instructions, see the AWS documentation.

Account B source S3 bucket bucket policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Statement1",
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::<account-A-id>:role/<tonic-role>",
                    "arn:aws:iam::<account-A-id>:role/<emr-ec2-instance-profile-role>"
                ]
            },
            "Action": [
                "s3:ListBucket",
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::<account-B-source-bucket>",
                "arn:aws:s3:::<account-B-source-bucket>/*"
            ]
        }
    ]
}

Account A Amazon EMR cluster

When you create your Amazon EMR cluster, make sure to enable the Use AWS Glue Data Catalog for table metadata option. This allows you to set a default catalog ID that points to Account B.

You must set the following configuration for all instance groups in the Amazon EMR cluster:

[
  {
    "Classification": "spark-hive-site",
    "Properties": {
      "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory",
      "hive.metastore.glue.catalogid": "<account-B-id>"
    }
  }
]

Structural server role

Identifying the profile that has the Structural server role

By default, Structural uses the IAM profile that is attached to the instance where Structural runs.

If you do not want to use that IAM profile, then to identify the profile to use:

Set the environment setting TONIC_AWS_ACCESS_KEY_ID to the AWS access key that is associated with the IAM profile.
Set the environment setting TONIC_AWS_SECRET_ACCESS_KEY to the secret key that is associated with the access key.

For information about how to configure Structural environment settings, go to Configuring environment settings.

Required permissions for the Structural server role

The Structural server role must have the the following permissions:

{
    "Sid": "EmrListClustersPerms",
    "Effect": "Allow",
    "Action": "elasticmapreduce:ListClusters",
    "Resource": "*"
},
{
    "Sid": "EmrPerms",
    "Effect": "Allow",
    "Action": [
        "elasticmapreduce:DescribeStep",
        "elasticmapreduce:AddJobFlowSteps",
        "elasticmapreduce:DescribeCluster"
    ],
    "Resource": [
        "arn:aws:elasticmapreduce:<region>:<account-A-id>:cluster/<cluster-id>"
    ]
},
{
    "Sid": "CrossAccountGluePerms",
    "Effect": "Allow",
    "Action": [
        "glue:GetUserDefinedFunctions",
        "glue:BatchGetPartition",
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:GetPartition",
        "glue:GetPartitions",
        "glue:GetTable",
        "glue:GetTables",
        "glue:GetTableVersion",
        "glue:GetTableVersions"
    ],
    "Resource": [
        "arn:aws:glue:<region>:<account-B-id>:catalog",
        "arn:aws:glue:<region>:<account-B-id>:database/*",
        "arn:aws:glue:<region>:<account-B-id>:table/*"
    ]
},
{
    "Sid": "CrossAccountS3SourcePerms",
    "Effect": "Allow",
    "Action": [
        "s3:ListBucket",
        "s3:GetObject"
    ],
    "Resource": [
        "arn:aws:s3:::<account-B-source-s3-bucket>",
        "arn:aws:s3:::<account-B-source-s3-bucket>/*"
    ]
},
{
    "Sid": "S3DestPerms",
    "Effect": "Allow",
    "Action": [
        "s3:PutObject"
    ],
    "Resource": [
        "arn:aws:s3:::<destination-s3-bucket>/*"
    ]
},
{
    "Sid": "AthenaPerms",
    "Effect": "Allow",
    "Action": [
        "athena:StartQueryExecution",
        "athena:GetQueryExecution",
        "athena:GetQueryResults"
    ],
    "Resource": "arn:aws:athena:<region>:<account-A-id>:workgroup/tonic-emr-workgroup"
},
{
    "Sid": "CrossAccountAthenaPerms",
    "Effect": "Allow",
    "Action": [
        "athena:GetDataCatalog"
    ],
    "Resource": "arn:aws:athena:<region>:<account-A-id>:datacatalog/<catalog-name>"
},
{
    "Sid": "AthenaQueryResultPerms",
    "Effect": "Allow",
    "Action": [
        "s3:GetBucketLocation",
        "s3:GetObject",
        "s3:ListBucket",
        "s3:ListBucketMultipartUploads",
        "s3:ListMultipartUploadParts",
        "s3:AbortMultipartUpload",
        "s3:CreateBucket",
        "s3:PutObject"
    ],
    "Resource": [
        "arn:aws:s3:::<athena-query-results-bucket>",
        "arn:aws:s3:::<athena-query-results-bucket>/*"
    ]
}

Amazon EC2 instance profile role

Identifying the profile that has the Amazon EC2 instance role

The profile is the Amazon EC2 instance profile that you assigned as the value of EC2 instance profile when you created the Amazon EMR cluster.

Required permissions for the Amazon EC2 instance role

By default, a new Amazon EMR cluster is assigned the role EMR_EC2_DefaultRole, which contains all of the required permissions, plus additional permissions.

However, AWS recommends that you create a custom IAM role for your Amazon EMR cluster's Amazon EC2 instance profile role.

The following permissions reflect the minimum permissions needed for Structural data generation:

{
    "Sid": "GluePerms",
    "Effect": "Allow",
    "Action": [
        "glue:CreateDatabase",
        "glue:UpdateDatabase",
        "glue:DeleteDatabase",
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:CreateTable",
        "glue:UpdateTable",
        "glue:DeleteTable",
        "glue:GetTable",
        "glue:GetTables",
        "glue:GetTableVersions",
        "glue:CreatePartition",
        "glue:BatchCreatePartition",
        "glue:UpdatePartition",
        "glue:DeletePartition",
        "glue:BatchDeletePartition",
        "glue:GetPartition",
        "glue:GetPartitions",
        "glue:BatchGetPartition",
        "glue:CreateUserDefinedFunction",
        "glue:UpdateUserDefinedFunction",
        "glue:DeleteUserDefinedFunction",
        "glue:GetUserDefinedFunction",
        "glue:GetUserDefinedFunctions"
    ],
    "Resource": "*"
},
{
    "Sid": "CrossAccountS3SourceBucketPerms",
    "Effect": "Allow",
    "Action": [
        "s3:ListBucket",
        "s3:GetObject"
    ],
    "Resource": [
        "arn:aws:s3:::<account-B-source-s3-bucket>",
        "arn:aws:s3:::<account-B-source-s3-bucket>/*"
    ]
},
{
    "Sid": "S3DestBucketPerms",
    "Effect": "Allow",
    "Action": [
        "s3:ListBucket",
        "s3:GetObject",
        "s3:PutObject"
    ],
    "Resource": [
        "arn:aws:s3:::<destination-s3-bucket>",
        "arn:aws:s3:::<destination-s3-bucket>/*"
    ]
},
{
    "Sid": "S3EmrLogBucketPerms",
    "Effect": "Allow",
    "Action": "s3:PutObject",
    "Resource": [
        "arn:aws:s3:::<s3-emr-log-bucket>/*"
    ]
}

For Amazon EMR, the Glue catalog must contain a default database. If the default database does not exist, then Amazon EMR attempts to create it.

Before you run a Structural data generation, you must either:

Ensure that the default catalog exists
Add glue:CreateDatabase to the list of permissions that are granted to this role

Structural does not otherwise require this permission, and does not explicitly attempt to create a database.

Configuring Amazon EMR workspace data connections

During workspace creation, to indicate to use Amazon EMR:

For Connection Type, choose Spark.
For Cluster Type, choose Amazon EMR.

Connecting to the catalog database

In the Catalog Database section, you provide the details about the source database:

In the Glue Catalog Database field, provide the name of the AWS Glue catalog database that contains the source data.
If the AWS Glue catalog database is in a different AWS account from Tonic Structural and Amazon EMR, then:
1. Toggle Cross Account Access to the on position.
2. In the Glue Catalog Id field, provide the AWS account ID for the account that contains the data catalog.
3. In the Glue Catalog Name field, provide the catalog name of the data source that is attached to Athena.
To test the connection to the catalog, click Test Catalog Connection.

Enabling validation of table filters

For EMR workspaces, you can provide where clauses to filter tables. See #table-mode-filter-tables.

The Enable partition filter validation toggle indicates whether Structural should validate those filters when you create them.

By default, the toggle is in the on position, and Structural validates the filters. To disable the validation, change the toggle to the off position.

Blocking data generation for all schema changes

By default, data generations are not blocked when schema changes do not conflict with your workspace configuration.

To block data generation when there are any schema changes, regardless of whether they conflict with your workspace configuration, toggle Block data generation on schema changes to the on position.

Identifying the Amazon EMR cluster

Amazon EMR supports the launching of Spark jobs through the Amazon EMR Steps API. To do this, it needs the Amazon EMR cluster identifier.

Under EMR Cluster, in the EMR Cluster Id field, provide the cluster ID of your Amazon EMR cluster.

You can find the cluster ID on the EMR Clusters console page. The ID always begins with "j-".

To test the connection to the Amazon EMR cluster, click Test EMR Connection.

Specifying the Amazon S3 location for the destination data

Under Output S3 Location, in the S3 Bucket Path field, provide the path to the location in Amazon S3 where Structural writes the destination data.

By default, Structural writes the output for each data generation to a new folder under the output location, and Create job-specific destination folder is in the on position. To create the folder, Structural appends a GUID to the output location.

To not create a separate folder for each data generation, toggle Create job-specific destination folder to the off position.

To verify that Structural can reach the provided path, click Test S3 Connection.

If you use non-job-specific folders for destination data, then the following environment settings determine how Structural handles overwrites. You can configure these settings from the Environment Settings tab on Structural Settings.

TONIC_WORKSPACE_ERROR_ON_OVERRIDE. Whether to prevent overwrites of previous writes. By default, this setting is true, and attempts to overwrite return an error. To allow overwrites, set this to false.
TONIC_WORKSPACE_DEFAULT_SAVE_MODE. The mode to use to save tables to a non-job-specific folder. When this is set to a value other than null, which is the default, then this setting takes precedence over TONIC_WORKSPACE_ERROR_ON_OVERRIDE. The available values are Append, ErrorIfExists, Ignore, and Overwrite.

Providing Spark configuration variable values

The Spark Configuration section provides a list of spark configuration variables that Structural needs to be set.

Creating IAM roles for Structural and Amazon EMR

Tonic Structural with Amazon EMR uses the following IAM roles:

A role that is used by the Structural server
A role that is used by the Amazon EC2 instance profile for the Amazon EMR cluster

These roles must have the required permissions in order for Structural to run successfully on Amazon EMR data. Each role must be assigned to the appropriate profile.

If the Glue catalog is under a different AWS account from the one used for Structural and Amazon EMR, then use the instructions in Configuration for cross-account setups.

Structural server role for Amazon EMR

Identifying the profile that has the Structural server role

By default, Structural uses the IAM profile that is attached to the instance where Structural runs.

If you do not want to use that IAM profile, then to identify the profile to use:

Set the environment setting TONIC_AWS_ACCESS_KEY_ID to the AWS access key that is associated with the IAM profile.
Set the environment setting TONIC_AWS_SECRET_ACCESS_KEY to the secret key that is associated with the access key.

For information about how to configure Structural environment settings, go to Configuring environment settings.

Required permissions for the Structural server role

The Structural server role must have the the following permissions:

/*Ability to launch jobs via EMR Steps API and also to view clusters*/
{
    "Sid": "EmrListClustersPerms",
    "Effect": "Allow",
    "Action": "elasticmapreduce:ListClusters",
    "Resource": "*"
},
{
    "Sid": "EmrPerms",
    "Effect": "Allow",
    "Action": [
        "elasticmapreduce:DescribeStep",
        "elasticmapreduce:AddJobFlowSteps",
        "elasticmapreduce:DescribeCluster"
    ],
    "Resource": [
        "arn:aws:elasticmapreduce:<region>:<account-id>:cluster/<emr-cluster-id>"
    ]
},

/*Ability to query Glue catalog*/
{
    "Sid": "GluePerms",
    "Effect": "Allow",
    "Action": [
        "glue:GetUserDefinedFunctions",
        "glue:BatchGetPartition",
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:GetPartition",
        "glue:GetPartitions",
        "glue:GetTable",
        "glue:GetTables",
        "glue:GetTableVersion",
        "glue:GetTableVersions"
    ],
    "Resource": [
        "arn:aws:glue:<region>:<account-id>:catalog",
        "arn:aws:glue:<region>:<account-id>:database/*",
        "arn:aws:glue:<region>:<account-id>:table/*"
    ]
},
{
    "Sid": "S3SourcePermissions",
    "Effect": "Allow",
    "Action": [
        "s3:ListBucket",
        "s3:GetObject"
    ],
    "Resource": [
        "arn:aws:s3:::<source-s3-bucket>",
        "arn:aws:s3:::<source-s3-bucket>/*"
    ]
},
{
    "Sid": "S3DestPermissions",
    "Effect": "Allow",
    "Action": [
        "s3:PutObject"
    ],
    "Resource": [
        "arn:aws:s3:::<destination-s3-bucket>/*"
    ]
}

/*Ability to issue queries against Athena*/
{
    "Sid": "AthenaPerms",
    "Effect": "Allow",
    "Action": [
        "athena:StartQueryExecution",
        "athena:GetQueryExecution",
        "athena:GetQueryResults",
        "athena:GetWorkGroup"
    ],
    "Resource": "arn:aws:athena:<region>:<account-id>:workgroup/tonic-emr-workgroup"
},
{
    "Sid": "AthenaQueryResultPermissions",
    "Effect": "Allow",
    "Action": [
        "s3:GetBucketLocation",
        "s3:GetObject",
        "s3:ListBucket",
        "s3:ListBucketMultipartUploads",
        "s3:ListMultipartUploadParts",
        "s3:AbortMultipartUpload",
        "s3:CreateBucket",
        "s3:PutObject"
    ],
    "Resource": [
        "arn:aws:s3:::<athena-query-results-bucket>",
        "arn:aws:s3:::<athena-query-results-bucket>/*"
    ]
}

Amazon EC2 instance profile role

Identifying the Amazon EC2 instance profile

The profile is the Amazon EC2 instance profile that you assigned as the value of EC2 instance profile when you created the Amazon EMR cluster.

Required permissions for the Amazon EC2 instance profile role

By default, a new Amazon EMR cluster is assigned the role EMR_EC2_DefaultRole, which contains all of the required permissions, plus additional permissions.

However, AWS recommends that you create a custom IAM role for your Amazon EMR cluster's Amazon EC2 instance profile role.

The following permissions reflect the minimum permissions needed for Structural data generation:

{
    "Sid": "GluePerms",
    "Effect": "Allow",
    "Action": [
        "glue:GetUserDefinedFunctions",
        "glue:BatchGetPartition",
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:GetPartition",
        "glue:GetPartitions",
        "glue:GetTable",
        "glue:GetTables",
        "glue:GetTableVersion",
        "glue:GetTableVersions"
    ],
    "Resource": [
        "arn:aws:glue:<region>:<account-id>:catalog",
        "arn:aws:glue:<region>:<account-id>:database/*",
        "arn:aws:glue:<region>:<account-id>:table/*"
    ]
},
{
    "Sid": "S3SourceBucketPerms",
    "Effect": "Allow",
    "Action": [
        "s3:ListBucket",
        "s3:GetObject"
    ],
    "Resource": [
        "arn:aws:s3:::<source-s3-bucket>",
        "arn:aws:s3:::<source-s3-bucket>/*"
    ]
},
{
    "Sid": "S3DestBucketPerms",
    "Effect": "Allow",
    "Action": [
        "s3:ListBucket",
        "s3:GetObject",
        "s3:PutObject"
    ],
    "Resource": [
        "arn:aws:s3:::<destination-s3-bucket>",
        "arn:aws:s3:::<destination-s3-bucket>/*"
    ]
},
{
    "Sid": "S3EmrLogBucketPerms",
    "Effect": "Allow",
    "Action": "s3:PutObject",
    "Resource": [
        "arn:aws:s3:::<emr-logs-s3-bucket>/*"
    ]
}

For Amazon EMR, the Glue catalog must contain a default database. If the default database does not exist, then Amazon EMR attempts to create it.

Before you run a Structural data generation, you must either:

Ensure that the default catalog exists
Add glue:CreateDatabase to the list of permissions that are granted to this role

Structural does not otherwise require this permission, and does not explicitly attempt to create a database.

Decrypt and encrypt permissions

You must add decrypt and encrypt permissions to your account on both the source and destination paths.

If you use the Amazon EMR Steps API, then the Amazon EMR role that is assigned to your cluster must have Decrypt access to the AWS KMS key that is used on the output bucket.

Configuration for cross-account setups

Tonic Structural supports operating on AWS Glue catalogs in AWS accounts different from where Structural and Amazon EMR are configured.

For the instructions in this topic, we'll use the following example:

AWS Account A contains the Amazon EMR Cluster, Athena workgroup, and destination S3 bucket.
AWS Account B contains the AWS Glue data catalog and source S3 bucket.

Granting access to the required resources

The account that has Structural and Amazon EMR must be granted accesses to the resources for the account that has the AWS Glue catalog.

Account B Glue data catalog resource policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::<account-A-id>:role/<tonic-role>",
                    "arn:aws:iam::<account-A-id>:role/<emr-ec2-instance-profile-role>"
                ]
            },
            "Action": [
                "glue:GetUserDefinedFunctions",
                "glue:BatchGetPartition",
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:GetPartition",
                "glue:GetPartitions",
                "glue:GetTable",
                "glue:GetTables",
                "glue:GetTableVersion",
                "glue:GetTableVersions"
            ],
            "Resource": [
                "arn:aws:glue:<region>:<account-B-id>:catalog",
                "arn:aws:glue:<region>:<account-B-id>:database/*",
                "arn:aws:glue:<region>:<account-B-id>:table/*"
            ]
        }
    ]
}

Register Account B's glue data catalog as an Athena data source inside Account A's Athena console. For instructions, see the AWS documentation.

Account B source S3 bucket bucket policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Statement1",
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::<account-A-id>:role/<tonic-role>",
                    "arn:aws:iam::<account-A-id>:role/<emr-ec2-instance-profile-role>"
                ]
            },
            "Action": [
                "s3:ListBucket",
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::<account-B-source-bucket>",
                "arn:aws:s3:::<account-B-source-bucket>/*"
            ]
        }
    ]
}

Account A Amazon EMR cluster

When you create your Amazon EMR cluster, make sure to enable the Use AWS Glue Data Catalog for table metadata option. This allows you to set a default catalog ID that points to Account B.

You must set the following configuration for all instance groups in the Amazon EMR cluster:

[
  {
    "Classification": "spark-hive-site",
    "Properties": {
      "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory",
      "hive.metastore.glue.catalogid": "<account-B-id>"
    }
  }
]

Structural server role

Identifying the profile that has the Structural server role

By default, Structural uses the IAM profile that is attached to the instance where Structural runs.

If you do not want to use that IAM profile, then to identify the profile to use:

Set the environment setting TONIC_AWS_ACCESS_KEY_ID to the AWS access key that is associated with the IAM profile.
Set the environment setting TONIC_AWS_SECRET_ACCESS_KEY to the secret key that is associated with the access key.

For information about how to configure Structural environment settings, go to Configuring environment settings.

Required permissions for the Structural server role

The Structural server role must have the the following permissions:

{
    "Sid": "EmrListClustersPerms",
    "Effect": "Allow",
    "Action": "elasticmapreduce:ListClusters",
    "Resource": "*"
},
{
    "Sid": "EmrPerms",
    "Effect": "Allow",
    "Action": [
        "elasticmapreduce:DescribeStep",
        "elasticmapreduce:AddJobFlowSteps",
        "elasticmapreduce:DescribeCluster"
    ],
    "Resource": [
        "arn:aws:elasticmapreduce:<region>:<account-A-id>:cluster/<cluster-id>"
    ]
},
{
    "Sid": "CrossAccountGluePerms",
    "Effect": "Allow",
    "Action": [
        "glue:GetUserDefinedFunctions",
        "glue:BatchGetPartition",
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:GetPartition",
        "glue:GetPartitions",
        "glue:GetTable",
        "glue:GetTables",
        "glue:GetTableVersion",
        "glue:GetTableVersions"
    ],
    "Resource": [
        "arn:aws:glue:<region>:<account-B-id>:catalog",
        "arn:aws:glue:<region>:<account-B-id>:database/*",
        "arn:aws:glue:<region>:<account-B-id>:table/*"
    ]
},
{
    "Sid": "CrossAccountS3SourcePerms",
    "Effect": "Allow",
    "Action": [
        "s3:ListBucket",
        "s3:GetObject"
    ],
    "Resource": [
        "arn:aws:s3:::<account-B-source-s3-bucket>",
        "arn:aws:s3:::<account-B-source-s3-bucket>/*"
    ]
},
{
    "Sid": "S3DestPerms",
    "Effect": "Allow",
    "Action": [
        "s3:PutObject"
    ],
    "Resource": [
        "arn:aws:s3:::<destination-s3-bucket>/*"
    ]
},
{
    "Sid": "AthenaPerms",
    "Effect": "Allow",
    "Action": [
        "athena:StartQueryExecution",
        "athena:GetQueryExecution",
        "athena:GetQueryResults"
    ],
    "Resource": "arn:aws:athena:<region>:<account-A-id>:workgroup/tonic-emr-workgroup"
},
{
    "Sid": "CrossAccountAthenaPerms",
    "Effect": "Allow",
    "Action": [
        "athena:GetDataCatalog"
    ],
    "Resource": "arn:aws:athena:<region>:<account-A-id>:datacatalog/<catalog-name>"
},
{
    "Sid": "AthenaQueryResultPerms",
    "Effect": "Allow",
    "Action": [
        "s3:GetBucketLocation",
        "s3:GetObject",
        "s3:ListBucket",
        "s3:ListBucketMultipartUploads",
        "s3:ListMultipartUploadParts",
        "s3:AbortMultipartUpload",
        "s3:CreateBucket",
        "s3:PutObject"
    ],
    "Resource": [
        "arn:aws:s3:::<athena-query-results-bucket>",
        "arn:aws:s3:::<athena-query-results-bucket>/*"
    ]
}

Amazon EC2 instance profile role

Identifying the profile that has the Amazon EC2 instance role

The profile is the Amazon EC2 instance profile that you assigned as the value of EC2 instance profile when you created the Amazon EMR cluster.

Required permissions for the Amazon EC2 instance role

By default, a new Amazon EMR cluster is assigned the role EMR_EC2_DefaultRole, which contains all of the required permissions, plus additional permissions.

However, AWS recommends that you create a custom IAM role for your Amazon EMR cluster's Amazon EC2 instance profile role.

The following permissions reflect the minimum permissions needed for Structural data generation:

{
    "Sid": "GluePerms",
    "Effect": "Allow",
    "Action": [
        "glue:CreateDatabase",
        "glue:UpdateDatabase",
        "glue:DeleteDatabase",
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:CreateTable",
        "glue:UpdateTable",
        "glue:DeleteTable",
        "glue:GetTable",
        "glue:GetTables",
        "glue:GetTableVersions",
        "glue:CreatePartition",
        "glue:BatchCreatePartition",
        "glue:UpdatePartition",
        "glue:DeletePartition",
        "glue:BatchDeletePartition",
        "glue:GetPartition",
        "glue:GetPartitions",
        "glue:BatchGetPartition",
        "glue:CreateUserDefinedFunction",
        "glue:UpdateUserDefinedFunction",
        "glue:DeleteUserDefinedFunction",
        "glue:GetUserDefinedFunction",
        "glue:GetUserDefinedFunctions"
    ],
    "Resource": "*"
},
{
    "Sid": "CrossAccountS3SourceBucketPerms",
    "Effect": "Allow",
    "Action": [
        "s3:ListBucket",
        "s3:GetObject"
    ],
    "Resource": [
        "arn:aws:s3:::<account-B-source-s3-bucket>",
        "arn:aws:s3:::<account-B-source-s3-bucket>/*"
    ]
},
{
    "Sid": "S3DestBucketPerms",
    "Effect": "Allow",
    "Action": [
        "s3:ListBucket",
        "s3:GetObject",
        "s3:PutObject"
    ],
    "Resource": [
        "arn:aws:s3:::<destination-s3-bucket>",
        "arn:aws:s3:::<destination-s3-bucket>/*"
    ]
},
{
    "Sid": "S3EmrLogBucketPerms",
    "Effect": "Allow",
    "Action": "s3:PutObject",
    "Resource": [
        "arn:aws:s3:::<s3-emr-log-bucket>/*"
    ]
}

For Amazon EMR, the Glue catalog must contain a default database. If the default database does not exist, then Amazon EMR attempts to create it.

Before you run a Structural data generation, you must either:

Ensure that the default catalog exists
Add glue:CreateDatabase to the list of permissions that are granted to this role

Structural does not otherwise require this permission, and does not explicitly attempt to create a database.