Tonic Structural with Amazon EMR uses the following IAM roles:
A role that is used by the Structural server
A role that is used by the Amazon EC2 instance profile for the Amazon EMR cluster
These roles must have the required permissions in order for Structural to run successfully on Amazon EMR data. Each role must be assigned to the appropriate profile.
If the Glue catalog is under a different AWS account from the one used for Structural and Amazon EMR, then use the instructions in Configuration for cross-account setups.
By default, Structural uses the IAM profile that is attached to the instance where Structural runs.
If you do not want to use that IAM profile, then to identify the profile to use:
Set the environment setting TONIC_AWS_ACCESS_KEY_ID
to the AWS access key that is associated with the IAM profile.
Set the environment setting TONIC_AWS_SECRET_ACCESS_KEY
to the secret key that is associated with the access key.
For information about how to configure Structural environment settings, go to Configuring environment settings.
The Structural server role must have the the following permissions:
The profile is the Amazon EC2 instance profile that you assigned as the value of EC2 instance profile when you created the Amazon EMR cluster.
By default, a new Amazon EMR cluster is assigned the role EMR_EC2_DefaultRole
, which contains all of the required permissions, plus additional permissions.
However, AWS recommends that you create a custom IAM role for your Amazon EMR cluster's Amazon EC2 instance profile role.
The following permissions reflect the minimum permissions needed for Structural data generation:
For Amazon EMR, the Glue catalog must contain a default
database. If the default
database does not exist, then Amazon EMR attempts to create it.
Before you run a Structural data generation, you must either:
Ensure that the default
catalog exists
Add glue:CreateDatabase
to the list of permissions that are granted to this role
Structural does not otherwise require this permission, and does not explicitly attempt to create a database.
You must add decrypt and encrypt permissions to your account on both the source and destination paths.
If you use the Amazon EMR Steps API, then the Amazon EMR role that is assigned to your cluster must have Decrypt access to the AWS KMS key that is used on the output bucket.
Before you create a workspace that uses the Amazon EMR data connector, you need to complete the following configuration.
Create IAM roles
Create roles for the Tonic Structural server and the EC2 instance profile
Create Athena workgroups
Enable AWS Athena to run queries against your Glue catalog.
Configure cross-account setups
Enable the Glue catalog to be in a different AWS account from Structural and Amazon EMR.
Tonic Structural uses Amazon Athena to run queries against your AWS Glue catalog. These queries are typically for powering the front-end experience. They are also used in conjunction with gathering data used by certain generators during a data generation.
To use Athena, you must set up the required permissions, and also create an Athena Workgroup. You can do this from the Athena homepage in the AWS Console.
By default, Structural expects the WorkGroup name to be tonic-emr-workgroup
. To override the default value, configure the following environment setting:
If you do override this value, you must override it in both the Structural web server container and on the Amazon EMR cluster.
Tonic Structural supports operating on AWS Glue catalogs in AWS accounts different from where Structural and Amazon EMR are configured.
For the instructions in this topic, we'll use the following example:
AWS Account A contains the Amazon EMR Cluster, Athena workgroup, and destination S3 bucket.
AWS Account B contains the AWS Glue data catalog and source S3 bucket.
The following instructions explain how to set up each required AWS component. For cross-account setups, you use these instructions instead of the instructions in Creating IAM roles for Structural and Amazon EMR.
These instructions assume that both accounts reside in the same region. If your accounts belong to different regions, then see the Amazon documentation for instructions on how to set up a VPC for cross-account access.
The account that has Structural and Amazon EMR must be granted accesses to the resources for the account that has the AWS Glue catalog.
To continue our example, you must first grant Account A access to Account B's resources. To do this, set up the following resource-based policies for Account B's AWS Glue data catalog and source S3 bucket.
Register Account B's glue data catalog as an Athena data source inside Account A's Athena console. For instructions, see the AWS documentation.
When you create your Amazon EMR cluster, make sure to enable the Use AWS Glue Data Catalog for table metadata option. This allows you to set a default catalog ID that points to Account B.
You must set the following configuration for all instance groups in the Amazon EMR cluster:
By default, Structural uses the IAM profile that is attached to the instance where Structural runs.
If you do not want to use that IAM profile, then to identify the profile to use:
Set the environment setting TONIC_AWS_ACCESS_KEY_ID
to the AWS access key that is associated with the IAM profile.
Set the environment setting TONIC_AWS_SECRET_ACCESS_KEY
to the secret key that is associated with the access key.
For information about how to configure Structural environment settings, go to Configuring environment settings.
The Structural server role must have the the following permissions:
The profile is the Amazon EC2 instance profile that you assigned as the value of EC2 instance profile when you created the Amazon EMR cluster.
By default, a new Amazon EMR cluster is assigned the role EMR_EC2_DefaultRole
, which contains all of the required permissions, plus additional permissions.
However, AWS recommends that you create a custom IAM role for your Amazon EMR cluster's Amazon EC2 instance profile role.
The following permissions reflect the minimum permissions needed for Structural data generation: