Creating IAM roles for Structural and Amazon EMR

Tonic Structural with Amazon EMR uses the following IAM roles:

  • A role that is used by the Structural server

  • A role that is used by the Amazon EC2 instance profile for the Amazon EMR cluster

These roles must have the required permissions in order for Structural to run successfully on Amazon EMR data. Each role must be assigned to the appropriate profile.

If the Glue catalog is under a different AWS account from the one used for Structural and Amazon EMR, then use the instructions in Configuration for cross-account setups.

Structural server role for Amazon EMR

Identifying the profile that has the Structural server role

By default, Structural uses the IAM profile that is attached to the instance where Structural runs.

If you do not want to use that IAM profile, then to identify the profile to use:

  1. Set the environment setting TONIC_AWS_ACCESS_KEY_ID to the AWS access key that is associated with the IAM profile.

  2. Set the environment setting TONIC_AWS_SECRET_ACCESS_KEY to the secret key that is associated with the access key.

For information about how to configure Structural environment settings, go to Configuring environment settings.

Required permissions for the Structural server role

The Structural server role must have the the following permissions:

/*Ability to launch jobs via EMR Steps API and also to view clusters*/
{
    "Sid": "EmrListClustersPerms",
    "Effect": "Allow",
    "Action": "elasticmapreduce:ListClusters",
    "Resource": "*"
},
{
    "Sid": "EmrPerms",
    "Effect": "Allow",
    "Action": [
        "elasticmapreduce:DescribeStep",
        "elasticmapreduce:AddJobFlowSteps",
        "elasticmapreduce:DescribeCluster"
    ],
    "Resource": [
        "arn:aws:elasticmapreduce:<region>:<account-id>:cluster/<emr-cluster-id>"
    ]
},

/*Ability to query Glue catalog*/
{
    "Sid": "GluePerms",
    "Effect": "Allow",
    "Action": [
        "glue:GetUserDefinedFunctions",
        "glue:BatchGetPartition",
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:GetPartition",
        "glue:GetPartitions",
        "glue:GetTable",
        "glue:GetTables",
        "glue:GetTableVersion",
        "glue:GetTableVersions"
    ],
    "Resource": [
        "arn:aws:glue:<region>:<account-id>:catalog",
        "arn:aws:glue:<region>:<account-id>:database/*",
        "arn:aws:glue:<region>:<account-id>:table/*"
    ]
},
{
    "Sid": "S3SourcePermissions",
    "Effect": "Allow",
    "Action": [
        "s3:ListBucket",
        "s3:GetObject"
    ],
    "Resource": [
        "arn:aws:s3:::<source-s3-bucket>",
        "arn:aws:s3:::<source-s3-bucket>/*"
    ]
},
{
    "Sid": "S3DestPermissions",
    "Effect": "Allow",
    "Action": [
        "s3:PutObject"
    ],
    "Resource": [
        "arn:aws:s3:::<destination-s3-bucket>/*"
    ]
}

/*Ability to issue queries against Athena*/
{
    "Sid": "AthenaPerms",
    "Effect": "Allow",
    "Action": [
        "athena:StartQueryExecution",
        "athena:GetQueryExecution",
        "athena:GetQueryResults",
        "athena:GetWorkGroup"
    ],
    "Resource": "arn:aws:athena:<region>:<account-id>:workgroup/tonic-emr-workgroup"
},
{
    "Sid": "AthenaQueryResultPermissions",
    "Effect": "Allow",
    "Action": [
        "s3:GetBucketLocation",
        "s3:GetObject",
        "s3:ListBucket",
        "s3:ListBucketMultipartUploads",
        "s3:ListMultipartUploadParts",
        "s3:AbortMultipartUpload",
        "s3:CreateBucket",
        "s3:PutObject"
    ],
    "Resource": [
        "arn:aws:s3:::<athena-query-results-bucket>",
        "arn:aws:s3:::<athena-query-results-bucket>/*"
    ]
}

Amazon EC2 instance profile role

Identifying the Amazon EC2 instance profile

The profile is the Amazon EC2 instance profile that you assigned as the value of EC2 instance profile when you created the Amazon EMR cluster.

Required permissions for the Amazon EC2 instance profile role

By default, a new Amazon EMR cluster is assigned the role EMR_EC2_DefaultRole, which contains all of the required permissions, plus additional permissions.

However, AWS recommends that you create a custom IAM role for your Amazon EMR cluster's Amazon EC2 instance profile role.

The following permissions reflect the minimum permissions needed for Structural data generation:

{
    "Sid": "GluePerms",
    "Effect": "Allow",
    "Action": [
        "glue:GetUserDefinedFunctions",
        "glue:BatchGetPartition",
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:GetPartition",
        "glue:GetPartitions",
        "glue:GetTable",
        "glue:GetTables",
        "glue:GetTableVersion",
        "glue:GetTableVersions"
    ],
    "Resource": [
        "arn:aws:glue:<region>:<account-id>:catalog",
        "arn:aws:glue:<region>:<account-id>:database/*",
        "arn:aws:glue:<region>:<account-id>:table/*"
    ]
},
{
    "Sid": "S3SourceBucketPerms",
    "Effect": "Allow",
    "Action": [
        "s3:ListBucket",
        "s3:GetObject"
    ],
    "Resource": [
        "arn:aws:s3:::<source-s3-bucket>",
        "arn:aws:s3:::<source-s3-bucket>/*"
    ]
},
{
    "Sid": "S3DestBucketPerms",
    "Effect": "Allow",
    "Action": [
        "s3:ListBucket",
        "s3:GetObject",
        "s3:PutObject"
    ],
    "Resource": [
        "arn:aws:s3:::<destination-s3-bucket>",
        "arn:aws:s3:::<destination-s3-bucket>/*"
    ]
},
{
    "Sid": "S3EmrLogBucketPerms",
    "Effect": "Allow",
    "Action": "s3:PutObject",
    "Resource": [
        "arn:aws:s3:::<emr-logs-s3-bucket>/*"
    ]
}

For Amazon EMR, the Glue catalog must contain a default database. If the default database does not exist, then Amazon EMR attempts to create it.

Before you run a Structural data generation, you must either:

  • Ensure that the default catalog exists

  • Add glue:CreateDatabase to the list of permissions that are granted to this role

Structural does not otherwise require this permission, and does not explicitly attempt to create a database.

Decrypt and encrypt permissions

You must add decrypt and encrypt permissions to your account on both the source and destination paths.

If you use the Amazon EMR Steps API, then the Amazon EMR role that is assigned to your cluster must have Decrypt access to the AWS KMS key that is used on the output bucket.

Last updated