Configuration for cross-account setups

Tonic Structural supports operating on AWS Glue catalogs in AWS accounts that are different from where Structural and Amazon EMR are configured.

For the instructions in this topic, we'll use the following example:

AWS Account A contains the Amazon EMR Cluster, Athena workgroup, and destination S3 bucket.
AWS Account B contains the AWS Glue data catalog and source S3 bucket.

The following instructions explain how to set up each required AWS component. For cross-account setups, you use these instructions instead of the instructions in Creating IAM roles for Structural and Amazon EMR.

These instructions assume that both accounts reside in the same Region. If your accounts belong to different Regions, then go to the Amazon documentation for instructions on how to set up a VPC for cross-account access.

Granting access to the required resources

The account that has Structural and Amazon EMR must be granted accesses to the resources for the account that has the AWS Glue catalog.

To continue our example, you must first grant Account A access to Account B's resources. To do this, set up the following resource-based policies for Account B's AWS Glue data catalog and source S3 bucket.

Account B Glue data catalog resource policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::<account-A-id>:role/<tonic-role>",
                    "arn:aws:iam::<account-A-id>:role/<emr-ec2-instance-profile-role>"
                ]
            },
            "Action": [
                "glue:GetUserDefinedFunctions",
                "glue:BatchGetPartition",
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:GetPartition",
                "glue:GetPartitions",
                "glue:GetTable",
                "glue:GetTables",
                "glue:GetTableVersion",
                "glue:GetTableVersions"
            ],
            "Resource": [
                "arn:aws:glue:<region>:<account-B-id>:catalog",
                "arn:aws:glue:<region>:<account-B-id>:database/*",
                "arn:aws:glue:<region>:<account-B-id>:table/*"
            ]
        }
    ]
}

Register Account B's glue data catalog as an Athena data source inside Account A's Athena console. For instructions, go to the AWS documentation.

Account B source S3 bucket bucket policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Statement1",
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::<account-A-id>:role/<tonic-role>",
                    "arn:aws:iam::<account-A-id>:role/<emr-ec2-instance-profile-role>"
                ]
            },
            "Action": [
                "s3:ListBucket",
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::<account-B-source-bucket>",
                "arn:aws:s3:::<account-B-source-bucket>/*"
            ]
        }
    ]
}

Account A Amazon EMR cluster

When you create your Amazon EMR cluster, make sure to enable the Use AWS Glue Data Catalog for table metadata option. This allows you to set a default catalog ID that points to Account B.

You must set the following configuration for all instance groups in the Amazon EMR cluster:

[
  {
    "Classification": "spark-hive-site",
    "Properties": {
      "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory",
      "hive.metastore.glue.catalogid": "<account-B-id>"
    }
  }
]

Structural server role

Identifying the profile that has the Structural server role

By default, Structural uses the IAM profile that is attached to the instance where Structural runs.

If you do not want to use that IAM profile, then to identify the profile to use:

Set the environment setting TONIC_AWS_ACCESS_KEY_ID to the AWS access key that is associated with the IAM profile.
Set the environment setting TONIC_AWS_SECRET_ACCESS_KEY to the secret key that is associated with the access key.

For information about how to configure Structural environment settings, go to Configuring environment settings.

Required permissions for the Structural server role

The Structural server role must have the the following permissions:

{
    "Sid": "EmrListClustersPerms",
    "Effect": "Allow",
    "Action": "elasticmapreduce:ListClusters",
    "Resource": "*"
},
{
    "Sid": "EmrPerms",
    "Effect": "Allow",
    "Action": [
        "elasticmapreduce:DescribeStep",
        "elasticmapreduce:AddJobFlowSteps",
        "elasticmapreduce:DescribeCluster"
    ],
    "Resource": [
        "arn:aws:elasticmapreduce:<region>:<account-A-id>:cluster/<cluster-id>"
    ]
},
{
    "Sid": "CrossAccountGluePerms",
    "Effect": "Allow",
    "Action": [
        "glue:GetUserDefinedFunctions",
        "glue:BatchGetPartition",
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:GetPartition",
        "glue:GetPartitions",
        "glue:GetTable",
        "glue:GetTables",
        "glue:GetTableVersion",
        "glue:GetTableVersions"
    ],
    "Resource": [
        "arn:aws:glue:<region>:<account-B-id>:catalog",
        "arn:aws:glue:<region>:<account-B-id>:database/*",
        "arn:aws:glue:<region>:<account-B-id>:table/*"
    ]
},
{
    "Sid": "CrossAccountS3SourcePerms",
    "Effect": "Allow",
    "Action": [
        "s3:ListBucket",
        "s3:GetObject"
    ],
    "Resource": [
        "arn:aws:s3:::<account-B-source-s3-bucket>",
        "arn:aws:s3:::<account-B-source-s3-bucket>/*"
    ]
},
{
    "Sid": "S3DestPerms",
    "Effect": "Allow",
    "Action": [
        "s3:PutObject"
    ],
    "Resource": [
        "arn:aws:s3:::<destination-s3-bucket>/*"
    ]
},
{
    "Sid": "AthenaPerms",
    "Effect": "Allow",
    "Action": [
        "athena:StartQueryExecution",
        "athena:GetQueryExecution",
        "athena:GetQueryResults"
    ],
    "Resource": "arn:aws:athena:<region>:<account-A-id>:workgroup/tonic-emr-workgroup"
},
{
    "Sid": "CrossAccountAthenaPerms",
    "Effect": "Allow",
    "Action": [
        "athena:GetDataCatalog"
    ],
    "Resource": "arn:aws:athena:<region>:<account-A-id>:datacatalog/<catalog-name>"
},
{
    "Sid": "AthenaQueryResultPerms",
    "Effect": "Allow",
    "Action": [
        "s3:GetBucketLocation",
        "s3:GetObject",
        "s3:ListBucket",
        "s3:ListBucketMultipartUploads",
        "s3:ListMultipartUploadParts",
        "s3:AbortMultipartUpload",
        "s3:CreateBucket",
        "s3:PutObject"
    ],
    "Resource": [
        "arn:aws:s3:::<athena-query-results-bucket>",
        "arn:aws:s3:::<athena-query-results-bucket>/*"
    ]
}

Amazon EC2 instance profile role

Identifying the profile that has the Amazon EC2 instance role

The profile is the Amazon EC2 instance profile that you assigned as the value of EC2 instance profile when you created the Amazon EMR cluster.

Required permissions for the Amazon EC2 instance role

By default, a new Amazon EMR cluster is assigned the role EMR_EC2_DefaultRole, which contains all of the required permissions, plus additional permissions.

However, AWS recommends that you create a custom IAM role for your Amazon EMR cluster's Amazon EC2 instance profile role.

The following permissions reflect the minimum permissions needed for Structural data generation:

{
    "Sid": "GluePerms",
    "Effect": "Allow",
    "Action": [
        "glue:CreateDatabase",
        "glue:UpdateDatabase",
        "glue:DeleteDatabase",
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:CreateTable",
        "glue:UpdateTable",
        "glue:DeleteTable",
        "glue:GetTable",
        "glue:GetTables",
        "glue:GetTableVersions",
        "glue:CreatePartition",
        "glue:BatchCreatePartition",
        "glue:UpdatePartition",
        "glue:DeletePartition",
        "glue:BatchDeletePartition",
        "glue:GetPartition",
        "glue:GetPartitions",
        "glue:BatchGetPartition",
        "glue:CreateUserDefinedFunction",
        "glue:UpdateUserDefinedFunction",
        "glue:DeleteUserDefinedFunction",
        "glue:GetUserDefinedFunction",
        "glue:GetUserDefinedFunctions"
    ],
    "Resource": "*"
},
{
    "Sid": "CrossAccountS3SourceBucketPerms",
    "Effect": "Allow",
    "Action": [
        "s3:ListBucket",
        "s3:GetObject"
    ],
    "Resource": [
        "arn:aws:s3:::<account-B-source-s3-bucket>",
        "arn:aws:s3:::<account-B-source-s3-bucket>/*"
    ]
},
{
    "Sid": "S3DestBucketPerms",
    "Effect": "Allow",
    "Action": [
        "s3:ListBucket",
        "s3:GetObject",
        "s3:PutObject"
    ],
    "Resource": [
        "arn:aws:s3:::<destination-s3-bucket>",
        "arn:aws:s3:::<destination-s3-bucket>/*"
    ]
},
{
    "Sid": "S3EmrLogBucketPerms",
    "Effect": "Allow",
    "Action": "s3:PutObject",
    "Resource": [
        "arn:aws:s3:::<s3-emr-log-bucket>/*"
    ]
}

For Amazon EMR, the Glue catalog must contain a default database. If the default database does not exist, then Amazon EMR attempts to create it.

Before you run a Structural data generation, you must either:

Ensure that the default catalog exists
Add glue:CreateDatabase to the list of permissions that are granted to this role

Structural does not otherwise require this permission, and does not explicitly attempt to create a database.

Last updated 1 month ago

Was this helpful?