Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Tonic Structural with Amazon EMR uses the following IAM roles:
A role that is used by the Structural server
A role that is used by the Amazon EC2 instance profile for the Amazon EMR cluster
These roles must have the required permissions in order for Structural to run successfully on Amazon EMR data. Each role must be assigned to the appropriate profile.
If the Glue catalog is under a different AWS account from the one used for Structural and Amazon EMR, then use the instructions in Configuration for cross-account setups.
By default, Structural uses the IAM profile that is attached to the instance where Structural runs.
If you do not want to use that IAM profile, then to identify the profile to use:
Set the environment setting TONIC_AWS_ACCESS_KEY_ID
to the AWS access key that is associated with the IAM profile.
Set the environment setting TONIC_AWS_SECRET_ACCESS_KEY
to the secret key that is associated with the access key.
For information about how to configure Structural environment settings, go to Configuring environment settings.
The Structural server role must have the the following permissions:
The profile is the Amazon EC2 instance profile that you assigned as the value of EC2 instance profile when you created the Amazon EMR cluster.
By default, a new Amazon EMR cluster is assigned the role EMR_EC2_DefaultRole
, which contains all of the required permissions, plus additional permissions.
However, AWS recommends that you create a custom IAM role for your Amazon EMR cluster's Amazon EC2 instance profile role.
The following permissions reflect the minimum permissions needed for Structural data generation:
For Amazon EMR, the Glue catalog must contain a default
database. If the default
database does not exist, then Amazon EMR attempts to create it.
Before you run a Structural data generation, you must either:
Ensure that the default
catalog exists
Add glue:CreateDatabase
to the list of permissions that are granted to this role
Structural does not otherwise require this permission, and does not explicitly attempt to create a database.
You must add decrypt and encrypt permissions to your account on both the source and destination paths.
If you use the Amazon EMR Steps API, then the Amazon EMR role that is assigned to your cluster must have Decrypt access to the AWS KMS key that is used on the output bucket.
Tonic Structural supports Spark 2.4.x, Spark 3.0.0, Spark 3.0.1, and Spark 3.2.0. However, note that Spark 2.4.2 is not supported.
We suggest using EMR-6.1.0 or EMR-6.2.0 with Spark 3.0.0 or Spark 3.0.1, respectively. Any version between 5.2.8 and 6.0.2 should work.
Structural supports the following data providers:
Parquet
Parquet
CSV
CSV
Avro
Avro
JSON
JSON
ORC
ORC
Structural requires a metadata catalog when connecting to your data. Currently only AWS Glue is supported when working with Amazon EMR.
Structural writes data to Amazon S3 only. Structural does not write output data back into a catalog.
If your S3 buckets have server side encryption enabled via AWS KMS, then your Spark cluster must have Hadoop 2.8.1+ installed.
Required license: Professional or Enterprise
Not available on Structural Cloud.
Amazon EMR workspaces do not support workspace inheritance.
You can only assign the De-Identify or Truncate table modes.
For Truncate mode, the table is ignored completely. The table does not exist in the destination database.
Amazon EMR workspaces cannot use the following generators:
Algebraic
Array Character Scramble
Array JSON Mask
Array Regex Mask
Cross-Table Sum
CSV Mask
Event Timestamps
HTML Mask
JSON Mask
SIN
The following generators are supported, but with restrictions:
Character Scramble is only supported for text columns.
Timestamp Shift is only supported on date column types.
Amazon EMR workspaces do not support subsetting.
However, for tables that use the De-Identify table mode, you can provide a WHERE
clause to filter the table. For details, to go Using table filtering for data warehouses and Spark-based data connectors.
Amazon EMR workspaces do not support upsert.
For Amazon EMR workspaces, you cannot write the destination data to container artifacts.
For Amazon EMR workspaces, you cannot write the destination data to an Ephemeral snapshot.
The logging of Spark jobs on the job details page is more limited than it is for other data connectors. This is because of how Spark clusters are distributed and managed.
The Jobs view provides information about the job's status as it runs.
After the job starts, it provides a tracking URL. The tracking URL leads to the Spark management portal, where you can find additional, more detailed logs.
During workspace creation, to indicate to use Amazon EMR:
For Connection Type, choose Spark.
For Cluster Type, choose Amazon EMR.
In the Catalog Database section, you provide the details about the source database:
In the Glue Catalog Database field, provide the name of the AWS Glue catalog database that contains the source data.
If the AWS Glue catalog database is in a different AWS account from Tonic Structural and Amazon EMR, then:
Toggle Cross Account Access to the on position.
In the Glue Catalog Id field, provide the AWS account ID for the account that contains the data catalog.
In the Glue Catalog Name field, provide the catalog name of the data source that is attached to Athena.
To test the connection to the catalog, click Test Catalog Connection.
For EMR workspaces, you can provide where clauses to filter tables. See #table-mode-filter-tables.
The Enable partition filter validation toggle indicates whether Structural should validate those filters when you create them.
By default, the toggle is in the on position, and Structural validates the filters. To disable the validation, change the toggle to the off position.
By default, data generations are not blocked when schema changes do not conflict with your workspace configuration.
To block data generation when there are any schema changes, regardless of whether they conflict with your workspace configuration, toggle Block data generation on schema changes to the on position.
Amazon EMR supports the launching of Spark jobs through the Amazon EMR Steps API. To do this, it needs the Amazon EMR cluster identifier.
Under EMR Cluster, in the EMR Cluster Id field, provide the cluster ID of your Amazon EMR cluster.
You can find the cluster ID on the EMR Clusters console page. The ID always begins with "j-".
To test the connection to the Amazon EMR cluster, click Test EMR Connection.
Under Output S3 Location, in the S3 Bucket Path field, provide the path to the location in Amazon S3 where Structural writes the destination data.
By default, Structural writes the output for each data generation to a new folder under the output location, and Create job-specific destination folder is in the on position. To create the folder, Structural appends a GUID to the output location.
To not create a separate folder for each data generation, toggle Create job-specific destination folder to the off position.
To verify that Structural can reach the provided path, click Test S3 Connection.
If you use non-job-specific folders for destination data, then the following environment settings determine how Structural handles overwrites. You can configure these settings from the Environment Settings tab on Structural Settings.
TONIC_WORKSPACE_ERROR_ON_OVERRIDE
. Whether to prevent overwrites of previous writes. By default, this setting is true
, and attempts to overwrite return an error. To allow overwrites, set this to false
.
TONIC_WORKSPACE_DEFAULT_SAVE_MODE
. The mode to use to save tables to a non-job-specific folder. When this is set to a value other than null, which is the default, then this setting takes precedence over TONIC_WORKSPACE_ERROR_ON_OVERRIDE
. The available values are Append
, ErrorIfExists
, Ignore
, and Overwrite
.
The Spark Configuration section provides a list of spark configuration variables that Structural needs to be set.
The following high-level diagram describes how Tonic Structural data generation is processed for Amazon EMR.
For an Amazon EMR workspace, the source data comes from a database in a Glue catalog. The source data is fed to the Structural web server and Structural worker through Amazon Athena.
The Structural worker calls the EMR Steps API to coordinate the data generation job on the Spark cluster.
The destination data is written to an S3 bucket.
Tonic Structural supports operating on AWS Glue catalogs in AWS accounts different from where Structural and Amazon EMR are configured.
For the instructions in this topic, we'll use the following example:
AWS Account A contains the Amazon EMR Cluster, Athena workgroup, and destination S3 bucket.
AWS Account B contains the AWS Glue data catalog and source S3 bucket.
The following instructions explain how to set up each required AWS component. For cross-account setups, you use these instructions instead of the instructions in Creating IAM roles for Structural and Amazon EMR.
These instructions assume that both accounts reside in the same region. If your accounts belong to different regions, then see the Amazon documentation for instructions on how to set up a VPC for cross-account access.
The account that has Structural and Amazon EMR must be granted accesses to the resources for the account that has the AWS Glue catalog.
To continue our example, you must first grant Account A access to Account B's resources. To do this, set up the following resource-based policies for Account B's AWS Glue data catalog and source S3 bucket.
Register Account B's glue data catalog as an Athena data source inside Account A's Athena console. For instructions, see the AWS documentation.
When you create your Amazon EMR cluster, make sure to enable the Use AWS Glue Data Catalog for table metadata option. This allows you to set a default catalog ID that points to Account B.
You must set the following configuration for all instance groups in the Amazon EMR cluster:
By default, Structural uses the IAM profile that is attached to the instance where Structural runs.
If you do not want to use that IAM profile, then to identify the profile to use:
Set the environment setting TONIC_AWS_ACCESS_KEY_ID
to the AWS access key that is associated with the IAM profile.
Set the environment setting TONIC_AWS_SECRET_ACCESS_KEY
to the secret key that is associated with the access key.
For information about how to configure Structural environment settings, go to Configuring environment settings.
The Structural server role must have the the following permissions:
The profile is the Amazon EC2 instance profile that you assigned as the value of EC2 instance profile when you created the Amazon EMR cluster.
By default, a new Amazon EMR cluster is assigned the role EMR_EC2_DefaultRole
, which contains all of the required permissions, plus additional permissions.
However, AWS recommends that you create a custom IAM role for your Amazon EMR cluster's Amazon EC2 instance profile role.
The following permissions reflect the minimum permissions needed for Structural data generation:
For Amazon EMR, the Glue catalog must contain a default
database. If the default
database does not exist, then Amazon EMR attempts to create it.
Before you run a Structural data generation, you must either:
Ensure that the default
catalog exists
Add glue:CreateDatabase
to the list of permissions that are granted to this role
Structural does not otherwise require this permission, and does not explicitly attempt to create a database.
Tonic Structural uses Amazon Athena to run queries against your AWS Glue catalog. These queries are typically for powering the front-end experience. They are also used in conjunction with gathering data used by certain generators during a data generation.
To use Athena, you must , and also create an Athena Workgroup. You can do this from the Athena homepage in the AWS Console.
By default, Structural expects the WorkGroup name to be tonic-emr-workgroup
. To override the default value, configure the following environment setting:
If you do override this value, you must override it in both the Structural web server container and on the Amazon EMR cluster.
Structural process for Amazon EMR
How Structural data generation works with Amazon EMR.
System requirements
Supported versions of Amazon EMR and other requirements related to Amazon EMR.
Structural differences and limitations
Features that are unavailable or work differently for the Amazon EMR data connector.
Required Amazon EMR configuration
Required configuration in Amazon EMR before you create an Amazon EMR workspace.
Configure the workspace data connections
Data connection settings for Amazon EMR workspaces.
Create IAM roles
Create roles for the Tonic Structural server and the EC2 instance profile.
Create Athena workgroups
Enable AWS Athena to run queries against your Glue catalog.
Configure cross-account setups
Enable the Glue catalog to be in a different AWS account from Structural and Amazon EMR.