Configuring Amazon EMR workspace data connections

During workspace creation, to indicate to use Amazon EMR:

For Connection Type, choose Spark.
For Cluster Type, choose Amazon EMR.

Connecting to the catalog database

In the Catalog Database section, you provide the details about the source database:

In the Glue Catalog Database field, provide the name of the AWS Glue catalog database that contains the source data.
If the AWS Glue catalog database is in a different AWS account from Tonic Structural and Amazon EMR, then:
1. Toggle Cross Account Access to the on position.
2. In the Glue Catalog Id field, provide the AWS account ID for the account that contains the data catalog.
3. In the Glue Catalog Name field, provide the catalog name of the data source that is attached to Athena.
To test the connection to the catalog, click Test Catalog Connection.

Enabling validation of table filters

For EMR workspaces, you can provide where clauses to filter tables. For more information, go to Applying a filter to tables.

The Enable partition filter validation toggle indicates whether Structural validates those filters when you create them.

By default, the toggle is in the on position, and Structural validates the filters. To disable the validation, toggle the setting to the off position.

Identifying the Amazon EMR cluster

Amazon EMR supports the launching of Spark jobs through the Amazon EMR Steps API. To do this, it needs the Amazon EMR cluster identifier.

Under EMR Cluster, in the EMR Cluster Id field, provide the cluster ID of your Amazon EMR cluster.

You can find the cluster ID on the EMR Clusters console page. The ID always begins with "j-".

To test the connection to the Amazon EMR cluster, click Test EMR Connection.

Specifying the Amazon S3 location for the destination data

Under Output S3 Location, in the S3 Bucket Path field, provide the path to the location in Amazon S3 where Structural writes the destination data.

By default, Structural writes the output for each data generation to a new folder under the output location, and Create job-specific destination folder is in the on position. To create the folder, Structural appends a GUID to the output location.

To not create a separate folder for each data generation, toggle Create job-specific destination folder to the off position.

To verify that Structural can reach the provided path, click Test S3 Connection.

If you use non-job-specific folders for destination data, then the following environment settings determine how Structural handles overwrites. You can configure these settings from the Environment Settings tab on Structural Settings. You can also override these settings in individual workspaces.

TONIC_WORKSPACE_ERROR_ON_OVERRIDE. Whether to prevent overwrites of previous writes. By default, this setting is true, and attempts to overwrite return an error. To allow overwrites, set this to false.
TONIC_WORKSPACE_DEFAULT_SAVE_MODE. The mode to use to save tables to a non-job-specific folder. When this is set to a value other than null, which is the default, then this setting takes precedence over TONIC_WORKSPACE_ERROR_ON_OVERRIDE. The available values are Append, ErrorIfExists, Ignore, and Overwrite.

Providing Spark configuration variable values

The Spark Configuration section provides a list of spark configuration variables that Structural needs to be set.

Last updated 3 months ago

Was this helpful?