Configuring Amazon EMR workspace data connections

During workspace creation, to indicate to use EMR: For Connection Type, choose Spark. For Cluster Type, choose EMR.

Connecting to the catalog database

In the Catalog Database section, you provide the details about the source database:
  1. 1.
    In the Glue Catalog Database field, provide the name of the Glue catalog database that contains the source data.
  2. 2.
    If the Glue catalog database is in a different AWS account from Tonic and EMR, then:
    1. 1.
      Toggle Cross Account Access to the on position.
    2. 2.
      In the Glue Catalog Id field, provide the AWS account ID for the account that contains the data catalog.
    3. 3.
      In the Glue Catalog Name field, provide the catalog name of the data source that is attached to Athena.
  3. 3.
    To test the connection to the catalog, click Test Catalog Connection.

Enabling validation of table filters

For EMR workspaces, you can provide where clauses to filter tables. See Applying a filter to tables.
The Enable partition filter validation toggle indicates whether Tonic should validate those filters when you create them.
By default, the toggle is in the on position, and Tonic validates the filters. To disable the validation, change the toggle to the off position.

Blocking data generation for all schema changes

By default, data generations are not blocked when schema changes do not conflict with your workspace configuration.
To block data generation when there are any schema changes, regardless of whether they conflict with your workspace configuration, toggle Block data generation on schema changes to the on position.

Identifying the EMR cluster

Amazon EMR supports the launching of Spark jobs through the EMR Steps API. To do this, it needs the EMR cluster identifier.
Under EMR Cluster, in the EMR Cluster Id field, provide the cluster ID of your EMR cluster.
You can find the cluster ID on the EMR Clusters console page. The ID always begins with "j-".
To test the connection to the EMR cluster, click Test EMR Connection.

Specifying the S3 location for the destination data

Under Output S3 Location, in the S3 Bucket Path field, provide the path to the location in S3 where Tonic writes the destination data.
To verify that Tonic can reach the provided path, click Test S3 Connection.

Providing Spark configuration variable values

The Spark Configuration section provides a list of spark configuration variables that Tonic needs to be set.