Amazon EMR
Amazon Elastic MapReduce (EMR) is Amazon's managed Spark Cluster. Tonic supports the processing of flat files (parquet, csv, avro, etc) in S3 via EMR.

Supported version of Spark

Tonic supports Spark 2.4.x and Spark 3.0.0. Note that Spark 2.4.2 is not supported, however. We suggest using EMR-6.1.0 or EMR-6.2.0 with Spark 3.0.0 or Spark 3.0.1, respectively but any version between 5.2.8 and 6.0.2 should work.

Additional requirements

Tonic uses EMR for compute, however, there are a few other pieces of AWS infrastructure that are also required.
Tonic requires a metadata catalog when connecting to your data. At the moment only AWS Glue is supported when working with EMR. Additionally, Tonic writes data to S3 only. Tonic does not write output data back into a catalog.

S3 server side encryption

If your buckets have server side encryption enabled via KMS then your Spark Cluster must have Hadoop 2.8.1+ installed.

Connecting to your EMR cluster

Connecting to EMR via STEPS

Amazon EMR supports the launching of Spark jobs through the EMR Steps API. With this approach you must only provide Tonic with the cluster id of your EMR Cluster. The cluster id can be found on the EMR Clusters console page. The id should always begin with "j-".


Logging of Spark jobs is more limited than for other databases Tonic supports. This is due to the distributed and managed nature of Spark clusters.

Logging jobs launched via EMR Steps

The Jobs page, available on Tonic's left sidebar will provide information on whether a job was successfully submitted to the EMR cluster but will provide no information as to the status of the job. This should be done through the EMR console page.

Logging jobs launched via an SSH connection (including to EMR)

The Jobs page, available on Tonic's left sidebar will provide information on the job's status as it runs and will additionally provide a Tracking URL once the job has started. You can follow this tracking URL to Spark's management portal where you can find additional, more detailed logs.