Spark

Tonic supports flat Parquet and Avro file processing via Spark. Currently this feature is limited to flat files residing in Amazon's S3. If you have flat files residing in other systems such as Azure Data Lake, Azure Blob Store, HDFS, or DBFS please reach out to [email protected]

Supported version of Spark

Tonic supports Spark 2.4.x and Spark 3+. However, Spark 2.4.2 is not supported.

Connecting to S3

Tonic requires both a source and output S3 path. The source path should contain the flat files you intend on processing while the destination should be where you want the processed files to be placed. Tonic ensures that the folder structure will remain consistent in the output database.

Source path

Tonic assumes that your source path points to a folder or bucket that contains as direct children the tables which you wish to process. For example, for a customer with the two tables customers and orders we would expect a folder structure such as:

  • data/

    • customers/

      • file1.parquet

      • file2.parquet

      • ....

    • orders/

      • file1.parquet

      • file2.parquet

      • ...

In the above example your source path should be s3://<bucket name>/data

Destination path

The destination path should point to a folder which will the output in a sub-folder. The sub-folder is a GUID representing the specific job which generated the data.

For example, if your output path is s3://tonic-output/v1 then Tonic will create a sub-folder which contains all of the tables. So in the above example with customers and orders tables your output hierarchy will look like

  • tonic-output/

    • v1/

      • some-guid/

        • customers/

          • file1.parquet

          • file2.parquet

          • ....

        • orders/

          • file1.parquet

          • file2.parquet

          • ...

Partitioned datasets

Additionally, Tonic supports partitioned tables such as shown below:

  • data/

    • customers/

      • file1.parquet

      • file2.parquet

      • ....

    • orders/

      • dt=2020-01-01

        • hr=0

          • file1.parquet

          • file2.parquet

          • ...

        • hr=1

          • file1.parquet

          • file2.parquet

          • ...

      • dt=2020-01-02

        • ...

In the above example, Tonic will automatically recognize that the orders table is partitioned on the "dt" and "hr" columns and will preserve that partitioning in your output folder.

S3 Server Side Encryption

If your buckets have server side encryption enabled via KMS then your Spark Cluster must have Hadoop 2.8.1+ installed.

Connecting to your Spark Cluster

Tonic can connect to your on-prem Spark cluster in a variety of ways. Currently we support Amazon's managed Spark service EMR as well as connecting directly to your Spark Cluster via SSH-ing into your cluster's master node.

Amazon EMR

Amazon EMR is a managed Spark Cluster. Tonic has been tested on EMR v5.28.0+. However, earlier versions should also work, assuming they come packaged with a supported version of Spark. When possible, however, please use EMR v6+.

Connecting to EMR via SSH

Tonic can connect to any spark cluster, including EMR, via SSH. The EMR ssh connection information can be found on the Summary tab of your EMR clusters console.

You'll need to provide the DNS name of the cluster, the user name (always hadoop), the port (always 22), and the SSH private key which you selected when you setup the cluster.

Connecting to EMR via STEPS

Amazon EMR supports the launching of Spark jobs through the EMR Steps API. With this approach you must only provide Tonic with the cluster id of your EMR Cluster. The cluster id can be found on the EMR Clusters console page. The id should always begin with "j-".

Generic SSH

Tonic can connect to Spark clusters by being given SSH connection info to the cluster's master node.

Permissions

Depending on your specific setup permissions will vary. Currently, Tonic only supports permissions based off of IAM users and requires you to provide Tonic with the user's IAM access key id and access secret.

Permissions for S3 Access

These specific permissions are required regardless of your setup.

{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": "s3:ListBucket",
"Resource": [
"arn:aws:s3:::<bucket of source path>",
"arn:aws:s3:::<bucket of destination path>"
]
},
{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::<source path>/*",
"arn:aws:s3:::<source path>/",
"arn:aws:s3:::<destination path>/*",
"arn:aws:s3:::<destination path>/"
]
},
{
"Sid": "VisualEditor2",
"Effect": "Allow",
"Action": "s3:PutObject",
"Resource": [
"arn:aws:s3:::<destination path>/*",
"arn:aws:s3:::<destination path>"
]
}

Permissions for EMR Steps

These permissions are only required if you are using the EMR Steps API.

{
"Sid": "VisualEditor3",
"Effect": "Allow",
"Action": "elasticmapreduce:ListClusters",
"Resource": "*"
},
{
"Sid": "VisualEditor4",
"Effect": "Allow",
"Action": [
"elasticmapreduce:DescribeStep",
"elasticmapreduce:AddJobFlowSteps",
"elasticmapreduce:DescribeCluster"
],
"Resource": [
"arn:aws:elasticmapreduce:<aws region, e.g. us-east-1>:<aws account id>:cluster/<cluster id>"
]
}

Permissions for AWS Glue

For Hive+Spark with AWS Glue you'll need the below permissions to connect to your source Hive database.

{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:GetTables",
"glue:GetTable"
"glue:CreateDatabase",
],
"Resource": [
"arn:aws:glue:<aws region, e.g. us-east-1>:<aws account id>:catalog",
"arn:aws:glue:<aws region, e.g. us-east-1>:<aws account id>:table/*/*",
"arn:aws:glue:<aws region, e.g. us-east-1>:<aws account id>:database/*",
]
}

You can optionally write your data to an output AWS Glue database by specifying one in Tonic. If you do this, you'll need to add the below permissions as well.

{
"Sid": "VisualEditor2",
"Effect": "Allow",
"Action": [
"glue:DeleteDatabase"
],
"Resource": [
"arn:aws:glue:<aws region>:<aws account>:catalog",
"arn:aws:glue:<aws region>:<aws account>:database/<output database name>",
"arn:aws:glue:<aws region>:<aws account>:userDefinedFunction/<output database name>/*",
"arn:aws:glue:<aws region>:<aws account>:table/<output database name>/*",
]
}

Server side Encryption with KMS

You'll need to add decrypt and encrypt permissions to your account on both the source and destination paths

Additionally, if you are using the EMR Steps API then EMR Role assigned to your cluster must be given Decrypt access to KMS key being used on the output bucket.

Logs

Logging of Spark jobs is more limited than for other databases Tonic supports. This is due to the distributed and managed nature of Spark clusters.

Logging jobs launched via EMR Steps

The Jobs page, available on Tonic's left sidebar will provide information on whether a job was successfully submitted to the EMR cluster but will provide no information as to the status of the job. This should be done through the EMR console page.

Logging jobs launched via an SSH connection (including to EMR)

The Jobs page, available on Tonic's left sidebar will provide information on the job's status as it runs and will additionally provide a Tracking URL once the job has started. You can follow this tracking URL to Spark's management portal where you can find additional, more detailed logs.