Spark SDK
An introduction to Tonic's Spark SDK
Tonic supports Spark with a native integration to Databricks and Amazon EMR, and through an SDK, so that Tonic can be incorporated directly into existing Spark programs and workflows.

About the Tonic Spark SDK

Tonic's Spark SDK is written in Java. It can be used in existing Java, PySpark, and Scala Spark programs.

Connecting to a Spark database

The SDK requires a connection to the Tonic web server. To access the SDK experience from within Tonic, you first connect to a Spark database:
You connect to your self-managed Spark cluster. You use either Dremio or Hive as your metadata catalog.

Using a Spark program to de-identify data

After you connect to the Spark cluster, you can begin to use Tonic as normal. You identify the PII contained in your dataset and apply the appropriate Tonic generators to de-identify your data.
When all of the sensitive information is properly de-identified, you can use the Tonic SDK to de-identify data using a Spark program.
From the top right corner of the Tonic UI, click SDK Setup, then follow the instructions to get started.
In short, you need to:
  1. 1.
    Download the Tonic SDK JAR file and place it on the Spark master node. Your Spark jobs need to include this JAR file.
  2. 2.
    Create an API token.
  3. 3.
    Take note of your workspace ID.