Using Spark to run de-identification of the data

After you connect to the Spark cluster, you can begin to use Tonic Structural as normal. You identify the sensitive data in your dataset and apply the appropriate Structural generators to de-identify your data.

When all of the sensitive information is properly de-identified, you can use the Structural SDK to de-identify data using a Spark program. For details, go to the Tonic SDK Javadoc.

From the top right corner of the Structural application, click SDK Setup, then follow the instructions to get started.

At a high level:

  1. Download the Structural SDK JAR file and place it on the Spark master node. Your Spark jobs must include this JAR file.

  2. Create an API token.

  3. Take note of your workspace ID.

Here is a very basic example of using the SDK:

val baseStatisticsSeed = 489465;
val workspace = Workspace.createWorkspace("https://path/to/tonic", "<<api-key>>", "<<workspace-id>>", baseStatisticsSeed);
val sourceDf = spark.read.parquet("s3://parquet/source/users")
val processedDf = workspace.processDataframe("users", sourceDf);

Last updated