Overview of the Spark SDK integration

Workflow for the Spark SDK

The following diagram provides a high-level overview of the workflow to create a Spark SDK workspace and generate transformed data.

Workflow for creating and generating data from a Spark SDK workspace
  1. To start, you create a Spark SDK workspace in Structural. In the workspace settings, you provide a connection to a catalog database (Hive or Dremio) that contains the source data.

  2. After you create the workspace, you set up the Spark SDK.

  3. Next, in the workspace, you configure the data generation. This includes:

    • Identifying sensitive data. For Spark SDK workspaces, you run manual or scheduled sensitivity scans.

    • Indicating whether to truncate any tables.

    • Assigning and configuring generators for data columns.

  4. Finally, you use a Spark program to run the data generation and write the output to a specified output location.

Data generation process

The following high-level diagram describes the Tonic Structural data generation process for the Spark SDK.

Overview diagram of the Tonic Structural process for the Spark SDK

Source data

A Spark SDK workspace reads the source data from an existing Spark catalog, either:

  • Hive catalog, such as an AWS Glue data catalog

  • Dremio catalog

You load the source data into Spark DataFrames in the same way that you would for a standard Spark job.

Workflow for the data transformation

The Structural Spark SDK doesn’t send data to Structural for processing. Instead:

  1. The Spark job initializes the Structural SDK.

  2. The SDK retrieves the workspace configuration and generator definitions from the Structural web server.

  3. Spark executes the generator configuration locally, in parallel, across the cluster.

  4. The job generates a transformed Spark DataFrame that reflects the workspace configuration.

This design centralizes the rule definitions in Structural, but keeps the execution close to your data for scale and performance.

Output handling

A Spark SDK workspace is not configured with an output location for the transformed data.

You identify the output location in your Spark program.

For example, you might write the output to Amazon S3, a Hadoop Distributed File System (HDFS), Delta Lake, or any Spark-supported sink.

Last updated

Was this helpful?