Overview of the Spark SDK integration
Workflow for the Spark SDK
The following diagram provides a high-level overview of the workflow to create a Spark SDK workspace and generate transformed data.

To start, you create a Spark SDK workspace in Structural. In the workspace settings, you provide a connection to a catalog database (Hive or Dremio) that contains the source data.
After you create the workspace, you set up the Spark SDK.
Next, in the workspace, you configure the data generation. This includes:
Indicating whether to truncate any tables.
Assigning and configuring generators for data columns.
Finally, you use a Spark program to run the data generation and write the output to a specified output location.
Data generation process
The following high-level diagram describes the Tonic Structural data generation process for the Spark SDK.

Source data
A Spark SDK workspace reads the source data from an existing Spark catalog, either:
Hive catalog, such as an AWS Glue data catalog
Dremio catalog
You load the source data into Spark DataFrames in the same way that you would for a standard Spark job.
Workflow for the data transformation
The Structural Spark SDK doesn’t send data to Structural for processing. Instead:
The Spark job initializes the Structural SDK.
The SDK retrieves the workspace configuration and generator definitions from the Structural web server.
Spark executes the generator configuration locally, in parallel, across the cluster.
The job generates a transformed Spark DataFrame that reflects the workspace configuration.
This design centralizes the rule definitions in Structural, but keeps the execution close to your data for scale and performance.
Output handling
A Spark SDK workspace is not configured with an output location for the transformed data.
You identify the output location in your Spark program.
For example, you might write the output to Amazon S3, a Hadoop Distributed File System (HDFS), Delta Lake, or any Spark-supported sink.
Last updated
Was this helpful?