1 of 5

Spark SDK

In addition to its native integration with Databricks and Amazon EMR, Tonic Structural also supports Spark through an SDK.

The Spark SDK allows you to incorporate Structural directly into existing Spark programs and workflows.

The Structural Spark SDK is written in Java. It can be used in existing Java, PySpark, and Scala Spark programs.

Structural process overview for the Spark SDK

The following high-level diagram describes the Tonic Structural data generation process for the Spark SDK.

The source data for a Spark SDK workspace comes from either a Hive or Dremio catalog.

The Structural SDK on the Spark cluster calls the Structural web server to apply the configured generators and produce the output data.

A Spark SDK workspace is not configured with an output location for the transformed data. You identify the output location in your Spark program.

Structural differences and limitations with the Spark SDK

Required license: Professional or Enterprise license.

Not available on Tonic Structural Cloud.

No workspace inheritance

Spark SDK workspaces do not support workspace inheritance.

Table mode limitations

You can only assign the De-Identify or Truncate table modes.

For Truncate mode, the table is ignored completely. The table does not exist in the destination database.

Generator limitations

Spark SDK workspaces only support the following generators:

Address
Categorical
Character Scramble
Company Name
Constant
Continuous
Custom Categorical
Date Truncation
Email
HIPAA Address
Integer Key
JSON Mask
MAC Address
Name
Noise Generator
Null
Random Hash
Random Integer
Random UUID
Regex Mask
SSN
Struct Mask
Timestamp Shift Generator
UUID Key

No subsetting

Spark SDK workspaces do not support subsetting.

No upsert

Spark SDK workspaces do not support upsert.

No output to a container repository

For Spark SDK workspaces, you cannot write the destination data to a container repository.

No output to an Ephemeral snapshot

For Spark SDK workspaces, you cannot write the destination data to an Ephemeral snapshot

Configuring Spark SDK workspace data connections

The SDK requires a connection to the Tonic Structural web server. To access the SDK experience from within Structural, you must create a workspace that connects to a Spark database.

When configuring the SDK, the Structural application requires a connection to a catalog database to retrieve table information and data.

Structural supports Hive or Dremio.

In the workspace configuration, select Spark as the connection type, then select Self-managed as the cluster type.

Connecting to a Hive catalog database

Providing the connection details

Under Catalog Database, to connect to a Hive catalog using the SDK:

Under Catalog Type, click Hive.
In the Hive Catalog Database field, enter the name of the database.
In the Server field, provide the server where the database is located.
In the Port field, provide the port to use to connect to the database.
In the Username field, provide the username for the account to use to connect to the database.
In the Password field, provide the password for the specified user.
To test the connection to the Hive catalog database, click Test Hive Connection.

Enabling validation of table filters

The Enable partition filter validation setting indicates whether Structural validates those filters when you create them.

By default, the setting is in the on position, and Structural validates the filters. To disable the validation, toggle Enable partition filter validation to the off position.

Connecting to a Dremio catalog database

To connect to a Dremio catalog using the SDK:

Under Catalog Type, click Dremio.
Under Connection Method, select either Legacy ODBC or Arrow Flight.
In the Server field, provide the name of the server.
In the Port field, provide the port to use to connect to the database.
In the Username field, provide the name of the user to use to connect to the database.
In the Password field, provide the password for the specified user.
By default, the source data contains all of the schemas. To limit the data to specific schemas, in the Schema(s) field, enter the list of schemas.
If you selected Legacy ODBC as the connection method, then in the Delegation Username field, enter the name of the delegation user.
By default, SSL is enabled, and Enable SSL/TLS is in the on position. We strongly recommend that you do not turn off SSL.
To test the connection to the Dremio catalog, click Test Dremio Connection.

Blocking data generation on all schema changes

By default, data generation is not blocked for schema changes that do not conflict with your workspace configuration.

To block data generation when there are any schema changes, regardless of whether they conflict with your workspace configuration, toggle Block data generation on schema changes to the on position.

Using Spark to run de-identification of the data

After you connect to the Spark cluster, you can begin to use Tonic Structural as normal. You identify the sensitive data in your dataset and apply the appropriate Structural generators to de-identify your data.

When all of the sensitive information is properly de-identified, you can use the Structural SDK to de-identify data using a Spark program. For details, go to the Tonic SDK Javadoc.

From the top right corner of the Structural application, click SDK Setup, then follow the instructions to get started.

At a high level:

Download the Structural SDK JAR file and place it on the Spark master node. Your Spark jobs must include this JAR file.
Create an API token.
Take note of your workspace ID.

Here is a very basic example of using the SDK:

val baseStatisticsSeed = 489465;
val workspace = Workspace.createWorkspace("https://path/to/tonic", "<<api-key>>", "<<workspace-id>>", baseStatisticsSeed);
val sourceDf = spark.read.parquet("s3://parquet/source/users")
val processedDf = workspace.processDataframe("users", sourceDf);

Structural differences and limitations with the Spark SDK

Required license: Professional or Enterprise license.

Not available on Tonic Structural Cloud.

No workspace inheritance

Spark SDK workspaces do not support workspace inheritance.

Table mode limitations

You can only assign the De-Identify or Truncate table modes.

For Truncate mode, the table is ignored completely. The table does not exist in the destination database.

Generator limitations

Spark SDK workspaces only support the following generators:

Address
Categorical
Character Scramble
Company Name
Constant
Continuous
Custom Categorical
Date Truncation
Email
HIPAA Address
Integer Key
JSON Mask
MAC Address
Name
Noise Generator
Null
Random Hash
Random Integer
Random UUID
Regex Mask
SSN
Struct Mask
Timestamp Shift Generator
UUID Key

No subsetting

Spark SDK workspaces do not support subsetting.

No upsert

Spark SDK workspaces do not support upsert.

No output to a container repository

For Spark SDK workspaces, you cannot write the destination data to a container repository.

No output to an Ephemeral snapshot

For Spark SDK workspaces, you cannot write the destination data to an Ephemeral snapshot

Configuring Spark SDK workspace data connections

The SDK requires a connection to the Tonic Structural web server. To access the SDK experience from within Structural, you must create a workspace that connects to a Spark database.

When configuring the SDK, the Structural application requires a connection to a catalog database to retrieve table information and data.

Structural supports Hive or Dremio.

In the workspace configuration, select Spark as the connection type, then select Self-managed as the cluster type.

Connecting to a Hive catalog database

Providing the connection details

Under Catalog Database, to connect to a Hive catalog using the SDK:

Under Catalog Type, click Hive.
In the Hive Catalog Database field, enter the name of the database.
In the Server field, provide the server where the database is located.
In the Port field, provide the port to use to connect to the database.
In the Username field, provide the username for the account to use to connect to the database.
In the Password field, provide the password for the specified user.
To test the connection to the Hive catalog database, click Test Hive Connection.

Enabling validation of table filters

For Spark workspaces, you can provide where clauses to filter tables. For details, go to .

The Enable partition filter validation setting indicates whether Structural validates those filters when you create them.

By default, the setting is in the on position, and Structural validates the filters. To disable the validation, toggle Enable partition filter validation to the off position.

Connecting to a Dremio catalog database

To connect to a Dremio catalog using the SDK:

Under Catalog Type, click Dremio.
Under Connection Method, select either Legacy ODBC or Arrow Flight.
In the Server field, provide the name of the server.
In the Port field, provide the port to use to connect to the database.
In the Username field, provide the name of the user to use to connect to the database.
In the Password field, provide the password for the specified user.
By default, the source data contains all of the schemas. To limit the data to specific schemas, in the Schema(s) field, enter the list of schemas.
If you selected Legacy ODBC as the connection method, then in the Delegation Username field, enter the name of the delegation user.
By default, SSL is enabled, and Enable SSL/TLS is in the on position. We strongly recommend that you do not turn off SSL.
To test the connection to the Dremio catalog, click Test Dremio Connection.

Blocking data generation on all schema changes

By default, data generation is not blocked for schema changes that do not conflict with your workspace configuration.