The SDK requires a connection to the Tonic Structural web server. To access the SDK experience from within Tonic, you must create a workspace that connects to a Spark database.
When configuring the SDK, the Structural application requires a connection to a catalog database to retrieve table information and data.
Structural supports Hive or Dremio.
In the workspace configuration, select Spark as the connection type, then select Self-managed as the cluster type.
Under Catalog Database, to connect to a Hive catalog using the SDK:
Under Catalog Type, click Hive.
In the Hive Catalog Database field, enter the name of the database.
In the Server field, provide the server where the database is located.
In the Port field, provide the port to use to connect to the database.
In the Username field, provide the username for the account to use to connect to the database.
In the Password field, provide the password for the specified user.
To test the connection to the Hive catalog database, click Test Hive Connection.
For Spark workspaces, you can provide where clauses to filter tables. For details, go to #table-mode-filter-tables.
The Enable partition filter validation setting indicates whether Structural should validate those filters when you create them.
By default, the setting is in the on position, and Structural validates the filters. To disable the validation, toggle Enable partition filter validation to the off position.
To connect to a Dremio catalog using the SDK:
Under Catalog Type, click Dremio.
Under Connection Method, select either Legacy ODBC or Arrow Flight.
In the Server field, provide the name of the server.
In the Port field, provide the port to use to connect to the database.
In the Username field, provide the name of the user to use to connect to the database.
In the Password field, provide the password for the specified user.
By default, the source data contains all of the schemas. To limit the data to specific schemas, in the Schema(s) field, enter the list of schemas.
If you selected Legacy ODBC as the connection method, then in the Delegation Username field, enter the name of the delegation user.
By default, SSL is enabled, and Enable SSL/TLS is in the on position. We strongly recommend that you do not turn off SSL.
To test the connection to the Dremio catalog, click Test Dremio Connection.
By default, data generation is not blocked as long as schema changes do not conflict with your workspace configuration.
To block data generation when there are any schema changes, regardless of whether they conflict with your workspace configuration, toggle Block data generation on schema changes to the on position.
After you connect to the Spark cluster, you can begin to use Tonic Structural as normal. You identify the sensitive data in your dataset and apply the appropriate Structural generators to de-identify your data.
When all of the sensitive information is properly de-identified, you can use the Structural SDK to de-identify data using a Spark program. For details, go to the Tonic SDK Javadoc.
From the top right corner of the Structural application, click SDK Setup, then follow the instructions to get started.
At a high level:
Download the Structural SDK JAR file and place it on the Spark master node. Your Spark jobs must include this JAR file.
Create an API token.
Take note of your workspace ID.
Here is a very basic example of using the SDK:
Required license: Professional or Enterprise license.
Not available on Tonic Structural Cloud.
Spark SDK workspaces do not support workspace inheritance.
You can only assign the De-Identify or Truncate table modes.
For Truncate mode, the table is ignored completely. The table does not exist in the destination database.
Spark SDK workspaces only support the following generators:
Address
Categorical
Character Scramble
Company Name
Constant
Continuous
Custom Categorical
Date Truncation
HIPAA Address
Integer Key
JSON Mask
MAC Address
Name
Noise Generator
Null
Random Hash
Random Integer
Random UUID
Regex Mask
SSN
Struct Mask
Timestamp Shift Generator
UUID Key
Spark SDK workspaces do not support subsetting.
Spark SDK workspaces do not support upsert.
For Spark SDK workspaces, you cannot write the destination data to container artifacts.
For Spark SDK workspaces, you cannot write the destination data to an Ephemeral snapshot
In addition to its native integration with Databricks and Amazon EMR, Tonic Structural also supports Spark through an SDK.
The Spark SDK allows you to incorporate Structural directly into existing Spark programs and workflows.
The Structural Spark SDK is written in Java. It can be used in existing Java, PySpark, and Scala Spark programs.
Structural process for Spark SDK
How Structural data generation works with the Spark SDK.
Structural differences and limitations
Features that are unavailable or work differently in Spark SDK workspaces.
Configure workspace data connections
Data connection settings for Spark SDK workspaces.
Run data generation from Spark
Use the Structural SDK to run data generation on Spark SDK workspaces.
The following high-level diagram describes how Tonic Structural data generation is processed for the Spark SDK.
The source data for a Spark SDK workspace comes from either a Hive or Dremio catalog.
The Structural SDK on the Spark cluster calls the Structural web server to apply the configured generators and produce the output data.
A Spark SDK workspace is not configured with an output location for the transformed data. You identify the output location in your Spark program.