LogoLogo
Release notesAPI docsDocs homeStructural CloudTonic.ai
  • Tonic Structural User Guide
  • About Tonic Structural
    • Structural data generation workflow
    • Structural deployment types
    • Structural implementation roles
    • Structural license plans
  • Logging into Structural for the first time
  • Getting started with the Structural free trial
  • Managing your user account
  • Frequently Asked Questions
  • Tutorial videos
  • Creating and managing workspaces
    • Managing workspaces
      • Viewing your list of workspaces
      • Creating, editing, or deleting a workspace
      • Workspace configuration settings
        • Workspace identification and connection type
        • Data connection settings
        • Configuring secrets managers for database connections
        • Data generation settings
        • Enabling and configuring upsert
        • Writing output to Tonic Ephemeral
        • Writing output to a container repository
        • Advanced workspace overrides
      • About the workspace management view
      • About workspace inheritance
      • Assigning tags to a workspace
      • Exporting and importing the workspace configuration
    • Managing access to workspaces
      • Sharing workspace access
      • Transferring ownership of a workspace
    • Viewing workspace jobs and job details
  • Configuring data generation
    • Privacy Hub
    • Database View
      • Viewing and configuring tables
      • Viewing the column list
      • Displaying sample data for a column
      • Configuring an individual column
      • Configuring multiple columns
      • Identifying similar columns
      • Commenting on columns
    • Table View
    • Working with document-based data
      • Performing scans on collections
      • Using Collection View
    • Identifying sensitive data
      • Running the Structural sensitivity scan
      • Manually indicating whether a column is sensitive
      • Built-in sensitivity types that Structural detects
      • Creating and managing custom sensitivity rules
    • Table modes
    • Generator information
      • Generator summary
      • Generator reference
        • Address
        • Algebraic
        • Alphanumeric String Key
        • Array Character Scramble
        • Array JSON Mask
        • Array Regex Mask
        • ASCII Key
        • Business Name
        • Categorical
        • Character Scramble
        • Character Substitution
        • Company Name
        • Conditional
        • Constant
        • Continuous
        • Cross Table Sum
        • CSV Mask
        • Custom Categorical
        • Date Truncation
        • Email
        • Event Timestamps
        • File Name
        • Find and Replace
        • FNR
        • Geo
        • HIPAA Address
        • Hostname
        • HStore Mask
        • HTML Mask
        • Integer Key
        • International Address
        • IP Address
        • JSON Mask
        • MAC Address
        • Mongo ObjectId Key
        • Name
        • Noise Generator
        • Null
        • Numeric String Key
        • Passthrough
        • Phone
        • Random Boolean
        • Random Double
        • Random Hash
        • Random Integer
        • Random Timestamp
        • Random UUID
        • Regex Mask
        • Sequential Integer
        • Shipping Container
        • SIN
        • SSN
        • Struct Mask
        • Timestamp Shift Generator
        • Unique Email
        • URL
        • UUID Key
        • XML Mask
      • Generator characteristics
        • Enabling consistency
        • Linking generators
        • Differential privacy
        • Partitioning a column
        • Data-free generators
        • Supporting uniqueness constraints
        • Format-preserving encryption (FPE)
      • Generator types
        • Composite generators
        • Primary key generators
    • Generator assignment and configuration
      • Reviewing and applying recommended generators
      • Assigning and configuring generators
      • Document View for file connector JSON columns
      • Generator hints and tips
      • Managing generator presets
      • Configuring and using Structural data encryption
      • Custom value processors
    • Subsetting data
      • About subsetting
      • Using table filtering for data warehouses and Spark-based data connectors
      • Viewing the current subsetting configuration
      • Subsetting and foreign keys
      • Configuring subsetting
      • Viewing and managing configuration inheritance
      • Viewing the subset creation steps
      • Viewing previous subsetting data generation runs
      • Generating cohesive subset data from related databases
      • Other subsetting hints and tips
    • Viewing and adding foreign keys
    • Viewing and resolving schema changes
    • Tracking changes to workspaces, generator presets, and sensitivity rules
    • Using the Privacy Report to verify data protection
  • Running data generation
    • Running data generation jobs
      • Types of data generation
      • Data generation process
      • Running data generation manually
      • Scheduling data generation
      • Issues that prevent data generation
    • Managing data generation performance
    • Viewing and downloading container artifacts
    • Post-job scripts
    • Webhooks
  • Installing and Administering Structural
    • Structural architecture
    • Using Structural securely
    • Deploying a self-hosted Structural instance
      • Deployment checklist
      • System requirements
      • Deploying with Docker Compose
      • Deploying on Kubernetes with Helm
      • Enabling the option to write output data to a container repository
        • Setting up a Kubernetes cluster to use to write output data to a container repository
        • Required access to write destination data to a container repository
      • Entering and updating your license key
      • Setting up host integration
      • Working with the application database
      • Setting up a secret
      • Setting a custom certificate
    • Using Structural Cloud
      • Structural Cloud notes
      • Setting up and managing a Structural Cloud pay-as-you-go subscription
      • Structural Cloud onboarding
    • Managing user access to Structural
      • Structural organizations
      • Determining whether users can create accounts
      • Creating a new account in an existing organization
      • Single sign-on (SSO)
        • Structural user authentication with SSO
        • Enabling and configuring SSO on Structural Cloud
        • Synchronizing SSO groups with Structural
        • Viewing the list of SSO groups in Tonic Structural
        • AWS IAM Identity Center
        • Duo
        • GitHub
        • Google
        • Keycloak
        • Microsoft Entra ID (previously Azure Active Directory)
        • Okta
        • OpenID Connect (OIDC)
        • SAML
      • Managing Structural users
      • Managing permissions
        • About permission sets
        • Built-in permission sets
        • Available permissions
        • Viewing the lists of global and workspace permission sets
        • Configuring custom permission sets
        • Selecting default permission sets
        • Configuring access to global permission sets
        • Setting initial access to all global permissions
        • Granting Account Admin access for a Structural Cloud organization
    • Structural monitoring and logging
      • Monitoring Structural services
      • Performing health checks
      • Downloading the usage report
      • Tracking user access and permissions
      • Redacted and diagnostic (unredacted) logs
      • Data that Tonic.ai collects
      • Verifying and enabling telemetry sharing
    • Configuring environment settings
    • Updating Structural
  • Connecting to your data
    • About data connectors
    • Overview for database administrators
    • Data connector summary
    • Amazon DynamoDB
      • System requirements and limitations for DynamoDB
      • Structural differences and limitations with DynamoDB
      • Before you create a DynamoDB workspace
      • Configuring DynamoDB workspace data connections
    • Amazon EMR
      • Structural process overview for Amazon EMR
      • System requirements for Amazon EMR
      • Structural differences and limitations with Amazon EMR
      • Before you create an Amazon EMR workspace
        • Creating IAM roles for Structural and Amazon EMR
        • Creating Athena workgroups
        • Configuration for cross-account setups
      • Configuring Amazon EMR workspace data connections
    • Amazon Redshift
      • Structural process overview for Amazon Redshift
      • Structural differences and limitations with Amazon Redshift
      • Before you create an Amazon Redshift workspace
        • Required AWS instance profile permissions for Amazon Redshift
        • Setting up the AWS Lambda role for Amazon Redshift
        • AWS KMS permissions for Amazon SQS message encryption
        • Amazon Redshift-specific Structural environment settings
        • Source and destination database permissions for Amazon Redshift
      • Configuring Amazon Redshift workspace data connections
    • Databricks
      • Structural process overview for Databricks
      • System requirements for Databricks
      • Structural differences and limitations with Databricks
      • Before you create a Databricks workspace
        • Granting access to storage
        • Setting up your Databricks cluster
        • Configuring the destination database schema creation
      • Configuring Databricks workspace data connections
    • Db2 for LUW
      • System requirements for Db2 for LUW
      • Structural differences and limitations with Db2 for LUW
      • Before you create a Db2 for LUW workspace
      • Configuring Db2 for LUW workspace data connections
    • File connector
      • Overview of the file connector process
      • Supported file and content types
      • Structural differences and limitations with the file connector
      • Before you create a file connector workspace
      • Configuring the file connector storage type and output options
      • Managing file groups in a file connector workspace
      • Downloading generated file connector files
    • Google BigQuery
      • Structural differences and limitations with Google BigQuery
      • Before you create a Google BigQuery workspace
      • Configuring Google BigQuery workspace data connections
      • Resolving schema changes for de-identified views
    • MongoDB
      • System requirements for MongoDB
      • Structural differences and limitations with MongoDB
      • Configuring MongoDB workspace data connections
      • Other MongoDB hints and tips
    • MySQL
      • System requirements for MySQL
      • Before you create a MySQL workspace
      • Configuring MySQL workspace data connections
    • Oracle
      • Known limitations for Oracle schema objects
      • System requirements for Oracle
      • Structural differences and limitations with Oracle
      • Before you create an Oracle workspace
      • Configuring Oracle workspace data connections
    • PostgreSQL
      • System requirements for PostgreSQL
      • Before you create a PostgreSQL workspace
      • Configuring PostgreSQL workspace data connections
    • Salesforce
      • System requirements for Salesforce
      • Structural differences and limitations with Salesforce
      • Before you create a Salesforce workspace
      • Configuring Salesforce workspace data connections
    • Snowflake on AWS
      • Structural process overviews for Snowflake on AWS
      • Structural differences and limitations with Snowflake on AWS
      • Before you create a Snowflake on AWS workspace
        • Required AWS instance profile permissions for Snowflake on AWS
        • Other configuration for Lambda processing
        • Source and destination database permissions for Snowflake on AWS
        • Configuring whether Structural creates the Snowflake on AWS destination database schema
      • Configuring Snowflake on AWS workspace data connections
    • Snowflake on Azure
      • Structural process overview for Snowflake on Azure
      • Structural differences and limitations with Snowflake on Azure
      • Before you create a Snowflake on Azure workspace
      • Configuring Snowflake on Azure workspace data connections
    • Spark SDK
      • Structural process overview for the Spark SDK
      • Structural differences and limitations with the Spark SDK
      • Configuring Spark SDK workspace data connections
      • Using Spark to run de-identification of the data
    • SQL Server
      • System requirements for SQL Server
      • Before you create a SQL Server workspace
      • Configuring SQL Server workspace data connections
    • Yugabyte
      • System requirements for Yugabyte
      • Structural differences and limitations with Yugabyte
      • Before you create a Yugabyte workspace
      • Configuring Yugabyte workspace data connections
      • Troubleshooting Yugabyte data generation issues
  • Using the Structural API
    • About the Structural API
    • Getting an API token
    • Getting the workspace ID
    • Using the Structural API to perform tasks
      • Configure environment settings
      • Manage generator presets
        • Retrieving the list of generator presets
        • Structure of a generator preset
        • Creating a custom generator preset
        • Updating an existing generator preset
        • Deleting a generator preset
      • Manage custom sensitivity rules
      • Create a workspace
      • Connect to source and destination data
      • Manage file groups in a file connector workspace
      • Assign table modes and filters to source database tables
      • Set column sensitivity
      • Assign generators to columns
        • Getting the generator IDs and available metadata
        • Updating generator configurations
        • Structure of a generator assignment
        • Generator API reference
          • Address (AddressGenerator)
          • Algebraic (AlgebraicGenerator)
          • Alphanumeric String Key (AlphaNumericPkGenerator)
          • Array Character Scramble (ArrayTextMaskGenerator)
          • Array JSON Mask (ArrayJsonMaskGenerator)
          • Array Regex Mask (ArrayRegexMaskGenerator)
          • ASCII Key (AsciiPkGenerator)
          • Business Name (BusinessNameGenerator)
          • Categorical (CategoricalGenerator)
          • Character Scramble (TextMaskGenerator)
          • Character Substitution (StringMaskGenerator)
          • Company Name (CompanyNameGenerator)
          • Conditional (ConditionalGenerator)
          • Constant (ConstantGenerator)
          • Continuous (GaussianGenerator)
          • Cross Table Sum (CrossTableAggregateGenerator)
          • CSV Mask (CsvMaskGenerator)
          • Custom Categorical (CustomCategoricalGenerator)
          • Date Truncation (DateTruncationGenerator)
          • Email (EmailGenerator)
          • Event Timestamps (EventGenerator)
          • File Name (FileNameGenerator)
          • Find and Replace (FindAndReplaceGenerator)
          • FNR (FnrGenerator)
          • Geo (GeoGenerator)
          • HIPAA Address (HipaaAddressGenerator)
          • Hostname (HostnameGenerator)
          • HStore Mask (HStoreMaskGenerator)
          • HTML Mask (HtmlMaskGenerator)
          • Integer Key (IntegerPkGenerator)
          • International Address (InternationalAddressGenerator)
          • IP Address (IPAddressGenerator)
          • JSON Mask (JsonMaskGenerator)
          • MAC Address (MACAddressGenerator)
          • Mongo ObjectId Key (ObjectIdPkGenerator)
          • Name (NameGenerator)
          • Noise Generator (NoiseGenerator)
          • Null (NullGenerator)
          • Numeric String Key (NumericStringPkGenerator)
          • Passthrough (PassthroughGenerator)
          • Phone (USPhoneNumberGenerator)
          • Random Boolean (RandomBooleanGenerator)
          • Random Double (RandomDoubleGenerator)
          • Random Hash (RandomStringGenerator)
          • Random Integer (RandomIntegerGenerator)
          • Random Timestamp (RandomTimestampGenerator)
          • Random UUID (UUIDGenerator)
          • Regex Mask (RegexMaskGenerator)
          • Sequential Integer (UniqueIntegerGenerator)
          • Shipping Container (ShippingContainerGenerator)
          • SIN (SINGenerator)
          • SSN (SsnGenerator)
          • Struct Mask (StructMaskGenerator)
          • Timestamp Shift (TimestampShiftGenerator)
          • Unique Email (UniqueEmailGenerator)
          • URL (UrlGenerator)
          • UUID Key (UuidPkGenerator)
          • XML Mask (XmlMaskGenerator)
      • Configure subsetting
      • Check for and resolve schema changes
      • Run data generation jobs
      • Schedule data generation jobs
    • Example script: Starting a data generation job
    • Example script: Polling for a job status and creating a Docker package
Powered by GitBook
On this page
  • Generators that automatically have differential privacy
  • Generators for which differential privacy is configurable
  • Categorical generator
  • Continuous generator
  • More details: Mathematical formulation
  • Privacy budget
  • A simple example: counting
  • Approximate differential privacy
  • References

Was this helpful?

Export as PDF
  1. Configuring data generation
  2. Generator information
  3. Generator characteristics

Differential privacy

Last updated 4 months ago

Was this helpful?

Differential privacy is one technique that Tonic Structural uses to ensure the privacy of your data.

Differential privacy limits the effect of a single source record or user on the destination data. Someone who views the output of a process that has differential privacy cannot determine whether a particular individual's information was used to generate that output.

Data that is protected by a process with differential privacy cannot be reverse engineered, re-identified, or otherwise compromised.

Generators that automatically have differential privacy

Any generator that does not use the underlying data at all is considered "". A data-free generator always has differential privacy.

Several Structural generators are either always data-free, or are data-free if consistency is not enabled.

Generators for which differential privacy is configurable

The configuration options for the and generators include a Differential Privacy toggle to enable or disable differential privacy.

Categorical generator

The Categorical generator shuffles the values of a column, but preserves the overall frequency of the values. Note that NULL is considered its own category of value.

Differential privacy (disabled by default) further protects the privacy of your data. It\y:

  1. First, adds noise to the frequencies of categories.

  2. Next, if needed, removes rare categories from the possible samples.

These steps ensure that a single row of source data has limited influence on the output values. By default, the for this generator is ε=1,\varepsilon = 1,ε=1,with δ=110n\delta = \frac{1}{10n}δ=10n1​, where nnnis the number of rows.

Differential privacy is not appropriate when the data in each row is unique or nearly unique. As a general rule of thumb, categories that are represented by fewer than 15 rows are at risk of being suppressed.

Structural warns you when a column isn’t suitable for differential privacy. A column is not suitable for differential privacy if most or all categories have fewer than 15 rows.

Continuous generator

The Continuous generator produces samples that preserve the individual column distributions and correlations between columns.

More details: Mathematical formulation

Privacy budget

A simple example: counting

Suppose we want to count the number of users in a database that have some sensitive property. For example, the number of users with a particular medical diagnosis.

Approximate differential privacy

A common relaxation, called approximate differential privacy, allows for flexible privacy analysis with noise drawn that is from a wider array of distributions than the Laplace distribution.

References

When differential privacy is enabled, noise is added to the individual distributions and the correlation matrix, using the mechanism described in [].

The default privacy budget for this generator is ε=1,\varepsilon = 1,ε=1,with δ=110n\delta = \frac{1}{10n}δ=10n1​.

Differential privacy is a property of a randomized algorithm M\mathcal{M}M, which takes as input a database DDD and produces some output M(D).\mathcal{M}(D).M(D).The outputs could be counts or summary statistics or synthetic databases — the specific type is not important for this formulation.

For this formulation, we say two databases DDDand D′D'D′are neighbors if they differ by a single row.

For a given ε≥0\varepsilon\geq 0ε≥0, we say that M\mathcal{M}Mis ε\varepsilonεdifferentially private if, for all subset of outputs O\mathcal{O}O, we have:

P⁡(M(D)∈O)≤eεP⁡(M(D′)∈O)\operatorname{P}(\mathcal{M}(D)\in \mathcal{O}) \leq e^\varepsilon \operatorname{P}(\mathcal{M}(D')\in \mathcal{O}) P(M(D)∈O)≤eεP(M(D′)∈O)

When δ\deltaδis non-zero, this is sometimes called approximately differentially private.

The parameter ε\varepsilonε is the privacy budget of the algorithm, and quantifies in a precise sense an upper bound on how much information an adversary can gain from observing the outputs of the algorithm on an unknown database.

Suppose an attacker suspects that our secret databaseDDDis one of two possible neighboring databases D1,D2D_1, D_2D1​,D2​, with some fixed odds.

IfM\mathcal{M}Mis ε\varepsilonεdifferentially private, then observing M(D)\mathcal{M}(D)M(D)updates the the attacker's log odds of D=D1D=D_1D=D1​vs D=D2D=D_2D=D2​by at most ε\varepsilonε.

The closer ε\varepsilonεis to 000, the better the privacy guarantee, as an attacker is more and more limited in what information they can learn from M(D)\mathcal{M}(D)M(D).

Conversely, larger values of ε\varepsilonε mean that an attacker can possibly learn significant information by observing M(D)\mathcal{M}(D)M(D).

Dwork, McSherry, Nissim and Smith introduced in [] the Laplace Mechanism as a way to publish these counts in a secure way, by adding noise sampled from the Laplace distribution.

This noise affords us plausible deniability. If the underlying count changed by ±1\pm 1±1, then the probability of observing the same noisy output does not change by much:

P⁡(Noised=k∣count=n±1)≤exp⁡(1)P⁡(Noised=k∣count=n)\operatorname{P}( \text{Noised} = k | \text{count} = n \pm 1) \leq \operatorname{exp}(1) \operatorname{P}(\text{Noised} = k | \text{count} = n)P(Noised=k∣count=n±1)≤exp(1)P(Noised=k∣count=n)

We illustrate this visually, showing the probability density function (pdf) of the observed values given true counts of nnn (blue), n+1n + 1n+1 (orange), and n−1n - 1n−1 (green).

The Laplace Mechanism

The blue shaded region shows that the the possibly noisy count values for n−1n-1n−1 and n+1n+1n+1 lie within a factor of exp⁡(1)\exp(1)exp(1) of the noisy count values of nnn, so this mechanism is differentially private with ε=1\varepsilon=1ε=1.

For example, the AnalyzeGauss mechanisms of [], and differentially private gradient descent of [], use Gaussian noise as a fundamental ingredient, which requires the following relaxation:

For a given ε≥0\varepsilon\geq 0ε≥0 and δ∈[0,1]\delta\in[0,1]δ∈[0,1], we say that M\mathcal{M}Mis (ε,δ)(\varepsilon,\delta)(ε,δ)differentially private if, for all subset of outputs O\mathcal{O}O, we have:

P⁡(M(D)∈O)≤eεP⁡(M(D′)∈O)+δ\operatorname{P}(\mathcal{M}(D)\in \mathcal{O}) \leq e^\varepsilon \operatorname{P}(\mathcal{M}(D')\in \mathcal{O}) +\deltaP(M(D)∈O)≤eεP(M(D′)∈O)+δ

The parameter δ\deltaδis often described as the risk of a (possibly catastrophic) privacy violation. While this formal definition does allow, for example, a mechanism that reveals a sensitive database with probability δ\deltaδ, in practice this is not a plausible outcome with carefully designed mechanisms. Also, taking δ\deltaδto be small relative to the size of the database ensures that the risk of disclosure is low.

Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep Learning with Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS '16). Association for Computing Machinery, New York, NY, USA, 308–318. DOI:

Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam Smith. 2006 Calibrating Noise to Sensitivity in Private Data Analysis. In: Halevi S., Rabin T. (eds) Theory of Cryptography. (TCC '06). Lecture Notes in Computer Science, vol 3876. Springer, Berlin, Heidelberg. DOI:

Cynthia Dwork and Aaron Roth. 2014. The Algorithmic Foundations of Differential Privacy. Found. Trends Theor. Comput. Sci. 9, 3–4 (August 2014), 211–407. DOI:

Cynthia Dwork, Kunal Talwar, Abhradeep Thakurta, and Li Zhang. 2014. Analyze gauss: optimal bounds for privacy-preserving principal component analysis. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing (STOC '14). Association for Computing Machinery, New York, NY, USA, 11–20. DOI:

4
2
4
1
https://doi.org/10.1145/2976749.2978318
https://doi.org/10.1007/11681878_14
https://doi.org/10.1561/0400000042
https://doi.org/10.1145/2591796.2591883
data-free
Categorical
Continuous
privacy budget