Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Consistency
Map the same input values to the same output values across multiple columns, tables, and databases.
Linking
Identify columns that use the same generator and that are inter-dependent or correlated.
Differential privacy
Ensures that the output does not reveal anything that is attributable to a specific member of the source data.
Data-free generators
Indicates that the generator output is completely unrelated to the input.
Column partitioning
Base the value of a column on other related columns.
Uniqueness constraints
Generators that you can use on columns that have uniqueness constraints.
Format-preserving encryption (FPE)
Encrypts data in such a way that the output is in the same format as the input.
Using format-preserving encryption (FPE) means to encrypt data in such a way that the output is in the same format as the input. For example, a number in the input produces a number in the generated output.
For the following generators, Tonic Structural uses FPE to encrypt the generated values. Note that the Structural implementation of FPE might not guarantee compliance with standards. For example, the ASCII Key generator does not guarantee that the length of the output data matches the length of the input data.
Each generator supports a specific input character set or domain.
When a generator attempts to process data that is not within the expected domain, it results in encryption errors. For example, the Numeric String Key generator cannot process a string that includes non-numeric characters such as letters or symbols. The UUID Key generator cannot process any value that is not a valid UUID.
If you see encryption errors, then it probably means that the column contains values that are incompatible with the selected generator. To address this, you need to choose a different generator.
One option is the ASCII Key generator, which has very few restrictions on the allowed values.
Another option is to use the Conditional generator, which allows you to assign different generators based on column values.
Partitioning allows the value of a column to be based on the values of other related columns. It is one way to generate more realistic destination values.
The following generators support partitioning:
Note that partitioning cannot be configured as part of a generator preset. You can only configure partitioning when you configure a specific column.
To enable partitioning, from the Partition by dropdown list, you choose one or more columns to partition by.
You can only choose columns that have the generator set to Passthrough or Categorical.
For each value or combination of values in the partitioning columns, Tonic Structural generates a distribution of values for the original column.
For example, you assign the Continuous generator to an Income column, and partition it by an Occupation column. For each Occupation value, Structural generates a distribution of Income values. In other words, it generates a range of incomes for each occupation, such as Doctor and Construction Worker.
If you choose multiple columns, then the distribution is for each combination of column values. For example, you partition by both Occupation and Region. Structural creates a distribution of income values for each combination of occupation and region. So there is a distribution for Doctor and Northeast, and a different distribution for Doctor and Southeast.
In the destination database, Structural sets the value of the partitioned column to a value from the appropriate distribution. The distribution that Structural uses is based on the value of the partitioning columns in the destination database, not the original value of the partitioning columns in the source database.
To continue our example, assume that the Occupation column uses the Categorical generator. During data generation, Structural assigns to each record a random occupation value from the current values. For one of the records, the occupation value is Doctor in the source database and Construction Worker in the destination database.
For the Income column for that record, Structural assigns a value from the distribution of income values for the Construction Worker occupation. In other words, it assigns an income value that is realistic for the destination occupation value based on the source data.
Some generators can be data-free. When a generator is data-free, it means that the output data is completely unrelated to the source data. There is no way to use the output data to uncover the source data. Data-free generators implicitly have differential privacy. A generator is not data-free if consistency is enabled.
The following generators are always data-free:
The following generators are data-free only when consistency is disabled:
Company Name (deprecated)
A column that has a uniqueness constraint must have a unique value for every record.
Primary key columns automatically require uniqueness. Uniqueness can also be required for other columns. For example, in a users
table, userid
is the primary key column, but username
also must be unique.
The following generators can be used with columns that have uniqueness constraints:
Differential privacy is one technique that Tonic Structural uses to ensure the privacy of your data.
Differential privacy limits the effect of a single source record or user on the destination data. Someone who views the output of a process that has differential privacy cannot determine whether a particular individual's information was used to generate that output.
Data that is protected by a process with differential privacy cannot be reverse engineered, re-identified, or otherwise compromised.
Any generator that does not use the underlying data at all is considered "data-free". A data-free generator always has differential privacy.
Several Structural generators are either always data-free, or are data-free if consistency is not enabled.
The configuration options for the Categorical and Continuous generators include a Differential Privacy toggle to enable or disable differential privacy.
The Categorical generator shuffles the values of a column while preserving the overall frequency of the values. Note that NULL is considered its own category of value.
Differential privacy (disabled by default) further protects the privacy of your data by:
First, adding noise to the frequencies of categories.
After that, if needed, removing rare categories from the possible samples.
These steps ensure that a single row of source data has limited influence on the output values. By default, the privacy budget for this generator is with , where is the number of rows.
Differential privacy is not appropriate when the data in each row is unique or nearly unique. As a general rule of thumb, categories that are represented by fewer than 15 rows are at risk of being suppressed.
Structural warns you when a column isn’t suitable for differential privacy. A column is not suitable for differential privacy if most or all categories have fewer than 15 rows.
The Continuous generator produces samples that preserve the individual column distributions and correlations between columns.
When differential privacy is enabled, noise is added to the individual distributions and the correlation matrix, using the mechanism described in [4].
Suppose we want to count the number of users in a database that have some sensitive property. For example, the number of users with a particular medical diagnosis.
Dwork, McSherry, Nissim and Smith introduced in [2] the Laplace Mechanism as a way to publish these counts in a secure way, by adding noise sampled from the Laplace distribution.
A common relaxation, called approximate differential privacy, allows for flexible privacy analysis with noise drawn that is from a wider array of distributions than the Laplace distribution.
For example, the AnalyzeGauss mechanisms of [4], and differentially private gradient descent of [1], use Gaussian noise as a fundamental ingredient, which requires the following relaxation:
Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep Learning with Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS '16). Association for Computing Machinery, New York, NY, USA, 308–318. DOI:https://doi.org/10.1145/2976749.2978318
Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam Smith. 2006 Calibrating Noise to Sensitivity in Private Data Analysis. In: Halevi S., Rabin T. (eds) Theory of Cryptography. (TCC '06). Lecture Notes in Computer Science, vol 3876. Springer, Berlin, Heidelberg. DOI:https://doi.org/10.1007/11681878_14
Cynthia Dwork and Aaron Roth. 2014. The Algorithmic Foundations of Differential Privacy. Found. Trends Theor. Comput. Sci. 9, 3–4 (August 2014), 211–407. DOI:https://doi.org/10.1561/0400000042
Cynthia Dwork, Kunal Talwar, Abhradeep Thakurta, and Li Zhang. 2014. Analyze gauss: optimal bounds for privacy-preserving principal component analysis. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing (STOC '14). Association for Computing Machinery, New York, NY, USA, 11–20. DOI:https://doi.org/10.1145/2591796.2591883
The default privacy budget for this generator is with .
Differential privacy is a property of a randomized algorithm , which takes as input a database and produces some output The outputs could be counts or summary statistics or synthetic databases — the specific type is not important for this formulation.
For this formulation, we say two databases and are neighbors if they differ by a single row.
For a given , we say that is differentially private if, for all subset of outputs , we have:
When is non-zero, this is sometimes called approximately differentially private.
The parameter is the privacy budget of the algorithm, and quantifies in a precise sense an upper bound on how much information an adversary can gain from observing the outputs of the algorithm on an unknown database.
Suppose an attacker suspects that our secret databaseis one of two possible neighboring databases , with some fixed odds.
Ifis differentially private, then observing updates the the attacker's log odds of vs by at most .
The closer is to , the better the privacy guarantee, as an attacker is more and more limited in what information they can learn from .
Conversely, larger values of mean that an attacker can possibly learn significant information by observing .
This noise affords us plausible deniability. If the underlying count changed by , then the probability of observing the same noisy output does not change by much:
We illustrate this visually, showing the probability density function (pdf) of the observed values given true counts of (blue), (orange), and (green).
The blue shaded region shows that the the possibly noisy count values for and lie within a factor of of the noisy count values of , so this mechanism is differentially private with .
For a given and , we say that is differentially private if, for all subset of outputs , we have:
The parameter is often described as the risk of a (possibly catastrophic) privacy violation. While this formal definition does allow, for example, a mechanism that reveals a sensitive database with probability , in practice this is not a plausible outcome with carefully designed mechanisms. Also, taking to be small relative to the size of the database ensures that the risk of disclosure is low.
Consistency is an option for some generators that when turned on, maps the same input to the same output across an entire database.
Consistency can also be maintained across multiple databases of varying types. For example, if consistency is turned on for a name generator, it always maps the same input name (for example, Albert Einstein) to the same output (for example, Richard Feynman).
You can also view this video overview of consistency.
The primary reasons for using consistency are to:
Enable joining on columns that don't have explicit database constraints in the schema. This is often seen with values such as email addresses. With consistency, you can completely anonymize an email address and still use it in a join.
Preserve the approximate cardinality of a column. For example, a city column contains 50 different cities. To randomize this column but still have ~50 cities, you can use consistency to maintain the approximate cardinality. Because consistency does not guarantee uniqueness, the cardinality might change. However, it is guaranteed to not increase. If unique 1-to-1 mappings are required, a Key generator should be used.
Match duplicated data across 1 or more databases. For example, you have a user database that contains a username in both a column and a JSON blob, and another database that contains their website activity, identified by the same username values. To anonymize the username, but still have the username be the same in all locations/databases, use consistency.
Self-consistency indicates that the value in the destination database is consistent with the value of the same column in the source database.
For example, a column contains a first name. You make the assigned generator self-consistent. A given first name in the source database is always replaced by the same first name in the destination database. For example, the first name value John
is always replaced by the value Michael
.
Consistency with another column indicates that the value in the destination database is consistent with the value of a different column in the source database.
For example, a column contains an IP address. You make the assigned generator consistent with the username column. Every row that has the username User1
in the input database has the same IP address in the destination database.
When you select a generator as the sub-generator for a composite generator, in most cases you cannot configure the generator to be consistent with another column. Only the Conditional generator and the Regex Mask generator allow a sub-generator to be consistent with another column.
Note that consistency with another column cannot be configured in a generator preset. You can only configure it when you configure an individual column.
To enable consistency, on the generator configuration panel, toggle the Consistency switch.
Not all generators support consistency.
Consistency is a function of the both the data type and the value.
For example, a numeric field contains the value 123. A string/varchar field contains the value "123".
Both fields have consistent generators applied.
The output is not consistent between the two fields.
To demonstrate the effect of consistency on the output, we'll use a column that contains a first name, and that uses the Name generator.
Here is the sample input and output when consistency is not enabled:
In this sample data, the first name Melissa appears twice, but is mapped to Walton the first time and Linn the second time.
Here is the sample input and output when consistency is enabled:
In this case, the first name Melissa is mapped to Rosella both times.
A consistent generator ensures that the same input value always produces the same output value.
It does not guarantee that two different input values produce two different output values.
Consistent generators are not 1:1 mappings.
Consistency can reduce the privacy of your data, because it reveals something about the frequency of the data values.
For example, if someone is familiar with the source data values and frequency, they might be able to connect the source and destination values. For example, they know that Jane appears 20 times and Michael appears 3 times in the source. When they see 20 instances of Susan and 3 instances of John, they might infer that Susan is mapped from Jane and John from Michael.
However, this risk does require some knowledge of the source data. Tonic Structural does not store mappings of the source data to the destination data. In other words, someone can see that in the destination data the name Susan appears 20 times and the name John appears 3 times. But without any knowledge of the source data, they cannot determine that Susan is mapped from Jane and John is mapped from Michael.
Also, the mapping of source to destination values is not guaranteed to be unique. Both Jane and Michael could be mapped to John. In that case there would be 23 instances of John, which would not match the frequency of a specific source value. To guarantee unique values, use a primary key generator.
Any column, regardless of which table it resides in, is consistent with any other column that uses the same consistent generator.
For example, your database includes a Customers table and an Employees table. Each table contains a column for the first name of the customer or employee. You assign the Name generator to both columns to generate a first name, and make the generators consistent. The same first name value in either column is mapped to the same destination value. For example, the first name John is always mapped to Michael, whether the name John appears in the Customers table or the Employees table.
However, by default, consistency is not guaranteed between data generation runs, even if the run is on the same database.
By default, consistency is only guaranteed across a single data generation for a single workspace.
For example, for a column that contains a first name value, you assign the Name generator and configure the generator to be consistent. The first time you run data generation, all instances of the name John might be replaced with Michael. The next time you run data generation, all instances of the name John might instead be replaced with Gregory.
You can enable consistency across runs and workspaces so that, for example, every time you run a data generation, John is always replaced with Michael.
To do this, you configure a seed value. You can either:
Configure the Structural environment setting TONIC_STATISTICS_SEED
. This ensures consistency across all workspaces and data generation runs.
Configure a seed value for a workspace. This ensures consistency across all data generation runs for that workspace, as well as across other workspaces that have the same seed value.
Disable cross-data generation consistency for a workspace. This indicates to not have consistency across data generation runs or with other workspaces.
To ensure consistency across all data generations and workspaces, add the following environment setting to the Structural worker and web server containers:
TONIC_STATISTICS_SEED: <ANY 32-BIT SIGNED INTEGER>
When you configure a value for this environment setting, then consistency is across all data generations for all workspaces that do not either:
Have a workspace seed value configured.
Have disabled consistency across data generations.
For an individual workspace, you can override the Structural seed value. When you override the Structural seed value, you can either:
Disable consistency across data generation runs for the workspace.
Provide a seed value for the workspace.
When a workspace has a configured seed value, then consistency is across the data generation runs for that workspace.
Consistency is also across all of the data generations for all of the workspaces that have the same seed value.
On the workspace details view, to override the Structural seed value:
Toggle Override Statistics Seed to the on position.
To disable consistency across data generations, click Don't use consistency.
To provide a seed value for the workspace:
Click Consistency value.
In the field, enter the seed value. It must be a 32-bit signed integer. The value defaults to the current value of TONIC_STATISTICS_SEED
.
The following generators can be made consistent to themselves. This means that the same input value in the column always produces the same output value.
The following generators can be made consistent either to themselves or to other columns.
When a column is consistent to another column, the output value is based on the other column.
For example, a column contains a company name. You assign the Company Name generator, and make it consistent with the username column. Every row that has the username User1 in the input database has the same company name in the destination database.
Company Name (Deprecated)
The linking option for a generator allows multiple columns within the same table to use a single generator.
At a high level, consider using linking when columns share a strong interdependency or correlation.
When you link columns, you tell Tonic Structural that the columns are related to each other, and that Structural should take this relationship into account when it generates new data.
In a child workspace, if you change the configuration of a linked column, the columns that it is linked to also are marked as having overrides to the parent workspace configuration.
Note that you cannot configure linking as part of a generator preset. You can only configure linking when you configure specific columns.
To link columns, you first assign the same generator to those columns.
After you assign the generator, then on the generator configuration panel for any of the columns, you can link the columns.
Categorical generators support linking and can be used to preserve hierarchical data. Examples of hierarchical data include:
City, State, Zip
Job Title, Department
Day of Month, Month, Year
To illustrate how linking works, we'll use an example of city and state columns. Here is the original data:
The below image shows the results when you apply the Categorical generator to city and state columns, but do not link the columns. Because the columns are not linked, the values in each column are shuffled independently. In the output, the city and state combinations are not valid. For example, Phoenix is not in Florida and Baltimore is not in Tennessee.
The next image shows the results when you apply the Categorical generator to and link the city and state columns. This preserves the data hierarchy and ensures that the city and state combinations are valid.
The following generators can be linked: