Differential privacy
Differential privacy is one technique that Tonic Structural uses to ensure the privacy of your data.
Differential privacy limits the effect of a single source record or user on the destination data. Someone who views the output of a process that has differential privacy cannot determine whether a particular individual's information was used to generate that output.
Data that is protected by a process with differential privacy cannot be reverse engineered, re-identified, or otherwise compromised.
Generators that automatically have differential privacy
Any generator that does not use the underlying data at all is considered "data-free". A data-free generator always has differential privacy.
Several Structural generators are either always data-free, or are data-free if consistency is not enabled.
Generators for which differential privacy is configurable
The configuration options for the Categorical and Continuous generators include a Differential Privacy toggle to enable or disable differential privacy.
Categorical generator
The Categorical generator shuffles the values of a column while preserving the overall frequency of the values. Note that NULL is considered its own category of value.
Differential privacy (disabled by default) further protects the privacy of your data by:
First, adding noise to the frequencies of categories.
After that, if needed, removing rare categories from the possible samples.
These steps ensure that a single row of source data has limited influence on the output values. By default, the privacy budget for this generator is with , where is the number of rows.
Differential privacy is not appropriate when the data in each row is unique or nearly unique. As a general rule of thumb, categories that are represented by fewer than 15 rows are at risk of being suppressed.
Structural warns you when a column isn’t suitable for differential privacy. A column is not suitable for differential privacy if most or all categories have fewer than 15 rows.
Continuous generator
The Continuous generator produces samples that preserve the individual column distributions and correlations between columns.
When differential privacy is enabled, noise is added to the individual distributions and the correlation matrix, using the mechanism described in [4].
The default privacy budget for this generator is with .
More details: Mathematical formulation
Differential privacy is a property of a randomized algorithm , which takes as input a database and produces some output The outputs could be counts or summary statistics or synthetic databases — the specific type is not important for this formulation.
For this formulation, we say two databases and are neighbors if they differ by a single row.
For a given , we say that is differentially private if, for all subset of outputs , we have:
When is non-zero, this is sometimes called approximately differentially private.
Privacy budget
The parameter is the privacy budget of the algorithm, and quantifies in a precise sense an upper bound on how much information an adversary can gain from observing the outputs of the algorithm on an unknown database.
Suppose an attacker suspects that our secret databaseis one of two possible neighboring databases , with some fixed odds.
Ifis differentially private, then observing updates the the attacker's log odds of vs by at most .
The closer is to , the better the privacy guarantee, as an attacker is more and more limited in what information they can learn from .
Conversely, larger values of mean that an attacker can possibly learn significant information by observing .
A simple example: counting
Suppose we want to count the number of users in a database that have some sensitive property. For example, the number of users with a particular medical diagnosis.
Dwork, McSherry, Nissim and Smith introduced in [2] the Laplace Mechanism as a way to publish these counts in a secure way, by adding noise sampled from the Laplace distribution.
This noise affords us plausible deniability. If the underlying count changed by , then the probability of observing the same noisy output does not change by much:
We illustrate this visually, showing the probability density function (pdf) of the observed values given true counts of (blue), (orange), and (green).
The blue shaded region shows that the the possibly noisy count values for and lie within a factor of of the noisy count values of , so this mechanism is differentially private with .
Approximate differential privacy
A common relaxation, called approximate differential privacy, allows for flexible privacy analysis with noise drawn that is from a wider array of distributions than the Laplace distribution.
For example, the AnalyzeGauss mechanisms of [4], and differentially private gradient descent of [1], use Gaussian noise as a fundamental ingredient, which requires the following relaxation:
For a given and , we say that is differentially private if, for all subset of outputs , we have:
The parameter is often described as the risk of a (possibly catastrophic) privacy violation. While this formal definition does allow, for example, a mechanism that reveals a sensitive database with probability , in practice this is not a plausible outcome with carefully designed mechanisms. Also, taking to be small relative to the size of the database ensures that the risk of disclosure is low.
References
Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep Learning with Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS '16). Association for Computing Machinery, New York, NY, USA, 308–318. DOI:https://doi.org/10.1145/2976749.2978318
Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam Smith. 2006 Calibrating Noise to Sensitivity in Private Data Analysis. In: Halevi S., Rabin T. (eds) Theory of Cryptography. (TCC '06). Lecture Notes in Computer Science, vol 3876. Springer, Berlin, Heidelberg. DOI:https://doi.org/10.1007/11681878_14
Cynthia Dwork and Aaron Roth. 2014. The Algorithmic Foundations of Differential Privacy. Found. Trends Theor. Comput. Sci. 9, 3–4 (August 2014), 211–407. DOI:https://doi.org/10.1561/0400000042
Cynthia Dwork, Kunal Talwar, Abhradeep Thakurta, and Li Zhang. 2014. Analyze gauss: optimal bounds for privacy-preserving principal component analysis. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing (STOC '14). Association for Computing Machinery, New York, NY, USA, 11–20. DOI:https://doi.org/10.1145/2591796.2591883
Last updated