Tonic
Search…
Differential Privacy
Differential privacy is one of the techniques Tonic uses to ensure the privacy of your data. At a high level, differential privacy limits the influence of a single user’s data on Tonic’s output data, ensuring privacy through a robust notion of plausible deniability. Slightly more precisely, differential privacy is a property of a randomized process on databases which stipulates that altering the database by a single record (or user) can only perturb the output in a limited way.
Differentially private processes have some valuable properties. The first, and perhaps most important, is that no amount of post processing or additional knowledge can break the guarantees of differential privacy. This isn’t true for other data anonymization techniques, e.g. k-anonymity. Additionally, differentially private data can be combined with other differentially private data without losing its protection. In practice this means data protected by a process with differential privacy cannot be reverse engineered, re-identified, or otherwise compromised, no matter the adversary.
Tonic can enable differential privacy for the Categorical and Continuous generators and many of Tonic’s generators enjoy this property for free — any generator that does not use the underlying data at all is differentially private.

Categorical Generator

The categorical generator shuffles the values of a column while preserving the overall frequencies of the values. Differential privacy (enabled by default) will further protect the privacy of your data by:
    1.
    Adding noise to the frequencies of categories
    2.
    Suppressing rare categories.
These steps ensure that a single row of source data has limited influence on the output values. By default, the privacy budget for this generator is
ε=1,\varepsilon = 1,
with
δ=110n\delta = \frac{1}{10n}
, where
nn
is the number of rows.
Disabling differential privacy is not recommended, but there is one situation where it may be necessary: when the data in each row is unique or nearly unique — as a general rule of thumb, categories represented by fewer than 15 rows are at risk of being suppressed. Tonic will warn you when a column isn’t suitable for differential privacy, and you must disable it in those cases.

Continuous Generator

The continuous generator produces samples which preserve the individual column distributions and correlations between columns. When differential privacy is enabled, noise is added to the individual distributions and the correlation matrix, using the mechanism described in [4]. The default privacy budget for this generator is
ε=1,\varepsilon = 1,
with
δ=110n\delta = \frac{1}{10n}
.

More details: Mathematical formulation

Differential privacy is a property of a randomized algorithm
M\mathcal{M}
which takes as input a database
DD
and produces some output
M(D).\mathcal{M}(D).
The outputs could be counts or summary statistics or synthetic databases — the specific type is not important for this formulation. For this formulation, we say two databases
DD
and
DD'
are neighbors if they differ by a single row. For a given
ε0\varepsilon\geq 0
, we say that
M\mathcal{M}
is
ε\varepsilon
differentially private if, for all subset of outputs
O\mathcal{O}
, we have
P(M(D)O)eεP(M(D)O)\operatorname{P}(\mathcal{M}(D)\in \mathcal{O}) \leq e^\varepsilon \operatorname{P}(\mathcal{M}(D')\in \mathcal{O})
When
δ\delta
is non-zero, this is sometimes called approximately differentially private. The parameter
δ\delta
is often interpreted as an upper bound on the risk a privacy leak, but

Privacy Budget

The parameter
ε\varepsilon
is the privacy budget of the algorithm, and quantifies in a precise sense, an upper bound on how much information an adversary can gain from observing the outputs of the algorithm on an unknown database: Suppose an attacker suspects our secret database
DD
is one of two possible neighboring databases
D1,D2D_1, D_2
, with some fixed odds. If
M\mathcal{M}
is
ε\varepsilon
differentially private, then observing
M(D)\mathcal{M}(D)
updates the the attacker's log odds of
D=D1D=D_1
vs
D=D2D=D_2
by at most
ε\varepsilon
. The closer
ε\varepsilon
is to
00
the better the privacy guarantee, as an attacker is more and more limited in what information they can learn from
M(D)\mathcal{M}(D)
. Conversely, larger values of
ε\varepsilon
mean an attacker can possibly learn significant information by observing
M(D)\mathcal{M}(D)
.

A simple example: counting

Suppose we want to count the number of users in a database having some sensitive property — for example, the number of users with a particular medical diagnosis. Dwork, McSherry, Nissim and Smith introduced in [2] the Laplace Mechanism as a way of publishing these counts in a secure way, by adding noise sampled from the Laplace distribution. This noise affords us plausible deniability — if the underlying count changed by
±1\pm 1
, then the probability of observing the same noisy output does not change by much:
P(Noised=kcount=n±1)exp(1)P(Noised=kcount=n)\operatorname{P}( \text{Noised} = k | \text{count} = n \pm 1) \leq \operatorname{exp}(1) \operatorname{P}(\text{Noised} = k | \text{count} = n)
We illustrate this visually, showing the probability density function (pdf) of the observed values given true counts of
nn
(blue),
n+1n + 1
(orange), and
n1n - 1
(green).
The Laplace Mechanism
The blue shaded region shows that the the possibly noisy count values for
n1n-1
and
n+1n+1
lie within a factor of
exp(1)\exp(1)
of the noisy count values of
nn
, so this mechanism is differentially private with
ε=1\varepsilon=1
.

Approximate Differential Privacy

A common relaxation, called approximate differential privacy, allows for flexible privacy analysis with noise drawn from a wider array of distributions than the Laplace distribution. For example, the AnalyzeGauss mechanisms of [4], and differentially private gradient descent of [1], use Gaussian noise as a fundamental ingredient, which requires the following relaxation: For a given
ε0\varepsilon\geq 0
and
δ[0,1]\delta\in[0,1]
, we say that
M\mathcal{M}
is
(ε,δ)(\varepsilon,\delta)
differentially private if, for all subset of outputs
O\mathcal{O}
, we have
P(M(D)O)eεP(M(D)O)+δ\operatorname{P}(\mathcal{M}(D)\in \mathcal{O}) \leq e^\varepsilon \operatorname{P}(\mathcal{M}(D')\in \mathcal{O}) +\delta
The parameter
δ\delta
is often described as the risk of a (possibly catastrophic) privacy violation. While this formal definition does allow, for example, a mechanism revealing a sensitive database with probability
δ\delta
, in practice this is not a plausible outcome with carefully designed mechanisms. Additionally, taking
δ\delta
to be small relative to the size of the database ensures the risk of disclosure is low.

References

    1.
    Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep Learning with Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS '16). Association for Computing Machinery, New York, NY, USA, 308–318. DOI:https://doi.org/10.1145/2976749.2978318
    2.
    Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam Smith. 2006 Calibrating Noise to Sensitivity in Private Data Analysis. In: Halevi S., Rabin T. (eds) Theory of Cryptography. (TCC '06). Lecture Notes in Computer Science, vol 3876. Springer, Berlin, Heidelberg. DOI:https://doi.org/10.1007/11681878_14
    3.
    Cynthia Dwork and Aaron Roth. 2014. The Algorithmic Foundations of Differential Privacy. Found. Trends Theor. Comput. Sci. 9, 3–4 (August 2014), 211–407. DOI:https://doi.org/10.1561/0400000042
    4.
    Cynthia Dwork, Kunal Talwar, Abhradeep Thakurta, and Li Zhang. 2014. Analyze gauss: optimal bounds for privacy-preserving principal component analysis. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing (STOC '14). Association for Computing Machinery, New York, NY, USA, 11–20. DOI:https://doi.org/10.1145/2591796.2591883
Last modified 1mo ago