TOИIC

Search…

The Tonic Platform

Concepts

Differential Privacy

Differential privacy is one of the techniques Tonic uses to ensure the privacy of your data. At a high level, differential privacy limits the influence of a single user’s data on Tonic’s output data, ensuring privacy through a robust notion of plausible deniability. Slightly more precisely, differential privacy is a property of a randomized process on databases which stipulates that altering the database by a single record (or user) can only perturb the output in a limited way.

Differentially private processes have some valuable properties. The first, and perhaps most important, is that no amount of post processing or additional knowledge can break the guarantees of differential privacy. This isn’t true for other data anonymization techniques, e.g. k-anonymity. Additionally, differentially private data can be combined with other differentially private data without losing its protection. **In practice this means data protected by a process with differential privacy cannot be reverse engineered, re-identified, or otherwise compromised, no matter the adversary.**

Tonic can enable differential privacy for the Categorical and Continuous generators and many of Tonic’s generators enjoy this property for free — any generator that does not use the underlying data at all is considered "data free", which has the property of being differentially private.

Categorical Generator

The categorical generator shuffles the values of a column while preserving the overall frequencies of the values. Differential privacy (enabled by default) will further protect the privacy of your data by:

- 1.Adding noise to the frequencies of categories
- 2.Suppressing rare categories.

These steps ensure that a single row of source data has limited influence on the output values. By default, the privacy budget for this generator is

$\varepsilon = 1,$

with $\delta = \frac{1}{10n}$

, where $n$

is the number of rows.Disabling differential privacy is not recommended, but there is one situation where it may be necessary: when the data in each row is unique or nearly unique — as a general rule of thumb, categories represented by fewer than 15 rows are at risk of being suppressed. Tonic will warn you when a column isn’t suitable for differential privacy, and you must disable it in those cases.

Continuous Generator

The continuous generator produces samples which preserve the individual column distributions and correlations between columns. When differential privacy is enabled, noise is added to the individual distributions and the correlation matrix, using the mechanism described in [4]. The default privacy budget for this generator is

$\varepsilon = 1,$

with $\delta = \frac{1}{10n}$

.More details: Mathematical formulation

Differential privacy is a property of a *randomized algorithm* ** **For a given

$\mathcal{M}$

which takes as input a database $D$

and produces some output $\mathcal{M}(D).$

The outputs could be counts or summary statistics or synthetic databases — the specific type is not important for this formulation. For this formulation, we say two databases $D$

and $D'$

are neighbors if they differ by a single row. $\varepsilon\geq 0$

, we say that $\mathcal{M}$

is $\varepsilon$

differentially private if, for all subset of outputs $\mathcal{O}$

, we have$\operatorname{P}(\mathcal{M}(D)\in \mathcal{O}) \leq e^\varepsilon \operatorname{P}(\mathcal{M}(D')\in \mathcal{O})$

When *approximately differentially private*. The parameter

$\delta$

is non-zero, this is sometimes called $\delta$

is often interpreted as an upper bound on the risk a privacy leak, butPrivacy Budget

The parameter *possibly* learn significant information by observing

$\varepsilon$

is the privacy budget of the algorithm, and quantifies in a precise sense, an upper bound on how much information an adversary can gain from observing the outputs of the algorithm on an unknown database: Suppose an attacker suspects our secret database$D$

is one of two possible neighboring databases $D_1, D_2$

, with some fixed odds. If$\mathcal{M}$

is $\varepsilon$

differentially private, then observing $\mathcal{M}(D)$

updates the the attacker's log odds of $D=D_1$

vs $D=D_2$

by at most $\varepsilon$

. The closer $\varepsilon$

is to $0$

the better the privacy guarantee, as an attacker is more and more limited in what information they can learn from $\mathcal{M}(D)$

. Conversely, larger values of $\varepsilon$

mean an attacker can $\mathcal{M}(D)$

.A simple example: counting

Suppose we want to count the number of users in a database having some sensitive property — for example, the number of users with a particular medical diagnosis. Dwork, McSherry, Nissim and Smith introduced in [2] the Laplace Mechanism as a way of publishing these counts in a secure way, by adding noise sampled from the Laplace distribution. This noise affords us plausible deniability — if the underlying count changed by

$\pm 1$

, then the probability of observing the same noisy output does not change by much:$\operatorname{P}( \text{Noised} = k | \text{count} = n \pm 1) \leq \operatorname{exp}(1) \operatorname{P}(\text{Noised} = k | \text{count} = n)$

We illustrate this visually, showing the probability density function (pdf) of the observed values given true counts of

$n$

(blue), $n + 1$

(orange), and $n - 1$

(green).
The Laplace Mechanism

The blue shaded region shows that the the possibly noisy count values for

$n-1$

and $n+1$

lie within a factor of $\exp(1)$

of the noisy count values of $n$

, so this mechanism is differentially private with $\varepsilon=1$

.Approximate Differential Privacy

A common relaxation, called *approximate differential privacy*, allows for flexible privacy analysis with noise drawn from a wider array of distributions than the Laplace distribution. For example, the AnalyzeGauss mechanisms of [4], and differentially private gradient descent of [1], use Gaussian noise as a fundamental ingredient, which requires the following relaxation: ** **For a given

$\varepsilon\geq 0$

and $\delta\in[0,1]$

, we say that $\mathcal{M}$

is $(\varepsilon,\delta)$

differentially private if, for all subset of outputs $\mathcal{O}$

, we have$\operatorname{P}(\mathcal{M}(D)\in \mathcal{O}) \leq e^\varepsilon \operatorname{P}(\mathcal{M}(D')\in \mathcal{O}) +\delta$

The parameter

$\delta$

is often described as the risk of a (possibly catastrophic) privacy violation. While this formal definition does allow, for example, a mechanism revealing a sensitive database with probability $\delta$

, in practice this is not a plausible outcome with carefully designed mechanisms. Additionally, taking $\delta$

to be small relative to the size of the database ensures the risk of disclosure is low.References

- 1.Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep Learning with Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS '16). Association for Computing Machinery, New York, NY, USA, 308–318. DOI:https://doi.org/10.1145/2976749.2978318
- 2.Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam Smith. 2006 Calibrating Noise to Sensitivity in Private Data Analysis. In: Halevi S., Rabin T. (eds) Theory of Cryptography. (TCC '06). Lecture Notes in Computer Science, vol 3876. Springer, Berlin, Heidelberg. DOI:https://doi.org/10.1007/11681878_14
- 3.Cynthia Dwork and Aaron Roth. 2014. The Algorithmic Foundations of Differential Privacy. Found. Trends Theor. Comput. Sci. 9, 3–4 (August 2014), 211–407. DOI:https://doi.org/10.1561/0400000042
- 4.Cynthia Dwork, Kunal Talwar, Abhradeep Thakurta, and Li Zhang. 2014. Analyze gauss: optimal bounds for privacy-preserving principal component analysis. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing (STOC '14). Association for Computing Machinery, New York, NY, USA, 11–20. DOI:https://doi.org/10.1145/2591796.2591883

Last modified 4mo ago