Differential privacy

Differential privacy is one technique that Tonic Structural uses to ensure the privacy of your data.

Differential privacy limits the effect of a single source record or user on the destination data. Someone who views the output of a process that has differential privacy cannot determine whether a particular individual's information was used to generate that output.

Data that is protected by a process with differential privacy cannot be reverse engineered, re-identified, or otherwise compromised.

Generators that automatically have differential privacy

Any generator that does not use the underlying data at all is considered "data-free". A data-free generator always has differential privacy.

Several Structural generators are either always data-free, or are data-free if consistency is not enabled.

Generators for which differential privacy is configurable

The configuration options for the Categorical and Continuous generators include a Differential Privacy toggle to enable or disable differential privacy.

Categorical generator

The Categorical generator shuffles the values of a column, but preserves the overall frequency of the values. Note that NULL is considered its own category of value.

Differential privacy (disabled by default) further protects the privacy of your data. It\y:

First, adds noise to the frequencies of categories.
Next, if needed, removes rare categories from the possible samples.

These steps ensure that a single row of source data has limited influence on the output values. By default, the privacy budget for this generator is $\varepsilon = 1,$ with $\delta = \frac{1}{10n}$ , where $n$ is the number of rows.

Differential privacy is not appropriate when the data in each row is unique or nearly unique. As a general rule of thumb, categories that are represented by fewer than 15 rows are at risk of being suppressed.

Structural warns you when a column isn’t suitable for differential privacy. A column is not suitable for differential privacy if most or all categories have fewer than 15 rows.

Continuous generator

The Continuous generator produces samples that preserve the individual column distributions and correlations between columns.

When differential privacy is enabled, noise is added to the individual distributions and the correlation matrix, using the mechanism described in [4].

The default privacy budget for this generator is $\varepsilon = 1,$ with $\delta = \frac{1}{10n}$ .

More details: Mathematical formulation

Differential privacy is a property of a randomized algorithm $\mathcal{M}$ , which takes as input a database $D$ and produces some output $\mathcal{M}(D).$ The outputs could be counts or summary statistics or synthetic databases — the specific type is not important for this formulation.

For this formulation, we say two databases $D$ and $D'$ are neighbors if they differ by a single row.

For a given $\varepsilon\geq 0$ , we say that $\mathcal{M}$ is $\varepsilon$ differentially private if, for all subset of outputs $\mathcal{O}$ , we have:

\operatorname{P}(\mathcal{M}(D)\in \mathcal{O}) \leq e^\varepsilon \operatorname{P}(\mathcal{M}(D')\in \mathcal{O})

When $\delta$ is non-zero, this is sometimes called approximately differentially private.

Privacy budget

The parameter $\varepsilon$ is the privacy budget of the algorithm, and quantifies in a precise sense an upper bound on how much information an adversary can gain from observing the outputs of the algorithm on an unknown database.

Suppose an attacker suspects that our secret database $D$ is one of two possible neighboring databases $D_1, D_2$ , with some fixed odds.

If $\mathcal{M}$ is $\varepsilon$ differentially private, then observing $\mathcal{M}(D)$ updates the the attacker's log odds of $D=D_1$ vs $D=D_2$ by at most $\varepsilon$ .

The closer $\varepsilon$ is to $0$ , the better the privacy guarantee, as an attacker is more and more limited in what information they can learn from $\mathcal{M}(D)$ .

Conversely, larger values of $\varepsilon$ mean that an attacker can possibly learn significant information by observing $\mathcal{M}(D)$ .

A simple example: counting

Suppose we want to count the number of users in a database that have some sensitive property. For example, the number of users with a particular medical diagnosis.

Dwork, McSherry, Nissim and Smith introduced in [2] the Laplace Mechanism as a way to publish these counts in a secure way, by adding noise sampled from the Laplace distribution.

This noise affords us plausible deniability. If the underlying count changed by $\pm 1$ , then the probability of observing the same noisy output does not change by much:

\operatorname{P}( \text{Noised} = k | \text{count} = n \pm 1) \leq \operatorname{exp}(1) \operatorname{P}(\text{Noised} = k | \text{count} = n)

We illustrate this visually, showing the probability density function (pdf) of the observed values given true counts of $n$ (blue), $n + 1$ (orange), and $n - 1$ (green).

The blue shaded region shows that the the possibly noisy count values for $n-1$ and $n+1$ lie within a factor of $\exp(1)$ of the noisy count values of $n$ , so this mechanism is differentially private with $\varepsilon=1$ .

Approximate differential privacy

A common relaxation, called approximate differential privacy, allows for flexible privacy analysis with noise drawn that is from a wider array of distributions than the Laplace distribution.

For example, the AnalyzeGauss mechanisms of [4], and differentially private gradient descent of [1], use Gaussian noise as a fundamental ingredient, which requires the following relaxation:

For a given $\varepsilon\geq 0$ and $\delta\in[0,1]$ , we say that $\mathcal{M}$ is $(\varepsilon,\delta)$ differentially private if, for all subset of outputs $\mathcal{O}$ , we have:

\operatorname{P}(\mathcal{M}(D)\in \mathcal{O}) \leq e^\varepsilon \operatorname{P}(\mathcal{M}(D')\in \mathcal{O}) +\delta

The parameter $\delta$ is often described as the risk of a (possibly catastrophic) privacy violation. While this formal definition does allow, for example, a mechanism that reveals a sensitive database with probability $\delta$ , in practice this is not a plausible outcome with carefully designed mechanisms. Also, taking $\delta$ to be small relative to the size of the database ensures that the risk of disclosure is low.

References

Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep Learning with Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS '16). Association for Computing Machinery, New York, NY, USA, 308–318. DOI:https://doi.org/10.1145/2976749.2978318
Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam Smith. 2006 Calibrating Noise to Sensitivity in Private Data Analysis. In: Halevi S., Rabin T. (eds) Theory of Cryptography. (TCC '06). Lecture Notes in Computer Science, vol 3876. Springer, Berlin, Heidelberg. DOI:https://doi.org/10.1007/11681878_14
Cynthia Dwork and Aaron Roth. 2014. The Algorithmic Foundations of Differential Privacy. Found. Trends Theor. Comput. Sci. 9, 3–4 (August 2014), 211–407. DOI:https://doi.org/10.1561/0400000042
Cynthia Dwork, Kunal Talwar, Abhradeep Thakurta, and Li Zhang. 2014. Analyze gauss: optimal bounds for privacy-preserving principal component analysis. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing (STOC '14). Association for Computing Machinery, New York, NY, USA, 11–20. DOI:https://doi.org/10.1145/2591796.2591883

Last updated 9 months ago

Was this helpful?