Data Privacy; Disclosure control and differential privacy

Data lock and chain Photo by Pixabay from Pexels

I’ve been reading a lot about how Differential Privacy is the answer to all our privacy concerns. However, having probed the methodology in detail, I’m not so sure. Indeed, I think we have a silo problem. Differential Privacy has been developed in a Computer Science framework as a means of balancing the need for altering data to meet privacy requirements with the need to maintain data close enough to their raw state to yield meaningful answers. But I’m not convinced people working with tools from canonical Computer Science ask the same questions as people working from canonical Statistical Science (I include sociologists, demographers, public health researchers within this latter framework). So I’ve written this post as a first step to understanding the broader Differential Privacy framework. I can understand specific results for specific methods; but I just don’t see that Differential Privacy magically solves all possible trade-offs between preserving individual privacy and making data available to researchers which can answer their questions with sufficient accuracy.

I also have my own bias, which is that we should avoid bias. That’s a slightly strange statement from a Bayesian Statistician who is stereo-typically willing to accept a little bias if the estimation method has better mean square error (for want of a better metric). In this case, what I mean by bias is that we don’t want to use privacy methods that systematically underestimate the strength of relationship between variables. I fully respect the need to preserve individual privacy. Indeed, I look to governments and census bureaus to lead the way in showing how you can respect privacy and obtain insight from data. Nevertheless, I would prefer we used methods which were able to account for any uncertainty we have introduced into an analysis as a result of using a privacy respecting stage in our data pipeline..

I have delivered many consulting problems where I needed to work with publicly available data that had been subject to some disclosure control mechanism. A well established weakness of many applications of statistical methods is that they only provide uncertainty estimates (confidence / credible intervals) for aleatory uncertainty; that part of our uncertainty due to random sampling. Epistemic uncertainty such as non-response bias, model mis-specification or data that have been permuted for privacy reasons is simply ignored. I have never liked the fact that disclosure control methods introduce uncertainty in a way that cannot be acknowledged in the results of an analysis. To give a concrete example, you don’t want to do an analysis comparing a policy intervention with a baseline and report that the credible intervals for the effect of the intervention did not overlap if you knew that adding in an allowance for various sources of epistemic uncertainty widened the intervals such that they did overlap. You end up recommending policies that are not effective.

As it happens, most of my engagement with this subject has been in trying to reconstitute multi-way census tables. I know full well this is impossible (because the disclosure control methods make it so). Equally I know there are methods such as Bonferroni bounds which let you put bounds on the range of count values that any given cell could take (strictly speaking these should be called Bonferroni-Fréchet-Hoeffding bounds as several workers discovered them simultaneously). But for myself, I’ve always preferred the idea of releasing multiple versions of a table which allow propagation of the uncertainty after you’ve applied a method such as iterative proportional fitting to reconstitute the full table based on the released micro-data and local census tables.

The long and the short of this problem is that there is a trade-off between protecting individual privacy and preserving the statistical accuracy for researchers. The more noise you add, the less likely that individuals can have their data identified. But the more noise you add, the further the data are from the “truth”. The big problem for working with tables is that permuting the cell contents may weaken the observed relationship between variables.

There are two “classical” methods for ensuring data privacy. The first is cell suppression. You never allow any query if the conditions of that query are met by only one record in the database. The problem with this rule is that you can attack it using set differencing. I can’t get a cross-tabulation of age, sex and commute method because there is a single female cyclist in a given age band in a given census output area. So I query how many female cyclists there are in that output area for all ages, and then for all ages excluding the one of interest and can see there is a single respondent. A popular alternative is data swapping; a subset of the data is taken and within this records are swapped with that of similar records. So I may think I have identified an output area with a single female cyclists of a given age band; the truth is I don’t know whether she was swapped there from another area. At a higher level, the numbers are consistent, but as my querying becomes more granular the noise may be a bigger component of the table. There is a well established R package which implements classical disclosure control methodology sdc. The ONS provides extensive guidance on how to apply disclosure control methods and a common method was chosen for 2011 for all UK census authorities.

Anyway, life moves on and of course state of the art in ensuring data privacy when releasing data nowadays seems to be Differential Privacy. DiffPriv provides differential privacy methods in to complement the more traditional disclosure control methods. Despite being the current fashion, it is an old and widely used technique. Just add some random noise. For researchers with simple univariate analyses to conduct, the Data Privacy modified data should have an average value should be close to the true value. Unlike data swapping however, noise is typically added to the final value rather than to the raw data. Also, unlike traditional disclosure control methods, queries are run repeatedly on the source data and different perturbations may be applied each time. With more conventional disclosure control, a perturbed set of results is created once and then released to the public. Moreover, under Differential Privacy, a perturbation method must be developed for every statistical method. With more traditional disclosure control, once the data had been perturbed they can be used for any kind of analysis.

Differential Privacy is defined in probabilistic terms:

Consider two datasets $DF$ and $DF’$
Both have the same variables $j=1, \ldots, P$ and the same number of row $i=1, \ldots, n$
However, they differ in the contents of one row
Denote the response to applying a query $Q$ to these datasets as $Q(DF)$ and $Q(DF’)$.

For any set $\Omega$ that can be created by applying $Q()$ to any such dataset,

$$P(Q(DF) \in \Omega) < \exp(\epsilon) \times P(Q(DF’) \in \Omega)$$

The smaller we set $\epsilon$ the more we assure privacy.

An example (taken from the excellent Differential Privacy: A Primer for a non-Technical Audience is as follows:

Consider computing an estimate of the number of HIV-positive individuals in a sample, where the sample contains $n= 10,000$ individuals of whom $m= 38$ are HIV-positive.  In a differentially private version of the computation, random noise $Y$ is introduced into the count so as to hide the contribution of a single individual.   That is, the result of the computation would be
$m′= m+Y= 38 +Y$ instead of $m= 38$.

To me, it seems we have several remaining problems. These include the use of Laplacian errors to perturb count values. This means if your Laplacian noise makes a count negative you then need to set that equal to zero which adds an additional step and hence further noise. But I also note that we are talking about univariate analyses. I’ve been almost entirely concerned about the associations between categorical variables (using a variety of log-Linear, graphical models or recently graphical causal inference models to provide analytic output). This kind of noise addition is going to reduce the apparent association between categorical variables in published tables. Given my interests tend to be on the lines of “what is the association between ethnicity and Covid-19 mortality, conditional on a number of important other factors such as age, sex, occupation and so on”, I really don’t want to be given data subject to a Differential Privacy technique which reduces the strength of that association. And ideally, I would like to be able to estimate the epistemic uncertainty associated with the data privacy perturbations. We accept random sampling in statistics; and quantify the associated errors. Why aren’t we looking for privacy preserving methods that help us quantify the uncertainty we’ve added to an analysis in doing this?

In summary, I still think I like the idea of being supplied multiple tables of results in a way that preserves privacy but lets me quantify the uncertainty associated with these kinds of procedures. This gets talked about constantly in statistical circles. I have to say, I’m not convinced Differential Privacy is the last word on the matter.

Read More

Data Privacy; Disclosure control and differential privacy

Differential Privacy is defined in probabilistic terms:

Read More

Reflections on the Post Office Horizon Enquiry

Read More