Paul Ohm recently put out an article where he makes the dramatic claim that de-identification has failed (see http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1450006). I have heard that argument before and the argument’s primary weakness is amplified in this article – therefore I feel compelled to comment.
Paul Ohm’s argument about the failure of anonymization is based on evidence that does not actually support his point. Therefore, his overall argument about de-identification is very questionable. Below I will explain why.
The key point is that existing re-identifications successes demonstrate the de-identification does not work. This, of course, assumes that the datasets that were re-identified was properly anonymized – it was not. One example that Ohm uses to make his case is the insurance database released in Massachusetts more than a decade ago (pre-HIPAA). That database was not properly anonymized and no professional working in this field would say that that was a properly anonymized database. The Group Insurance Commission did a lousy job. The second example is AOL – which again is an example of a database that was not properly anonymized. AOL did a lousy job in anonymizing their database. In fact the examples he cites were cases where the custodian did not use existing re-identification risk measurement techniques and did not use de-identification techniques that are available in the literature. We know how to de-identify datasets properly (up to a pre-specified threshold) and in none of those examples was this done. There is no example of a database that has been properly de-identified being re-identified.
So I want to make a distinction between lousy practice and good practice. Being a software engineer in a previous life, I will use a software example. There are different levels of maturity in software development: lousy and good. We measure project risk using a maturity scale. If I see a couple of software projects that produce buggy software and do not deliver on time, I would not conclude that all software development is lousy and therefore software engineering is dead and should be abandoned – which is what Ohm’s reasoning would lead me to. It just happened that I selected a couple of low maturity projects, and if I had selected high maturity projects I would have a very different picture.
Ohm has taken examples of poorly de-identified datasets that were re-identified and drew broad conclusions from those. A truly sophisticated custodian would measure the risk of re-identification (see http://www.jamia.org/cgi/content/abstract/15/5/627 and the references therein for examples), and if it is too high then the custodian would use a contemporary de-identification technique to de-identify the data (see http://www.jamia.org/cgi/content/abstract/M3144v1 and the references therein for examples).
If a custodian discloses a dataset that has proper identity and attribute disclosure control (ie, the risk of re-identification is below a threshold), and an intruder demonstrates that the risk of re-identification is higher than the threshold, then there should be concern. This article does not demonstrate that at all. However, a valid conclusion from the article would be that if you do lousy de-identification then the data is easy to re-identify.
Therefore, extreme caution is advised here.