« 'Anonymized' Medical Data Protects Privacy, Improves Care | Main | Health records: Privacy concerns and eHealth »

August 19, 2009

Comments

Thank you for reading the paper and for your helpful comments. I must respectfully disagree with your conclusions. Of the five or six examples I cite in my paper, you cite only two and you neglect to mention the studies by Vitaly Shmatikov, Arvind Naryananan, and Justin Brickell that all demonstrate reidentification studies that worked on what I believe you would call "properly de-identified."

Second, you neglect to mention at all the theoretical work I cite by Shmatikov, Brickell, and Dwork that demonstrate general limits of de-identification.

Third, your failure to cite Shmatikov and Brickell is especially curious because you then point to a paper in JAMIA about k-anonymity and you refer to this as an example of what a "truly sophisticated custodian" would do. The research by Shmatikov and Brickell does a fairly good job showing the significant (some say fatal) limits of k-anonymity.

So thank you once again for the response. One of my most sincere desires is that I can use this paper as a vehicle to reach out to communities in health privacy and medical informatics, so I greatly appreciate the opportunity.

Paul raises a number of valid comments in his response, so I will address these below.

I did not comment on the other studies originally because that would bring up a whole set of new issues into the discussion, but here goes. I must initially admit a bias towards health information and that is the lens I use.

In terms of other examples not in Paul Ohm’s paper of re-identification, which are mentioned in our k-anonymity paper ( see http://www.jamia.org/cgi/content/abstract/15/5/627), there are the Chicago homicide database, the Illinois department of public health, and the Canadian adverse event database and the CBC. There is also another one mentioned in our IEEE S&P paper of a prescription database (see http://www2.computer.org/portal/web/csdl/doi/10.1109/MSP.2009.47).

So there are many examples of re-identification. All of these were due to inappropriate de-identification of the data before disclosure. So they are all consistent with my earlier point.

The Netflix example is a bit different because it pertains to a ‘transaction’ or very high dimensional dataset. This presents its own set of difficulties. There are datasets that look like that in a medical context as well (eg, when one considers diagnoses and drugs). Recently there has been quite a bit of activity on developing techniques for assessing risk and de-identifying transactional datasets, and there is some unpublished work on this in the healthcare context that should be coming out within the next 12 months or so (by our group and others). Therefore, the main point about lousy de-identification before releasing data remains in that example as well.

So, I would not argue that any of these examples represent a “truly sophisticated custodian” at all (at least within the context of the examples).

The Brickell work is very interesting, but it is not the last word on the issue. Here are a few considerations. All of the examples of re-identification that I cited above are of identity disclosure. Therefore, one can argue that this is really what matters because that is what today’s intruders are doing. The Brickell paper measured risk in terms of attribute disclosure. De-identification criteria like k-anonymity do not address attribute disclosure, so naturally k-anonymity algorithms will not perform well on a criterion that they are not addressing. The most commonly used value for k=5, and they did not really look at that. Of course, it would be nice to see the same results on multiple datasets rather than a single one. We have been doing this for a few years and we have de-identified datasets that have been used by researchers, commercial entities, and policy makers (see http://www.cjhp-online.ca/index.php/cjhp/article/view/812 for a recent example). Also, one can quibble with the way the researcher vs. attacker variable selection / workload is defined.

In practice de-identification has to be included in a more general risk assessment framework, which is similar to the conclusion that Paul Ohm reaches (albeit I would keep de-identification in as part of the framework). We have developed such a framework (see http://www.ehealthinformation.ca/documents/SecondaryUseFW.pdf) which includes motive, invasion-of-privacy, and security & privacy practices of the recipient. This can serve as a starting point for a discussion.

Khaled,

Thank you for the detailed response. As I read it, your response lends much more support than refutation for my paper. Let me try to summarize what you have said:

1. I could have cited three other examples of reidentification in my paper, which would have piled on support for my point.

2. The Netflix study involves transaction data, which is directly applicable to medical privacy questions...

3. ...but unpublished studies will demonstrate that we might have new techniques to protect these. Maybe.

4. So although I have pointed to a half-dozen examples of sophisticated, well-resourced companies and government agencies performing woeful anonymization, this demonstrates only that lots of people do a lousy job anonymizing, most of the time.

5. The theoretical work demonstrates that k-anonymity does not work well on attribute disclosure. k-anonymity does work well against identity disclosure.

[My response to this one: why should this distinction matter to policymakers? If attribute disclosure reduces entropy which can be used to destroy identity, shouldn't we consider this a huge flaw that deserves regulatory response? Isn't it as if you are arguing, "who cares that we can destroy privacy by looking through the windows, aren't you impressed by how hard it is to look through the door?"]

6. You agree completely with my prescription: a nuanced risk assessment.

So, in summary, we agree about almost everything, and on the small things on which we might disagree, you principally point to unpublished studies. I'm very happy that we agree about so much.

Why then on another blog (http://www.emergentchaos.com/archives/2009/08/new_on_ssrn.html) do you say, "the case is not as strong as it initially seems"?

And thank you for the link to your framework. I had not seen it, and I will be sure to incorporate it into my next draft. It is impressive, and it once again shows how close we are on this topic.

Paul, I do apologize if I was not very clear.

I think my main point remains that the examples you mention in your paper, plus the ones that I mention, do not support the main title of your paper (and the main thesis of the first half or so). They are not examples of anonymized data being re-identified. They are examples of data that have not been anonymized in any meaningful way being re-identified, which I think leads to a very different conclusion. That is really the key point.

One alternative conclusion from these half a dozen examples would be the need to have better guidelines and enforcement of best practices for de-identification because clearly many people are disclosing data without proper de-identification and when they do that it is easy to re-identify.

Regarding the de-identification of transaction data (such as Netflix), there are at least half a dozen different techniques published in the computer science literature on how to de-identify such data already. The unpublished work pertains to applying some of those ideas to health data.

I cannot tell you how many times a custodian told me that they have anonymized a dataset and after close inspection it turns out that they basically just removed the names. So claims that a dataset are anonymized, without producing concrete evidence that it was properly done and how, to me at least, are meaningless.

Re your response to point 5 – fair enough. But the issue around this point is which technique is best – and we can have a much longer conversation on that one at some future point. It still does not support the argument about anonymization failures as there are techniques to protect against attribute disclosure as well.

(I am using “de-identification” and “anonymization” interchangeably – it is Saturday and I should really be out playing with the kids, so I am taking shortcuts today)

Thank you Khaled. This has been a very fruitful exchange, and I hope it helps others understand the nature of this debate better. I have the feeling you and I will have many other chances to discuss and debate this topic in the coming months and years. I look forward to it!

I, too, should be playing with my kids on this Saturday, but unfortunately, I'm stuck in my office instead!

The comments to this entry are closed.