Best Practices

Data Scrubbing: Is there a right way?

An article yesterday from ars technica got me wondering. In my former position we often “scrubbed” databases for sample data from which to work. And certainly one can see the value in working with data with personally identifiable information removed for the purposes of business or health-care informatics, service level determinations, quality of service surveys, and so on. Yet, according to a study at Carnegie Mellon University:

87% (216 million of 248 million) of the population in the United States had reported characteristics that likely made them unique based only on {5-digit ZIP, gender, date of birth}.

This seems to be the balance point: 3 pieces of non-anonymized data are enough to identify the majority of the population. (Think, “Three Strikes! You’re out!”) So what do we do when we need solid, anonymous data from which to work?

Taking the example of the health records from the article linked above, I would think that the following steps would be enough to fully scrub the data beyond where “reidentification” would not be possible. Since this is medical records we can safely (I feel) assume that randomizing the gender is a non-starter. (“Wow, according to this data 14% of all men went to the Emergency Room with pregnancy-related complications!”)

And since this data is taken from an area where the zip codes are known we are already at two strikes. So why did they not randomize the dates of birth? It would be difficult to do in the case of infants beyond a few days or weeks, since many of the health issues are related to their age in months. But for anyone over the age of 8 it should be simple enough to randomize the month and date of birth, and allow a set of ranges for randomizing the year of birth. If we assume a 20% range up or down we gain a lattitude of possible years of birth which increases the older the patient actually is. Another possibility is to give everyone the same date of birth, differing only in the year. (Jan 1, xxxx).

This of course means that any reporting done on age is meaningless, but it also means that the data can more safely be widely distributed. In cases where exact age and gender are required for study it would be better to merge data from many different areas, covering as many cities, counties, states and regions as possible. In this case we would still need to weigh the risks, as all three pieces of data would still be available, although at a much higher level of trail and error. In the case mentioned by ars technica the study covered seven zip codes. Perhaps spreading the information over a few hundred would make it much less worth the effort to sort through them all to try to identify individuals, and even then one would expect multiple possible hits.

The need for real data for statistical analysis and study is not going to go away. When you are considering releasing scrubbing data to release a “sanitized” version it would be good to keep the mantra “Three Strikes! You’re out!” in mind. When it comes to data for testing software operation, however, I still think the better method is complete randomization. Totally bogus data that has the look and feel of “real” data. (Which is no doubt why all the bogus users in my test dbs live in different cities, in different states, and at addresses ranging from 123 to 999 on Any Street!)