Blog by Simon Frey

k-anonymity

As you might already have recognized I am in favor of strong privacy technology to counter the surveillance in our current age. One pillar to archive this is to get developers thinking about this and teach methods to build privacy protecting systems.

Within this article you will learn about a theoretical technic called k-anonymity in order to obfuscate & protect the people which data you collected.

Usage Scenario

Let’s start with the typical example scenario in this space:
A hospital collects data about their patients in order to help research gather insights about diseases and in the end serve their patients even better.
Let’s assume we have following (intentional super small) data set (Source: Wikipedia):

Name Age Gender State of domicile Religion Disease
Ramsha 30 Female Tamil Nadu Hindu Cancer
Yadu 24 Female Kerala Hindu Viral infection
Salima 28 Female Tamil Nadu Muslim TB
Sunny 27 Male Karnataka Parsi No illness
Joan 24 Female Kerala Christian Heart-related
Bahuksana 23 Male Karnataka Buddhist TB
Rambha 19 Male Kerala Hindu Cancer
Kishor 29 Male Karnataka Hindu Heart-related
Johnson 17 Male Kerala Christian Heart-related
John 19 Male Kerala Christian Viral infection

I assume we agree that is is not a good idea to expose the patients and their diseases by handing this data as it is to research. Doing so we would violate the patients privacy and could cause problems for them and also the hospital. (GDPR for the rescue ;))

This is the moment where k-anonymity comes into play

Definition k-anonymity

Wikipedia states: *”A release of data is said to have the *k-anonymity* property if the information for each person contained in the release cannot be distinguished from at least k − 1 individuals whose information also appear in the release.”*

We get the following insights from this definition:

To archive 2-anonymity (k=2) the data set needs to be obfuscated (by some method) with the result that for every entry, there is at least one other entry being identical. For 3-anonymity it must be at least two others and so on…

Obfuscation

Best protection would be to remove as much attributes as possible and only hand out a minimal amount of data. Though protecting the privacy of the patients this would render research useless. We have to find the sweet spot between patient privacy and research interest, by just obfuscating enough information to effectively protect the patients privacy.

There are two common obfuscation methods in order to achieve k-anonymity:

Obfuscation applied

Generalization

Our data set would look like following if we apply generalization to the age attribute with the two categories “Age ≤ 20” & “20 < Age ≤ 30”:

Name Age Gender State of domicile Religion Disease
Ramsha 20 < Age ≤ 30 Female Tamil Nadu Hindu Cancer
Yadu 20 < Age ≤ 30 Female Kerala Hindu Viral infection
Salima 20 < Age ≤ 30 Female Tamil Nadu Muslim TB
Sunny 20 < Age ≤ 30 Male Karnataka Parsi No illness
Joan 20 < Age ≤ 30 Female Kerala Christian Heart-related
Bahuksana 20 < Age ≤ 30 Male Karnataka Buddhist TB
Rambha Age ≤ 20 Male Kerala Hindu Cancer
Kishor 20 < Age ≤ 30 Male Karnataka Hindu Heart-related
Johnson Age ≤ 20 Male Kerala Christian Heart-related
John Age ≤ 20 Male Kerala Christian Viral infection

Suppression

Our data set would look like following if we apply suppression to Name an the Religion attribute:

Name Age Gender State of domicile Religion Disease
30 Female Tamil Nadu Cancer
24 Female Kerala Viral infection
28 Female Tamil Nadu TB
27 Male Karnataka No illness
24 Female Kerala Heart-related
23 Male Karnataka TB
19 Male Kerala Cancer
29 Male Karnataka Heart-related
17 Male Kerala Heart-related
19 Male Kerala Viral infection

Combined

Most of the time both methods are used in combination. Combining the two examples our data set would look like following:

Name Age Gender State of domicile Religion Disease
20 < Age ≤ 30 Female Tamil Nadu Cancer
20 < Age ≤ 30 Female Kerala Viral infection
20 < Age ≤ 30 Female Tamil Nadu TB
20 < Age ≤ 30 Male Karnataka No illness
20 < Age ≤ 30 Female Kerala Heart-related
20 < Age ≤ 30 Male Karnataka TB
Age ≤ 20 Male Kerala Cancer
20 < Age ≤ 30 Male Karnataka Heart-related
Age ≤ 20 Male Kerala Heart-related
Age ≤ 20 Male Kerala Viral infection

Evaluating k-anonymity

To check if we have a degree of k-anonymity we have to count the occurrences of non-distinguishable entries for disease as the goal of our anonymity is to not disclose which person has which disease.

Let’s do this for 4 different data sets:

The whole data set always gets the lowest k-anonymity even if some entries offer a lot higher protection. Thus, even if the suppression example offers some 2-anonym entries, the overall class is still 1-anonymity as there are unique entries left. Even if only one unique entry would be left and all others are in higher classes, the class would stay the same.

Only the last example combining generalization & suppression offers 2-anonymity as the weakest anonymity the data set provides is the 2-anonymity.
Thereby no person can be linked to a single disease, as there are always at least two entries with the same identifying attributes.

Possible attacks & caveats

There are possible attacks on k-anonymity to re-identify the information in the data set.

Other problems are:

Conclusion

If you work with data sets you should always keep the privacy rights of the people in the set in mind. You can use the given techniques Generalization & Suppression to archive a certain strength of k-anonymity. The whole process is proven to not be easily automatizable and needs time and understanding. Small mistakes can result in de-anonymization.


Please keep in mind this article is a simplified explanation. If you want to gain even deeper knowledge about this topic, following paper is a good start: https://dataprivacylab.org/dataprivacy/projects/kanonymity/paper3.pdf