guess who game

An introduction to Guess Who

Guess Who? Although I am sure that most of you know this game from back in the days, here a brief recap. The goal of the game: discover the name of the cartoon character selected by your opponent by asking ‘yes’ and ‘no’ questions, like ‘does the person wear a hat?’ or ‘does the person wear glasses’? Players eliminate candidates based on the opponent’s response and learn attributes that relate to their opponent’s mystery character. The first player who figures out the other player’s mystery character wins the game.

You got it. One must identify the individual out of a dataset by having only access to the corresponding attributes. In fact, we regularly see this concept of Guess Who applied in practice, but then employed on datasets formatted with rows and columns containing attributes of real people. The main difference when working with data is that people tend to underestimate the ease by which real individuals can be unmasked by having access to only a few attributes.

As the Guess Who game illustrates, someone can identify individuals by having access to only a few attributes. It serves as a simple example of why removing only ‘names’ (or other direct identifiers) from your dataset fails as an anonymization technique. In this blog, we provide four practical cases to inform you about the privacy risks associated with the removal of columns as a means of data anonymization.

2) Linkage attacks: your dataset linked to other (public) data sources

The risk of a linkage attacks is the most important reason why solely removing names does not work (anymore) as a method for anonymization. With a linkage attack, the attacker combines the original data with other accessible data sources in order to uniquely identify an individual and learn (often sensitive) information about this person.

Key here is the availability of other data resources that are present now, or may become present in the future. Think about yourself. How much of your own personal data can be found on Facebook, Instagram or LinkedIn that could potentially be abused for a linkage attack?

In earlier days, the availability of data was much more limited, which partly explains why the removal of names was sufficient to preserve the privacy of individuals. Less available data means fewer opportunities for linking data. However, we are now (active) participants in a data-driven economy, where the amount of data is growing at an exponential rate. More data, and improving technology for gathering data will lead to increased potential for linkage attacks. What would one write in 10 years about the risk of a linkage attack?

Illustration 1

Exponentially growing data is a fact

Ammount of data

Case study

Sweeney (2002) demonstrated in an academic paper how she was able to identify and retrieve sensitive medical data from individuals based on linking a public available data set of ‘hospital visits’ to the publicly available voting registrar in the United States. Both datasets where assumed to be properly anonymized through the deletion of names and other direct identifiers.

Illustration 2

Linkage attack in practice

Linkage Attack

Based on only the three parameters (1) Zip Code, (2) Gender and (3) Date of Birth, she showed that 87% of the entire US population could be re-identified by matching aforementioned attributes from both datasets. Sweeney then repeated her work with having ‘country’ as an alternative to ‘Zip Code’. Additionally, she demonstrated that 18% of the entire US population could be identified only by having access to a dataset containing information about the (1) home country, (2) gender and (3) date of birth. Think about the aforementioned public sources, like Facebook, LinkedIn or Instagram. Is your country, gender and date of birth visible, or are other users able to deduct it?

Illustration 3

Sweeney’s results

Quasi-identifiers

% uniquely identified of US population (248 million)

5-digit ZIP, gender, date of birth

87%

place, gender, date of birth

53%

country, gender, date of birth

18%

This example demonstrates that it can be remarkably easy to de-anonymize individuals in seemingly anonymous data. First, this study indicates a huge magnitude of risk, as 87% of the US population can be easily identified using few characteristics. Second, the exposed medical data in this study was highly sensitive. Examples of exposed individuals’ data from the hospital visits dataset include ethnicity, diagnosis and medication. Attributes that one may rather keep secret, for example, from insurance companies.

3) Informed individuals

Another risk of removing only direct identifiers, such as names, arises when informed individuals have superior knowledge or information about traits or behavior of specific individuals in the dataset. Based on their knowledge, the attacker may then be able to link specific data records to actual people.

Case study

An example of an attack on a dataset using superior knowledge is the New York taxi case, where Atockar (2014) was able to unmask specific individuals. The employed dataset contained all taxi journeys in New York, enriched with basic attributes like start coordinates, end coordinates, price and tip of the ride.

An informed individual that knows New York was able to derive taxi trips to adult club ‘Hustler’. By filtering the ‘end location’, he deduced the exact start addresses and thereby identified various frequent visitors. Similarly, one could deduce taxi rides when the home address of the individual was known. The time and location of several celebrity movie stars were discovered on gossip sites. After linking this information to the NYC taxi data, it was easy to derive their taxi rides, the amount they paid, and whether they had tipped.

Illustration 4

An informed individual

drop-off coordinates Hustler

Bradley Cooper

taxi and map

Jessica Alba

maps tracking

4) Data as a fingerprint

A common line of argumentation is ‘this data is worthless’ or ‘no-one can do anything with this data’. This is often a misconception. Even the most innocent data can form a unique ‘fingerprint’ and be used to re-identify individuals. It is the risk derived from the believe that the data itself is worthless, while it is not.

The risk of identification will increase with the increase of data, AI, and other tools and algorithms that enable the uncovering of complex relationships in data. Consequently, even if your dataset cannot be uncovered now, and is presumably useless for unauthorized persons today, it may not be tomorrow.

Case study

A great example is the case where Netflix intended to crowdsource its R&D department by introducing an open Netflix competition to improve their movie recommendation system.  ‘The one that improves the collaborative filtering algorithm to predict user ratings for films wins a prize of US $1,000,000’. In order to support the crowd, Netflix published a dataset containing only the following basic attributes: userID, movie, date of grade and grade (so no further information on the user or film itself).

Illustration 5

Dataset structure Netflix price

UserID Movie Date of grade Grade
123456789 Mission impossible 10-12-2008 4

In isolation, the data appeared futile. When asking the question ’Is there any customer information in the dataset that should be kept private?’, the answer was:

 ‘No, all customer identifying information has been removed; all that remains are ratings and dates. This follows our privacy policy …’

However, Narayanan (2008) from the University of Texas at Austin proved otherwise. The combination of grades, date of grade and movie of an individual forms a unique movie-fingerprint. Think about your own Netflix behavior. How many people do you think watched the same set of movies? How many watched the same set of movies at the same time?

Main question, how to match this fingerprint? It was rather simple. Based on information from the well-known movie-rating website IMDb (Internet Movie Database), a similar fingerprint could be formed. Consequently, individuals could be re-identified.

While movie-watching behavior might not be presumed as sensitive information, think about your own behavior – would you mind if it went public? Examples that Narayanan provided in his paper are political preferences (ratings on ‘Jesus of Nazareth’ and ‘The Gospel of John’) and sexual preferences (ratings on ‘Bent’ and ‘Queer as folk’) that could be easily distilled.

5) General Data Protection Regulation (GDPR)

GDPR might not be super-exciting, nor the silver bullet among blog topics. Yet, it is helpful to get the definitions straight when processing personal data. Since this blog is about the common misconception of removing columns as a way to anonymize data and to educate you as data processor, let us start with exploring the definition of anonymization according to GDPR. 

According to recital 26 from the GDPR, anonymized information is defined as:

‘information which does not relate to an identified or identifiable natural person or personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.’

Since one processes personal data that relates to a natural person, only part 2 of the definition is relevant. In order to comply to the definition, one has to ensure that the data subject (individual) is not or no longer identifiable.  As indicated in this blog, however, it is remarkably simple to identify individuals based on a few attributes. So, removing names from a dataset does not comply to the GDPR definition of anonymization.

In conclusion

We challenged one commonly considered and, unfortunately, still frequently applied approach of data anonymization: removing names. In the Guess Who game and four other examples about:

  • Linkage attacks
  • Informed individuals
  • Data as a fingerprint
  • General Data Protection Regulation (GDPR)

it was shown that removing names fails as anonymization. Although the examples are striking cases, each shows the simplicity of re-identification and the potential negative impact on the privacy of individuals.

In conclusion, the removal of names from your dataset does not result in anonymous data. Hence, we better avoid using both terms interchangeably. I sincerely hope you will not apply this approach for anonymization. And, if you still do, ensure that you and your team fully understand the privacy risks, and are permitted to accept those risks on behalf of the affected individuals.

group of people smiling

Data is synthetic, but our team is real!

Contact Syntho and one of our experts will get in touch with you at the speed of light to explore the value of synthetic data!

  • D. Reinsel, J. Gantz, John Rydning. The Digitization of the World From Edge to Core, Data Age 2025, 2018
  • L. Sweeney. k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), 2002: 557-570
  • L. Sweeney. Simple Demographics Often Identify People Uniquely. Carnegie Mellon University, Data Privacy Working Paper 3. Pittsburgh 2000
  • P. Samarati. Protecting Respondents’ Identities in Microdata Release. IEEE Transactions on Knowledge and Data Engineering, 13 (6), 2001: 1010-1027
  • Atockar. Riding with the Stars: Passenger Privacy in the NYC Taxicab Dataset, 2014
  • Narayanan, A., & Shmatikov, V. (2008). Robust de-anonymization of large sparse datasets. In Proceedings – 2008 IEEE Symposium on Security and Privacy, SP (pp. 111-125)
  • General Data Protection Regulation (GDPR), Recital 26, Not Applicable to Anonymous Data