Why classic anonymization (and pseudonymization) does not result in anonymous data

This blog covers the following topics:

What is classic anonymization?
What are the disadvantages of classic anonymization?
Why do classic anonymization techniques offer a suboptimal combination between data-utlity and privacy protection?.
How is Synthetic Data different?
Why still use personal data if you can use synthetic data?

What is classic anonymization?

With classic anonymization, we imply all methodologies where one manipulates or distorts an original dataset to hinder tracing back individuals.

Typical examples of classic anonymization that we see in practice are generalization, suppression / wiping, pseudonymization and row and column shuffling.

Hereby those techniques with corresponding examples.

Technique	Original data	Manipulated data
Generalization	27 years old	Between 25 and 30 years old
Suppression / Wiping	info@syntho.ai	xxxx@xxxxxx.xx
Pseudonymization	Amsterdam	hVFD6td3jdHHj78ghdgrewui6
Row and column shuffling	Aligned	Shuffled

What are the disadvantages of classic anonymization?

Manipulating a dataset with classic anonymization techniques results in 2 keys disadvantages:

Distorting a dataset results in decreased data quality (i.e. data utility). This introduces the classic garbage-in garbage-out principle.
Privacy risk will be reduced, but will always be present. It stays and manipulated version of the original dataset with 1-1 relations.

We demonstrate those 2 key disadvantages, data utility and privacy protection. We do that with the following illustration with applied suppression and generalization.

Note: we use images for illustrative purposes. The same principle holds for structured datasets.

Left: little application of classic anonymization result in a representative illustration. However, the individual can easily be identified and privacy risk is significant.

Right: severe application of classic anonymization results in strong privacy protection. However, the illustration becomes useless.

Classic anonymization techniques offer a suboptimal combination between data-utility and privacy protection.

This introduces the trade-off between data utility and privacy protection, where classic anonymization techniques always offer a suboptimal combination of both.

Is removing all direct identifiers (such as names) from the dataset a solution?

No. This is a big misconception and does not result in anonymous data. Do you still apply this as way to anonymize your dataset? Then this blog is a must read for you.

How is Synthetic Data different?

Syntho develops software to generate an entirely new dataset of fresh data records. Information to identify real individuals is simply not present in a synthetic dataset. Since synthetic data contains artificial data records generated by software, personal data is simply not present resulting in a situation with no privacy risks.

The key difference at Syntho: we apply machine learning. Consequently, our solution reproduces the structure and properties of the original dataset in the synthetic dataset resulting in maximized data-utility. Accordingly, you will be able to obtain the same results when analyzing the synthetic data as compared to using the original data.

This case study demonstrates highlights from our quality report containing various statistics from synthetic data generated through our Syntho Engine in comparison to the original data.

In conclusion, synthetic data is the preferred solution to overcome the typical sub-optimal trade-off between data-utility and privacy-protection, that all classic anonymization techniques offer you.

So, why use real (sensitive) data when you can use synthetic data?

In conclusion, from a data-utility and privacy protection perspective, one should always opt for synthetic data when your use-case allows so.

	Value for analysis	Privacy risk
Synthetic data	High	None
Real (personal) data	High	High
Manipulated data (through classic ‘anonymization’)	Low-Medium	Medium-High

Synthetic data by Syntho fills the gaps where classic anonymization techniques fall short by maximizing both data-utility and privacy-protection.

Data is synthetic, but our team is real!

Contact Syntho and one of our experts will get in touch with you at the speed of light to explore the value of synthetic data!

De-identification and synthetization

Rule-based Synthetic Data

Subsetting

PII Scanner

Synthetic Mock Data

Consistent mapping

What is synthetic data?

Quality assurance report

External evaluation by SAS

Time series synthetic data

Upsampling

Deployment and integration

Connectors

Extended features

Supported data

User documentation

Schedule a demo

Test data

Analytics

Data sharing

Product demo's

Data monetization

AI modeling

Healthcare

Finance

Public Organizations

User documentation

Whitepapers and Guides

Blog

Webinars

Case Studies

Pricing

About us

Careers

Why classic anonymization (and pseudonymization) does not result in anonymous data

This blog covers the following topics:

What is classic anonymization?

What are the disadvantages of classic anonymization?

Classic anonymization techniques offer a suboptimal combination between data-utility and privacy protection.

Is removing all direct identifiers (such as names) from the dataset a solution?

How is Synthetic Data different?

So, why use real (sensitive) data when you can use synthetic data?

Explore more resources

Data Masking vs Encryption: What Is the Difference? – Syntho

Data Migration Testing – A Guide from Syntho

What Is Data Obfuscation? | A Comprehensive Guide from Syntho

Power of Synthetic Data for Enterprise Data Strategy

Synthetic Data vs Real Data: Which Is the Better Choice?

Data is synthetic, but our team is real!

Main Menu