Syntho, an expert in AI-generated synthetic data, aims to turn privacy by design into a competitive advantage with AI-generated synthetic data. They help organizations build a strong data foundation with easy and fast access to high-quality data and recently won the Philips Innovation Award.
However, synthetic data generation with AI is a relatively new solution that typically introduces frequently asked questions. To answer these, Syntho started a case study together with SAS, the market leader in Advanced Analytics and AI software.
In collaboration with the Dutch AI Coalition (NL AIC), they investigated the value of synthetic data by comparing AI-generated synthetic data generated by the Syntho Engine with original data via various assessments on data quality, legal validity, and usability.
Your guide into synthetic data generation
Classic anonymization techniques have in common that they manipulate original data in order to hinder tracing back individuals. Examples are generalization, suppression, wiping, pseudonymization, data masking, and shuffling of rows & columns. You can find examples in the table below.
Those techniques introduce 3 key challenges:
These points are also assessed via this case study.
For the case study, the target dataset was a telecom dataset provided by SAS containing the data of 56.600 customers. The dataset contains 128 columns, including one column indicating whether a customer has left the company (i.e. ‘churned’) or not. The goal of the case study was to use the synthetic data to train some models to predict customer churn and to evaluate the performance of those trained models. As churn prediction is a classification task, SAS selected four popular classification models to make the predictions, including:
Before generating the synthetic data, SAS randomly split the telecom dataset into a train set (for training the models) and a holdout set (for scoring the models). Having a separate holdout set for scoring allows for an unbiased assessment of how well the classification model might perform when applied to new data.
Using the train set as input, Syntho used its Syntho Engine to generate a synthetic dataset. For benchmarking, SAS also created a manipulated version of the train set after applying various anonymization techniques to reach a certain threshold (of k-anonymity). The former steps resulted in four datasets:
Datasets 1, 3, and 4 were used to train each classification model, resulting in 12 (3 x 4) trained models. SAS subsequently used the holdout dataset to measure the accuracy with which each model predicts customer churn. The results are presented below, starting with some basic statistics.
Figure: Machine Learning pipeline generated in SAS Visual Data Mining and Machine Learning
Anonymization techniques destroy even basic patterns, business logic, relationships, and statistics (as in the example below). Using anonymized data for basic analytics thus produces unreliable results. In fact, the poor quality of the anonymized data made it almost impossible to use it for advanced analytics tasks (e.g. AI/ML modeling and dashboarding).
Synthetic data generation with AI preserves basic patterns, business logic, relationships, and statistics (as in the example below). Using synthetic data for basic analytics thus produces reliable results. The key question is, does synthetic data hold for advanced analytics tasks (e.g. AI/ML modeling and dashboarding)?
Synthetic data holds not only for basic patterns (as shown in the former plots), but it also captures deep ‘hidden’ statistical patterns required for advanced analytics tasks. The latter is demonstrated in the bar chart below, indicating that the accuracy of models trained on synthetic data versus models trained on original data is similar. Furthermore, with an area under the curve (AUC*) close to 0.5, the models trained on anonymized data perform by far the worst. The full report with all advanced analytics assessments on synthetic data in comparison with the original data is available on request.
*AUC: the area under the curve is a measure of the accuracy of advanced analytics models, taking into account true positives, false positives, false negatives ,and true negatives. 0,5 means that a model predicts randomly and has no predictive power and 1 means that the model is always correct and has full predictive power.
Additionally, this synthetic data can be used to understand data characteristics and the main variables needed for the actual training of the models. The inputs selected by the algorithms on synthetic data compared to the original data were very similar. Hence, the modeling process can be done on this synthetic version, which reduces the risk of data breaches. However, when inferencing individual records (eg. telco customer) retraining on original data is recommended for explainability, increased acceptance, or just because of regulation.
Figure: AUC by Algorithm grouped by Method
Conclusions:
Having a strong data foundation with easy and fast access to usable, high-quality data is essential to developing models (e.g. dashboards [BI] and advanced analytics [AI & ML]). However, many organizations suffer from a suboptimal data foundation resulting in 3 key challenges:
Synthetic data approach: develop models with as-good-as-real synthetic data to:
This allows organizations to build a strong data foundation with easy and fast access to usable, high-quality data to unlock data and leverage data opportunities.
Testing and development with high-quality test data is essential to deliver state-of-the-art software solutions. Using original production data seems obvious, but is not allowed due to (privacy) regulations. Alternative Test Data Management (TDM) tools introduce “legacy-by-design” in getting the test data right:
Synthetic data approach: Test and develop with AI-generated synthetic test data to deliver state-of-the-art software solutions start with:
This allows organizations to test and develop with next-level test data to deliver state-of-the-art software solutions!
Interested? For more information about synthetic data, visit the Syntho website or contact Wim Kees Janssen. For more information about SAS, visit www.sas.com or contact kees@syntho.ai.
In this use case, Syntho, SAS, and the NL AIC work together to achieve the intended results. Syntho is an expert in AI-generated synthetic data and SAS is a market leader in analytics and offers software for exploring, analyzing,g and visualizing data.
*Predicts 2021 – Data and Analytics Strategies to Govern, Scale and Transform Digital Business, Gartner, 2020.
What is synthetic data?
How does it work?
Why do organizations use it?
How to start?
Keep up to date with synthetic data news