External evaluation of our synthetic data by the data experts of SAS
Our synthetic data is assessed and approved by the data experts of SAS
Introduction to the external evaluation of our synthetic data by the data experts of SAS
What did we do?
Synthetic data generated by Syntho is assessed, validated and approved from an external and objective point of view by the data experts of SAS.
Why is our synthetic data externally evaluated by the data experts of SAS?
Though Syntho is proud to offer its users an advanced quality assurance report, we also understand the importance of having an external and objective evaluation of our synthetic data from industry leaders. That’s why we collaborate with SAS, leader in analytics, to assess our synthetic data.
SAS conducts various thorough evaluations on data-accuracy, privacy protection, and usability of Syntho’s AI-generated synthetic data in comparison to the original data. As conclusion, SAS assessed and approved Syntho’s synthetic data as being accurate, secure, and usable in comparison to the original data.
What did SAS do during this assessment?
We used telecom data that is used for “churn” prediction as target data. The goal of the evaluation was to use synthetic data to train various churn prediction models and to assess the performance of each model. As churn prediction is a classification task, SAS selected popular classification models to make the predictions, including:
- Random forest
- Gradient boosting
- Logistic regression
- Neural network
Before generating the synthetic data, SAS randomly split the telecom dataset into a train set (for training the models) and a holdout set (for scoring the models). Having a separate holdout set for scoring allows for an unbiased assessment of how well the classification model might do when applied to new data.
Using the train set as input, Syntho used its Syntho Engine to generate a synthetic dataset. For benchmarking, SAS also created an anonymized version of the train set after applying various anonymization techniques to reach a certain threshold (of k-anonymity). The former steps resulted into four datasets:
- A train dataset (i.e. the original dataset minus the holdout dataset)
- A holdout dataset (i.e. a subset of the original dataset)
- An anonymized dataset (anonymized data of the train dataset, original dataset minus the holdout dataset)
- A synthetic dataset (synthesized data of the train dataset, original dataset minus the holdout dataset)
Datasets 1, 3 and 4 were used to train each classification model, resulting in 12 (3 x 4) trained models. SAS subsequently used the holdout dataset to measure the accuracy of each model in the prediction of customer churn.
SAS conducts various thorough evaluations on data-accuracy, privacy protection, and usability of Syntho’s AI-generated synthetic data in comparison to the original data. As conclusion, SAS assessed and approved Syntho’s synthetic data as being accurate, secure, and usable in comparison to the original data.
Do you have any questions?
Talk to one of our experts
Initial results of the data assessment by SAS
Models trained on synthetic data score highly similar in comparison to models trained on original data
Synthetic data from Syntho holds not only for basic patterns, it also captures deep ‘hidden’ statistical patterns required for advanced analytics tasks. The latter is demonstrated in the bar chart, indicating that the accuracy of models trained on synthetic data versus models trained on original data are similar. Hence, synthetic data can be used for actual training of the models. The inputs and variable importance selected by the algorithms on synthetic data compared to original data were very similar. Hence, it is concluded that the modeling process can be done on synthetic data, as an alternative for using real sensitive data.
Why do models trained on anonymized data score worse?
Classic anonymization techniques have in common that they manipulate original data in order to hinder tracing back individuals. They manipulate data and thereby destroy data in the process. The more you anonymize, the better your data is protected, but also the more your data is destroyed. This is especially devastating for AI and modeling tasks where “predictive power” is essential, because bad quality data will result in bad insights from the AI model. SAS demonstrated this, with an area under the curve (AUC*) close to 0.5, demonstrating that the models trained on anonymized data perform by far the worst.
Additional results of synthetic data assessments by SAS
Additional results of synthetic data assessments by SAS
The correlations and relationships between variables were accurately preserved in synthetic data.
The Area Under the Curve (AUC), a metric for measuring model performance, remained consistent.
Furthermore, the variable importance, which indicated the predictive power of variables in a model, remained intact when comparing synthetic data to the original dataset.
Based on these observations by SAS and by using SAS Viya, we can confidently conclude that synthetic data generated by the Syntho Engine is indeed on par with real data in terms of quality. This validates the use of synthetic data for model development, paving the way for advanced analytics with synthetic data.
Conclusions by the data experts of SAS
- Models trained on synthetic data compared to the models trained on original data show highly similar performance
- Models trained on anonymized data with ‘classic anonymization techniques’ show inferior performance compared to models trained on the original data or synthetic data
- Synthetic data generation is easy and fast because the technique works exactly the same per dataset and per data type
Reference articles
- Assessment by the data experts of SAS: https://blogs.sas.com/content/hiddeninsights/2022/07/07/ai-generated-synthetic-data-easy-and-fast-access-to-high-quality-data/
- Syntho winner of the SAS global hackathon: https://www.linkedin.com/feed/update/urn:li:activity:7070047376249376769/
- Healthcare case study results: https://communities.sas.com/t5/SAS-Hacker-s-Hub/AI-Generated-Synthetic-Data-in-Healthcare/ta-p/863407
Save your synthetic data guide now!
- What is synthetic data?
- Why do organizations use it?
- Value adding synthetic data client cases
- How to start