FAQ
Frequently Asked Questions about synthetic data
Understandable! Luckily, we have the answers and we’re here to help. Check our frequently asked questions.
Please open up a question below and click the links to find more information. Have a more complicated question that is not stated here? Ask our experts directly!
The most asked questions
Synthetic data refers to data that is artificially generated rather than collected from real-world sources. In general, whereas original data is collected in all your interactions with persons (clients, patients, etc.) and via all your internal processes, synthetic data is generated by a computer algorithm.
Synthetic data can also be used to test and evaluate models in a controlled environment, or to protect sensitive information by generating data that is similar to real-world data but does not contain any sensitive information. Synthetic data is often used as alternative for privacy sensitive data and could be used as testdata, for analytics or to train machine learning.
Guaranteeing that synthetic data holds the same data quality as the original data can be challenging, and often depends on the specific use case and the methods used to generate the synthetic data. Some methods for generating synthetic data, such as generative models, can produce data that is highly similar to the original data. Key question: how to demonstrate this?
There are some ways to ensure the quality of synthetic data:
- Data quality metrics via our data quality report: One way to ensure that synthetic data holds the same data quality as the original data is to use data quality metrics to compare the synthetic data to the original data. These metrics can be used to measure things like similarity, accuracy, and completeness of the data. Syntho software included a data quality report with various data quality metrices.
- External evaluation: since the data quality of synthetic data in comparison to original data is key, we recently did an assessment with the data experts of SAS (market leader in analytics) to demonstrate the data quality of synthetic data by Syntho in comparison to the real data. Edwin van Unen, analytics expert from SAS, evaluated generated synthetic datasets from Syntho via various analytics (AI) assessments and shared the outcomes. Watch a short recap of that video here.
- Testing and evaluation by yourself: synthetic data can be tested and evaluated by comparing it to real-world data or by using it to train machine learning models and comparing their performance to models trained on real-world data. Why not test the data quality of synthetic data by yourself? Ask our experts for the possibilities of this here.
It’s important to note that synthetic data can never guarantee to be 100% similar to the original data, but it can be close enough to be useful for a specific use case. This specific use case can even be advanced analytics or training machine learning models.
Classic ‘anonymization’ is not always the best solution, because:
- Privacy risk – you will always have
a privacy risk. Applying those
classic anonymization techniques
makes it only harder, but not
impossible to identify individuals. - Destroying data – the more you
anonymize, the better you protect
your privacy, but the more you
destroy your data. This is not what
you want for analytics, because
destroyed data will result in bad
insights. - Time-consuming – it is a solution
that takes a lot of time, because
those techniques work different
per dataset and per datatype.
Synthetic data aims to solve all of these shortcomings. The difference is so striking that we made a video about it. Watch it here.
Frequently Asked Questions
Synthetic Data
Generally, most of our clients use synthetic data for:
- Software testing & development
- Synthetic data for analytics, model development and advanced analytics (AI & ML)
- Product demos
A synthetic data twin is an algorithm-generated replica of a real-world dataset and / or database. With a Synthetic Data Twin, Syntho aims to mimic an original dataset or database as close as possible to the original data to create a realistic representation of the original. With a synthetic data twin, we aim for superior synthetic data quality in comparison to the original data. We do this this with our synthetic data software that uses state-of-the-art AI models. Those AI models generate completely new datapoints and models them in such a way that we preserve the characteristics, relationships and statistical patterns of the original data to such an extent that you can use it as-if it is original data.
This can be used for a variety of purposes, such as testing and training machine learning models, simulating scenarios for research and development, and creating virtual environments for training and education. Synthetic data twins can be used to create realistic and representative data that can be used in place of real-world data when it is not available or when using the real-world data would be impractical or unethical due to strict data privacy regulations.
Yes we do. We offer various value-adding synthetic data optimization and augmentation features, including mockers, to take your data to the next level.
Mock data and AI-generated synthetic data are both types of synthetic data, but they are generated in different ways and serve different purposes.
Mock data is a type of synthetic data that is manually created and is often used for testing and development purposes. It is typically used to simulate the behavior of real-world data in a controlled environment and is often used to test the functionality of a system or application. It is often simple, easy to generate, and does not require complex models or algorithms. Often, one referrers also to mock data as “dummy data” or “fake data”.
AI-generated synthetic data, on the other hand, is generated using artificial intelligence techniques, such as machine learning or generative models. It is used to create realistic and representative data that can be used in place of real-world data when using the real-world data would be impractical or unethical due to strict privacy regulations. It is often more complex and requires more computational resources than manual mock data. As result, it is much more realistic and mimics the original data as close as possible.
In summary, mock data is manually created and is typically used for testing and development, while AI-generated synthetic data is created using artificial intelligence techniques and is used to create representative and realistic data.
Data Quality
Guaranteeing that synthetic data holds the same data quality as the original data can be challenging, and often depends on the specific use case and the methods used to generate the synthetic data. Some methods for generating synthetic data, such as generative models, can produce data that is highly similar to the original data. Key question: how to demonstrate this?
There are some ways to ensure the quality of synthetic data:
- Data quality metrics via our data quality report: One way to ensure that synthetic data holds the same data quality as the original data is to use data quality metrics to compare the synthetic data to the original data. These metrics can be used to measure things like similarity, accuracy, and completeness of the data. Syntho software included a data quality report with various data quality metrices.
- External evaluation: since the data quality of synthetic data in comparison to original data is key, we recently did an assessment with the data experts of SAS (market leader in analytics) to demonstrate the data quality of synthetic data by Syntho in comparison to the real data. Edwin van Unen, analytics expert from SAS, evaluated generated synthetic datasets from Syntho via various analytics (AI) assessments and shared the outcomes. Watch a short recap of that video here.
- Testing and evaluation by yourself: synthetic data can be tested and evaluated by comparing it to real-world data or by using it to train machine learning models and comparing their performance to models trained on real-world data. Why not test the data quality of synthetic data by yourself? Ask our experts for the possibilities of this here.
It’s important to note that synthetic data can never guarantee to be 100% similar to the original data, but it can be close enough to be useful for a specific use case. This specific use case can even be advanced analytics or training machine learning models.
Yes it is. The synthetic data even holds patterns of which you did not know they were present in the original data.
But don’t just take our word for it. The analytics experts of SAS (global market leader in analytics) did an (AI) assessment of our synthetic data and compared it with the original data. Curious? Watch the whole event here or watch the short version about data quality here.
Yes we do. Our platform is optimized for databases and consequently, the preservation of referential integrity between datasets in the datgabase.
Curious to find out more about this?
Privacy
No we don’t. We can easily deploy the Syntho Engine on-premise or in your private cloud via docker.
No. We optimized our platform in such a way that it can be easily deployed in the trusted environment of the customer. This ensures that data will never leave the trusted environment of the customer. Deployment options for the trusted environment of the customer are “on-premise” and in the “cloud environment of the customer (private cloud)”.
Optional: Syntho supports a version that is hosted in the “Syntho cloud”.
No. The Syntho Engine is a self-service platform. As a results, generating synthetic data with the Syntho Engine is possible in a way that in the end-to-end process, Syntho is never able to see and never required to process data.
Yes we do this via our QA report.
When synthesizing a dataset, it is essential to demonstrate that one is not able to re-identify individuals. In this video, Marijn introduces privacy measures that are in our quality report to demonstrate this.
Syntho’s QA report contains three industry-standard metrics for evaluating data privacy. The idea behind each of these metrics is as follows:
- Synthetic data (S) shall be “as close as possible”, but “not too close” to the target data (T).
- Randomly selected holdout data (H) determines the benchmark for “too close”.
- A perfect solution generates new synthetic data that behaves exactly like the original data, but hasn’t been seen before (= H).
One of the use cases that is specifically highlighted by the Dutch Data Protection Authority is using synthetic data as test data.
Syntho Engine
The Syntho Engine is shipped in a Docker container and can be easily deployed and plugged into your environment of choice.
Possible deployment options include:
- On-premise
- Any (private) cloud
- Any other environment
Syntho enables you to easily connect with your databases, applications, data pipelines or file systems.
We support various integrated connectors so that you can connect with the source-environment (where the original data is stored) and the destination environment (where you want to write your synthetic data to) for an end-to-end integrated approach.
Connection features that we support:
- Plug-and-play with Docker
- 20+ database connectors
- 20+ filesystem connectors
Naturally, the generation time depends on the size of the database. On average, a table with less than 1 million records is synthesized in less than 5 minutes.
Syntho’s machine learning algorithms can better generalize the features with more entity records available, which decreases the privacy risk. A minimum column-to-row ratio of 1:500 is recommended. For example, if your source table has 6 columns, it should contain a minimum of 3000 rows.
Not at all. Although it may take some effort to fully understand the advantages, workings and use cases of synthetic data, the process of synthesizing is very simple and anyone with basic computer knowledge can do it. For more information about the synthesizing process, check out this page or request a demo.
The Syntho Engine works best on structured, tabular data (anything that contains rows and columns). Within these structures, we support the following data types:
- Structures data formatted in tables (categorical, numerical, etc.)
- Direct identifiers and PII
- Large datasets and databases
- Geographic location data (like GPS)
- Time series data
- Multi-table databases (with referential integrity)
- Open text data
Complex data support
Next to all regular types of tabular data, the Syntho Engine supports complex data types and complex data structures.
- Time series
- Multi-table databases
- Open text
No, we optimized our platform to minimize computational requirements (e.g. no GPU required), without compromising on the data accuracy. In addition, we support auto scaling, so that one can synthesize huge databases.
Yes. Syntho software is optimized for databases containing multiple tables.
As for this, Syntho automatically detects the data types, schemas and formats to maximize data accuracy. For multi-table database, we support automatic table relationship inference and synthesis to preserve referential integrity.
Data is synthetic, but our team is real!
Contact Syntho and one of our experts will get in touch with you at the speed of light to explore the value of synthetic data!