Browse our resources, search the knowledge base
Yes, Syntho has a PII text scanner that can identify and mask PII in unstructured text data. For example, it can detect and replace PII in text fields, such as doctor’s notes, by tagging and obfuscating sensitive information like names, dates, and SSNs, while creating mock replacements.
More information can be found on this page under the “Introducing the PII text scanner” section.
Yes, we facilitate on-premise deployments and all features are available on-premise.
Yes, Syntho’s AI-powered generation automatically captures patterns and complex relationships between columns, reproducing them in the generated synthetic data.
Additionally, Syntho offers rule-based synthetic data methods, including calculated columns, to model business rules from scratch, e.g. for cases where you don’t have any data yet.
It is both viewable in the tool, as well as there is an option to export it as text.
Syntho’s Test Data Management solutions are designed to mask and de-identify sensitive data at scale, including complex relational datasets. Syntho’s consistent mapping feature is important to realize preserving consistency and referential integrity for complex relational datasets and works across tables, across databases, across systems and even over time.
Syntho offers over 150 mock data generators that accurately mimic real-world data characteristics. Rule-based synthetic data can also be customized to suit specific requirements.
Yes, Syntho can detect and adapt PII data as configured during setup and as demonstrated during the webinar.
More information about our PII scanner can be found here.
More information about our mockers to adapt PII can be found here.
Syntho supports handling Blob data, both by duplication and exclusion of such columns. Details can be found in our User Documentation. We can deepdive further into this with you, if desired.
The PII scanner detects all PII attributes and identifiers. While a birthdate alone may not uniquely identify an individual, you can customize the scanner to include attributes like birthdate and other variables as needed. Then, our PII scanner can also detect non-identifiers such as the birthdate.
The scanner offers both “shallow” and “deep” scans: a shallow scan reviews metadata, such as column names and data types, while a deep scan leverages advanced entity recognition to analyze actual data in depth. This flexibility allows you to specify which PII types to detect.
PII, or Personally Identifiable Information, refers to sensitive data linked to individuals. Privacy regulations make it challenging to use personal data for testing purposes, so it is essential to protect this data accordingly.
Yes, users can also identify PII entities manually as an alternative to the PII scanner. Users can also apply mockers manually as an alternative to the automated suggested mockers. However, we optimized our platform in such a way that AI does the work for you to mitigate manual work and to be able to process large data volumes quickly.
To initiate de-identification, identifying columns containing Personally Identifiable Information (PII) is essential. However, this often demands extensive time and manual effort from developers.
Our solution streamlines this process through an automated PII scanner, allowing customers to efficiently identify and de-identify PII with our AI-powered PII scanner. Our advanced AI-powered solution eliminates manual efforts, enhancing efficiency and ensuring comprehensive identification of sensitive data automatically.
PII stands for Personally Identifiable Information. PII is unique for every individual and only one person shares the same trait. Learn more about the definition of PII here.
Mock data and AI-generated synthetic data are both types of synthetic data, but they are generated in different ways and serve different purposes.
Mock data is a type of synthetic data that is manually created and is often used for testing and development purposes. It is typically used to simulate the behavior of real-world data in a controlled environment and is often used to test the functionality of a system or application. It is often simple, easy to generate, and does not require complex models or algorithms. Often, one referrers also to mock data as “dummy data” or “fake data”.
AI-generated synthetic data, on the other hand, is generated using artificial intelligence techniques, such as machine learning or generative models. It is used to create realistic and representative data that can be used in place of real-world data when using the real-world data would be impractical or unethical due to strict privacy regulations. It is often more complex and requires more computational resources than manual mock data. As result, it is much more realistic and mimics the original data as close as possible.
In summary, mock data is manually created and is typically used for testing and development, while AI-generated synthetic data is created using artificial intelligence techniques and is used to create representative and realistic data.
Yes we do. We offer various value-adding synthetic data optimization and augmentation features, including mockers, to take your data to the next level.
A synthetic data twin is an algorithm-generated replica of a real-world dataset and / or database. With a Synthetic Data Twin, Syntho aims to mimic an original dataset or database as close as possible to the original data to create a realistic representation of the original. With a synthetic data twin, we aim for superior synthetic data quality in comparison to the original data. We do this this with our synthetic data software that uses state-of-the-art AI models. Those AI models generate completely new datapoints and models them in such a way that we preserve the characteristics, relationships and statistical patterns of the original data to such an extent that you can use it as-if it is original data.
This can be used for a variety of purposes, such as testing and training machine learning models, simulating scenarios for research and development, and creating virtual environments for training and education. Synthetic data twins can be used to create realistic and representative data that can be used in place of real-world data when it is not available or when using the real-world data would be impractical or unethical due to strict data privacy regulations.
Generally, most of our clients use synthetic data for:
Yes we do. Our platform is optimized for databases and consequently, the preservation of referential integrity between datasets in the datgabase.
Curious to find out more about this?
Ask our experts directly.
Yes it is. The synthetic data even holds patterns of which you did not know they were present in the original data.
But don’t just take our word for it. The analytics experts of SAS (global market leader in analytics) did an (AI) assessment of our synthetic data and compared it with the original data. Curious? Watch the whole event here or watch the short version about data quality here.
Guaranteeing that synthetic data holds the same data quality as the original data can be challenging, and often depends on the specific use case and the methods used to generate the synthetic data. Some methods for generating synthetic data, such as generative models, can produce data that is highly similar to the original data. Key question: how to demonstrate this?
There are some ways to ensure the quality of synthetic data:
It’s important to note that synthetic data can never guarantee to be 100% similar to the original data, but it can be close enough to be useful for a specific use case. This specific use case can even be advanced analytics or training machine learning models.
One of the use cases that is specifically highlighted by the Dutch Data Protection Authority is using synthetic data as test data.
More can be found in this article.
Syntho’s QA report contains three industry-standard metrics for evaluating data privacy. The idea behind each of these metrics is as follows:
Yes we do this via our QA report.
When synthesizing a dataset, it is essential to demonstrate that one is not able to re-identify individuals. In this video, Marijn introduces privacy measures that are in our quality report to demonstrate this.
No. The Syntho Engine is a self-service platform. As a results, generating synthetic data with the Syntho Engine is possible in a way that in the end-to-end process, Syntho is never able to see and never required to process data.
No. We optimized our platform in such a way that it can be easily deployed in the trusted environment of the customer. This ensures that data will never leave the trusted environment of the customer. Deployment options for the trusted environment of the customer are “on-premise” and in the “cloud environment of the customer (private cloud)”.
No we don’t. We can easily deploy the Syntho Engine on-premise or in your private cloud via docker.
Yes. Syntho software is optimized for databases containing multiple tables.
As for this, Syntho automatically detects the data types, schemas and formats to maximize data accuracy. For multi-table database, we support automatic table relationship inference and synthesis to preserve referential integrity.
No, we optimized our platform to minimize computational requirements (e.g. no GPU required), without compromising on the data accuracy. In addition, we support auto scaling, so that one can synthesize huge databases.
The Syntho Engine works best on structured, tabular data (anything that contains rows and columns). Within these structures, we support the following data types:
Complex data support Next to all regular types of tabular data, the Syntho Engine supports complex data types and complex data structures.
Not at all. Although it may take some effort to fully understand the advantages, workings and use cases of synthetic data, the process of synthesizing is very simple and anyone with basic computer knowledge can do it. For more information about the synthesizing process, check out this page or request a demo.
Syntho’s machine learning algorithms can better generalize the features with more entity records available, which decreases the privacy risk. A minimum column-to-row ratio of 1:500 is recommended. For example, if your source table has 6 columns, it should contain a minimum of 3000 rows.
Naturally, the generation time depends on the size of the database. On average, a table with less than 1 million records is synthesized in less than 5 minutes.
Syntho enables you to easily connect with your databases, applications, data pipelines or file systems.
We support various integrated connectors so that you can connect with the source-environment (where the original data is stored) and the destination environment (where you want to write your synthetic data to) for an end-to-end integrated approach.
Connection features that we support:
The Syntho Engine is shipped in a Docker container and can be easily deployed and plugged into your environment of choice.
Possible deployment options include:
Read more.
Unlock data access, accelerate development, and enhance data privacy.
Keep up to date with synthetic data news