Hi, how can we help you today?

Browse our resources, search the knowledge base

Webinar: The Future of TDM

Can Syntho detect and mask PII in text and unstructured data? Does Syntho work with unstructured data in general?

Yes, Syntho has a PII text scanner that can identify and mask PII in unstructured text data. For example, it can detect and replace PII in text fields, such as doctor’s notes, by tagging and obfuscating sensitive information like names, dates, and SSNs, while creating mock replacements.

More information can be found on this page under the “Introducing the PII text scanner” section.

As a finance company, data security is our top priority. Does Syntho support on-premise deployment, and if so, are all features available on-premise?

Yes, we facilitate on-premise deployments and all features are available on-premise.

Are synthetic data generated “in compliance” with implicit business rules? In other words, is the generator capable of inferring business rules?

Yes, Syntho’s AI-powered generation automatically captures patterns and complex relationships between columns, reproducing them in the generated synthetic data.

Additionally, Syntho offers rule-based synthetic data methods, including calculated columns, to model business rules from scratch, e.g. for cases where you don’t have any data yet.

Can we download the PII scan report in Excel or Notepad, or is it only viewable in the tool?

It is both viewable in the tool, as well as there is an option to export it as text.

Can Syntho generate synthetic versions of complex relational datasets (beyond simple tree structures)?

Syntho’s Test Data Management solutions are designed to mask and de-identify sensitive data at scale, including complex relational datasets. Syntho’s consistent mapping feature is important to realize preserving consistency and referential integrity for complex relational datasets and works across tables, across databases, across systems and even over time.

How do you check the validity of mock data?

Syntho offers over 150 mock data generators that accurately mimic real-world data characteristics. Rule-based synthetic data can also be customized to suit specific requirements.

Can PII information be detected and adapted?

Yes, Syntho can detect and adapt PII data as configured during setup and as demonstrated during the webinar.

More information about our PII scanner can be found here.

More information about our mockers to adapt PII can be found here.

Does Syntho have the capability to handle Blobs?

Syntho supports handling Blob data, both by duplication and exclusion of such columns. Details can be found in our User Documentation. We can deepdive further into this with you, if desired.

How do you make sure that all PII like birthdate is detected?

The PII scanner detects all PII attributes and identifiers. While a birthdate alone may not uniquely identify an individual, you can customize the scanner to include attributes like birthdate and other variables as needed. Then, our PII scanner can also detect non-identifiers such as the birthdate.

The scanner offers both “shallow” and “deep” scans: a shallow scan reviews metadata, such as column names and data types, while a deep scan leverages advanced entity recognition to analyze actual data in depth. This flexibility allows you to specify which PII types to detect.

Why should mock data, even if it is PII-related, be protected?

PII, or Personally Identifiable Information, refers to sensitive data linked to individuals. Privacy regulations make it challenging to use personal data for testing purposes, so it is essential to protect this data accordingly.

PII Scanner

Can I also identify PII manually?

Yes, users can also identify PII entities manually as an alternative to the PII scanner. Users can also apply mockers manually as an alternative to the automated suggested mockers. However, we optimized our platform in such a way that AI does the work for you to mitigate manual work and to be able to process large data volumes quickly.

Why do organizations use the PII column scanner?

To initiate de-identification, identifying columns containing Personally Identifiable Information (PII) is essential. However, this often demands extensive time and manual effort from developers.

Our solution streamlines this process through an automated PII scanner, allowing customers to efficiently identify and de-identify PII with our AI-powered PII scanner. Our advanced AI-powered solution eliminates manual efforts, enhancing efficiency and ensuring comprehensive identification of sensitive data automatically.

PII definition

PII stands for Personally Identifiable Information. PII is unique for every individual and only one person shares the same trait. Learn more about the definition of PII here.

Synthetic Data

What is the difference between synthetic data (a synthetic data twin) and mock data?

Mock data and AI-generated synthetic data are both types of synthetic data, but they are generated in different ways and serve different purposes.

Mock data is a type of synthetic data that is manually created and is often used for testing and development purposes. It is typically used to simulate the behavior of real-world data in a controlled environment and is often used to test the functionality of a system or application. It is often simple, easy to generate, and does not require complex models or algorithms. Often, one referrers also to mock data as “dummy data” or “fake data”.

AI-generated synthetic data, on the other hand, is generated using artificial intelligence techniques, such as machine learning or generative models. It is used to create realistic and representative data that can be used in place of real-world data when using the real-world data would be impractical or unethical due to strict privacy regulations. It is often more complex and requires more computational resources than manual mock data. As result, it is much more realistic and mimics the original data as close as possible.

In summary, mock data is manually created and is typically used for testing and development, while AI-generated synthetic data is created using artificial intelligence techniques and is used to create representative and realistic data.

Do you support mockers and mock data?

Yes we do. We offer various value-adding synthetic data optimization and augmentation features, including mockers, to take your data to the next level.

What do you mean by generating a ‘synthetic data twin’?

A synthetic data twin is an algorithm-generated replica of a real-world dataset and / or database. With a Synthetic Data Twin, Syntho aims to mimic an original dataset or database as close as possible to the original data to create a realistic representation of the original. With a synthetic data twin, we aim for superior synthetic data quality in comparison to the original data. We do this this with our synthetic data software that uses state-of-the-art AI models. Those AI models generate completely new datapoints and models them in such a way that we preserve the characteristics, relationships and statistical patterns of the original data to such an extent that you can use it as-if it is original data.

This can be used for a variety of purposes, such as testing and training machine learning models, simulating scenarios for research and development, and creating virtual environments for training and education. Synthetic data twins can be used to create realistic and representative data that can be used in place of real-world data when it is not available or when using the real-world data would be impractical or unethical due to strict data privacy regulations.

What are typical synthetic data use cases?

Generally, most of our clients use synthetic data for:

Software testing & development
Synthetic data for analytics, model development and advanced analytics (AI & ML)
Product demos

Data Quality

Do you preserve referential integrity over multi-table databases?

Yes we do. Our platform is optimized for databases and consequently, the preservation of referential integrity between datasets in the datgabase.

Curious to find out more about this?

Ask our experts directly.

Is the quality of AI generated synthetic data good enough for advanced analytics (e.g. AI, ML, BI)?

Yes it is. The synthetic data even holds patterns of which you did not know they were present in the original data.

But don’t just take our word for it. The analytics experts of SAS (global market leader in analytics) did an (AI) assessment of our synthetic data and compared it with the original data. Curious? Watch the whole event here or watch the short version about data quality here.

How does Syntho demonstrate the quality of generated synthetic data?

Guaranteeing that synthetic data holds the same data quality as the original data can be challenging, and often depends on the specific use case and the methods used to generate the synthetic data. Some methods for generating synthetic data, such as generative models, can produce data that is highly similar to the original data. Key question: how to demonstrate this?

There are some ways to ensure the quality of synthetic data:

Data quality metrics via our data quality report: One way to ensure that synthetic data holds the same data quality as the original data is to use data quality metrics to compare the synthetic data to the original data. These metrics can be used to measure things like similarity, accuracy, and completeness of the data. Syntho software included a data quality report with various data quality metrices.
External evaluation: since the data quality of synthetic data in comparison to original data is key, we recently did an assessment with the data experts of SAS (market leader in analytics) to demonstrate the data quality of synthetic data by Syntho in comparison to the real data. Edwin van Unen, analytics expert from SAS, evaluated generated synthetic datasets from Syntho via various analytics (AI) assessments and shared the outcomes. Watch a short recap of that video here.
Testing and evaluation by yourself: synthetic data can be tested and evaluated by comparing it to real-world data or by using it to train machine learning models and comparing their performance to models trained on real-world data. Why not test the data quality of synthetic data by yourself? Ask our experts for the possibilities of this here.

It’s important to note that synthetic data can never guarantee to be 100% similar to the original data, but it can be close enough to be useful for a specific use case. This specific use case can even be advanced analytics or training machine learning models.

Privacy

What does the Dutch Data Protection Authority say about using synthetic data?

One of the use cases that is specifically highlighted by the Dutch Data Protection Authority is using synthetic data as test data.

Syntho Engine

Will the referential integrity be preserved when I have a database?

Yes. Syntho software is optimized for databases containing multiple tables.

As for this, Syntho automatically detects the data types, schemas and formats to maximize data accuracy. For multi-table database, we support automatic table relationship inference and synthesis to preserve referential integrity.

Do I need a GPU to use Syntho?

No, we optimized our platform to minimize computational requirements (e.g. no GPU required), without compromising on the data accuracy. In addition, we support auto scaling, so that one can synthesize huge databases.

wWhich data types do you support?

The Syntho Engine works best on structured, tabular data (anything that contains rows and columns). Within these structures, we support the following data types:

Structures data formatted in tables (categorical, numerical, etc.)
Direct identifiers and PII
Large datasets and databases
Geographic location data (like GPS)
Time series data
Multi-table databases (with referential integrity)
Open text data

Complex data support
Next to all regular types of tabular data, the Syntho Engine supports complex data types and complex data structures.

Time series
Multi-table databases
Open text

Are specific skills required do use the Syntho Engine?

Not at all. Although it may take some effort to fully understand the advantages, workings and use cases of synthetic data, the process of synthesizing is very simple and anyone with basic computer knowledge can do it. For more information about the synthesizing process, check out this page or request a demo.

How many training records do I need to synthesize my data?

Syntho’s machine learning algorithms can better generalize the features with more entity records available, which decreases the privacy risk. A minimum column-to-row ratio of 1:500 is recommended. For example, if your source table has 6 columns, it should contain a minimum of 3000 rows.

How long does it take to generate synthetic data?

Naturally, the generation time depends on the size of the database. On average, a table with less than 1 million records is synthesized in less than 5 minutes.

How do you connect the Syntho Engine with your data?

Syntho enables you to easily connect with your databases, applications, data pipelines or file systems.

We support various integrated connectors so that you can connect with the source-environment (where the original data is stored) and the destination environment (where you want to write your synthetic data to) for an end-to-end integrated approach.

Connection features that we support:

Plug-and-play with Docker
20+ database connectors
20+ filesystem connectors

Which deployment options do you support?

The Syntho Engine is shipped in a Docker container and can be easily deployed and plugged into your environment of choice.

Possible deployment options include:

On-premise
Any (private) cloud
Any other environment

Build better and faster with synthetic data today

Unlock data access, accelerate development, and enhance data privacy.

Book a demo Contact us

Join our newsletter

Keep up to date with synthetic data news