We follow ethical norms & our process for objectivity.

This research is not funded by any sponsors.

Why is synthetic data important for businesses?

When is synthetic data used?

How do businesses generate synthetic data?

What are the techniques of synthetic data generation?

How do you generate synthetic data in Python?

What are the best practices for synthetic data generation?

Methodology

External Links

Why is synthetic data important for businesses?When is synthetic data used?How do businesses generate synthetic data?What are the techniques of synthetic data generation?How do you generate synthetic data in Python?What are the best practices for synthetic data generation?Methodology External Links

Table of contents

Synthetic Data

Updated on Aug 4, 2025

Synthetic Data Generation Benchmark & Best Practices ['25]

Cem Dilmegani

with Alparslan Polat

See our ethical norms

We benchmarked 7 publicly available synthetic data generators sourced from 4 distinct providers, utilizing a holdout dataset comprising 70,000 samples, with 4 numerical and 7 categorical features, to evaluate their performance in replicating real-world data characteristics.

Synthesizer		Best For
1.	YData	Statistically most accurate
2.	MOSTLY AI	Easy to use
3.	Synthetic Data Vault	Open source applications
4.	Gretel	Custom data generation

YData

Statistically most accurate

MOSTLY AI

Easy to use

Synthetic Data Vault

Open source applications

Gretel

Custom data generation

Below, you can see the benchmark results where we statistically compare the synthetic data generators.

You can find more detailed information about the metrics in the methodology.

Synthetic data generation is key in ensuring data quality, especially in maintaining privacy. By design, synthetic data mimics the statistical properties of real data without exposing sensitive information. However, if synthetic data can be reverse-engineered to reconstruct the original dataset, it undermines its fundamental purpose of safeguarding privacy.

Large Language Models (LLMs) are among the largest producers of synthetic data. Numerous benchmarks for state-of-the-art (SOTA) LLMs rely on these models to generate test cases for evaluating other LLMs. Moreover, LLMs themselves are often trained on synthetic data, leveraging the diversity and scale of artificial datasets to improve their performance.

As in most AI-related topics, deep learning is integral to synthetic data generation as well. Synthetic data created by deep learning algorithms is also being used to improve other deep learning algorithms. AIMultiple explains synthetic data generation techniques, as well as best practices.

Why is synthetic data important for businesses?

Synthetic data is important for businesses for three reasons:

privacy,
product testing,
and training machine learning algorithms.

Industry leaders also started to discuss the importance of data-centric approaches to AI/ML model development, to which synthetic data can add significant value.

When is synthetic data used?

Businesses face a trade-off between data privacy and data utility when selecting a privacy-enhancing technology. Therefore, they need to determine the priorities of their use case before investing. Synthetic data does not contain any personal information; it is sample data with a distribution similar to the original data.

Though synthetic data can be less useful than real data in some cases, synthetic data can also be almost as valuable as real data. For example, a team at Deloitte Consulting generated 80% of the training data for a machine-learning model by synthesizing data. The resulting model accuracy was similar to that of a model trained on real data.

Synthetic data generation can help build accurate machine-learning models, especially when companies need data to train machine-learning algorithms and their training data is highly imbalanced (e.g., more than 99% of instances belong to one class).

See the list of synthetic data use cases.

How do businesses generate synthetic data?

Diagram showing the flow of data from real data through synthesis to synthetic data. The process includes utility assessment, privacy assurance, report generation, and assurance reporting. — **Figure 1.** Workflow illustrating the synthetic data generation process, including utility assessment and privacy assurance.¹

Businesses can prefer different methods such as decision trees, deep learning techniques, and iterative proportional fitting to execute the data synthesis process. They should choose the method according to synthetic data requirements and the level of data utility that is desired for the specific purpose of data generation.

After data synthesis, they should assess the utility of synthetic data by comparing it with real data. The utility assessment process has two stages:

General purpose comparisons: Comparing parameters such as distributions and correlation coefficients measured from the two datasets
Workload-aware utility assessment: comparing the accuracy of outputs for the specific use case by performing analysis on synthetic data

What are the techniques of synthetic data generation?

Generating according to distribution

For cases where real data does not exist but the data analyst has a comprehensive understanding of how the dataset distribution would look, the analyst can generate a random sample of any distribution, such as Normal, Exponential, Chi-square, t, lognormal, and Uniform. In this technique, the utility of synthetic data varies depending on the analyst’s degree of knowledge about a specific data environment.

Fitting real data to a known distribution

If there is real data, then businesses can generate synthetic data by determining the best-fit distributions for given real data. If businesses want to fit real data into a known distribution and know the distribution parameters, businesses can use the Monte Carlo method to generate synthetic data. The method is a computational technique that uses random sampling and statistical modeling to solve problems that may be deterministic in principle but are too complex for direct analytical solutions. The steps are:

Define the Problem: Specify the problem to be solved, often involving parameters with known or assumed distributions.
Generate Random Inputs: Use random number generation to create inputs, often based on a probability distribution.
Run the Model: Perform simulations using these inputs to evaluate the outcome of the system or process.
Aggregate Results: Collect the results of all simulations and calculate statistical measures such as averages, variances, or probabilities.

The Monte Carlo method helps find the best fit for synthetic data, but it may not always meet business needs. Machine learning models like decision trees can model complex, non-classical distributions in such cases. This enables synthetic data generation to be highly correlated with original data. However, ML models risk overfitting, which can reduce their ability to generalize and predict future observations reliably.

Businesses can use hybrid synthetic data generation for cases where only some part of real data exists. In this case, analysts generate one part of the dataset from theoretical distributions and generate other parts based on real data.

Using deep learning

Deep generative models such as Variational Autoencoder (VAE) and Generative Adversarial Network (GAN) can generate synthetic data.

Variational Autoencoder

VAE is an unsupervised method in which the encoder compresses the original dataset into a more compact structure and transmits data to the decoder. Then, the decoder generates an output that represents the original dataset. The system is trained by optimizing the correlation between input and output data.

Diagram illustrating a generative model. Training process uses an encoder and decoder to create a latent space representation. Generation process samples from the latent space and decodes to create new content. — **Figure 2.** Encoder-Decoder Structure for Latent Space Learning and Content Generation.²

Generative adversarial network

In the GAN model, two networks, a generator and a discriminator, train the model iteratively. The generator takes random sample data and generates a synthetic dataset. The discriminator compares synthetically generated data with a real dataset based on conditions that were set before.

A representation of how GAN deep generative model works — **Figure 3.** The GAN Training Process: Generator, Discriminator, and Fine-Tuning.³

How do you generate synthetic data in Python?

There are several ways to generate synthetic data for simple tasks in Python, and with some effort, these methods can also be adapted for complex tasks.

1. Basic random data generation using NumPy

import numpy as np

# Generate 100 random integers between 0 and 10
random_integers = np.random.randint(0, 11, size=100)

# Generate 50 random floating-point numbers between 0 and 1
random_floats = np.random.rand(50)

2. Generating data with specific distributions using NumPy

import numpy as np

# Generate 100 values from a normal distribution (mean 0, standard deviation 1)
normal_data = np.random.normal(loc=0, scale=1, size=100)

# Generate 50 values from a uniform distribution between -1 and 1
uniform_data = np.random.uniform(low=-1, high=1, size=50)

3. Generating realistic data using faker library

from faker import Faker

fake = Faker() # You can also do Faker('fr_FR') for french or Faker('de_DE') for german

# Generate a fake name
name = fake.name()

# Generate a fake address
address = fake.address()

# Generate a fake email
email = fake.email()

# Generate a fake date
date = fake.date()

# Generate a fake credit card number
credit_card = fake.credit_card_number()

4. Generating data for classification and regression using scikit-learn

from sklearn.datasets import make_classification
from sklearn.datasets import make_regression

# Generate a synthetic classification dataset
X_classification, y_classification = make_classification(
    n_samples=100, 
    n_features=5,
    n_informative=3,
    n_redundant=1,
    random_state=42
)

# Generate a synthetic regression dataset
X_regression, y_regression = make_regression(
    n_samples=100,
    n_features=4,
    noise=0.1,
    random_state=42
)

What are the best practices for synthetic data generation?

Infographic outlining five best practices for synthetic data generation: mirroring real data, using clean data, vendor considerations, privacy compliance, and bias reduction. — **Figure 4.** Best Practices for Synthetic Data Generation.

1-Ensure that synthetic data mirrors real data for the application

The utility of synthetic varies depending on the technique you use while generating it. You need to analyze their use case and decide if the generated synthetic data fits the specific use case.

Real-life example:
JP Morgan has applied synthetic data in finance to generate accurate financial models while safeguarding customer privacy. Their approach includes testing synthetic data to ensure that it reflects the real characteristics of their financial data sets. This is particularly important when using synthetic data to train fraud detection algorithms, where the synthetic data must behave like real transactions to identify fraudulent patterns. ⁴

2-Work with clean data

Clean data is an essential requirement of synthetic data generation. If you don’t clean and prepare data before synthesis, you can have a garbage-in, garbage-out situation. In the data preparation process, make sure you apply the following principles:

Data cleaning: Eliminating inaccurate, improperly formatted, redundant, or lacking data from a dataset
Data harmonization: Synthesizing data from several sources and providing customers with a comparable understanding of information from various research.

Real-life example:

The Institute for Informatics, Data Science, and Biostatistics (I2DB) at Washington University in St. Louis adopted the MDClone platform in 2018. Through a landmark study, they confirmed that synthetic data can yield the same analytical outcomes as real data while also preserving privacy. This initiative is part of the university’s broader strategy to enhance data-driven research by providing secure and innovative resources to the academic community.⁵

Before creating synthetic data, patient records are cleaned, this involves removing errors and duplicates and ensuring data consistency. By cleaning and harmonizing Electronic Health Records (EHRs) from different departments, MDClone ensures that synthetic versions of these records maintain the utility of real data for medical research while protecting patient privacy.

3- Rely on vendors if necessary

Identify your organization’s synthetic data capabilities and outsource based on capability gaps. The two important steps are data preparation and data synthesis. Suppliers can automate both steps.

4-Ensure that synthetic data complies with privacy regulations

When generating synthetic data, ensuring the privacy of individuals whose data is used to create the synthetic dataset is critical. Compliance with data privacy regulations, such as GDPR (General Data Protection Regulation) and HIPAA, is mandatory. Synthetic data should be completely detached from any real individual, and there should be no way to trace synthetic data back to the original records.

Real-world datasets often contain sensitive information, especially in the healthcare, finance, and telecommunications industries. If synthetic data is not generated with strong privacy measures, it could lead to legal repercussions and customer trust issues.

5-Bias Mitigation

Synthetic data generation should aim to avoid introducing or perpetuating biases present in real-world datasets. This involves monitoring for potential bias in attributes like race, gender, or socioeconomic status, possibly leading to discriminatory outcomes if left unchecked. Bias mitigation is crucial for ensuring fair and equitable models.

Biased data leads to biased models, which in turn can perpetuate social inequalities. For instance, an AI hiring tool trained on biased data may disproportionately reject candidates from certain demographic groups. Ensuring that synthetic data is balanced and unbiased helps develop fairer AI models.

Methodology

We conducted the benchmark to evaluate the performance of 7 publicly available synthetic data generators using a holdout dataset. The dataset consists of 70,000 samples and includes 4 numerical and 7 categorical features.

Data Preparation

The dataset was first homogenized, meaning the distributions of features were consistent across the dataset, eliminating the need for additional preprocessing, such as normalization or handling missing values. To prepare the data for training and evaluation, we performed the following steps:

Shuffling: The dataset was randomly shuffled to ensure that the data points were distributed uniformly and to avoid any ordering bias.

Splitting: The shuffled dataset was split roughly into two equal halves:
- Training Data: 35,000 samples were used to train the synthetic data generators.
- Holdout Data: 35,000 samples reserved for evaluation to assess the quality of the generated synthetic data against unseen real data.

Since the dataset was homogenized, the distributions of features in the training and holdout sets were similar, ensuring a fair comparison.

Training Synthetic Data Generators

We selected 7 synthetic data generators for this benchmark: YData, Mostly AI, Gretel, and 4 generators from Synthetic Data Vault. The training process for each generator was as follows:

SDV and YData: We used the SDKs and models provided by SDV and YData, configured with their default settings to ensure consistency and reproducibility.
Mostly AI and Gretel: These generators were trained using their platforms, with settings configured as recommended by the providers.

Each generator was trained exclusively on the training data (35,000 samples) to generate synthetic datasets that mimic the characteristics of the real data.

Evaluation

After training, each synthetic data generator produced a synthetic dataset. We evaluated the quality of these synthetic datasets by comparing them against both the training data (to assess how well the generators captured the training distribution) and the holdout data (to evaluate generalization to unseen data).

We evaluated the synthetic data quality using three metrics:

Correlation Distance (Δ)

Measures the absolute difference between correlation matrices of numerical features in real and synthetic datasets, assessing how well relationships (e.g., between house size and energy consumption) are preserved.

Range: 0 (perfect) to 1 (worst).
Purpose: Ensures structural relationships are maintained, vital for tasks like regression.

Kolmogorov-Smirnov Distance (K)

Measures the maximum difference between cumulative distribution functions (CDFs) of numerical features, evaluating how well their marginal distributions (e.g., distribution of customer ages) are captured.

Range: 0 (identical) to 1 (completely different).
Purpose: Ensures realistic numerical distributions, which are crucial for simulations or statistical analysis.

Total Variation Distance (TVD)

Measures the difference between probability distributions of categorical features, calculated as half the sum of absolute differences (e.g., distribution of customer regions).

Range: 0 (identical) to 1 (completely different).
Purpose: Assesses how well categorical distributions are captured, important for tasks like classification.

Model-based evaluation

Statistical metrics are useful for gaining initial insights into synthetic data quality. However, the true evaluation should be model-based, focusing on the practical utility of synthetic data for model training. It is crucial to measure how synthetic data benefits model performance.

In our evaluation, we assessed the quality of synthetic data generated using the default settings of the synthesizers and checked for overfitting on the original training data.

We also investigated whether augmenting training data with synthetic data could improve accuracy. However, our tests, involving incrementally adding synthetic data from 10% up to 100% of the training set size, showed that model accuracy consistently decreased even with the smallest addition.

This decline correlated with the volume of synthetic data added, indicating increased noise introduction. Notably, synthesizers producing higher-quality data (lower error rates in our benchmark) introduced less noise and resulted in smaller performance drops.

Customizing these generators for specific needs in practical applications could produce significantly better results. Furthermore, we have omitted the performance results of models trained on augmented data. This is because those models were also tested using default configurations without hyperparameter tuning. Presenting such results would not constitute a fair comparison, given that real-world deployment necessitates model optimization.

In model-based evaluations, particularly when using a partial augmentation approach blending synthetic with real data, careful consideration of the synthetic data volume is vital. Adding too much synthetic data can introduce noise and diminish the model’s ability to learn effectively from the real data, potentially leading to an inaccurate assessment of synthetic data’s true benefits.

Synthetic Data FAQ

What is synthetic data?

Synthetic data is artificial data created by using different algorithms that mirror the statistical properties of the original data but do not reveal any information regarding real-world events or people.
For example, data produced by computer simulations would qualify as synthetic data. This includes applications like music synthesizers, medical imaging, economic models, and flight simulators, where the outputs mimic real-world phenomena but are entirely generated through algorithms.

In what fields is synthetic data commonly used?

Synthetic data is widely used in healthcare, finance, autonomous vehicles, gaming, cybersecurity, and any field where data privacy is crucial or real data is scarce or biased.

What are the limitations of synthetic data?

Limitations include potential inaccuracies if the synthetic data doesn’t accurately reflect real-world complexities, the risk of introducing bias, and the need for sophisticated algorithms and expertise to generate high-quality synthetic data.

Synthetic data can help comply with data privacy regulations like GDPR by ensuring that the data used for analysis or AI training doesn’t contain personally identifiable information. However, compliance also depends on the methodology used to generate the synthetic data.

Can synthetic data replace real data?

While synthetic data can supplement real data in many scenarios, especially where privacy or data scarcity is a concern, it’s not always a complete replacement. The decision to use synthetic versus real data depends on the specific use case, the quality of the synthetic data, and the criticality of accuracy.

External Links

Share This Article

Cem Dilmegani

Follow on

Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

Follow on

Researched by

Alparslan Polat

AI Researcher

Alparslan Polat is an AI Researcher at AIMultiple. His current work focuses on GPUs, synthetic data generation, computer vision, and large language models (LLMs).

Next to Read

Best Data Collection Services & Companies in 2025

Jul 210 min read

Top 20 Synthetic Data Use Cases & Applications in 2025

Jul 184 min read

12+ Data Augmentation Techniques for Data-Efficient ML

Apr 103 min read

Comments

Your email address will not be published. All fields are required.

2 Comments

Jaiber

Dec 15, 2020 at 15:50

It is SimPy not SymPy – the two are very different..

Cem Dilmegani

Dec 19, 2020 at 06:44

Hi Jaiber, thank you for your comment, we also notice a lot of typos on the web.
However, we had mentioned above that SymPy can help generate synthetic data with symbolic expressions, I clarified the wording a bit more. That seems correct to me.
I believe you mean that SimPy discrete event simulation can be used to create synthetic data, too, right? If you have an example, happy to add, too.

Shams

Oct 06, 2020 at 19:09

How I can generate synthetic data given that I want the data on the tail to follow a specific distribution and data on the head of follows a different distribution?

Cem Dilmegani

Oct 07, 2020 at 05:31

You could combine distributions to create a single distribution which you can use for data generation.

Related research

Synthetic Data vs Real Data: Benefits, Challenges in 2025

Jul 2510 min read

Top 20 Synthetic Data Use Cases & Applications in 2025

Jul 184 min read

Synthetic Data Generation Benchmark & Best Practices ['25]

Why is synthetic data important for businesses?

When is synthetic data used?

How do businesses generate synthetic data?

What are the techniques of synthetic data generation?

Generating according to distribution

Fitting real data to a known distribution

Using deep learning

Variational Autoencoder

Generative adversarial network

How do you generate synthetic data in Python?

1. Basic random data generation using NumPy

2. Generating data with specific distributions using NumPy

3. Generating realistic data using faker library

4. Generating data for classification and regression using scikit-learn

What are the best practices for synthetic data generation?

1-Ensure that synthetic data mirrors real data for the application

2-Work with clean data

3- Rely on vendors if necessary

4-Ensure that synthetic data complies with privacy regulations

5-Bias Mitigation

Methodology

Data Preparation

Training Synthetic Data Generators

Evaluation

Correlation Distance (Δ)

Kolmogorov-Smirnov Distance (K)

Total Variation Distance (TVD)

Model-based evaluation

Synthetic Data FAQ

What is synthetic data?

In what fields is synthetic data commonly used?

What are the limitations of synthetic data?

How does synthetic data relate to data privacy regulations like GDPR?

Can synthetic data replace real data?

External Links

Next to Read

Best Data Collection Services & Companies in 2025

Top 20 Synthetic Data Use Cases & Applications in 2025

12+ Data Augmentation Techniques for Data-Efficient ML

Comments

Related research

Synthetic Data vs Real Data: Benefits, Challenges in 2025

Top 20 Synthetic Data Use Cases & Applications in 2025