Synthetic data is artificial data generated with the purpose of preserving privacy, testing systems or creating training data for machine learning algorithms. Synthetic data generation is critical since it is an important factor in the quality of synthetic data; for example synthetic data that can be reverse engineered to identify real data would not be useful in privacy enhancement.

As in most AI related topics, deep learning comes up in synthetic data generation as well. So data created by deep learning algorithms is also being used to improve other deep learning algorithms. We explained other synthetic data generation techniques, as well as best practices:

What is synthetic data?

Synthetic data is artificial data that is created by using different algorithms that mirror the statistical properties of the original data but does not reveal any information regarding real people. For more information on synthetic data, feel free to check our comprehensive synthetic data article.

Why is synthetic data important for businesses?

Synthetic data is important for businesses due to three reasons: privacy, product testing and training machine learning algorithms. For more detailed information, please check our ultimate guide to synthetic data.

When to use synthetic data

Businesses trade-off between data privacy and data utility while selecting a privacy-enhancing technology. Therefore businesses need to determine the priorities of their use case before investing. Synthetic does not contain any personal information, it is a sample data that has a similar distribution with original data. However, the utility of synthetic data can be lower than real data.

How do businesses generate synthetic data?

An illustration of how synthetic data is generated
Source: O’Reilly

Businesses can prefer different methods such as decision trees, deep learning techniques, and iterative proportional fitting to execute the data synthesis process. They should choose the method according to synthetic data requirements and the level of data utility that is desired for the specific purpose of data generation.

After data synthesis, they should assess the utility of synthetic data by comparing it with real data. The utility assessment process has two stages:

  • General purpose comparisons: Comparing parameters such as distributions and correlation coefficients measured from the two datasets
  • Workload-aware utility assessment: comparing the accuracy of outputs for the specific use case by performing analysis on synthetic data

What are the techniques of synthetic data generation?

Generating according to distribution

For cases where real data does not exist but data analyst has a comprehensive understanding of how dataset distribution would look like, the analyst can generate a random sample of any distribution such as Normal, Exponential, Chi-square, t, lognormal and Uniform. In this technique, the utility of synthetic data varies depending on the analyst’s degree of knowledge about a specific data environment.

Fitting real data to a known distribution

If there is a real-data, then businesses can generate synthetic data by determining the best fit distributions for given real-data. If businesses want to fit real-data into a known distribution and they know the distribution parameters, businesses can use Monte Carlo method to generate synthetic data.

Though Monte Carlo method can help businesses find the best fit available, the best fit may not have good enough utility for business’ synthetic data needs. For those cases, businesses can consider using machine learning models to fit the distributions. Machine learning models such as decision trees allow businesses to model non-classical distributions that can be multi-modal, which does not contain common characteristics of known distributions. With this machine learning fitted distribution, businesses can generate synthetic data that is highly correlated with original data. However, machine learning models have a risk of overfitting that fail to fit new data or predict future observations reliably.

For cases where only some part of real data exists, businesses can also use hybrid synthetic data generation. In this case, analysts generate one part of the dataset from theoretical distributions and generate other parts based on real data.

Using deep learning

Deep generative models such as Variational Autoencoder(VAE) and Generative Adversarial Network (GAN) can generate synthetic data.

Variational Autoencoder

VAE is an unsupervised method where encoder compresses the original dataset into a more compact structure and transmits data to the decoder. Then the decoder generates an output which is a representation of the original dataset. The system is trained by optimizing the correlation between input and output data.


An illustration of how deep learning variational autoencoder works
Source: Towards Data Science

Generative adversarial network

In GAN model, two networks, generator and discriminator, train model iteratively. The generator takes random sample data and generates a synthetic dataset. Discriminator compares synthetically generated data with a real dataset based on conditions that are set before.

A representation of how GAN deep generative model works
Source: Medium

How to generate synthetic data in Python?

Python is one of the most popular languages, especially for data science. There are three libraries that data scientists can use to generate synthetic data:

  • Scikit-learn is one of the most widely-used Python libraries for machine learning tasks and it can also be used to generate synthetic data. One can generate data that can be used for regression, classification, or clustering tasks.
  • SymPy is another library that helps users to generate synthetic data. Users can specify the symbolic expressions for the data they want to create, which helps users to create synthetic data according to their needs.
  • Pydbgen: Categorical data can also be generated using Python’s Pydbgen library. Users can generate random names, international phone numbers, email addresses etc. easily using the library.

What are the best practices?

  • Work with clean data: Clean data is an essential requirement of synthetic data generation. If you don’t clean and prepare data before synthesis, you can have garbage in, garbage out situation. In the data preparation process, make sure you apply the following principles:
    • Data cleaning
    • Data harmonization: For example, same attributes from different sources need to be mapped to the same column
  • Assess whether synthetic data is similar enough to real data for its application area: The utility of synthetic varies depending on the technique you use while generating it. You need to analyze their use case and decide if the generated synthetic data is a good fit the specific use case.
  • Outsource support if necessary: Identify your organization’s synthetic data capabilities and outsource based on the capability gaps. The 2 important steps are data preparation and data synthesis. Both steps can be automated by suppliers.

What are synthetic data generation tools?

The synthetic data generation process is a two steps process. You need to prepare data before synthesis. There are various vendors in the space for both steps.

If you want to learn leading data preparation tools, you can check our list about top 152 data quality software. 

If you are looking for a synthetic data generator tool, feel free to check our sortable list of synthetic data generator vendors.

Synthetic data is not the only way to prevent data breaches, feel free to read our other security and privacy-related articles:

Source: O’Reilly Practical Synthetic Generation


Leave a Reply

Your email address will not be published. Required fields are marked *