Synthetic data, as the name suggests, is data that is artificially created rather than being generated by actual events. It is often created with the help of algorithms and is used for a wide range of activities, including as test data for new products and tools, for model validation, and in AI needs.
The questions that this post sets out to answer include:
Why is synthetic data important now?
Synthetic data is important because it can be generated to meet specific needs or conditions that are not available in existing (real) data. This can be useful in numerous cases such as
- when privacy requirements limit data availability or how it can be used
- Data is needed for testing a product to be released however such data either does not exist or is not available to the testers
Though synthetic data first started to be used in the ’90s, an abundance of computing power and storage space of 2010s brought more widespread use of synthetic data.
What are its applications?
Business functions that can benefit from synthetic data include:
- Marketing: Synthetic data allows marketing units to run detailed, individual-level simulations to improve their marketing spend. Such simulations would not be allowed without user consent due to GDPR however synthetic data, which follows the properties of real data, can be reliably used in simulation
- Machine learning: Self driving car simulations pioneered the use of synthetic data.
- Agile development and DevOps: When it comes time for software testing and quality assurance, artificially generated data is often the better choice as it eliminates the need to wait for ‘real’ data. Often referred to under this circumstance as ‘test data’. This can ultimately lead to decreased test time and increased flexibility and agility during development
- Clinical and scientific trials: Synthetic data can be used as a baseline for future studies and testing when no real data yet exists.
- Research: To help better understand the format of real data not yet recorded, develop an understanding of its specific statistical properties, tune parameters for related algorithms, or build preliminary models.
- Security: Synthetic data can be used to secure organizations’ online & offline properties. Two methods are commonly used:
- Training data for video surveillance: To take advantage of image recognition, organizations need to create and train neural network models but this has two limitations: Acquiring the volumes of data and manually tagging the objects. Synthetic data can help train models at lower cost compared to acquiring and annotating training data.
- Deep-fakes: Deep fakes can be used to test face recognition systems
Industries that can benefit from synthetic data:
- Automotive: Research to develop autonomous things such as robots, drones and self driving car simulations pioneered the use of synthetic data.
- Robotics: Real-life testing of robotic systems is expensive and slow. Synthetic data enables companies to test their robotics solutions in thousands of simulations, improving their robots and complementing expensive real-life testing.
- Manufacturing: As Leo Tolstoy states at the beginning of Anna Karenina: “All happy families are alike; each unhappy family is unhappy in its own way.” It is hard to test a system to see whether it identifies anomalies since there are infinitely many anomalies. Synthetic data enables more effective testing of quality control systems, improving their performance.
- Financial services: Fraud protection is a major part of any financial service and with synthetic data, new fraud detection methods can be tested and evaluated for their effectiveness.
- Healthcare: Synthetic data enables healthcare data professionals to allow the public use of record data while still maintaining patient confidentiality.
- Social Media: Facebook is using synthetic data to improve its various networking tools and to fight fake news, online harassment, and political propaganda from foreign governments by detecting bullying language on the platform.
Syynthetic data allows us to continue developing new and innovative products and solutions when the data necessary to do so otherwise wouldn’t be present or available.
Comparing synthetic and real data performance
Data is used in applications and the most direct measure of data quality is data’s effectiveness when in use. Machine learning is one of the most common use cases for data today. MIT scientists wanted to measure if machine learning models from synthetic data could perform as well as models built from real data. In a 2017 study, they split data scientists into two groups: one using synthetic data and another using real data. 70% of the time group using synthetic data was able to produce results on par with the group using real data.
Benefits of synthetic data
Being able to generate data that mimics the real thing may seem like a limitless way to create scenarios for testing and development. While there is much truth to this, it is important to remember that any synthetic models deriving from data can only replicate specific properties of the data, meaning that they’ll ultimately only be able to simulate general trends.
However, synthetic data has several benefits over real data:
- Overcoming real data usage restrictions: Real data may have usage constraints due to privacy rules or other regulations. Synthetic data can replicate all important statistical properties of real data without exposing real data, thereby eliminating the issue.
- Creating data to simulate not yet encountered conditions: Where real data does not exist, synthetic data is the only solution.
- Immunity to some common statistical problems: These can include item nonresponse, skip patterns, and other logical constraints.
- Focuses on relationships: Synthetic data aims to preserve the multivariate relationships between variables instead of specific statistics alone.
These benefits demonstrate that the creation and usage of synthetic data will only stand to grow as our data becomes more complex; and more closely guarded.
Synthetic data creation 101
When determining the best method for creating synthetic data, it is important to first consider what type of synthetic data you aim to have. There are two broad categories to choose from, each with different benefits and drawbacks:
Fully synthetic: This data does not contain any original data. This means that re-identification of any single unit is almost impossible and all variables are still fully available.
Partially synthetic: Only data that is sensitive is replaced with synthetic data. This requires a heavy dependency on the imputation model. This leads to decreased model dependence, but does mean that some disclosure is possible owing to the true values that remain within the dataset.
Two general strategies for building synthetic data include:
Drawing numbers from a distribution: This method works by observing real statistical distributions and reproducing fake data. This can also include the creation of generative models.
Agent-based modeling: To achieve synthetic data in this method, a model is created that explains an observed behavior, and then reproduces random data using the same model. It emphasizes understanding the effects of interactions between agents on a system as a whole.
Challenges of Synthetic Data
Though synthetic data has various benefits that can ease data science projects for organizations, it also has limitations:
- Outliers may be missing: Synthetic data can only mimic the real-world data, it is not an exact replica of it. Therefore, synthetic data may not cover some outliers that original data has. However, outliers in the data can be more important than regular data points as Nassim Nicholas Taleb explains in depth in his book, the Black Swan.
- Quality of the model depends on the data source: Quality of synthetic data is highly correlated with the quality of the input data and the data generation model. Synthetic data may reflect the biases in source data
- User acceptance is more challenging: Synthetic data is an emerging concept and it may not be accepted as valid by users who have not witnessed its benefits before.
- Synthetic data generation requires time and effort: Though easier to create than actual data, synthetic data is also not free.
- Output control is necessary: Especially in complex datasets, the best way to ensure the output is accurate is by comparing synthetic data with authentic data or human-annotated data. this is because there could be inconsistencies in synthetic data when trying to replicate complexities within original datasets
Machine Learning and Synthetic Data: Building AI
The role of synthetic data in machine learning is increasing rapidly. This is because machine learning algorithms are trained with an incredible amount of data which could be difficult to obtain or generate without synthetic data. It can also play an important role in the creation of algorithms for image recognition and similar tasks that are becoming the baseline for AI.
There are several additional benefits to using synthetic data to aid in the development of machine learning:
- Ease in data production once an initial synthetic model/environment has been established
- Accuracy in labeling that would be expensive or even impossible to obtain by hand
- The flexibility of the synthetic environment to be adjusted as needed to improve the model
- Usability as a substitute for data that contains sensitive information
2 synthetic data use cases that are gaining widespread adoption in their respective machine learning communities are:
Learning by real life experiments is hard in life and hard for algorithms as well. It is especially hard for people that end up getting hit by self-driving cars as in Uber’s deadly crash in Arizona. While Uber scales back their Arizona operation, they should probably ramp up their simulations to train their models.
Industry leaders such as Google have been relying on simulations to create millions of hours of synthetic driving data to train their algorithms.
Generative Adversarial Networks (GAN)
These networks, also called GAN or Generative adversarial neural networks, were introduced by Ian Goodfellow et al. in 2014. These networks are a recent breakthrough in image recognition. They are composed of one discriminator and one generator network. While the generator network generates synthetic images that are as close to reality as possible, discriminator network aims to identify real images from synthetic ones. Both networks build new nodes and layers to learn to become better at their tasks.
While this method is popular in neural networks used in image recognition, it has uses beyond neural networks. It can be applied to other machine learning approaches as well. It is generally called Turing learning as a reference to the Turing test. In the Turing test, a human converses with an unseen talker trying to understand whether it is a machine or a human.
Synthetic data case studies
Challenge: To create an augmented reality experience within a mobile app that is about the exterior of an automobile, Laan Labs needs to estimate the position and orientation of the automobile in real-time. Laan Labs needs to collect 10000+ images but acquiring that amount of image data is costly and needs a concentrated workload.
Solution: Laan Labs developed synthetic data generator for image training. They trained a neural network system with photorealistic images such as 3D car models, background scenes and lighting.
Results: Image training data is costly and requires labor intensive labeling. Since they didn’t need to annotate images, they saved money, work hours and, additionally, it eliminated human error risks during the annotation.
Challenge: Manheim is one of the world’s leading vehicle auction companies. Manheim was working on migration from a batch-processing system to one that operates in near real time so that Manheim would accelerate remittances and payments. However, testing this process requires large volumes of test data. Manheim used to create test data by copying their production datasets but this was inefficient, time-consuming and required specific skill sets.
Solution: As part of the digital transformation process, Manheim decided to change their method of test data generation. Manheim purchased CA Test Data Manager to generate large volumes of data in a short period. With synthetic data, Manheim is able to test the initiatives effectively.
Synthetic data tools
The tools related to synthetic data are often developed to meet one of the following needs:
- Test data for software development and similar
- The creation of machine learning models (referred to in the chart as ‘training data’)
We prepared a regularly updated, comprehensive sortable/filterable list of leading vendors in synthetic data generation software. Some common vendors that are working in this space include:
|Name||Founded||Status||Number of Employees|
|CA Technologies Datamaker||1976||Public||10,001+|
|Deep Vision Data by Kinetic Vision||1985||Private||51-200|
|Delphix Test Data Management||2008||Private||201-500|
|Informatica Test Data Management Tool||1993||Private||1,001-5,000|
These 10 tools are just a small representation of a growing market of tools and platforms related to the creation and usage of synthetic data. For the full list, please refer to our comprehensive list.
Synthetic data is one way that the world is evolving to deal with no only an increasing volume of data, but also with data that is oftentimes sensitive and requires additional protections. To learn more about related topics on data, be sure to see our research on data.