Synthetic data generation tools generate synthetic data to preserve the privacy of data, to test systems or to create training data for machine learning algorithms. For more detailed information about synthetic data, please check our ultimate guide to synthetic data.
This article is a collection of 26 up-to-date synthetic data statistics from various origins such as researches of reputable sources. In this list, you will find synthetic data stats about:
Market size forecasts
We can expect that the synthetic data market will continue to grow >10% p.a. since most of the synthetic data market serves:
- Test data management which is expected to grow 12.7% CAGR
- AI training data generation, which is expected to grow at 22.5% CAGR
Why does synthetic data matter?
- Mostly AI claims that synthetic data can retain 99% of the information and value of the original dataset while protecting sensitive data from re-identification. However, these results are based on a benchmark analyzed by their own team and the underlying data was not published. (Mostly AI)
- “Companies only need 50% of their original, authentic training data to finish the formal training of their algorithms”, claims Yashar Behzadi, CEO of Neuromation, a synthetic data generation startup. (eeNews)
- When training data is highly imbalanced (e.g. more than 99% instances belong to one class) synthetic data generation is necessary to build accurate machine learning models. (Tensorflow)
- Another important function of synthetic data is to keep data secure as:
- 17% of the global online population were victims of digital theft over the last few decades and it is estimated that 80% of the cybercrimes are not reported. (UN)
- As of 2016, almost 90% of businesses were victims of some kind of computer hack. (Secure Swiss Data)
How protective is synthetic data?
Synthetic data is useful to eliminate security gaps that traditional anonymization techniques can not prevent. Therefore, using an AI-powered synthetic data generation tool can be beneficial to better protect sensitive data. Vendor claims include:
- 80% of credit card owners can be re-identified from 3 transactions when traditional anonymization techniques are used. (Mostly AI)
- 51% of mobile phone owners can be re-identified by 2 antenna signals when traditional anonymization techniques are used. (Mostly AI)
- 87% of all people can be re-identified by their birthday, gender and postcode when traditional anonymization techniques are used. (Mostly AI)
Synthetic data benefits
There are numerous case studies demonstrating that synthetic data improves machine learning model accuracy.
- A team at Deloitte Consulting generated 80% of the training data to be used in a model by synthesizing data and model accuracy was similar to a model trained on real data. (Deloitte)
- Microsoft generated 2 million synthetic sentences to improve the translation of Levantine dialect of Arabic. (Microsoft)
- A 2020 study shows that using synthetic data improved the machine learning model performance up to 20% while categorizing actions in videos. (American University of Beirut)
- Researchers were able to identify drivers of cars with 87% accuracy by analyzing synthesized sensor data generated by vehicles. (De Gruyter)
- A study conducted in 2017 showed that 70% of the time predictive models built with real data and synthetic data generated results on par. (MIT News)
- A 2018 study depicts that using synthetic data reduces the false-positive rates from 60% to 20% while predicting volcanic eruptions. (ScienceMag)
Top synthetic data vendor funding stats
- TwentyBN raised $12.5M (2 rounds)
- Hazy raised $6.8M (5 rounds)
- Mostly AI raised $6.1M (2 rounds)
- AI.Reverie raised $5.8M (4 rounds)
- DataGen Technologies raised $3.5M (1 round)
Top synthetic data vendors by number of employees
- TwentyBN has 11-50 employees
- Hazy has 11-50 employees
- Mostly AI has 11-50 employees
- AI.Reverie has 1-10 employees
- DataGen Technologies has 11-50 employees
For more detailed information about synthetic data generator vendors, please check our synthetic data vendor selection guide or contact us:
Sources: Mostly AI*, eeNews, Tensorflow, UN, Secure Swiss Data, Mostly AI**, Mostly AI***, Mostly AI****, Deloitte, Microsoft, American University of Beirut, De Gruyter, MIT News, ScienceMag, Funding and number of employees data is from Crunchbase
How can we do better?
Your feedback is valuable. We will do our best to improve our work based on it.