To leverage or build generative AI or conversational AI solutions, a large amount of data is required. You can use existing datasets available on the market or hire a data collection service.
Here, we offer a list of different types of existing datasets to train your machine-learning models and some recommendations on how to find the right one.
Table 1. A List of all the ML datasets and data sources
Dataset Name | Domain / Source of Dataset | Description | Free / Paid |
---|---|---|---|
Clickworker | Custom human-generated datasets | Freshly collected/generated data via a 4.5+ million crowd | Paid |
Amazon Mechanical Turk | Custom human-generated datasets | Freshly collected/generated data via a 0.5+ million crowd | Paid |
Amazon reviews dataset | Natural Language Processing (NLP) | Dataset includes product reviews and Meta Data | Free |
Appen | Custom human-generated datasets | Freshly collected/generated data via a 1+ million crowd | Paid |
AWS public datasets | Open datasets | Datasets in areas including biology, meteorology, astronomy, and others. | Free |
Baidu ApolloScape Dataset | Image datasets | Annotated images for autonomous driving systems | Free |
Coco dataset | Image datasets | Over 200K labelled images | Free |
Common voice | Audio datasets | Crowdsourced database for speech recognition data | Free |
Data USA | Public government datasets | Over 47,000 reports | Free |
Data.Gov.uk | Public government datasets | Over 47,000 non-personal UK government data | Free |
ESC-50 | Audio datasets | Over 2K environmental audio recordings | Free |
EU open data portal | Public government datasets | Over 1.6 million datasets | Free |
Flickr audio caption corpus | Audio datasets | Over 8K audio recordings of spoken descriptions for images from the Flickr8K dataset | Free |
Free music archive (FMA) | Audio datasets | Over 106K tracks | Free |
GitHub | Others | A comprehensive library of datasets | Free & Paid |
Google dataset search | Open datasets | Open source dataset search engine | Free |
Google’s open images | Image datasets | Over 9 million images | Free |
Hazy | Custom machine-generated datasets | Synthetic data platform | Paid |
Healthdata.gov | Healthcare Datasets | Over 2K health-related datasets from the U.S. government. | Free |
Imagenet | Image datasets | Over 14.1 million annotated images | Free for |
Kaggle datasets | Open datasets | Data provided by companies and students | Free |
MIMIC Critical Care Database | Healthcare Datasets | Health-related data of over 40K patients from the Beth Israel Deaconess Medical Center. | Free |
OpenAI GPT-4 | Custom machine-generated datasets | LLM for AI training data generation | Freemium |
Speech commands dataset | Audio datasets | Over 65000 crowdsourced speech data units | Free |
Synthesis AI | Custom machine-generated datasets | Synthetic data generation | Paid |
Telus International | Custom human-generated datasets | Freshly collected/generated data via a 1+ million crowd | Paid |
The Big Bad NLP Database (BBNLPDB) | Natural Language Processing (NLP) | Over 300 datasets for NLP models | Free |
Waymo Open Dataset | Image datasets | Image dataset for autonomous vehicle research | Free for |
Wikipedia Links data | Natural Language Processing (NLP) | Cross-document coreference dataset labeled via links to Wikipedia | Free |
- The last row of the table includes a comprehensive list of free and paid datasets we might have missed. If you think there are any datasets missing from this list, let us know in the comments section.
- The list is compiled from dataset websites.
- Quantities in the description column may change over time.
- “Free for non-commercial use” means the dataset is free for researchers and academicians.
What are ML datasets?
A machine learning dataset is a collection of data that is used to train the model. A dataset acts as an example to teach the machine learning algorithm how to make predictions. The common types of data include:
- Text data
- Image data
- Audio data
- Video data
- Numeric data
The data is usually first labeled/annotated in order for the algorithm to understand what the outcome needs to be. Click here to learn more about data annotation.
Why prepare datasets for machine learning?
Preparing and choosing the right dataset is one of the most crucial steps in training an AI/ML model. It can be the determinant between the success and failure of the AI/ML development project.
There are 3 key purposes of an AI/ML dataset:
- To train the model
- To measure the accuracy of the model once it is trained
- To improve the model once it is deployed in a live setting.
Work with a data partner
You can transfer the preparation process of the dataset to a data collection or generation service provider. You can work with a data crowdsourcing platform or company.
You can also select a data partner based on specific data types:
- 10+ Image Data Collection Services
- 10+ Speech Data Collection Services
- 7+ Video Data Collection Services & Selection Criteria
Types of ML datasets
The whole dataset that is collected is separated into 3 subsets, which are as follows:
1. Training dataset
This is one of the most important subsets of the whole dataset, comprising about 60% of the total dataset. This set comprises the data that will initially be used to train the model. In other words, it helps teach the algorithm what to look for in the data.
For instance, a vehicle license plate recognition system will be trained with image data with labels indicating the location (e.g., front or rear of the car) and the data format of the license plates of vehicles and similar objects to learn what to detect and what to avoid.
Figure 1. Sample dataset for a license plate detection system1

2. Validation dataset
This subset is about 20% of the total dataset and is used to evaluate all the parameters of the model after the training phase is complete. The validation data is known data that helps in identifying any shortcomings in the model. This data is also used to identify if the model is over/underfitting.
3. Test dataset
This subset is input at the final stage of the training process and accounts for the last 20% of the dataset. The data in this subset is unknown to the model and is used to test the accuracy of the model. In simpler words, this dataset will show how much your model has learned from the previous 2 subsets.

Categories and the best datasets
Sourcing a dataset depends on the requirements and scope of the project. This section highlights some popular sources of acquiring datasets to train AI and machine learning models.
1. Custom human-generated datasets
Datasets can also be prepared with fresh data that is collected or generated by humans. Data collection services and companies offer vast pools of workers that help prepare human-generated datasets for machine learning.
Some popular names include:
2. Custom machine-generated datasets
Customer machine-generated datasets made by generative AI tools, particularly for models like Generative Adversarial Networks (GANs), have transformed the landscape of data creation and augmentation. Creating datasets using generative AI addresses several challenges in machine learning. Studies have used LLMs to generate training data for machine learning models.2
In situations where collecting real-world data is expensive, time-consuming, or ethically challenging, generative models can supplement or even replace traditional data collection methods.
For instance, medical imaging datasets can be augmented using GANs to generate more samples of rare conditions, making it easier to train models that can detect and diagnose these conditions.
Additionally, in domains like computer vision (CV), generating diverse data helps mitigate model overfitting and improve the robustness of the trained models. This synthesized data, when used carefully alongside real data, can help train more effective and accurate machine learning models, while saving resources and time in the data collection phase.
Some popular names offering synthetic data through generative AI models include:
- Synthesis AI
- Hazy
- LLMs Like GPT-4 by OpenAI
3. Natural language processing datasets
NLP datasets are used for speech recognition, text analytics, and language translation. These types of datasets are large in size and require heavy computational power.
Some popular NLP datasets include:
- Amazon Reviews
- The Big ad NLP Database
- Wikipedia Links Data
4. Open datasets
These ready-to-use datasets are freely available online for anyone to download, modify, and distribute without legal or financial restrictions. These datasets are regularly updated and are compatible with most ML frameworks. The only drawback is that open datasets lack personalization.
Popular open datasets include:
- Google dataset search
- AWS public datasets
- Kaggle datasets
5. Public government datasets
These datasets are used for government projects that are implemented for the public. For example, these datasets can include a certain population’s census or demographic data. These datasets can be used to make policies or train AI/ML models for immigration decision-making, chatbots that answer citizen queries, city infrastructure planning systems, etc.
Popular public government datasets include:
- Data.Gov.uk
- EU open data portal
- Data USA
6. Image datasets
Image datasets include both image and video data. This type of dataset is used to train computer vision systems for facial recognition, autonomous vehicle systems, retail security systems, etc. These datasets required high-quality image annotation to be used.
Popular image datasets include:
- Google’s open images
- Coco dataset
- Imagenet
- Baidu ApolloScape Dataset
- Waymo Open Dataset
7. Audio datasets
These datasets are used to train AI/ML models for voice recognition, music recognition, etc.
Popular audio datasets include:
- Environmental audio datasets
- Speech commands dataset
- Free music archive (FMA)
- Flickr audio caption corpus
- Common voice
8. Healthcare Datasets
These datasets are used to train medical imaging systems or medical diagnosis systems. They are usually large in size and require heavy computational and high-quality medical annotation.
Popular healthcare datasets include:
- MIMIC Critical Care Database
- Healthdata.gov
You can also check out our data-driven list of data collection/harvesting services to find the option that best suits your project needs.
FAQs
Where to get datasets for ML?
To find datasets for machine learning, data scientists can explore various data repositories offering diverse datasets, including demographic data, economic and financial data, and public government data. These curated datasets cover a range of applications, such as natural language processing, sentiment analysis, computer vision, and healthcare.
Resources like open datasets, free datasets, and public datasets provide high-quality training data, validation datasets, and test datasets in various data formats like CSV files. Popular sources include government portals, academic institutions, and organizations like the International Monetary Fund, offering extensive collections of datasets for ML projects, predictive models, and deep learning algorithms.
What kind of dataset is good for machine learning?
A good machine learning dataset is a high-quality, diverse dataset with rich metadata, suitable for specific tasks like natural language processing, image classification, or sentiment analysis, and is often available from public data repositories or open datasets.
Further reading
- Crowdsourced AI Data Collection Benefits & Best Practices
- Data Collection Challenges and Solutions
- Sentiment Analysis: How it Works & Best Practices
Comments
Your email address will not be published. All fields are required.