Sentiment analysis is a great way to understand the customers’ feelings toward a company and to see if they are associated with sales, investments, or agreements. Ensuring a reliable sentiment analysis depends on many factors, and one of its building blocks is the dataset used to train the models. However, finding the right dataset is easier said than done.
See the top sentiment analysis datasets to train your algorithms for more efficient and accurate sentiment analysis:
Sentiment analysis datasets worth knowing
Dataset | Data type |
---|---|
tweets |
|
news articles |
|
customer reviews, product metadata, and links |
|
movie reviews labelled with positive, negative, and neutral emotional tone |
|
financial news labelled with positive, negative, and neutral emotional tone |
|
customer reviews in 4 languages |
|
multimodal data extracted from Youtube videos, such as the sentences and the voice tone used |
|
businesses, reviews, and user data |
|
content analysis dictionaries |
|
dictionay includes positive and negative sentiment lexicons for 81 languages |
|
user-generated content, encompassing text, timestamps, hashtags, countries, likes, and retweets |
|
hotels and their reviews |
Although the quantity of the data is crucial, the quality and relevancy is also essential to have reliable results. For instance, if a retail company uses a dataset with financial jargon to train a customer sentiment analysis model, the algorithm may not provide reliable results. This is because the financial context of the model may not be a good fit for messages to retail customers.
So, having the right training dataset is crucial in evaluating the reviews, as you can develop new strategies with the insights you gather. Here we list the top eight sentiment analysis datasets to help you train your algorithm to obtain better results.
TweetEval
TweetEval is a comprehensive benchmark designed for multi-class tweet classification, encompassing seven diverse tasks on Twitter data. These tasks include emotion recognition, emoji prediction, irony detection, hate speech detection, offensive language identification, sentiment analysis, and stance detection.
The datasets are standardized with fixed training, validation, and test splits, offering a unified framework to evaluate models like RoBERTa re-trained on Twitter data.1
MPQA Opinion Corpus
The MPQA Opinion Corpus is a rich dataset comprising news articles and other text documents annotated for opinions, sentiments, beliefs, emotions, and speculations. In its latest version, MPQA 3.0, entity/event-level target (eTarget) annotations were introduced, enabling precise identification of attitudes toward specific entities and events, unlike the previous span-based target (sTarget) annotations
This corpus, with 70 documents, is instrumental for training sentiment analysis systems to recognize nuanced subjectivities in text.2
Amazon Review Data
This dataset contains information regarding product information (e.g., color, category, size, and images) and more than 230 million customer reviews from 1996 to 2018.3 The reviews are labeled based on their positive, negative, and neutral emotional tone.
Sponsored
Clickworker is a crowdsourced data collection expert and can provide the data required to fuel an open-source sentiment analysis tool. It works with 4 million registered data collectors worldwide who have proficiency in 30 languages and cover over 70 target markets.
They can help your company with sentiment analysis services using a pre-determined training dataset to understand your customers better.
Stanford Sentiment Treebank
Most sentiment analysis tools categorize the sentences by giving sentiment scores to each word without considering the sentence as a whole. Here, you can find almost 10,000 reviews on movies with sentiment scores ranging from 1 to 25.4 While 1 represents the most negative reviews and 25 corresponds to the most positive ones.
Figure 1. An example of a movie review and the sentiment score of each aggregate

Source: The Stanford NLP Group5
Financial Phrasebank
The financial phrase bank dataset contains almost 5000 English sentences from financial news, and all sentences are classified based on their emotional tones as either positive, negative, or neutral. All the data is annotated by researchers knowledgeable in the finance domain.6
Figure 2. Examples of the sentences from financial news and the corresponding sentiment label class

Source: HuggingFace7
Webis-CLS-10 Dataset
Webis cross-lingual sentiment dataset includes 800.000 Amazon product reviews in English, German, French, and Japanese.8 Its multilingual nature allows for reaching more audiences and conducting comprehensive analyses.
CMU Multimodal Opinion Sentiment and Emotion Intensity
Not only do texts contain customers’ sentiments regarding services or products, but they can also be detected from videos or audio. CMU dataset includes multimodal data extracted from Youtube videos, such as the sentences and the voice tone used.9
Figure 3. The word cloud of the topics mentioned in the videos

Source: Carnegie Mellon University10
Yelp Polarity Reviews
This open-source dataset includes more than 500,000 training samples consisting of consumer reviews, ratings, and recommendations.11 The polarity score of each sentence is determined, and the keywords requested can be extracted.
WordStat Sentiment Dictionary
Wordstat Sentiment Dictionary classifies sentiments as negative or positive and combines three dictionaries: Harvard IV Dictionary, Regressive Imagery Dictionary, and Linguistic and Word Count Dictionary.12 The combination of different dictionaries allows for identifying synonyms and word patterns automatically.
Sentiment Lexicons For 81 Languages
Although English is the most spoken language globally, it is also crucial to analyze the sentiment of other language speakers.13 This dataset includes 81 languages such as Chinese, Spanish, and German, so it offers a variety of data from different languages and represents the worldwide sentiment better.14
Social Media Sentiment
The Social Media Sentiments Analysis Dataset captures diverse emotions and interactions across social media platforms, including text, timestamps, hashtags, and engagement metrics.15 This sentiment analysis dataset is a valuable resource for sentiment analysis applications, offering insights into positive and negative words and overall sentiment scores from user-generated content worldwide.
Hotel Reviews
This sentiment analysis dataset contains 1,000 hotels and their reviews, including details like location, rating, and review text.16 It can be used for aspect based sentiment analysis, market research, and social media monitoring, allowing for the correlation of positive and negative words with review ratings and the computation of sentiment confidence scores.
Using a trained dataset to run your algorithm is essential in sentiment analysis. So, working with reliable sources matters.
You can also check our data-driven list of sentiment analysis services.
FAQs
How to get data for sentiment analysis?
To get data for sentiment analysis, you can utilize various sentiment analysis datasets such as the IMDB movie reviews dataset or paper reviews data. You may also consider leveraging social media platforms like Twitter for sentiment analysis, where you can use tweet text, twitter user ids, and retweet counts.
Additionally, datasets containing both positive and negative reviews, along with sentiment lexicons in multiple languages (e.g., sentiment lexicons for 81 languages), can enhance your sentiment analysis based on machine learning techniques and natural language processing methods. For academic purposes, data from computing and informatics conferences can provide valuable insights.
Further Reading
- Consumer Insights: Why Is It Essential & Top 4 Data Sources
- Sentiment Analysis Stock Market: Sources & Challenges
- Top 7 Sentiment Analysis Challenges & Solutions
External Links
- 1. Cardiff NLP · GitHub.
- 2. Arguing Corpus | MPQA.
- 3. Amazon review data. UCSD CSE Research Project
- 4. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank .
- 5. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank .
- 6. takala/financial_phrasebank · Datasets at Hugging Face.
- 7. takala/financial_phrasebank · Datasets at Hugging Face.
- 8. Webis Data Webis-CLS-10. The Web Technology & Information Systems Network
- 9. CMU-MOSEI Dataset | MultiComp. MultiComp Lab
- 10. CMU-MOSEI Dataset | MultiComp. MultiComp Lab
- 11. Open Dataset | Yelp Data Licensing. Yelp for Business
- 12. Wordstat. Provalis Research. Accessed: 25/July/2024
- 13. Top 40 Most Spoken Languages in the World 2023. EduDwar
- 14. Sentiment Lexicons for 81 Languages | Kaggle.
- 15. Social Media Sentiments Analysis Dataset 📊 | Kaggle.
- 16. Hotel Reviews. Data World. Accessed: 25/July/2024
Comments
Your email address will not be published. All fields are required.