AIMultiple ResearchAIMultiple ResearchAIMultiple Research
Sentiment Analysis
Updated on Apr 21, 2025

Sentiment Analysis Datasets 2025: Twitter and Amazon

Headshot of Cem Dilmegani
MailLinkedinX

Sentiment analysis is a great way to understand the customers’ feelings toward a company and to see if they are associated with sales, investments, or agreements. Ensuring a reliable sentiment analysis depends on many factors, and one of its building blocks is the dataset used to train the models. However, finding the right dataset is easier said than done.

See the top sentiment analysis datasets to train your algorithms for more efficient and accurate sentiment analysis:

Sentiment analysis datasets worth knowing

Last Updated at 12-24-2024
DatasetData type

TweetEval

tweets

news articles

Amazon Review Data

customer reviews, product metadata, and links

movie reviews labelled with positive, negative, and neutral emotional tone

financial news labelled with positive, negative, and neutral emotional tone

customer reviews in 4 languages

multimodal data extracted from Youtube videos, such as the sentences and the voice tone used

businesses, reviews, and user data

content analysis dictionaries

dictionay includes positive and negative sentiment lexicons for 81 languages

user-generated content, encompassing text, timestamps, hashtags, countries, likes, and retweets

Hotel Reviews

hotels and their reviews

Although the quantity of the data is crucial, the quality and relevancy is also essential to have reliable results. For instance, if a retail company uses a dataset with financial jargon to train a customer sentiment analysis model, the algorithm may not provide reliable results. This is because the financial context of the model may not be a good fit for messages to retail customers.

So, having the right training dataset is crucial in evaluating the reviews, as you can develop new strategies with the insights you gather. Here we list the top eight sentiment analysis datasets to help you train your algorithm to obtain better results.

TweetEval

TweetEval is a comprehensive benchmark designed for multi-class tweet classification, encompassing seven diverse tasks on Twitter data. These tasks include emotion recognition, emoji prediction, irony detection, hate speech detection, offensive language identification, sentiment analysis, and stance detection.

The datasets are standardized with fixed training, validation, and test splits, offering a unified framework to evaluate models like RoBERTa re-trained on Twitter data.1

MPQA Opinion Corpus

The MPQA Opinion Corpus is a rich dataset comprising news articles and other text documents annotated for opinions, sentiments, beliefs, emotions, and speculations. In its latest version, MPQA 3.0, entity/event-level target (eTarget) annotations were introduced, enabling precise identification of attitudes toward specific entities and events, unlike the previous span-based target (sTarget) annotations

This corpus, with 70 documents, is instrumental for training sentiment analysis systems to recognize nuanced subjectivities in text.2

Amazon Review Data

This dataset contains information regarding product information (e.g., color, category, size, and images) and more than 230 million customer reviews from 1996 to 2018.3 The reviews are labeled based on their positive, negative, and neutral emotional tone.

Sponsored

Clickworker is a crowdsourced data collection expert and can provide the data required to fuel an open-source sentiment analysis tool. It works with 4 million registered data collectors worldwide who have proficiency in 30 languages and cover over 70 target markets

They can help your company with sentiment analysis services using a pre-determined training dataset to understand your customers better.

Stanford Sentiment Treebank

Most sentiment analysis tools categorize the sentences by giving sentiment scores to each word without considering the sentence as a whole. Here, you can find almost 10,000 reviews on movies with sentiment scores ranging from 1 to 25.4 While 1 represents the most negative reviews and 25 corresponds to the most positive ones. 

Figure 1. An example of a movie review and the sentiment score of each aggregate

An example of a movie review, from one of the sentiment analysis dataset, and the sentiment score of each aggregate

Source: The Stanford NLP Group5

Financial Phrasebank

The financial phrase bank dataset contains almost 5000 English sentences from financial news, and all sentences are classified based on their emotional tones as either positive, negative, or neutral. All the data is annotated by researchers knowledgeable in the finance domain.6  

Figure 2. Examples of the sentences from financial news and the corresponding sentiment label class

Examples of the sentences from financial news and the corresponding sentiment label class from one of the sentiment analysis dataset

Source: HuggingFace7

Webis-CLS-10 Dataset

Webis cross-lingual sentiment dataset includes 800.000 Amazon product reviews in English, German, French, and Japanese.8 Its multilingual nature allows for reaching more audiences and conducting comprehensive analyses. 

CMU Multimodal Opinion Sentiment and Emotion Intensity 

Not only do texts contain customers’ sentiments regarding services or products, but they can also be detected from videos or audio. CMU dataset includes multimodal data extracted from Youtube videos, such as the sentences and the voice tone used.9  

Figure 3. The word cloud of the topics mentioned in the videos

The word cloud of the topics mentioned in the videos taken from a sentiment analysis dataset

Source: Carnegie Mellon University10  

Yelp Polarity Reviews

This open-source dataset includes more than 500,000 training samples consisting of consumer reviews, ratings, and recommendations.11 The polarity score of each sentence is determined, and the keywords requested can be extracted. 

WordStat Sentiment Dictionary

Wordstat Sentiment Dictionary classifies sentiments as negative or positive and combines three dictionaries: Harvard IV Dictionary, Regressive Imagery Dictionary, and Linguistic and Word Count Dictionary.12 The combination of different dictionaries allows for identifying synonyms and word patterns automatically.

Sentiment Lexicons For 81 Languages

Although English is the most spoken language globally, it is also crucial to analyze the sentiment of other language speakers.13 This dataset includes 81 languages such as Chinese, Spanish, and German, so it offers a variety of data from different languages and represents the worldwide sentiment better.14

Social Media Sentiment

The Social Media Sentiments Analysis Dataset captures diverse emotions and interactions across social media platforms, including text, timestamps, hashtags, and engagement metrics.15 This sentiment analysis dataset is a valuable resource for sentiment analysis applications, offering insights into positive and negative words and overall sentiment scores from user-generated content worldwide.

Hotel Reviews

This sentiment analysis dataset contains 1,000 hotels and their reviews, including details like location, rating, and review text.16 It can be used for aspect based sentiment analysis, market research, and social media monitoring, allowing for the correlation of positive and negative words with review ratings and the computation of sentiment confidence scores.

Using a trained dataset to run your algorithm is essential in sentiment analysis. So, working with reliable sources matters. 

You can also check our data-driven list of sentiment analysis services. 

FAQs

How to get data for sentiment analysis?

To get data for sentiment analysis, you can utilize various sentiment analysis datasets such as the IMDB movie reviews dataset or paper reviews data. You may also consider leveraging social media platforms like Twitter for sentiment analysis, where you can use tweet text, twitter user ids, and retweet counts.
Additionally, datasets containing both positive and negative reviews, along with sentiment lexicons in multiple languages (e.g., sentiment lexicons for 81 languages), can enhance your sentiment analysis based on machine learning techniques and natural language processing methods. For academic purposes, data from computing and informatics conferences can provide valuable insights.

Further Reading

Share This Article
MailLinkedinX
Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.
Ezgi is an Industry Analyst at AIMultiple, specializing in sustainability, survey and sentiment analysis for user insights, as well as firewall management and procurement technologies.

Next to Read

Comments

Your email address will not be published. All fields are required.

0 Comments