How to get data for sentiment analysis?

To get data for sentiment analysis, you can utilize various sentiment analysis datasets, such as the IMDB movie reviews dataset or paper reviews data. You may also consider leveraging social media platforms like Twitter for sentiment analysis, where you can use tweet text, Twitter user IDs, and retweet counts. Additionally, datasets containing both positive and negative reviews, along with sentiment lexicons in multiple languages (e.g., sentiment lexicons for 81 languages), can enhance your sentiment analysis based on machine learning techniques and natural language processing methods. For academic purposes, data from computing and informatics conferences can provide valuable insights.

Data Web Data Scraping Web Datasets

Sentiment Analysis Datasets

Cem Dilmegani

with Ezgi Arslan, PhD.

updated on Jul 22, 2025

See our ethical norms

Sentiment analysis is a great way to understand the customers’ feelings toward a company and to see if they are associated with sales, investments, or agreements. Ensuring a reliable sentiment analysis depends on many factors, and one of its building blocks is the dataset used to train the models. However, finding the right dataset is easier said than done.

See the top sentiment analysis datasets to train your algorithms for more efficient and accurate sentiment analysis:

Sentiment analysis datasets worth knowing

Although the quantity of the data is crucial, the quality and relevancy is also essential to have reliable results. For instance, if a retail company uses a dataset with financial jargon to train a customer sentiment analysis model, the algorithm may not provide reliable results. This is because the financial context of the model may not be a good fit for messages to retail customers.

So, having the right training dataset is crucial in evaluating the reviews, as you can develop new strategies with the insights you gather. Here we list the top eight sentiment analysis datasets to help you train your algorithm to obtain better results.

TweetEval

TweetEval is a comprehensive benchmark designed for multi-class tweet classification, encompassing seven diverse tasks on Twitter data. These tasks include emotion recognition, emoji prediction, irony detection, hate speech detection, offensive language identification, sentiment analysis, and stance detection.

The datasets are standardized with fixed training, validation, and test splits, offering a unified framework to evaluate models like RoBERTa re-trained on Twitter data.¹

MPQA Opinion Corpus

The MPQA Opinion Corpus is a rich dataset comprising news articles and other text documents annotated for opinions, sentiments, beliefs, emotions, and speculations. In its latest version, MPQA 3.0, entity/event-level target (eTarget) annotations were introduced, enabling precise identification of attitudes toward specific entities and events, unlike the previous span-based target (sTarget) annotations

This corpus, with 70 documents, is instrumental for training sentiment analysis systems to recognize nuanced subjectivities in text.²

Amazon Review Data

This dataset contains information regarding product information (e.g., color, category, size, and images) and more than 230 million customer reviews from 1996 to 2018.³ The reviews are labeled based on their positive, negative, and neutral emotional tone.

Stanford Sentiment Treebank

Most sentiment analysis tools categorize the sentences by giving sentiment scores to each word without considering the sentence as a whole. Here, you can find almost 10,000 reviews on movies with sentiment scores ranging from 1 to 25.⁴ While 1 represents the most negative reviews and 25 corresponds to the most positive ones.

Figure 1. An example of a movie review and the sentiment score of each aggregate

Source: The Stanford NLP Group⁵

Financial Phrasebank

The financial phrase bank dataset contains almost 5000 English sentences from financial news, and all sentences are classified based on their emotional tones as either positive, negative, or neutral. All the data is annotated by researchers knowledgeable in the finance domain.⁶

Figure 2. Examples of the sentences from financial news and the corresponding sentiment label class

Source: HuggingFace⁷

Webis-CLS-10 Dataset

Webis cross-lingual sentiment dataset includes 800.000 Amazon product reviews in English, German, French, and Japanese.⁸ Its multilingual nature allows for reaching more audiences and conducting comprehensive analyses.

CMU Multimodal Opinion Sentiment and Emotion Intensity

Not only do texts contain customers’ sentiments regarding services or products, but they can also be detected from videos or audio. CMU dataset includes multimodal data extracted from YouTube videos, such as the sentences and the voice tone used.⁹

Figure 3. The word cloud of the topics mentioned in the videos

The word cloud of the topics mentioned in the videos taken from a sentiment analysis dataset

Source: Carnegie Mellon University¹⁰

Yelp Polarity Reviews

This open-source dataset includes more than 500,000 training samples consisting of consumer reviews, ratings, and recommendations.¹¹ The polarity score of each sentence is determined, and the keywords requested can be extracted.

WordStat Sentiment Dictionary

Wordstat Sentiment Dictionary classifies sentiments as negative or positive and combines three dictionaries: Harvard IV Dictionary, Regressive Imagery Dictionary, and Linguistic and Word Count Dictionary.¹² The combination of different dictionaries allows for identifying synonyms and word patterns automatically.

Sentiment Lexicons For 81 Languages

Although English is the most spoken language globally, it is also crucial to analyze the sentiment of speakers of other languages.¹³ This dataset includes 81 languages such as Chinese, Spanish, and German, so it offers a variety of data from different languages and represents the worldwide sentiment better.¹⁴

The Social Media Sentiments Analysis Dataset captures diverse emotions and interactions across social media platforms, including text, timestamps, hashtags, and engagement metrics.¹⁵ This sentiment analysis dataset is a valuable resource for sentiment analysis applications, offering insights into positive and negative words and overall sentiment scores from user-generated content worldwide.

Hotel Reviews

This sentiment analysis dataset contains 1,000 hotels and their reviews, including details like location, rating, and review text.¹⁶ It can be used for aspect-based sentiment analysis, market research, and social media monitoring, allowing for the correlation of positive and negative words with review ratings and the computation of sentiment confidence scores.

Using a trained dataset to run your algorithm is essential in sentiment analysis. So, working with reliable sources matters.

You can also check our data-driven list of sentiment analysis services.

FAQs

Reference Links

Cardiff NLP · GitHub

Arguing Corpus | MPQA

Amazon review data

UCSD CSE Research Project

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank

takala/financial_phrasebank · Datasets at Hugging Face

Webis Data Webis-CLS-10

The Web Technology & Information Systems Network

CMU-MOSEI Dataset | MultiComp

MultiComp Lab

10.

CMU-MOSEI Dataset | MultiComp

MultiComp Lab

11.

Open Dataset | Yelp Data Licensing

Yelp for Business

12.

Wordstat. Provalis Research. Accessed: 25/July/2024

13.

Top 40 Most Spoken Languages in the World 2023

EduDwar

14.

Sentiment Lexicons for 81 Languages | Kaggle

15.

Social Media Sentiments Analysis Dataset 📊 | Kaggle

16.

data.world

Principal Analyst

Cem Dilmegani

Principal Analyst

Follow On

Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

View Full Profile

Researched by

Ezgi Arslan, PhD.

Industry Analyst

Follow On

Ezgi holds a PhD in Business Administration with a specialization in finance and serves as an Industry Analyst at AIMultiple. She drives research and insights at the intersection of technology and business, with expertise spanning sustainability, survey and sentiment analysis, AI agent applications in finance, answer engine optimization, firewall management, and procurement technologies.

View Full Profile

Be the first to comment

Your email address will not be published. All fields are required.

Next to Read

Agentic FinanceAug 23