AIMultiple ResearchAIMultiple Research

Top 8 Sentiment Analysis Datasets in 2024

Updated on Apr 4
3 min read
Written by
Cem Dilmegani
Cem Dilmegani
Cem Dilmegani

Cem is the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per Similarweb) including 60% of Fortune 500 every month.

Cem's work focuses on how enterprises can leverage new technologies in AI, automation, cybersecurity(including network security, application security), data collection including web data collection and process intelligence.

View Full Profile

Sentiment analysis is a great way to understand the customers’ feelings toward a company and to see if they are associated with sales, investments, or agreements. Ensuring a reliable sentiment analysis depends on many factors, and one of its building blocks is the dataset used to train the models. However, finding the right dataset is easier said than done.

This article highlights the top sentiment analysis datasets to train your algorithms for more efficient and accurate sentiment analysis.

Although the quantity of the data is crucial, the quality or relevancy is also essential to have reliable results. For instance, if a retail company uses a dataset with financial jargon to train a customer sentiment analysis model, the algorithm may not provide reliable results as the words which the algorithm evaluates will be from a financial context.

So, having the right training dataset is crucial in evaluating the reviews, as you can develop new strategies with the insights you gather. Here we list the top eight sentiment analysis datasets to help you train your algorithm to obtain better results. 

1. Amazon Review Data

This dataset contains information regarding product information (e.g., color, category, size, and images) and more than 230 million customer reviews from 1996 to 2018. The reviews are labeled based on their positive, negative, and neutral emotional tone.

Sponsored

Clickworker is a crowdsourced data collection expert and can provide the data required to fuel an open-source sentiment analysis tool. It works with 4 million registered data collectors worldwide who have proficiency in 30 languages and cover over 70 target markets. 

They can help your company with sentiment analysis services using a pre-determined training dataset to understand your customers better.

2. Stanford Sentiment Treebank

Most sentiment analysis tools categorize the sentences by giving sentiment scores to each word without considering the sentence as a whole. Here, you can find almost 10,000 reviews on movies with sentiment scores ranging from 1 to 25. While 1 represents the most negative reviews and 25 corresponds to the most positive ones. 

Figure 1. An example of a movie review and the sentiment score of each aggregate

Source: The Stanford NLP Group

3. Financial Phrasebank

The financial phrase bank dataset contains almost 5000 English sentences from financial news, and all sentences are classified based on their emotional tones as either positive, negative, or neutral. All the data is annotated by researchers knowledgeable in the finance domain. 

Figure 2. Examples of the sentences from financial news and the corresponding sentiment label class

Source: HuggingFace

4. Webis-CLS-10 Dataset

Webis cross-lingual sentiment dataset includes 800.000 Amazon product reviews in English, German, French, and Japanese. Its multilingual nature allows for reaching more audiences and conducting comprehensive analyses. 

5. CMU Multimodal Opinion Sentiment and Emotion Intensity 

Not only do texts contain customers’ sentiments regarding services or products, but they can also be detected from videos or audio. CMU dataset includes multimodal data extracted from Youtube videos, such as the sentences and the voice tone used. 

Figure 3. The word cloud of the topics mentioned in the videos

Source: Carnegie Mellon University

6. Yelp Polarity Reviews

This open-source dataset includes more than 500,000 training samples consisting of consumer reviews, ratings, and recommendations. The polarity score of each sentence is determined, and the keywords requested can be extracted. 

7. WordStat Sentiment Dictionary

Wordstat Sentiment Dictionary classifies sentiments as negative or positive and combines three dictionaries: Harvard IV Dictionary, Regressive Imagery Dictionary, and Linguistic and Word Count Dictionary. The combination of different dictionaries allows for identifying synonyms and word patterns automatically.

8. Sentiment Lexicons For 81 Languages

Although English is the most spoken language globally, it is also crucial to analyze the sentiment of other language speakers. This dataset includes 81 languages such as Chinese, Spanish, and German, so it offers a variety of data from different languages and represents the worldwide sentiment better.

Using a trained dataset to run your algorithm is essential in sentiment analysis. So, working with reliable sources matters. 

You can also check our data-driven list of sentiment analysis services. 

Further Reading

If you have further questions about sentiment analysis, do not hesitate to reach us:

Find the Right Vendors

This article was originally written by former AIMultiple industry analyst Begüm Yılmaz and reviewed by Cem Dilmegani.

Cem Dilmegani
Principal Analyst

Cem is the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per Similarweb) including 60% of Fortune 500 every month.

Cem's work focuses on how enterprises can leverage new technologies in AI, automation, cybersecurity(including network security, application security), data collection including web data collection and process intelligence.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE, NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and media that referenced AIMultiple.

Cem's hands-on enterprise software experience contributes to the insights that he generates. He oversees AIMultiple benchmarks in dynamic application security testing (DAST), data loss prevention (DLP), email marketing and web data collection. Other AIMultiple industry analysts and tech team support Cem in designing, running and evaluating benchmarks.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

Sources:

AIMultiple.com Traffic Analytics, Ranking & Audience, Similarweb.
Why Microsoft, IBM, and Google Are Ramping up Efforts on AI Ethics, Business Insider.
Microsoft invests $1 billion in OpenAI to pursue artificial intelligence that’s smarter than we are, Washington Post.
Data management barriers to AI success, Deloitte.
Empowering AI Leadership: AI C-Suite Toolkit, World Economic Forum.
Science, Research and Innovation Performance of the EU, European Commission.
Public-sector digitization: The trillion-dollar challenge, McKinsey & Company.
Hypatos gets $11.8M for a deep learning approach to document processing, TechCrunch.
We got an exclusive look at the pitch deck AI startup Hypatos used to raise $11 million, Business Insider.

To stay up-to-date on B2B tech & accelerate your enterprise:

Follow on

Next to Read

Comments

Your email address will not be published. All fields are required.

0 Comments