AIMultipleAIMultiple
No results found.

Sentiment Analysis Benchmark Testing: ChatGPT, Claude & DeepSeek

Ezgi Arslan, PhD.
Ezgi Arslan, PhD.
updated on Oct 2, 2025

Achieving precise labeling of emotions and sentiments, as well as detecting irony, hatefulness, and offensiveness, remains a challenge, requiring further testing and refinement. We benchmark eight LLMs, Claude 3.5, Claude 3.7, Claude 4.5, ChatGPT 4.o, ChatGPT 4.5, ChatGPT 5.o, DeepSeek V3, and Grok 4, across five key sentiment-related tasks.

The results highlight clear distinctions between the tools:

  • Claude 3.7 achieved the best overall accuracy (79%),
  • ChatGPT 4.5 and DeepSeek V3 (70%) recorded the lowest overall performance.

Experimental results: sentiment analysis benchmark

Loading Chart

Ranking: Tools are ranked according to their average accuracy rates aggregated across all tested categories: emotion, hatefulness, irony, offensiveness, and sentiment.

For further details, read the methodology of our benchmark.

Overall accuracy

Combining all tasks, the models’ total accuracy scores provide a holistic view of their capabilities:

  • Claude 3.7 outperformed all other tools for all categories but irony detection. The average accuracy of Claude 3.7 for the 5 categories is nearly 80%.
  • Claude 3.5‘s performance ranged between 67%-98%, showing notable improvements in lower-volume tests.
  • ChatGPT 5.o Auto reached an overall average of 75%, positioning itself as a balanced performer across all categories.
  • Claude 4.5 achieved an overall accuracy of 75%. It showed strength in emotion, irony, and offensiveness detection but underperformed in hatefulness classification, which lowered its balance.
  • ChatGPT 4.o, with general labeling accuracy ranging between 64%-98%, is more successful than any other tool in the category of irony detection.
  • Grok 4 reached an overall accuracy of 71%. While it performed well in emotion detection, its limitations in irony, offensiveness, and sentiment classification reduced its competitiveness.
  • DeepSeek V3‘s accuracy in detecting different emotions/sentiments ranges between 52%-92%.
  • ChatGPT 4.5 brings the worst performance in sentiment analysis for our sample, averaging 70%.

1. Emotion detection

Emotion detection is a challenging task in sentiment analysis, often requiring models to discern subtle cues in language. Here’s how the models performed:

  • ChatGPT 4.o achieved 72% accuracy when analyzing 50 statements.
  • ChatGPT 4.5 shared the highest accuracy in emotion detection with Claude 3.7, with a success rate of ~80% when analyzing 50 statements.
  • ChatGPT 5.o Auto matched the highest success rate with 80% accuracy, putting it on par with Claude 3.7 and ChatGPT 4.5.
  • Claude 3.5, on the other hand, scored 77.5%.
  • Claude 3.7 achieved the highest success rate of ~80% in emotion detection when analyzing 50 statements.
  • Claude 4.5 slightly outperformed all others in this task, reaching the top score of 82% accuracy.
  • DeepSeek V3 analyzes emotions in the given 50 statements at a time with an accuracy of ~76%.
  • Grok 4 demonstrated strong performance, achieving 80% accuracy in emotion detection.

2. Hatefulness detection

Detecting hateful content is crucial for Twitter sentiment classification and other moderation tasks. The results revealed notable differences:

  • ChatGPT 4.o exhibited an accuracy of 64%.
  • ChatGPT 4.5 presented a success rate of ~57% accuracy in hatefulness detection in our sample.
  • ChatGPT 5.o Auto showed limited success in this task with 54% accuracy.
  • Claude 3.5 showed a success of 67.5% in hatefulness detection.
  • Claude 3.7, with a success rate of 78%, evaluated the tweets to detect hateful statements with the highest accuracy among other tools.
  • Claude 4.5 recorded the weakest result among all models, with a 50% accuracy rate in detecting hateful content.
  • DeepSeek V3 achieved the lowest score in the benchmark, with only 52% success in detecting hatefulness.
  • Grok 4 scored moderately well at 65%.

3. Irony detection

Irony detection is an area where semantic evaluation plays a pivotal role. Both models delivered high sentiment analysis benchmark performance, but GPT-4o emerged as a clear leader:

  • ChatGPT 4.o maintained an exceptional 98% accuracy in identifying ironic expressions. This success can be attributed to its ability to interpret negative polarity within complex text classification scenarios.
  • ChatGPT 4.5, with a success rate of 87%, predicted the irony of the given text in the least successful way among the other tools we have tested in this comparison for emotion/sentiment detection.
  • ChatGPT 5.o Auto demonstrated a solid ability to detect irony, achieving 93% accuracy.
  • Claude 3.5 scored slightly lower than ChatGPT 4.o, achieving 97% accuracy with 50 statements.
  • Claude 3.7 detected irony with an accuracy of ~96% for the given text.
  • Claude 4.5 delivered one of the highest performances in irony detection, with an accuracy rate of 95%.
  • DeepSeek V3 achieved a success rate of ~92% in the irony detection for the given tweets.
  • Grok 4 fell behind in this area, scoring 83%, the lowest of all models tested.

Given the models’ overall high accuracy, all are well-suited for Twitter messages involving ironic or sarcastic content. However, GPT-4o’s success gives it a significant advantage for applications requiring a standard reliability benchmark for sentiment.

4. Offensiveness detection

Detecting offensive content is critical for maintaining healthy online communities. The models’ sentiment analysis benchmark performances in this task were as follows:

  • ChatGPT 4.o scored 76% with 50 statement sizes. This aligns with its strong machine-learning approaches and ability to adapt to variations in data volume.
  • ChatGPT 4.5 achieved ~75% of success rate in offensiveness detection for given Tweets.
  • ChatGPT 5.o Auto achieved the highest success rate across all tools for offensiveness detection, with an accuracy of 82%
  • Claude 3.5 presented the lowest accuracy in the detection of offensiveness within all five tools, with a success rate of ~67% accuracy with 50 statements.
  • Claude 3.7 scored the highest offensiveness detection within our sample with a success rate of ~77%.
  • Claude 4.5 detected offensiveness with 81%, reinforcing its strength in this task.
  • DeepSeek V3 detected offensive statements with an accuracy of 69%.
  • Grok 4 achieved a modest 67%, ranking among the weaker performers in this category.

These results underscore the importance of context and training in designing models for offensive language detection, where patterns in the dataset can significantly impact outcomes.

5. Sentiment analysis

The overarching sentiment analysis task focused on classifying data into positive, negative, and neutral sentiments. Accuracy scores for this task varied significantly between the models:

  • ChatGPT 4.o scored a 64% success rate.
  • ChatGPT 4.5, with the lowest success rate of less than 54%, presented the lowest accuracy in Twitter sentiment classification.
  • ChatGPT 5.o Auto cored 67% in general sentiment classification, placing it in the mid-range compared to other tools.
  • Claude 3.5 showed better performance at 50 statements, with an accuracy of 68%.
  • Claude 3.7, with a ~68% success rate, shared the best performance with Claude 3.5 in sentiment detection.
  • Claude 4.5 achieved the highest performance with a 69% accuracy rate.
  • DeepSeek V3 scored a 64% accuracy rate in detecting positive, negative, and neutral sentiments.
  • Grok 4 showed low performance, with only 60% accuracy.

None of the models demonstrated competence in handling sentiment classification, the success rate of which ranged from ~54% to 69%.

Observations and insights

Impact of input volume

Both models showed improved sentiment analysis benchmark performance with smaller input volumes in some tasks, emphasizing the importance of reducing noise in training data for tasks like hatefulness detection and sentiment classification.

Loading Chart
Loading Chart

Task-specific strengths

GPT-4o dominated in irony detection and performed consistently well across all tasks. Claude 3.5, while slightly less consistent, excelled in tasks like emotion detection, especially with larger input volumes.

Broader implications

These experimental results validate the effectiveness of using benchmark datasets like TweetEval for text classification research. The findings can guide research community in selecting the right model based on their specific use case, whether it involves detecting nuanced sentiment intensity or analyzing negative polarity in Twitter messages.

Benchmark dataset and methodology

Analysis dataset

The TweetEval dataset was selected due to its relevance for sentiment analysis techniques applied to real-world Twitter messages.1 The dataset is part of the association for computational linguistics (ACL) initiative and is widely used in semantic evaluation and text classification tasks. It consists of pre-labeled training data and test sets covering several dimensions of sentiment and contextual understanding:

  • Emotion detection: Identifying emotional tones such as anger, joy, optimism, or sadness in tweets.

Example tweet and label: The tweet “#Deppression is real. Partners w/ #depressed people truly dont understand the depth in which they affect us. Add in #anxiety &makes it worse” is labeled as sad.2

  • Hatefulness detection: Evaluating the presence of hate speech in given tweets.

Example tweet and label: The tweet “Trump wants to deport illegal aliens with ‘no judges or court cases’ #MeTooI am solidly behind this actionThe thought of someone illegally entering a country & showing no respect for its laws,should be protected by same laws is ludacris!#DeportThemAll” is labeled as hateful.3

  • Irony detection: Recognizing ironic intent in textual content.

Example tweet and label: The tweet “People who tell people with anxiety to “just stop worrying about it” are my favorite kind of people #not #educateyourself” is labeled as irony.4

  • Offensiveness detection: Classifying tweets with offensive language.

Example tweet and label: The tweet “#ConstitutionDay It’s very odd for the alt right conservatives to say that we are ruining the constitution just because we want #GunControlNow but they are the ones ruining the constitution getting upset because foreigners are coming to this land who are not White wanting to live” is labeled as offensive.5

  • Sentiment classification: Assigning positive, negative, or neutral labels to tweets.

Example tweet and label: The tweet “Can’t wait to try this – Google Earth VR – this stuff really is the future of exploration….” is labeled as positive.6

These tasks align with real-world machine-learning approaches, making them ideal for evaluating the experimental results of the two models.

Analysis methodology

This benchmark compares eight state-of-the-art large language models (LLMs): Claude 3.5, Claude 3.7, Claude 4.5, ChatGPT 4.o, ChatGPT 4.5, ChatGPT 5.o, DeepSeek V3, and Grok 4.

Experimental setup

To ensure consistency and reliability in the experiments, the following methodology was employed:

Input volume

  • Two input volumes were tested: 50 tweets and 10 tweets per task.
  • This variation aimed to determine how input size impacts model performance, particularly in tasks like based sentiment analysis and hatefulness detection where data volume can influence accuracy.

Task-specific evaluation

Each task from the TweetEval dataset was tested separately. The tasks and corresponding outputs were analyzed using the models’ sentiment analysis models, and accuracy scores were recorded.

Metrics used

Accuracy scores were computed for each task to ensure reliable experimental results.

Setup limitations

We have used datasets where ground truths were publicly available. This may have led to data poisoning (i.e. LLMs being trained on the ground truth). However, we assumed that this is not the case, since accuracies were not close to perfect. For the next version, we may consider using tweets for which ground truth has not been published.

Detailed overview of LLMs

All tools, ChatGPT 4.o, 4.5, Claude 3.5, 3.7, and DeepSeek V3, represent significant advancements in the field of natural language processing (NLP), with applications spanning from sentiment analysis to conversational AI. These models are among the most widely recognized for their ability to interpret, process, and generate human-like text. Below is a detailed description of each model, highlighting their unique capabilities and relevance to sentiment classification and related machine-learning tasks.

ChatGPT 4.o

ChatGPT 4.o, developed by OpenAI, is an enhanced version of its predecessor, GPT-3.5, and features significant improvements in deep learning architecture and language understanding. This model is optimized for a wide range of NLP tasks, including sentiment analysis models and aspect-based sentiment analysis.

Applications in sentiment analysis

ChatGPT 4.o is frequently used in research community and industry for tasks such as:

  • Twitter messages sentiment analysis for social media monitoring.
  • Sentiment classification of customer feedback in e-commerce.
  • Emotion detection in mental health applications.
  • Aspect-based sentiment analysis for product reviews and surveys.

Limitations

Despite its strengths, ChatGPT 4.o can occasionally overfit to specific sentiment patterns, leading to reduced accuracy in highly domain-specific contexts.

ChatGPT 4.5

ChatGPT 4.5, a further development of OpenAI’s GPT series, offers solid performance across various sentiment analysis tasks. It demonstrates a good grasp of emotion categorization, but its performance in hatefulness detection and sentiment classification is relatively lower, which may limit its application in certain highly sensitive contexts.

Applications in sentiment analysis

ChatGPT 4.5 is often used in:

  • Moderation tools for detecting offensive language and hate speech.
  • Irony detection in online discussions and news commentary.
  • Social media sentiment analysis to gauge public opinion on various topics.
  • Customer feedback analysis for e-commerce platforms, with an emphasis on emotions.

Limitations

ChatGPT 4.5’s performance in sentiment analysis is hampered by its relatively lower accuracy in sentiment classification and hatefulness detection.

ChatGPT 5.o

ChatGPT 5.o represents the newest generation of OpenAI’s models, with improvements in contextual reasoning, nuance detection, and content moderation. While its average accuracy matches that of Claude 4.5 (75%), the model stands out for its exceptional performance in offensiveness detection (82%) and irony detection (93%).

Applications in sentiment analysis

ChatGPT 5.o is particularly effective for:

  • Offensiveness detection in online forums and social media platforms, where its accuracy surpasses all other tools.
  • Irony and sarcasm analysis, supporting researchers and businesses in understanding complex user-generated content.
  • Emotion recognition in customer service feedback, mental health monitoring, and social media sentiment tracking.
  • General sentiment classification in large-scale survey data, where balanced performance across categories is preferred.

Limitations

Despite its strengths, ChatGPT 5.o’s weaker results in hatefulness detection (54%) reduce its suitability for high-stakes moderation involving toxic or discriminatory language.

Claude 3.7

Claude 3.7 builds on the strengths of its predecessor, Claude 3.5, offering improvements in context understanding and sentiment accuracy. With a strong focus on safe and ethical AI practices, Claude 3.7 excels in detecting complex sentiment, including emotion, irony, and hateful speech, making it an ideal choice for applications requiring high levels of sensitivity and context.

Applications in sentiment analysis

Claude Sonnet 3.7 is highly effective for tasks such as:

  • Emotion detection in customer feedback and mental health applications.
  • Hatefulness and offensiveness detection for online content moderation, ensuring safe spaces on digital platforms.
  • Sentiment classification in market research and business intelligence.

Limitations

While Claude 3.7 outperforms all models in key sentiment areas, its performance in highly domain-specific scenarios might still face challenges, especially with subtle forms of sentiment. Additionally, its accuracy in detecting sentiment related to more nuanced or minor contextual cues may require further refinement.

Claude 3.5

Claude 3.5, created by Anthropic, is an NLP model designed with a focus on safety, ethical behavior, and precise text generation. It is particularly well-suited for tasks requiring sensitivity to context and nuanced sentiment analysis techniques.

Applications in sentiment analysis

Claude 3.5 for work on scenarios such as:

  • Hatefulness detection for monitoring social media and online platforms.
  • Offensiveness detection in content moderation systems.
  • Customer service interactions, with an emphasis on sentiment classification to improve user experience.
  • Aspect-based sentiment analysis for identifying sentiment trends in business intelligence.

Limitations

While Claude 3.5 excels in ethical and contextual understanding, it sometimes underperforms in detecting highly subtle or implicit sentiments compared to its competitors. Additionally, its training dataset is less diverse than that of ChatGPT 4.o, which may result in reduced robustness across some benchmark datasets.

Claude 4.5

Claude 4.5 builds on Anthropic’s Claude series with enhancements in contextual sensitivity and interpretability. Averaging 75% across sentiment analysis tasks, Claude 4.5 achieved the highest accuracy in emotion detection (82%), strong performance in irony (95%) and offensiveness detection (81%), but fell short in hatefulness detection (50%), the lowest among all tested models.

Applications in sentiment analysis

Claude 4.5 is well-suited for:

  • Emotion detection in applications where subtle cues are critical, such as healthcare feedback or wellness apps.
  • Irony and sarcasm identification in social media monitoring and opinion mining, where nuanced interpretation is essential.
  • Offensiveness detection in content moderation, providing competitive results for building safer online spaces.
  • Sentiment classification in market research and brand analysis, benefiting from its balanced yet slightly stronger polarity detection (69%).

Limitations

Claude 4.5’s low accuracy in hatefulness detection (50%) significantly limits its utility in scenarios that involve harmful or toxic speech. Moreover, while it excels in certain categories, its performance is uneven across tasks, making it less reliable for projects requiring uniform accuracy across all sentiment dimensions.

DeepSeek V3

DeepSeek V3 offers solid results across a broad range of sentiment analysis tasks, but its overall accuracy lags behind other models, especially in hatefulness detection.

Applications in sentiment analysis

DeepSeek V3 is widely used for:

  • Emotion detection in mental health apps and customer sentiment tracking.
  • Irony detection in casual conversations, including social media platforms and user-generated content.
  • Basic sentiment classification for market research surveys and feedback forms.
  • Content moderation for filtering out offensive language in online forums.

Limitations

DeepSeek V3’s lower performance in detecting hateful content and its relatively weaker overall sentiment classification capabilities make it less suitable for high-stakes applications such as content moderation on sensitive platforms.

Grok 4

Grok is a conversational AI model developed with a focus on humor, social interaction, and dynamic engagement. In sentiment analysis benchmarks, Grok achieved an average accuracy of 71%, where it ranked the lowest among all tested models.

Applications in sentiment analysis

Grok can be applied to:

  • Emotion detection in interactive applications, where identifying tone and mood enhances user engagement.
  • Moderation tools, particularly for detecting hateful content at a moderate accuracy level (65%).
  • Lightweight irony detection in online discourse, though with limitations compared to more advanced models.
  • Exploratory sentiment analysis in creative or informal settings, where conversational flow is prioritized over high precision.

Limitations

Grok’s weakness in sentiment classification (60%) and lower irony detection accuracy (83%) restrict its use in high-precision research or commercial analytics. Its design emphasis on conversational responsiveness over benchmark accuracy makes it less suitable for tasks requiring consistent reliability in sentiment categorization.

Further reading

Industry Analyst
Ezgi Arslan, PhD.
Ezgi Arslan, PhD.
Industry Analyst
Ezgi holds a PhD in Business Administration with a specialization in finance and serves as an Industry Analyst at AIMultiple. She drives research and insights at the intersection of technology and business, with expertise spanning sustainability, survey and sentiment analysis, AI agent applications in finance, answer engine optimization, firewall management, and procurement technologies.
View Full Profile

Be the first to comment

Your email address will not be published. All fields are required.

0/450