With the growing demand for accurate sentiment classification, researchers have utilized various sentiment analysis models, techniques, and datasets to advance the field. However, achieving precise labeling of emotions and sentiments, as well as detecting irony, hatefulness, and offensiveness, remains a challenge, requiring further testing and refinement.
Explore the sentiment analysis benchmark performance of ChatGPT 4.o, 4.5, Claude 3.5, 3.7 and DeepSeek V3, and the details of the experimental testing:
Experimental results: sentiment analysis benchmark
Total | Emotion | Hatefulness | Irony | Offensiveness | Sentiment | |
---|---|---|---|---|---|---|
Claude 3.7 | 79% | 80% | 78% | 96% | 77% | 68% |
Claude 3.5 | 75% | 78% | 68% | 97% | 67% | 68% |
ChatGPT 4.o | 75% | 72% | 64% | 98% | 76% | 64% |
DeepSeek V3 | 70% | 76% | 52% | 92% | 69% | 64% |
ChatGPT 4.5 | 70% | 80% | 57% | 87% | 75% | 54% |
Sentiment analysis benchmark performance trends
- Claude 3.7 outperformed all other tools for all categories but irony detection. The average accuracy of Claude 3.7 for the 5 categories is nearly 80%.
- Claude 3.5‘s performance ranged between 67%-98%, showing notable improvements in lower-volume tests.
- ChatGPT 4.o, with general labeling accuracy ranging between 64%-98%, is more successful than any other tools in the category of irony detection.
- DeepSeek V3‘s accuracy in detecting different emotions/sentiments ranges between 52%-92%.
- ChatGPT 4.5 brings the worst performance in sentiment analysis for our sample, averaging 70%.
Overall accuracy
Figure 1. Total performance results of the tool

Combining all tasks, the models’ total accuracy scores provide a holistic view of their capabilities:
- Claude 3.7 outperformed all other tools for all categories but irony detection. The average accuracy of Claude 3.7 for the 5 categories is nearly 80%.
- Claude 3.5‘s performance ranged between 67%-98%, showing notable improvements in lower-volume tests.
- ChatGPT 4.o, with general labeling accuracy ranging between 64%-98%, is more successful than any other tools in the category of irony detection.
- DeepSeek V3‘s accuracy in detecting different emotions/sentiments ranges between 52%-92%.
- ChatGPT 4.5 brings the worst performance in sentiment analysis for our sample, averaging 70%.
1. Emotion detection
Emotion detection is a challenging task in sentiment analysis, often requiring models to discern subtle cues in language. Here’s how the models performed:
Figure 2. Emotion detection performance results

- ChatGPT 4.o achieved 72% accuracy when analyzing 50 statements.
- ChatGPT 4.5 shared the highest accuracy in emotion detection with Claude 3.7, with a success rate of ~80% when analyzing 50 statements.
- Claude 3.5, on the other hand, scored 77.5%.
- Claude 3.7 achieved the highest success rate of ~80% in emotion detection when analyzing 50 statements.
- DeepSeek V3 analyzes emotions in the given 50 statements at a time with an accuracy of ~76%.
2. Hatefulness detection
Detecting hateful content is crucial for Twitter sentiment classification and other moderation tasks. The results revealed notable differences:
Figure 3. Hatefulness detection performance results

- ChatGPT 4.o exhibited an accuracy of 64%.
- ChatGPT 4.5 presented a success rate of ~57% accuracy in hatefulness detection in our sample.
- Claude 3.5 showed a success of 67.5% in hatefulness detection.
- Claude 3.7, with a success rate of 78%, evaluated the tweets to detect hateful statements with the highest accuracy among other tools.
- DeepSeek V3 achieved the lowest score in the benchmark, with only 52% success in detecting hatefulness.
3. Irony detection
Irony detection is an area where semantic evaluation plays a pivotal role. Both models delivered high sentiment analysis benchmark performance, but GPT-4o emerged as a clear leader:
Figure 4. Irony detection performance results

- ChatGPT 4.o maintained an exceptional 98% accuracy in identifying ironic expressions. This success can be attributed to its ability to interpret negative polarity within complex text classification scenarios.
- ChatGPT 4.5, with a success rate of 87%, predicted the irony of the given text in the least successful way among the other tools we have tested in this comparison for emotion/sentiment detection.
- Claude 3.5 scored slightly lower than ChatGPT 4.o, achieving 97% accuracy with 50 statements.
- Claude 3.7 detected irony with an accuracy of ~96% for the given text.
- DeepSeek V3 achieved a success rate of ~92% in the irony detection for the given tweets.
Given the models’ overall high accuracy, all are well-suited for Twitter messages involving ironic or sarcastic content. However, GPT-4o’s success gives it a significant advantage for applications requiring standard reliability of benchmark for sentiment.
4. Offensiveness detection
Detecting offensive content is critical for maintaining healthy online communities. The models’ sentiment analysis benchmark performances in this task were as follows:
Figure 5. Offensiveness detection performance results

- ChatGPT 4.o scored 76% with 50 statements sizes. This aligns with its strong machine-learning approaches and ability to adapt to variations in data volume.
- ChatGPT 4.5 achieved ~75% of success rate in offensiveness detection for given Tweets.
- Claude 3.5 presented the lowest accuracy in the detection of offensiveness within all five tools, with a success rate of ~67% accuracy with 50 statements.
- Claude 3.7 scored the highest offensiveness detection within our sample with a success rate of ~77%.
- DeepSeek V3 detected offensive statements with an accuracy of 69%.
These results underscore the importance of context and training in designing models for offensive language detection, where patterns in the dataset can significantly impact outcomes.
5. Sentiment analysis
The overarching sentiment analysis task focused on classifying data into positive, negative, and neutral sentiments. Accuracy scores for this task varied significantly between the models:
Figure 6. Sentiment detection performance results

- ChatGPT 4.o scored a 64% success rate.
- ChatGPT 4.5, with the lowest success rate among the of less than 54%, presented the lowest accuracy in Twitter sentiment classification.
- Claude 3.5 showed better performance at 50 statements, with an accuracy of 68%.
- Claude 3.7, with a ~68% success rate, shared the best performance with Claude 3.5 in sentiment detection.
- DeepSeek V3 scored a 64% accuracy rate in detecting positive, negative, and neutral sentiments.
None of the models demonstrated competence in handling sentiment classification, the success rate of which ranged from ~54% to 68%.
Observations and insights
Total | Emotion | Hatefulness | Irony | Offensiveness | Sentiment | ||
---|---|---|---|---|---|---|---|
ChatGPT 4.0 | 50 statement per analysis | 74.8% | 72.% | 64% | 98% | 76% | 64% |
10 statement per analysis | 77.6% | 74% | 74% | 98% | 72% | 70% | |
Claude 3.5 | 50 statement per analysis | 75.2% | 78% | 68% | 97% | 67% | 68% |
10 statement per analysis | 74.0% | 76% | 62% | 90% | 72% | 70% |
Impact of input volume
Figure 7. General trend in the sentiment analysis performance of Claude 3.5 and GPT-4o

Both models showed improved sentiment analysis benchmark performance with smaller input volumes in some tasks, emphasizing the importance of reducing noise in training data for tasks like hatefulness detection and sentiment classification.
Figure 8. Performance for 10 statements per sentiment analysis

Figure 9. Performance for 50 statements per sentiment analysis

Task-specific strengths
GPT-4o dominated in irony detection and performed consistently well across all tasks. Claude 3.5, while slightly less consistent, excelled in tasks like emotion detection, especially with larger input volumes.
Broader implications
These experimental results validate the effectiveness of using benchmark datasets like TweetEval for text classification research. The findings can guide research community in selecting the right model based on their specific use case, whether it involves detecting nuanced sentiment intensity or analyzing negative polarity in Twitter messages.
Benchmark dataset and methodology
Analysis dataset
The TweetEval dataset was selected due to its relevance for sentiment analysis techniques applied to real-world Twitter messages.1 The dataset is part of the association for computational linguistics (ACL) initiative and is widely used in semantic evaluation and text classification tasks. It consists of pre-labeled training data and test sets covering several dimensions of sentiment and contextual understanding:
- Emotion detection: Identifying emotional tones such as anger, joy, optimism, or sadness in tweets.
Example tweet and label: The tweet “#Deppression is real. Partners w/ #depressed people truly dont understand the depth in which they affect us. Add in #anxiety &makes it worse” is labeled as sad.2
- Hatefulness detection: Evaluating the presence of hate speech in given tweets.
Example tweet and label: The tweet “Trump wants to deport illegal aliens with ‘no judges or court cases’ #MeTooI am solidly behind this actionThe thought of someone illegally entering a country & showing no respect for its laws,should be protected by same laws is ludacris!#DeportThemAll” is labeled as hateful.3
- Irony detection: Recognizing ironic intent in textual content.
Example tweet and label: The tweet “People who tell people with anxiety to “just stop worrying about it” are my favorite kind of people #not #educateyourself” is labeled as irony.4
- Offensiveness detection: Classifying tweets with offensive language.
Example tweet and label: The tweet “#ConstitutionDay It’s very odd for the alt right conservatives to say that we are ruining the constitution just because we want #GunControlNow but they are the ones ruining the constitution getting upset because foreigners are coming to this land who are not White wanting to live” is labeled as offensive.5
- Sentiment classification: Assigning positive, negative, or neutral labels to tweets.
Example tweet and label: The tweet “Can’t wait to try this – Google Earth VR – this stuff really is the future of exploration….” is labeled as positive.6
These tasks align with real-world machine-learning approaches, making them ideal for evaluating the experimental results of the two models.
Analysis methodology
The two models, GPT-4o and Claude 3.5, represent state-of-the-art systems in natural language processing and deep learning. Both models have been fine-tuned for sentiment analysis, leveraging extensive training data and advanced architectures.
- ChatGPT-4o: Based on OpenAI’s GPT-4o framework, this model utilizes a large-scale machine learning architecture optimized for multimodal sentiment analysis and contextual understanding.
- Claude 3.5: Developed by Anthropic, this model focuses on ethical AI interactions and precise text classification with an emphasis on conversational context and association-driven learning.
Experimental setup
To ensure consistency and reliability in the experiments, the following methodology was employed:
Input volume
- Two input volumes were tested: 50 tweets and 10 tweets per task.
- This variation aimed to determine how input size impacts model performance, particularly in tasks like based sentiment analysis and hatefulness detection where data volume can influence accuracy.
Task-specific evaluation
Each task from the TweetEval dataset was tested separately. The tasks and corresponding outputs were analyzed using the models’ sentiment analysis models, and accuracy scores were recorded.
Metrics used
Accuracy scores were computed for each task to ensure reliable experimental results.
Setup limitations
We have used datasets where ground truths were publicly available. This may have led to data poisoning (i.e. LLMs being trained on the ground truth). However, we assumed that this is not the case, since accuracies were not close to perfect. For the next version, we may consider using tweets for which ground truth has not been published.
Overview of GPT-4o and Claude 3.5
All tools, ChatGPT 4.o, 4.5, Claude 3.5, 3.7, and DeepSeek V3, represent significant advancements in the field of natural language processing (NLP), with applications spanning from sentiment analysis to conversational AI. These models are among the most widely recognized for their ability to interpret, process, and generate human-like text. Below is a detailed description of each model, highlighting their unique capabilities and relevance to sentiment classification and related machine-learning tasks.
ChatGPT 4.o
ChatGPT 4.o, developed by OpenAI, is an enhanced version of its predecessor, GPT-3.5, and features significant improvements in deep learning architecture and language understanding. This model is optimized for a wide range of NLP tasks, including sentiment analysis models and aspect-based sentiment analysis.
Applications in sentiment analysis
ChatGPT 4.o is frequently used in research community and industry for tasks such as:
- Twitter messages sentiment analysis for social media monitoring.
- Sentiment classification of customer feedback in e-commerce.
- Emotion detection in mental health applications.
- Aspect-based sentiment analysis for product reviews and surveys.
Limitations
Despite its strengths, ChatGPT 4.o can occasionally overfit to specific sentiment patterns, leading to reduced accuracy in highly domain-specific contexts.
ChatGPT 4.5
ChatGPT 4.5, a further development of OpenAI’s GPT series, offers solid performance across various sentiment analysis tasks. It demonstrates a good grasp of emotion categorization, but its performance in hatefulness detection and sentiment classification is relatively lower, which may limit its application in certain highly sensitive contexts.
Applications in sentiment analysis
ChatGPT 4.5 is often used in:
- Moderation tools for detecting offensive language and hate speech.
- Irony detection in online discussions and news commentary.
- Social media sentiment analysis to gauge public opinion on various topics.
- Customer feedback analysis for e-commerce platforms, with an emphasis on emotions.
Limitations
ChatGPT 4.5’s performance in sentiment analysis is hampered by its relatively lower accuracy in sentiment classification and hatefulness detection.
Claude 3.7
Claude 3.7 builds on the strengths of its predecessor, Claude 3.5, offering improvements in context understanding and sentiment accuracy. With a strong focus on safe and ethical AI practices, Claude 3.7 excels in detecting complex sentiment, including emotion, irony, and hateful speech, making it an ideal choice for applications requiring high levels of sensitivity and context.
Applications in sentiment analysis
Claude 3.7 is highly effective for tasks such as:
- Emotion detection in customer feedback and mental health applications.
- Hatefulness and offensiveness detection for online content moderation, ensuring safe spaces on digital platforms.
- Sentiment classification in market research and business intelligence.
Limitations
While Claude 3.7 outperforms all models in key sentiment areas, its performance in highly domain-specific scenarios might still face challenges, especially with subtle forms of sentiment. Additionally, its accuracy in detecting sentiment related to more nuanced or minor contextual cues may require further refinement.
Claude 3.5
Claude 3.5, created by Anthropic, is an NLP model designed with a focus on safety, ethical behavior, and precise text generation. It is particularly well-suited for tasks requiring sensitivity to context and nuanced sentiment analysis techniques.
Applications in sentiment analysis
Claude 3.5 is utilized in scenarios such as:
- Hatefulness detection for monitoring social media and online platforms.
- Offensiveness detection in content moderation systems.
- Customer service interactions, with an emphasis on sentiment classification to improve user experience.
- Aspect-based sentiment analysis for identifying sentiment trends in business intelligence.
Limitations
While Claude 3.5 excels in ethical and contextual understanding, it sometimes underperforms in detecting highly subtle or implicit sentiments compared to its competitors. Additionally, its training dataset is less diverse than that of ChatGPT 4.o, which may result in reduced robustness across some benchmark datasets.
DeepSeek V3
DeepSeek V3 offers solid results across a broad range of sentiment analysis tasks, but its overall accuracy lags behind other models, especially in hatefulness detection.
Applications in sentiment analysis
DeepSeek V3 is widely used for:
- Emotion detection in mental health apps and customer sentiment tracking.
- Irony detection in casual conversations, including social media platforms and user-generated content.
- Basic sentiment classification for market research surveys and feedback forms.
- Content moderation for filtering out offensive language in online forums.
Limitations
DeepSeek V3’s lower performance in detecting hateful content and its relatively weaker overall sentiment classification capabilities make it less suitable for high-stakes applications such as content moderation on sensitive platforms.
Further reading
External Links
- 1. Cardiff NLP · GitHub.
- 2. Saif Mohammad, Felipe Bravo-Marquez, Mohammad Salameh, and Svetlana Kiritchenko. 2018. SemEval-2018 Task 1: Affect in Tweets. In Proceedings of the 12th International Workshop on Semantic Evaluation, pages 1–17, New Orleans, Louisiana. Association for Computational Linguistics.
- 3. Basile, V., Bosco, C., Fersini, E., Nozza, D., Patti, V., Pardo, F. M. R., … & Sanguinetti, M. (2019, June). Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and women in twitter. In Proceedings of the 13th international workshop on semantic evaluation (pp. 54-63).
- 4. Van Hee, C., Lefever, E., & Hoste, V. (2018, June). Semeval-2018 task 3: Irony detection in english tweets. In Proceedings of the 12th international workshop on semantic evaluation (pp. 39-50).
- 5. SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval) - ACL Anthology.
- 6. SemEval-2017 Task 4: Sentiment Analysis in Twitter - ACL Anthology.
Comments
Your email address will not be published. All fields are required.