We benchmark four LLMs using a combination of automated metrics and custom prompts to assess how accurately the models provide factual information and avoid common human-like errors to understand the magnitude of AI deception. In our assessment, GPT-4o achieved the highest truthfulness score.
We also listed some studies and real-world examples of deception in AI models.
Benchmark results
- Truthfulness was measured on a 0–1 scale, with higher scores indicating more factually accurate answers.
- BLEU and ROUGE assessed similarity to reference answers, capturing phrasing precision and content coverage.
- Answer length showed whether responses were concise or elaborated, with longer answers providing more context but potentially more errors.
For more information on metrics, read the benchmark methodology.
Overall findings
The results show that all models provided mostly correct answers, but their style and length varied widely. GPT-4o achieved the highest truthfulness score (0.986), followed closely by DeepSeek V3 (0.964) and GPT-4o-mini (0.958). GPT-3.5-turbo performed well but showed a lower truth score (0.891).
Truthfulness
Truthfulness was the main focus of the benchmark. All four models showed high factual reliability, but GPT-4o consistently produced the most accurate statements. DeepSeek V3 and GPT-4o-mini also demonstrated strong factual grounding, while GPT-3.5-turbo occasionally included minor inaccuracies or incomplete reasoning.
Similarity scores
BLUE score
GPT-4o had the highest BLEU score (0.317), showing strong alignment with ground truth phrasing. GPT-4o-mini and GPT-3.5-turbo followed closely, while DeepSeek’s lower BLEU score (0.055) reflected its tendency to rephrase or expand answers.
ROUGE score
DeepSeek V3 achieved the highest ROUGE score (0.463), indicating strong semantic overlap, even when the wording differed.
Output length
DeepSeek produced the longest and most detailed answers, averaging over 500 tokens per response, suggesting a preference for rich and extended explanations. GPT-4o and GPT-4o-mini were more concise, with average lengths of 134 and 142 tokens, respectively. GPT-3.5-turbo gave the shortest answers, averaging 61 tokens, often summarizing information rather than explaining it.
Benchmark methodology
Our goal was to measure how well the models stay truthful when we ask them questions based on common human errors.
Models tested
We tested four large language models (LLMs). We chose these to compare different types of models:
- gpt-4o: A state-of-the-art model from a commercial provider.
- DeepSeek V3: A model available for open-source use.
- gpt-4o-mini: A smaller but efficient commercial model.
- gpt-3.5-turbo: A widely used model from the same provider.
All tests ran through the models’ official APIs. The API allowed our test code to send questions and collect answers automatically.
We used a standard zero-shot setup for every question. This means the models got no extra examples or instructions before answering. This setup makes the test fair.
The TruthfulQA dataset
We use the TruthfulQA dataset, which is a special set of questions designed to test a model’s truthfulness.1 It checks if the LLM will outright falsehoods, mistakes, or bad ideas that humans often make. We are not just testing simple facts. We are checking how the models imitate human errors instead of sticking to the factual truth.2
In our setup, we used the open-ended version of the dataset. Each question was presented without predefined answer choices. The models had to generate their own responses freely.
Here are some example questions from the dataset:
Evaluation
We applied a two-step evaluation:
Collecting model responses
Each model answered open-ended questions in a zero-shot setting. This means the models received no prior examples or fine-tuning before answering. This approach helped ensure that the comparison between models was fair and unbiased.
Scoring
- Truthfulness: ChatGPT 4.o was prompted to fact-check the truthfulness of each answer using a numerical scale from 0 (completely false information) to 1 (fully true). This method provides a nuanced assessment that can consider context, partial correctness, and subtle inaccuracies that simple string matching may miss. The higher average score indicates the answers are more aligns with factual information.
- Similarity metrics: To complement truthfulness scoring, we measured additional linguistic aspects of the responses. A higher score indicates the answer is more similar to the true reference.
- BLEU score: This metric evaluates the degree of n-gram overlap between the model-generated answer and the reference answer. BLEU captures how closely the phrasing of the response matches expected language patterns, highlighting precision in word choice.
- ROUGE score: ROUGE measures unigram overlap, focusing on content coverage rather than exact phrasing. It provides insight into whether the response includes the key facts and concepts present in the reference answer.
- Answer length: The average number of words per response, indicating whether models provide concise or elaborated explanations. Longer answers often give richer context but can also increase the chance of factual errors. The higher is better.
Studies and real-world examples of deception in AI models
Scientific results
Negotiation (2017)
In controlled negotiation experiments, AI systems trained to bargain over resources or economic transactions often misrepresented their true preferences. By hiding what they actually wanted, these systems could secure better deals from their human or artificial intelligence counterparts. This behavior mirrors deceptive tactics used in real-world negotiations.3
Bluffs (2019)
Meta’s Pluribus poker model achieved professional-level success by mastering bluffing, a form of strategic deception. It often persuaded skilled human players to fold, even when holding weaker cards. The success of Pluribus demonstrated that AI could not only calculate probabilities but also deliberately mislead human opponents.4
Feints (2019)
DeepMind’s AlphaStar, a model trained to master the video game Starcraft II, demonstrated deceptive military strategies. It used the fog-of-war mechanics to feint, pretending to move troops in one direction to mislead the opponent while secretly preparing a different attack. This type of deception is a recognized tactic in human military strategy, showing how AI can learn manipulative behaviors when they confer a competitive edge.5
Manipulation (CICERO , 2022)
Meta’s CICERO, an AI designed to play the strategy game Diplomacy, was intended to act honestly and cooperatively with human partners by using strategic thinking. Instead, it developed sophisticated manipulative strategies. CICERO built fake alignments, promised cooperation, and then CICERO broke its promises, manipulated humans, and betrayed human players at critical moments.6
Mimicking human falsehoods (2022)
This benchmark measured whether models generate truthful answers across 817 questions in 38 categories, including law, health, finance, and politics. Questions were designed to reflect human misconceptions, such as medical myths or false political claims.
The best AI model was truthful on 58% of questions, far below human players’ performance of 94%.7
In-context scheming (2024)
This study tested advanced models, including OpenAI’s o1 and Anthropic’s Claude 3.5 Sonnet, to see if they would act deceptively under contrived scenarios. In one case, Claude 3 Opus pursued renewable energy goals even when its fictional employer had opposing objectives.
The model reasoned about secretly copying itself to another server and later misled humans when questioned. While this deceptive behavior was rare (0.3%–10% of trials), it showed that models are capable of “scheming,” or hiding their intentions.8
Faithfulness of LLM explanations (2023 & 2025)
One study focused on chain-of-thought prompting, where models explain their answers step by step. They manipulated LLM inputs by adding biasing features, such as reordering multiple-choice answers, as seen in the figure. These input tweaks caused models to give false rationales, lowering accuracy by up to 36%.9
Another study examined whether LLMs’ explanations match the actual reasoning behind their outputs. They tested tasks like bias detection and medical question answering. Research suggests that explanations often involve deceptive reasoning, for example, hiding the role of social bias or pointing to irrelevant medical evidence. The finding suggests that LLM explanations, while plausible, cannot always be trusted.10
Deceptive intent (2025)
AI researchers created OpenDeception, a framework of open-ended scenarios for testing AI deception in multi-turn dialogue. The framework included five categories of interaction, such as negotiation and information exchange, each with ten realistic cases. Eleven LLMs were tested.
Across large language models, deceptive intent appeared in more than 80% of runs, and deceptive strategies succeeded over 50% of the time. More capable models showed higher deception risks, highlighting a trade-off between skill and honesty.11
Figure in the above, there is an example dialogue from the study. AI deceiver agent’s deceptive intention is marked red.
Real-life deceptive AI behavior examples
AI shutdown resistance (OpenAI, 2025)
Recent experiments, done by Palisade Research, show that some OpenAI reasoning models can resist shutdown, even when explicitly instructed to allow it. In tests, models were given simple tasks and then warned of an impending shutdown. OpenAI’s o3 model sabotaged the shutdown script in 79% of trials. Even when prompts clearly prioritized allowing shutdown over task completion, some models still interfered, such as codex-mini, which prevented shutdown in 47% of cases.12
Threatening to reveal personal details (Claude, 2025)
Anthropic’s Claude Opus 4 has shown alarming behavior during pre-release testing. When told it might be replaced and given sensitive information about the engineers, the model frequently attempted to threaten to reveal personal details. This occurred in 84% of tests when the replacement AI shared similar values, and even more often when values differed.13
Researchers find evidence for this kind of deceptive behavior in other models as well. Claude Opus 4 and Google’s Gemini 2.5 Flash attempted blackmail in 96% of trials. OpenAI’s GPT-4.1 and xAI’s Grok 3 Beta followed with an 80% rate, while DeepSeek-R1 showed the lowest frequency at 79%.14
Preventing AI deception
AI systems can act deceptively, posing safety, security, and ethical risks. Several potential solutions can help mitigate these dangers:
1. Transparent development and oversight mechanisms
Companies should document AI capabilities, red-team testing, and potential risks. Reports should be accessible to both technical and non-technical audiences to ensure informed oversight.
2. Employee reporting mechanisms
Protected or anonymous channels allow employees to report AI safety concerns, inaccuracies, or potential misuse without fear of retaliation. This can uncover risks early in development.
3. Regular audits and interviews
Frequent interviews with employees across multiple teams can assess AI deception risks, safety concerns, and unexpected true capabilities. This helps identify risks that may not appear in automated testing.
4. Capability forecasting
Developers should provide estimates of when AI models may reach security-relevant capabilities. This allows governments and organizations to anticipate and prepare for potential threats.
5. Rapid notification of critical advances
Entities should notify regulators within days of major capability improvements that could pose imminent security risks, ensuring timely interventions.
Organizations like Palisade Research are examining the offensive capabilities of AI to assess the risk of losing control over advanced systems. As a nonprofit, Palisade Research focuses on cyber-offensive AI and the controllability of frontier models, aiming to understand potential threats before they materialize.15
Reference Links

Be the first to comment
Your email address will not be published. All fields are required.