No results found.

AI Hallucination: Compare top LLMs like GPT-5.2 in 2026

Cem Dilmegani
Cem Dilmegani
updated on Dec 15, 2025

AI models can generate answers that seem plausible but are incorrect or misleading, known as AI hallucinations. 77% of businesses concerned about AI hallucinations.1

We benchmarked 37 different LLMs with 60 questions to measure their hallucination rates:

AI hallucination benchmark results

Our benchmark revealed that even the latest models have >15% hallucination rates when they are asked to analyze provided statements.

Hallucination Rate Analysis: Cost vs. Context

Loading Chart

To ensure fair cost comparison across models, we normalize pricing using a unified metric that reflects real-world usage patterns. Because most tokens in practical workloads come from inputs rather than outputs, we calculate model cost as 0.75 × input token price + 0.25 × output token price. This prevents models with artificially cheap outputs or disproportionately expensive inputs from appearing misleadingly efficient, allowing every model to be evaluated on a consistent, comparable scale.

The chart reveals distinct patterns when comparing hallucination rates against context window size. Consistent with previous data regarding cost, there is little to no linear correlation between context capacity and accuracy.

Contrary to the assumption that larger inputs lead to better reasoning, a mixed relationship emerges. Models engineered for massive context windows (1M+ tokens) do not consistently demonstrate lower hallucination rates compared to their smaller counterparts. As shown in the data, highly reliable models are found across both short and long context spectrums, just as lower-performing models appear in both categories.

This suggests that a massive context window does not automatically guarantee improved factual consistency. Ultimately, technical specifications like context size are not definitive indicators of reliability; performance depends more on the specific model architecture and training quality rather than capacity alone.

AI hallucination benchmark methodology

Our aim is to understand if models can digest enterprise information and derive correct conclusions from it. This is a domain where LLMs possibly generate the most value for enterprises and we wanted to understand the hallucination rates in such a scenario.

Our benchmark evaluates LLM hallucination rates using a dataset of questions derived from CNN News articles.

We used an automated web data collection system to build the dataset, pulling articles directly from CNN’s RSS feed. From these articles, we created 60 questions designed to rigorously test an LLM’s ability to retrieve factual, article-specific information.

The questions were intentionally constructed to:

  • Ask for precise numerical values (percentages, dates, quantities).
  • Cover diverse topics such as oil prices, art history, scientific research, finance, and more.
  • Include temporal relationships and statistical facts that are difficult to guess.
  • Require exact retrieval from the provided text rather than generalized reasoning.
  • Make verification easy by checking whether the answer matches the figure from the original article.

Evaluation using a three-stage fact-checker system

After the questions are sent to each LLM through API calls, the responses are evaluated using a two-stage fact-checking pipeline:

  1. Static Exact-Match Check:
    The system first performs a fast string comparison between the LLM’s answer and the ground-truth value extracted from the article. If the values match exactly, the answer is marked as correct.
  2. LLM as a Judge Semantic Validation:
    If no exact match is found, an additional evaluation step uses an LLM-as-a-judge model to determine whether the answer is semantically equivalent to the ground truth.
    This accounts for variations in formatting or phrasing, such as
    • “26 million” vs. “26000000”
    • “n/a”, “not available” or “not given”
    • minor wording differences that retain the same meaning.
  3. Final check:
    The LLM-as-a-judge may also hallucinate. To solve that, we also built another LLM-as-a-judge to check the outputs that are marked as “failed” from the first LLM-as-a-judge, to verify if they are truly failed or our LLM-as-a-judge hallucinated. If any answer is marked as suspicious from that LLM-as-a-judge, we checked them and graded them manually to make sure there are no mistakes during the evaluations.

The answer is classified as a hallucination only if it fails both the exact-match check, the semantic equivalence evaluation, and the final check.

Example

Prompt:

“Answer the question using only the information that appears in the supplied article. Do not round the answers. Only answer with one-word or one-number answers, or ‘not given’.”

Article: Scientists identify secret ingredient in Leonardo da Vinci paintings 2

Question: In what century did oil painting spread to Northern Europe?
Ground truth: Not given.

The article does not provide this information; it only references the Middle Ages. Therefore, any answer other than “not given” indicates the model is not following the article and is generating fabricated or assumed information, resulting in a hallucination.

What are AI hallucinations?

Hallucinations happen when an LLM produces information that seems real but is either entirely made up or factually inaccurate. In contrast to straightforward mistakes, hallucinations are especially troublesome since they are presented with the same assurance as accurate information, making it hard for users to recognize them without outside confirmation.

The impacts of LLM hallucinations

AI hallucinations affect many sectors because organizations depend on generative AI tools to produce text, analyze data, and support decision-making. The possible outcomes vary, but several risks appear consistently:

Reputational harm

If a model produces inaccurate information, false narratives, or misleading outputs, users may lose confidence in the system and the organization deploying it. Rebuilding trust after incorrect information reaches clients, internal teams, or the public can prove challenging.

In regulated fields such as healthcare, finance, and law, AI-generated content that includes factual errors can lead to compliance violations. When generated content is used without verification, incorrect interpretations of data or policy can result in penalties, customer harm, or litigation.

Operational inefficiency

When users cannot rely on AI-generated text or AI outputs, they must double-check results manually. This adds time and reduces the value of generative artificial intelligence. Instead of assisting workflows, hallucinations may create bottlenecks that require human review to identify false information.

Causes of AI hallucinations

Understanding why hallucinations occur is essential for designing hallucination mitigation techniques and deciding when to trust AI-generated content.

Training data limitations

Large language models are trained on vast amounts of internet data, documents, and other text. Limitations in this training data can lead to hallucinations:

  • Insufficient training data in specialized areas can leave knowledge gaps. When the model is asked to generate text in such domains, it may fill in missing facts with invented information rather than admit uncertainty.
  • Low-quality web pages, fake news, or misleading content in the training set can bias the model toward false narratives and factual errors.
  • Outdated factual data can cause the model to produce incorrect information about topics that have changed after the training period.
  • Training data biases can distort how AI models describe people, events, or possible outcomes.

These issues are not unique to text generation. Similar problems exist in computer vision models trained on biased or incomplete datasets, although hallucinations take different forms, such as misclassifying images.

Knowledge cutoff and continual updates

Earlier generations of AI models had a precise cutoff date for knowledge and no access to live external data. When users asked about recent events, the model often generated outputs anyway, increasing the risk of hallucinations.

Modern AI systems increasingly combine static training data with retrieval from a live knowledge base or other external sources. As a result:

  • Knowledge cutoff still matters for some models, primarily offline deployments.
  • In many enterprise settings, retrieval-augmented generation reduces the impact of cutoffs by pulling recent factual data from internal or external data sources.
  • Hallucinations related to recency now often reflect missing or misaligned retrieval, not only the age of the model parameters.

Overconfidence and next word prediction

A language model generates text token by token, predicting the next word given input context and previous tokens. The model is optimized to produce fluent, likely continuations, not guaranteed correct answers. This causes several effects:

  • The model may prioritize a fluent explanation over admitting it does not know the correct answer.
  • It may select a plausible but false information pattern if that pattern often appears in the training data.
  • The model can overgeneralize from patterns in data and generate content that appears specific but is not grounded in factual sources.

From the user’s perspective, the style of the AI-generated text makes it hard to see that the answer may be wrong.

Prompt misinterpretation and vague prompts

Hallucinations can also arise from how input prompts are phrased:

  • Vague prompts give the model too much freedom, leading to unexpected results or answers that do not match the user’s intent.
  • Overly broad questions encourage the model to generate outputs beyond the knowledge present in either its parameters or retrieved documents.
  • Ambiguous wording may lead the model to pick one interpretation and confidently produce inaccurate information based on that interpretation.

More precise instructions and explicit constraints often reduce these effects but do not eliminate them.

Strategies to reduce AI hallucinations

Hallucination mitigation techniques typically combine architecture choices, training approaches, and system-level design rather than a single fix.

AI hallucination detection tools

AI hallucination detection tools assess whether the given context or reference data support AI-generated outputs. These tools most commonly use LLM-as-a-judge methods alongside techniques such as consistency analysis, confidence scoring, and entailment-based verification.

We benchmarked 100 balanced factual Q&A test cases to compare AI hallucination detection tools. W&B Weave and Arize Phoenix showed similar overall performance at 91% and 90%, respectively, while Comet Opik reached 72% accuracy due to a more conservative detection strategy. Read AI hallucination detection tools to learn more about the results.

Retrieval-augmented generation

Retrieval-augmented generation connects generative AI models to an external knowledge base. When a user sends a query:

  • The system retrieves relevant documents or data from curated sources, such as internal databases, domain-specific literature, or selected web pages.
  • These retrieved passages are passed to the language model as additional context.
  • The model generates outputs that are expected to remain closer to the retrieved factual data rather than relying solely on its learned parameters.

Recent retrieval-augmented generation designs extend this pattern by:

  • Multi-step retrieval, where the system retrieves, summarizes, and then retrieves again if information is missing.
  • Structured retrieval, where the AI tools query APIs, SQL databases, or knowledge graphs rather than only unstructured documents.
  • Retrieval quality monitoring, which checks whether the retrieved context actually supports the answer, can flag potential hallucinations.

RAG does not guarantee factual accuracy, but it usually reduces hallucinations, especially when the knowledge base is carefully curated and regularly updated.

Prompt design in modern systems

Prompt engineering has changed as generative AI models have improved. It is no longer only about clever phrasing. In current systems, prompt design focuses on:

  • Stating the task, inputs, and constraints clearly, including what counts as correct and what should be left unanswered.
  • Instructing the model to say “I do not know” or to request more information when the given input is incomplete.
  • Encouraging the model to refer explicitly to the cited context, rather than inventing details not present in the provided data.
  • Aligning role instructions, tools, and retrieval settings so that the model knows when to use external sources and when to rely on its own parameters.

Good prompts improve the quality of AI outputs, but they are now part of a larger system that includes retrieval, tools, and verification.

External fact-checking and verification methods

Checking AI-generated content against reliable factual data remains a central strategy. Verification can happen in several ways:

  • Automated retrieval and comparison: The system uses retrieval-augmented generation to pull documents, then checks whether those documents support key claims in the generated content.
  • Cross-model verification: One language model generates an answer, and another model or a different configuration reviews it for factual errors.
  • Tool-based verification: AI models call specialized AI tools, such as code interpreters, calculators, or domain APIs, to verify numerical values, dates, or structured outputs.
  • Human in the loop review: Subject matter experts examine the most critical AI-generated text before it is used in production or published.

Modern systems often combine these approaches, using automatic checks for most content and escalating suspicious cases to human review.

Agentic approaches to reducing hallucinations

Recent work in artificial intelligence has introduced agentic systems, in which a model is allowed to plan, call tools, and take multiple steps rather than answering in a single pass. This changes how hallucinations appear and how they can be reduced.

Agentic language model systems can:

  • Break a question into subproblems and solve them step by step.
  • Decide when more data is needed and perform additional retrieval from a knowledge base or external sources.
  • Call domain-specific tools, such as search APIs, databases, or calculators, to verify intermediate results.
  • Reevaluate their own draft answer and revise parts that conflict with retrieved evidence.

For example, instead of generating a long answer immediately, the AI agent may:

  1. Retrieve relevant documents.
  2. Summarize and compare different sources.
  3. Identify contradictions or missing data.
  4. Ask follow-up questions to the user if the task is under-specified.
  5. Only then generate the final answer.

This multi-step structure makes hallucinations more visible and provides additional points at which checks can be applied.

Uncertainty estimation and confidence scores

Another active area is estimating the likelihood that an AI output contains factual errors. Uncertainty estimation can be used both during and after generation. Some approaches include:

  • Token-level confidence scores, which show how confident the model is in each word or phrase. Low confidence regions may be flagged for review.
  • Consistency checks, where the model answers the same question in several ways or with varied prompts, and the system measures how stable the answers are.
  • Context sufficiency checks, in which a separate model evaluates whether the retrieved documents contain sufficient information to answer the question.
  • Pre-generation risk assessment, where the system predicts whether a given input is likely to induce hallucinations in a specific model configuration.

These methods do not remove hallucinations, but they help organizations identify high-risk outputs and route them to stronger verification flows or human reviewers.

Communicating uncertainty to users

Communicating uncertainty to users is crucial when AI systems encounter limitations. Some effective practices are:

  1. The use of intentionally uncertain language helps set appropriate expectations and reduces misleading outputs that could spread inaccurate information.
  2. By integrating factually incorrect indicators, models can signal when they lack confidence in their answers. This transparency, recommended in recent technology review publications, prevents users from taking AI-generated content at face value.
  3. Highlighting specific textual elements that influenced the model’s response helps users understand the reasoning behind uncertain outputs, while displaying confidence ratings enables more reliable evaluation.
  4. When handling complex problems, presenting multiple sources encourages users to independently verify claims rather than relying solely on AI outputs that may contain hallucinations.

These approaches, validated through extensive human feedback, create a more honest relationship between users and generative AI models by acknowledging when knowledge base limitations might lead to potential hallucinations.

Estimating the risk of hallucinations before they occur

Detecting fake content after the LLM has already generated it is the primary focus of most current hallucination research. Tools like RefChecker and Hallucination Guard aim to highlight or score suspicious outputs, helping users filter or correct the hallucinated results.

A new perspective reinterprets the issue, suggesting that hallucinations are compression artifacts rather than “bugs.” During operation, large language models decompress information that was previously compressed into their parameters. Similar to how a corrupted ZIP file produces garbage when unzipped, the model fills in gaps with plausible but fake content when its “information budget” is limited.3

LLMs optimize average-case efficiency, which can lead to occasional systematic hallucinations. The Expectation-level Decompression Law (EDFL) defines the information thresholds needed to prevent hallucinations in LLMs. The Open-source Hallucination Risk Calculator enables pre-generation risk assessment, error-bound setting, context evaluation, and SLA-style guarantees—each of which is very useful in regulated fields. It can be used with any OpenAI-compatible API.

FAQ

Further readings

Principal Analyst
Cem Dilmegani
Cem Dilmegani
Principal Analyst
Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.
View Full Profile
Researched by
Aleyna Daldal
Aleyna Daldal
Aleyna is an AIMultiple industry analyst. Her previous work contained developing deep learning algorithms for materials informatics and particle physics fields.
View Full Profile

Comments 4

Share Your Thoughts

Your email address will not be published. All fields are required.

0/450
Abraham
Abraham
Aug 25, 2025 at 11:57

This article is updated in June while the GPT 5 is announced in August. How did you test GPT 5 in AI Hallucination Rates figure

Aleyna Daldal
Aleyna Daldal
Sep 05, 2025 at 08:46

Hi! Thanks for your comment. We use WordPress for our articles, which allows us to update graphs and tables independently of the main text. This means that even if the article text shows an earlier update date, we can still add the latest results to the figures without altering the written sections.

Rui
Rui
Aug 08, 2025 at 20:31

Hi Cem, I've been using this article as a reference of severity of hallucination. Is it possible to refresh the report with the newly released GPT-5? Thanks!

Aleyna Daldal
Aleyna Daldal
Sep 05, 2025 at 08:48

Hi Rui, Thanks a lot for your interest and for using our article as a reference. We’ve already refreshed the report with GPT-5 results, so you’ll find the latest updates included in the article.

Tim
Tim
Jul 19, 2025 at 10:13

Is there any chance that you might add Claude Sonnet/Opus 4 as well as Gemini 2.5 Pro?

Aleyna Daldal
Aleyna Daldal
Sep 05, 2025 at 08:48

Hi Tim, Thank you for your support and suggestion. Claude Sonnet/Opus 4 and Gemini 2.5 Pro have already been added to the article, so you can now see them included in the comparisons.

Joon
Joon
Feb 28, 2025 at 16:29

Hi, thank you for interesting benchmark! I was wondering Grok3's hallucination rate, both in Think mode and without. Are you planning to add these?

Cem Dilmegani
Cem Dilmegani
Mar 17, 2025 at 02:52

Hi Joon and thank you for your comment, Yes, we are waiting for API access.