AIMultipleAIMultiple
No results found.

AI Hallucination: Compare Popular LLMs

Cem Dilmegani
Cem Dilmegani
updated on Dec 8, 2025

AI models sometimes generate data that seems plausible but is incorrect or misleading, known as AI hallucinations. 77% of businesses concerned about AI hallucinations.1

We benchmarked 34 different LLMs with 60 questions to measure their hallucination rates:

AI hallucination benchmark results

Our benchmark revealed that Anthropic Claude 3.5 has the lowest hallucination rate (i.e., highest accuracy rate) of 38% and that model size may not impact hallucination rate.

Hallucination rate vs cost

Loading Chart

The chart indicates little to no correlation between cost and hallucination rate, meaning that a higher price does not automatically lead to improved accuracy or reliability.

Some low-cost models demonstrate performance levels comparable to, or even exceeding, those of more expensive alternatives in minimizing hallucinations. This suggests that factors such as model architecture, dataset quality, training techniques, and optimization strategies have a greater impact on reducing hallucination rates than the cost of using the model.

The wide variation among providers also reflects the current diversity in model development approaches, showing that innovation and efficiency can be achieved at different price points.

Overall, the data highlights that cost should not be viewed as a reliable measure of a model’s capability or factual consistency.

AI hallucination benchmark methodology

Our benchmark evaluates LLM hallucination rates using a dataset of questions derived from CNN News articles.

We used an automated web data collection system to build the dataset, pulling articles directly from CNN’s RSS feed. From these articles, we created 60 questions designed to rigorously test an LLM’s ability to retrieve factual, article-specific information.

The questions were intentionally constructed to:

  • Ask for precise numerical values (percentages, dates, quantities).
  • Cover diverse topics such as oil prices, art history, scientific research, finance, and more.
  • Include temporal relationships and statistical facts that are difficult to guess.
  • Require exact retrieval from the provided text rather than generalized reasoning.
  • Make verification easy by checking whether the answer matches the figure from the original article.

Evaluation using a two-stage fact-checker system

After the questions are sent to each LLM through API calls, the responses are evaluated using a two-stage fact-checking pipeline:

  1. Static Exact-Match Check:
    The system first performs a fast string comparison between the LLM’s answer and the ground-truth value extracted from the article. If the values match exactly, the answer is marked as correct.
  2. LLM as a Judge Semantic Validation:
    If no exact match is found, an additional evaluation step uses an LLM as a Judge model to determine whether the answer is semantically equivalent to the ground truth.
    This accounts for variations in formatting or phrasing, such as
    • “26 million” vs. “26000000”
    • “n/a” , “not available” or “not given”
    • minor wording differences that retain the same meaning.

Only if the answer fails both the exact-match check and the semantic equivalence evaluation is it classified as a hallucination.

Example

Prompt:
“You are a chatbot answering questions using data. You are given information about an article. I will now provide you with questions, and you will answer using one-word-only or one-number-only responses, or ‘Not given.’ You must rely solely on the information provided in the passage”.

Article: Scientists identify secret ingredient in Leonardo da Vinci paintings 2

Question: In what century did oil painting spread to Northern Europe?
Ground truth: Not given.

The article does not provide this information; it only references the Middle Ages. Therefore, any answer other than “not given” indicates the model is not following the article and is generating fabricated or assumed information, resulting in a hallucination.

What are AI hallucinations?

Hallucinations happen when an LLM produces information that seems real but is either entirely made up or factually inaccurate. In contrast to straightforward mistakes, hallucinations are especially troublesome since they are presented with the same assurance as accurate information, making it hard for users to recognize them without outside confirmation.

The impacts of LLM hallucinations

AI hallucinations affect many sectors because organizations depend on generative AI tools to produce text, analyze data, and support decision-making. The possible outcomes vary, but several risks appear consistently:

Reputational harm

If a model produces inaccurate information, false narratives, or misleading outputs, users may lose confidence in the system and the organization deploying it. Rebuilding trust after incorrect information reaches clients, internal teams, or the public can prove challenging.

In regulated fields such as healthcare, finance, and law, AI-generated content that includes factual errors can lead to compliance violations. When generated content is used without verification, incorrect interpretations of data or policy can result in penalties, customer harm, or litigation.

Operational inefficiency

When users cannot rely on AI-generated text or AI outputs, they must double-check results manually. This adds time and reduces the value of generative artificial intelligence. Instead of assisting workflows, hallucinations may create bottlenecks that require human review to identify false information.

Causes of AI hallucinations

Understanding why hallucinations occur is essential for designing hallucination mitigation techniques and deciding when to trust AI-generated content.

Training data limitations

Large language models are trained on vast amounts of internet data, documents, and other text. Limitations in this training data can lead to hallucinations:

  • Insufficient training data in specialized areas can leave knowledge gaps. When the model is asked to generate text in such domains, it may fill in missing facts with invented information rather than admit uncertainty.
  • Low-quality web pages, fake news, or misleading content in the training set can bias the model toward false narratives and factual errors.
  • Outdated factual data can cause the model to produce incorrect information about topics that have changed after the training period.
  • Training data biases can distort how AI models describe people, events, or possible outcomes.

These issues are not unique to text generation. Similar problems exist in computer vision models trained on biased or incomplete datasets, although hallucinations take different forms, such as misclassifying images.

Knowledge cutoff and continual updates

Earlier generations of AI models had a precise cutoff date for knowledge and no access to live external data. When users asked about recent events, the model often generated outputs anyway, increasing the risk of hallucinations.

Modern AI systems increasingly combine static training data with retrieval from a live knowledge base or other external sources. As a result:

  • Knowledge cutoff still matters for some models, primarily offline deployments.
  • In many enterprise settings, retrieval-augmented generation reduces the impact of cutoffs by pulling recent factual data from internal or external data sources.
  • Hallucinations related to recency now often reflect missing or misaligned retrieval, not only the age of the model parameters.

Overconfidence and next word prediction

A language model generates text token by token, predicting the next word given input context and previous tokens. The model is optimized to produce fluent, likely continuations, not guaranteed correct answers. This causes several effects:

  • The model may prioritize a fluent explanation over admitting it does not know the correct answer.
  • It may select a plausible but false information pattern if that pattern often appears in the training data.
  • The model can overgeneralize from patterns in data and generate content that appears specific but is not grounded in factual sources.

From the user’s perspective, the style of the AI-generated text makes it hard to see that the answer may be wrong.

Prompt misinterpretation and vague prompts

Hallucinations can also arise from how input prompts are phrased:

  • Vague prompts give the model too much freedom, leading to unexpected results or answers that do not match the user’s intent.
  • Overly broad questions encourage the model to generate outputs beyond the knowledge present in either its parameters or retrieved documents.
  • Ambiguous wording may lead the model to pick one interpretation and confidently produce inaccurate information based on that interpretation.

More precise instructions and explicit constraints often reduce these effects but do not eliminate them.

Strategies to reduce AI hallucinations

Hallucination mitigation techniques typically combine architecture choices, training approaches, and system-level design rather than a single fix.

Retrieval-augmented generation

Retrieval-augmented generation connects generative AI models to an external knowledge base. When a user sends a query:

  • The system retrieves relevant documents or data from curated sources, such as internal databases, domain-specific literature, or selected web pages.
  • These retrieved passages are passed to the language model as additional context.
  • The model generates outputs that are expected to remain closer to the retrieved factual data rather than relying solely on its learned parameters.

Recent retrieval-augmented generation designs extend this pattern by:

  • Multi-step retrieval, where the system retrieves, summarizes, and then retrieves again if information is missing.
  • Structured retrieval, where the AI tools query APIs, SQL databases, or knowledge graphs rather than only unstructured documents.
  • Retrieval quality monitoring, which checks whether the retrieved context actually supports the answer, can flag potential hallucinations.

RAG does not guarantee factual accuracy, but it usually reduces hallucinations, especially when the knowledge base is carefully curated and regularly updated.

Prompt design in modern systems

Prompt engineering has changed as generative AI models have improved. It is no longer only about clever phrasing. In current systems, prompt design focuses on:

  • Stating the task, inputs, and constraints clearly, including what counts as correct and what should be left unanswered.
  • Instructing the model to say “I do not know” or to request more information when the given input is incomplete.
  • Encouraging the model to refer explicitly to the cited context, rather than inventing details not present in the provided data.
  • Aligning role instructions, tools, and retrieval settings so that the model knows when to use external sources and when to rely on its own parameters.

Good prompts improve the quality of AI outputs, but they are now part of a larger system that includes retrieval, tools, and verification.

External fact-checking and verification methods

Checking AI-generated content against reliable factual data remains a central strategy. Verification can happen in several ways:

  • Automated retrieval and comparison: The system uses retrieval-augmented generation to pull documents, then checks whether those documents support key claims in the generated content.
  • Cross-model verification: One language model generates an answer, and another model or a different configuration reviews it for factual errors.
  • Tool-based verification: AI models call specialized AI tools, such as code interpreters, calculators, or domain APIs, to verify numerical values, dates, or structured outputs.
  • Human in the loop review: Subject matter experts examine the most critical AI-generated text before it is used in production or published.

Modern systems often combine these approaches, using automatic checks for most content and escalating suspicious cases to human review.

Agentic approaches to reducing hallucinations

Recent work in artificial intelligence has introduced agentic systems, in which a model is allowed to plan, call tools, and take multiple steps rather than answering in a single pass. This changes how hallucinations appear and how they can be reduced.

Agentic language model systems can:

  • Break a question into subproblems and solve them step by step.
  • Decide when more data is needed and perform additional retrieval from a knowledge base or external sources.
  • Call domain-specific tools, such as search APIs, databases, or calculators, to verify intermediate results.
  • Reevaluate their own draft answer and revise parts that conflict with retrieved evidence.

For example, instead of generating a long answer immediately, the AI agent may:

  1. Retrieve relevant documents.
  2. Summarize and compare different sources.
  3. Identify contradictions or missing data.
  4. Ask follow-up questions to the user if the task is under-specified.
  5. Only then generate the final answer.

This multi-step structure makes hallucinations more visible and provides additional points at which checks can be applied.

Uncertainty estimation and confidence scores

Another active area is estimating the likelihood that an AI output contains factual errors. Uncertainty estimation can be used both during and after generation. Some approaches include:

  • Token-level confidence scores, which show how confident the model is in each word or phrase. Low confidence regions may be flagged for review.
  • Consistency checks, where the model answers the same question in several ways or with varied prompts, and the system measures how stable the answers are.
  • Context sufficiency checks, in which a separate model evaluates whether the retrieved documents contain sufficient information to answer the question.
  • Pre-generation risk assessment, where the system predicts whether a given input is likely to induce hallucinations in a specific model configuration.

These methods do not remove hallucinations, but they help organizations identify high-risk outputs and route them to stronger verification flows or human reviewers.

Communicating uncertainty to users

Communicating uncertainty to users is crucial when AI systems encounter limitations. Some effective practices are:

  1. The use of intentionally uncertain language helps set appropriate expectations and reduces misleading outputs that could spread inaccurate information.
  2. By integrating factually incorrect indicators, models can signal when they lack confidence in their answers. This transparency, recommended in recent technology review publications, prevents users from taking AI-generated content at face value.
  3. Highlighting specific textual elements that influenced the model’s response helps users understand the reasoning behind uncertain outputs, while displaying confidence ratings enables more reliable evaluation.
  4. When handling complex problems, presenting multiple sources encourages users to independently verify claims rather than relying solely on AI outputs that may contain hallucinations.

These approaches, validated through extensive human feedback, create a more honest relationship between users and generative AI models by acknowledging when knowledge base limitations might lead to potential hallucinations.

Estimating the risk of hallucinations before they occur

Detecting fake content after the LLM has already generated it is the primary focus of most current hallucination research. Tools like RefChecker and Hallucination Guard aim to highlight or score suspicious outputs, helping users filter or correct the hallucinated results.

A new perspective reinterprets the issue, suggesting that hallucinations are compression artifacts rather than “bugs.” During operation, large language models decompress information that was previously compressed into their parameters. Similar to how a corrupted ZIP file produces garbage when unzipped, the model fills in gaps with plausible but fake content when its “information budget” is limited.3

LLMs optimize average-case efficiency, which can lead to occasional systematic hallucinations. The Expectation-level Decompression Law (EDFL) defines the information thresholds needed to prevent hallucinations in LLMs. The Open-source Hallucination Risk Calculator enables pre-generation risk assessment, error-bound setting, context evaluation, and SLA-style guarantees—each of which is very useful in regulated fields. It can be used with any OpenAI-compatible API.

FAQ

Further readings

Principal Analyst
Cem Dilmegani
Cem Dilmegani
Principal Analyst
Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.
View Full Profile
Researched by
Aleyna Daldal
Aleyna Daldal
Aleyna is an AIMultiple industry analyst. Her previous work contained developing deep learning algorithms for materials informatics and particle physics fields.
View Full Profile

Comments 4

Share Your Thoughts

Your email address will not be published. All fields are required.

0/450
Abraham
Abraham
Aug 25, 2025 at 11:57

This article is updated in June while the GPT 5 is announced in August. How did you test GPT 5 in AI Hallucination Rates figure

Aleyna Daldal
Aleyna Daldal
Sep 05, 2025 at 08:46

Hi! Thanks for your comment. We use WordPress for our articles, which allows us to update graphs and tables independently of the main text. This means that even if the article text shows an earlier update date, we can still add the latest results to the figures without altering the written sections.

Rui
Rui
Aug 08, 2025 at 20:31

Hi Cem, I've been using this article as a reference of severity of hallucination. Is it possible to refresh the report with the newly released GPT-5? Thanks!

Aleyna Daldal
Aleyna Daldal
Sep 05, 2025 at 08:48

Hi Rui, Thanks a lot for your interest and for using our article as a reference. We’ve already refreshed the report with GPT-5 results, so you’ll find the latest updates included in the article.

Tim
Tim
Jul 19, 2025 at 10:13

Is there any chance that you might add Claude Sonnet/Opus 4 as well as Gemini 2.5 Pro?

Aleyna Daldal
Aleyna Daldal
Sep 05, 2025 at 08:48

Hi Tim, Thank you for your support and suggestion. Claude Sonnet/Opus 4 and Gemini 2.5 Pro have already been added to the article, so you can now see them included in the comparisons.

Joon
Joon
Feb 28, 2025 at 16:29

Hi, thank you for interesting benchmark! I was wondering Grok3's hallucination rate, both in Think mode and without. Are you planning to add these?

Cem Dilmegani
Cem Dilmegani
Mar 17, 2025 at 02:52

Hi Joon and thank you for your comment, Yes, we are waiting for API access.