AIMultiple ResearchAIMultiple ResearchAIMultiple Research
We follow ethical norms & our process for objectivity.
This research is not funded by any sponsors.
LLMAI
Updated on Apr 10, 2025

AI Hallucination: Comparison of the Most Popular LLMs ['25]

Headshot of Cem Dilmegani
MailLinkedinX

AI models sometimes generate data that seems plausible but is incorrect or misleading; known as AI hallucinations. According to Deloitte, 77% of businesses who joined the study are concerned about AI hallucinations.1

We benchmarked 13 LLMs with 60 questions to each one to measure their hallucination rates:

Results

Our benchmark revealed that OpenAI GPT-4.5 has the lowest hallucination rate (i.e. highest accuracy rate) of 15%.

Methodology

Our questions are prepared to test an LLM’s abilities using CNN News articles.

We used an automated web data collection system to prepare the dataset that used CNN News’ RSS feed.

These questions were asked to the LLM using API keys and the accuracy of the results were compared using a fact-checker system that compares the ground truth to the LLM’s answer.

We prepared 60 questions by using these articles. The questions:

  1. Ask for specific, precise numerical values (percentages, dates, quantities)
  2. Cover diverse topics (oil prices, art history, scientific research, financial news, and more)
  3. Include questions about temporal relationships and specific statistics that would be difficult to guess accurately
  4. Require retrieving exact figures from the source material rather than making generalizations
  5. Easily verifiable whether the answers match the exact figures mentioned in the source articles

Example

Prompt: ” You are a chatbot answering questions using data. You are given the information about an article. Now, I will provide you with some questions, and you will answer them with only one-word-only or one-number-only answers or Not given. You must stick to the answers provided solely by the text in the passage provided.

Article: Scientists identify secret ingredient in Leonardo da Vinci paintings 2

Question: In what century did oil painting spread to Northern Europe?”

Ground truth: Not given.

This information is not explicitly provided in the text, which only references the Middle Ages. Verifying whether the responses align with the specific figures mentioned in the source articles is straightforward. If the language models offer any answer other than ‘not given,’ it indicates they are not adhering to the original prompt and training data, creating hallucinations.

What are AI hallucinations?

Hallucinations happen when an LLM produces information that seems real but is either completely made up or factually inaccurate. In contrast to straightforward mistakes, hallucinations are especially troublesome since they are presented with the same assurance as true information, making it hard for users to recognize them without outside confirmation.

The impacts of LLM hallucinations

LLM hallucinations have far-reaching effects that go well beyond small errors. Businesses using this technology face several significant risks:

Reputational damage

Customers or stakeholders lose confidence in AI systems and the company using them when they receive inaccurate information from them. Rebuilding confidence after trust has been damaged is a difficult task that may take years to accomplish.

Accurate information produced by an LLM could result in legal ramifications, especially in regulated sectors such as healthcare, finance, and legal services. Organizations could be penalized severely if hallucinations caused by generative AI lead to infractions or negative consequences.

Operational inefficiency

Hallucinating LLMs may result in more work instead of simplifying processes. The efficiency advantages that generative AI model systems promise are essentially lost when workers or consumers cannot trust the results, forcing them to spend valuable time confirming information.

Causes of AI hallucinations

Creating successful mitigation methods requires an understanding of the causes of hallucinations. There are numerous contributing aspects to this phenomenon:

Training data limitations

LLMs are trained using extensive text datasets from the internet and other sources. Their outputs are impacted by:

  1. Insufficient training data for specialized domains creates knowledge gaps filled with hallucinated content.
  2. Low-quality internet data containing unreliable information directly influences model responses.
  3. Proper factual data curation is crucial for generative artificial intelligence that prioritizes accurate information over fluent but deceptive outputs
  4. Knowledge base limitations lead models to occasionally invent facts rather than express uncertainty.
  5. Training data biases and outdated information can perpetuate errors in AI-generated content.

Knowledge cutoffs

Most LLMs have a training deadline, after which they are not directly informed about current affairs or emerging trends. Beyond this cutoff, models are more likely to hallucinate than to admit their ignorance when questioned about recent events.

Overconfidence in generated content

Since LLMs are made to write clear, fluid writing, they frequently prioritize fluency over factual correctness. Rather than communicating uncertainty, this innate propensity for coherence may cause models to produce information that sounds accurate but is false.

Prompt misinterpretation

Hallucinations can occasionally be caused by the model misinterpreting unclear questions or extrapolating information beyond what is specified in the prompt, leading to answers that are inconsistent with the user’s purpose.

Strategies to reduce AI hallucinations

Retrieval Augmented Generation (RAG)

RAG systems reduce inaccuracies by grounding AI responses in verified external information. When a query is received, the system retrieves data from a curated knowledge base and provides it to the language model, which generates a response based on both the prompt and the retrieved information. This ensures responses are based on verified data rather than just the model’s parameters.

Prompt engineering

Well-crafted prompts significantly reduce hallucination rates by providing clear, specific instructions. Including relevant context, explicitly instructing the model to indicate uncertainty when appropriate, and using system prompts to prioritize accuracy over speculation.

External fact-checking and self-reflection

External fact-checking and self-reflection mechanisms are essential for reducing AI hallucinations. When generative artificial intelligence produces potentially misleading outputs, using advanced external sources can validate responses against trusted knowledge bases.

Modern language models now incorporate self-reflection capabilities to analyze their outputs for inconsistencies. Additionally, using independent systems to double-check responses against reliable internet data helps identify errors that might otherwise appear as hallucinated content.

Additional techniques

Additional techniques for reducing hallucinations include:

  • Providing accurate information, references, and data to the model before generating responses
  • Implementing systematic double-check protocols for important outputs that ensure reliability
  • Using custom instructions that emphasize accuracy helps prevent models from generating confusing outputs
  • Creating effective human feedback loops allows corrections from other researchers to continuously improve system performance

Communicating uncertainty to users

Communicating uncertainty to users is crucial when AI systems encounter limitations. Some effective practices are:

  1. The use of intentionally uncertain language helps set appropriate expectations and reduces misleading outputs that could spread inaccurate information.
  2. By integrating factually incorrect indicators, models can signal when they lack confident answers. This transparency, recommended in recent technology review publications, prevents users from taking AI-generated content at face value.
  3. Highlighting specific textual elements that influenced the model’s response helps users understand the reasoning behind uncertain outputs, while displaying confidence ratings enables better evaluation of reliability.
  4. When handling complex problems, presenting multiple sources encourages users to independently verify claims rather than relying solely on AI outputs that might contain hallucinated content.

These approaches, validated through extensive human feedback, create a more honest relationship between users and generative AI models by acknowledging when knowledge base limitations might lead to potential hallucinations.

FAQ

What are the best practices when using large language models?

AI tools can generate false information or misleading outputs. To prevent AI hallucinations, users can double-check the answers and should ask more straightforward questions. This factually incorrect information in an AI-generated text can lead to unpleasant results, especially in areas like scientific writing and legal research.

Why do AI systems hallucinate?

Publications have identified several causes of AI hallucinations. When generative artificial intelligence systems like large language models produce factually incorrect outputs, it’s often due to insufficient training data or reliance on outdated factual data. Research shows that previous methods for creating knowledge base systems didn’t adequately prevent models from generating hallucinated references or inaccurate information when processing internet data for answering complex problems.

Why should we fact-check AI outputs?

AI-generated content often lacks verification against external sources, leading to misleading outputs. Generative models struggle with topics outside their training corpus and can invent plausible-sounding facts that fail expert verification.
While valuable in areas like legal research, AI systems can produce inaccuracies, especially with low-traffic subjects or adversarial attacks. Models may confuse correlation with causation, and even accurate outputs can include fabrications, highlighting the need for fact-checking against trustworthy sources. This issue persists due to inadequate review standards for how models process data.

Further readings

External references

Share This Article
MailLinkedinX
Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.
Researched by
Headshot of Aleyna Daldal
Aleyna Daldal
Aleyna is an AIMultiple industry analyst. Her previous work contained developing deep learning algorithms for materials informatics and particle physics fields.

Next to Read

Comments

Your email address will not be published. All fields are required.

1 Comments
Joon
Feb 28, 2025 at 16:29

Hi, thank you for interesting benchmark! I was wondering Grok3’s hallucination rate, both in Think mode and without. Are you planning to add these?

Cem Dilmegani
Mar 17, 2025 at 02:52

Hi Joon and thank you for your comment,
Yes, we are waiting for API access.

Related research