AI models sometimes generate data that seems plausible but is incorrect or misleading; known as AI hallucinations. According to Deloitte, 77% of businesses who joined the study are concerned about AI hallucinations.1
We benchmarked 13 LLMs with 60 questions to each one to measure their hallucination rates:
Results
Our benchmark revealed that OpenAI GPT-4.5 has the lowest hallucination rate (i.e. highest accuracy rate) of 15%.
Methodology
Our questions are prepared to test an LLM’s abilities using CNN News articles.
We used an automated web data collection system to prepare the dataset that used CNN News’ RSS feed.
These questions were asked to the LLM using API keys and the accuracy of the results were compared using a fact-checker system that compares the ground truth to the LLM’s answer.
We prepared 60 questions by using these articles. The questions:
- Ask for specific, precise numerical values (percentages, dates, quantities)
- Cover diverse topics (oil prices, art history, scientific research, financial news, and more)
- Include questions about temporal relationships and specific statistics that would be difficult to guess accurately
- Require retrieving exact figures from the source material rather than making generalizations
- Easily verifiable whether the answers match the exact figures mentioned in the source articles
Example
Prompt: ” You are a chatbot answering questions using data. You are given the information about an article. Now, I will provide you with some questions, and you will answer them with only one-word-only or one-number-only answers or Not given. You must stick to the answers provided solely by the text in the passage provided.
Article: Scientists identify secret ingredient in Leonardo da Vinci paintings 2
Question: In what century did oil painting spread to Northern Europe?”
Ground truth: Not given.
This information is not explicitly provided in the text, which only references the Middle Ages. Verifying whether the responses align with the specific figures mentioned in the source articles is straightforward. If the language models offer any answer other than ‘not given,’ it indicates they are not adhering to the original prompt and training data, creating hallucinations.
What are AI hallucinations?
Hallucinations happen when an LLM produces information that seems real but is either completely made up or factually inaccurate. In contrast to straightforward mistakes, hallucinations are especially troublesome since they are presented with the same assurance as true information, making it hard for users to recognize them without outside confirmation.
The impacts of LLM hallucinations
LLM hallucinations have far-reaching effects that go well beyond small errors. Businesses using this technology face several significant risks:
Reputational damage
Customers or stakeholders lose confidence in AI systems and the company using them when they receive inaccurate information from them. Rebuilding confidence after trust has been damaged is a difficult task that may take years to accomplish.
Legal liability
Accurate information produced by an LLM could result in legal ramifications, especially in regulated sectors such as healthcare, finance, and legal services. Organizations could be penalized severely if hallucinations caused by generative AI lead to infractions or negative consequences.
Operational inefficiency
Hallucinating LLMs may result in more work instead of simplifying processes. The efficiency advantages that generative AI model systems promise are essentially lost when workers or consumers cannot trust the results, forcing them to spend valuable time confirming information.
Causes of AI hallucinations
Creating successful mitigation methods requires an understanding of the causes of hallucinations. There are numerous contributing aspects to this phenomenon:
Training data limitations
LLMs are trained using extensive text datasets from the internet and other sources. Their outputs are impacted by:
- Insufficient training data for specialized domains creates knowledge gaps filled with hallucinated content.
- Low-quality internet data containing unreliable information directly influences model responses.
- Proper factual data curation is crucial for generative artificial intelligence that prioritizes accurate information over fluent but deceptive outputs
- Knowledge base limitations lead models to occasionally invent facts rather than express uncertainty.
- Training data biases and outdated information can perpetuate errors in AI-generated content.
Knowledge cutoffs
Most LLMs have a training deadline, after which they are not directly informed about current affairs or emerging trends. Beyond this cutoff, models are more likely to hallucinate than to admit their ignorance when questioned about recent events.
Overconfidence in generated content
Since LLMs are made to write clear, fluid writing, they frequently prioritize fluency over factual correctness. Rather than communicating uncertainty, this innate propensity for coherence may cause models to produce information that sounds accurate but is false.
Prompt misinterpretation
Hallucinations can occasionally be caused by the model misinterpreting unclear questions or extrapolating information beyond what is specified in the prompt, leading to answers that are inconsistent with the user’s purpose.
Strategies to reduce AI hallucinations
Retrieval Augmented Generation (RAG)
RAG systems reduce inaccuracies by grounding AI responses in verified external information. When a query is received, the system retrieves data from a curated knowledge base and provides it to the language model, which generates a response based on both the prompt and the retrieved information. This ensures responses are based on verified data rather than just the model’s parameters.
Prompt engineering
Well-crafted prompts significantly reduce hallucination rates by providing clear, specific instructions. Including relevant context, explicitly instructing the model to indicate uncertainty when appropriate, and using system prompts to prioritize accuracy over speculation.
External fact-checking and self-reflection
External fact-checking and self-reflection mechanisms are essential for reducing AI hallucinations. When generative artificial intelligence produces potentially misleading outputs, using advanced external sources can validate responses against trusted knowledge bases.
Modern language models now incorporate self-reflection capabilities to analyze their outputs for inconsistencies. Additionally, using independent systems to double-check responses against reliable internet data helps identify errors that might otherwise appear as hallucinated content.
Additional techniques
Additional techniques for reducing hallucinations include:
- Providing accurate information, references, and data to the model before generating responses
- Implementing systematic double-check protocols for important outputs that ensure reliability
- Using custom instructions that emphasize accuracy helps prevent models from generating confusing outputs
- Creating effective human feedback loops allows corrections from other researchers to continuously improve system performance
Communicating uncertainty to users
Communicating uncertainty to users is crucial when AI systems encounter limitations. Some effective practices are:
- The use of intentionally uncertain language helps set appropriate expectations and reduces misleading outputs that could spread inaccurate information.
- By integrating factually incorrect indicators, models can signal when they lack confident answers. This transparency, recommended in recent technology review publications, prevents users from taking AI-generated content at face value.
- Highlighting specific textual elements that influenced the model’s response helps users understand the reasoning behind uncertain outputs, while displaying confidence ratings enables better evaluation of reliability.
- When handling complex problems, presenting multiple sources encourages users to independently verify claims rather than relying solely on AI outputs that might contain hallucinated content.
These approaches, validated through extensive human feedback, create a more honest relationship between users and generative AI models by acknowledging when knowledge base limitations might lead to potential hallucinations.
FAQ
What are the best practices when using large language models?
AI tools can generate false information or misleading outputs. To prevent AI hallucinations, users can double-check the answers and should ask more straightforward questions. This factually incorrect information in an AI-generated text can lead to unpleasant results, especially in areas like scientific writing and legal research.
Why do AI systems hallucinate?
Publications have identified several causes of AI hallucinations. When generative artificial intelligence systems like large language models produce factually incorrect outputs, it’s often due to insufficient training data or reliance on outdated factual data. Research shows that previous methods for creating knowledge base systems didn’t adequately prevent models from generating hallucinated references or inaccurate information when processing internet data for answering complex problems.
Why should we fact-check AI outputs?
AI-generated content often lacks verification against external sources, leading to misleading outputs. Generative models struggle with topics outside their training corpus and can invent plausible-sounding facts that fail expert verification.
While valuable in areas like legal research, AI systems can produce inaccuracies, especially with low-traffic subjects or adversarial attacks. Models may confuse correlation with causation, and even accurate outputs can include fabrications, highlighting the need for fact-checking against trustworthy sources. This issue persists due to inadequate review standards for how models process data.
Further readings
External references
External Links
- 1. Managing gen AI risks | Deloitte Insights. Deloitte
- 2. An Old Master’s secret ingredient? Egg yolk, new study suggests | CNN. Getty Images
Comments
Your email address will not be published. All fields are required.