RAG (Retrieval-Augmented Generation) improves LLM responses by adding external data sources. We benchmarked different embedding models and separately tested various chunk sizes to determine what combinations work best for RAG systems.
Explore top RAG frameworks and tools, learn what RAG is, how it works, its benefits, and its role in today’s LLM landscape.
RAG benchmark results
Embedding models
RAG systems’ performance heavily depends on the quality of embedding models, as they directly influence the system’s accuracy and effectiveness in retrieving relevant information.
To assess this, we evaluated the performance of 4 embedding models:
These results show that Mistral Embed achieved the highest accuracy in our benchmark, underscoring the importance of selecting the right embedding model for RAG systems.
Embeddings directly affect both the relevance of retrieved information and the accuracy of generated responses. To understand our evaluation process, see our embedding methodology.
For our detailed benchmark analysis comparing the accuracy and cost of top providers like OpenAI, Gemini, and Cohere, see our full embedding models benchmark.
Chunk size
Chunk size in RAG systems determines how large the text segments are when they are divided for processing. These segments are then converted into vectors by embedding models and stored in a vector database. When a question is posed, the model retrieves the most relevant segments from the vector database and generates a response based on this information.
Choosing the right combination of chunk size and embedding model is essential to balance retrieval precision and overall system efficiency:
The benchmark results show the role of chunk size in RAG systems. Chunk size directly affects how text is segmented and the quality of the retrieved information, requiring a balance to ensure the system operates both efficiently and accurately.
The results indicate that a chunk size of 512 tokens generally delivers the best performance, balancing retrieval precision and efficiency.
In the chunk size benchmark, we used:
- Embedding model: OpenAI text-embedding-3-small
- Vector database: Pinecone.
RAG chunk size benchmark methodology
This study was specifically designed to evaluate the performance of Retrieval-Augmented Generation (RAG) systems. To test RAG’s ability to retrieve and generate accurate and relevant information from a vector database, we prepared a dataset based on CNN News articles and formulated questions. The tests focused on examining the impact of critical parameters such as chunk size and embedding models.
- CNN News articles were loaded into a vector database. This database served as the knowledge source for the LLM, ensuring that the model-generated responses were solely based on the provided data.
- Each response generated by the LLM was compared against the ground truth in the source articles. This comparison was performed automatically using an accuracy evaluation system, with the accuracy rate calculated based on the exact match between the responses and the article data.
RAG vs. Context Window
RAG retrieves external data for queries, while context windows process fixed amounts of text. As context windows expand to millions of tokens, some question whether RAG will still be necessary, yet our results show it continues to offer clear accuracy advantages.
We benchmarked the RAG against a long context window approach:
For context window:
We used Llama 4 Scout’s native context length.
For RAG:
- LLM: Llama 4 Scout
- Vector database: Pinecone
- Embedding model: OpenAI text-embedding-3-large
- Chunk size: 512
Potential reasons behind performance differences between RAG vs context window
Accuracy
RAG achieved higher accuracy because it acts as a strict filter, removing 99% of irrelevant text before the LLM processes it. This discriminative hard attention approach compels the model to focus solely on the relevant facts, reducing noise and effectively guaranteeing high accuracy.
Attention drift
Long context windowing performed low due to the “lost in the middle” phenomenon, where the LLM’s attention naturally dilutes over very long documents. The model struggles to prioritize a single relevant fact when it is buried inside tens of thousands of tokens of unrelated text.
Why RAG remains effective
RAG systems leverage external knowledge bases like vector databases to retrieve only the most relevant information for a given query. By segmenting the data into chunks and embedding them, Llama 4 was able to focus on high-quality, contextually relevant data rather than processing an entire lengthy context.
This avoids the clutter of irrelevant data that often overwhelms models in long-context scenarios. RAG helps the model maintain clarity and deliver more accurate responses by focusing on smaller, targeted inputs.
In long context lengths, models often struggle to process and prioritize information effectively, leading to diminished performance.1
Can long context windows replace RAG?
Long context windows can process large datasets in one go. Still, their practical downsides, such as performance drops and computational inefficiency, make RAG a more dependable option for tasks needing high accuracy.
RAG systems address these challenges by adjusting parameters like chunk size and embedding models, achieving a balance between efficiency and effectiveness. Context windows provide a limited view of the input, whereas RAG retrieves relevant external information to enhance response quality. This makes RAG better suited for tasks needing up-to-date or domain-specific knowledge that exceeds the model’s internal training data.
While context windows can work for simpler tasks within the model’s token limit, RAG is more effective when external knowledge is required.
Methodology for RAG vs. context window benchmark
We evaluated the performance of Llama 4 Scout using two approaches: RAG and a long context window. For RAG, we integrated Llama 4 Scout with Pinecone as the vector database, using OpenAI’s text-embedding-3-large model for embeddings and a chunk size of 512.
For the context window approach, we relied solely on Llama 4 Scout’s native context length without external retrieval. Both methods were evaluated using our previously mentioned dataset, with accuracy calculated as the percentage of correct responses to a set of queries.
Why is RAG important now?
The importance of Retrieval-Augmented Generation (RAG) has increased in recent years due to the growing need for AI systems that provide accurate, transparent, and contextually relevant responses. However, business leaders may not know the term, as RAG is a recently emerging area (See Figure below).
As businesses and developers seek to overcome the limitations of traditional Large Language Models (LLMs), such as outdated knowledge, lack of transparency, and hallucinated outputs, RAG has emerged as a critical solution.
What are the available RAG models and tools?
Retrieval-Augmented Generation (RAG) models and tools can be divided into three categories:
- LLMs with Built-in RAG Capabilities to enhance response accuracy by accessing external knowledge.
- RAG libraries and frameworks that can be applied to LLMs for custom implementations.
- Components, such as integration frameworks, vector databases, and retrieval models, that can be combined with each other or with large language models (LLMs) to build RAG systems.
LLMs with Built-in RAG Capabilities
Several LLMs now feature native RAG functionality to enhance their accuracy and relevance by retrieving external knowledge.
- Meta AI: The RAG model from Meta AI integrates retrieval and generation within a single framework, using Dense Passage Retrieval (DPR) for the retrieval process and BART for generation. This model is available on Hugging Face for knowledge-intensive tasks.
- Anthropic’s Claude: Includes a Citations API for models like Claude 3.5 Sonnet and Haiku, enabling source referencing.
- Mistral’s SuperRAG 2.0: This model offers retrieval with integration into Mistral 8x7B v1.
- Cohere’s Command R: Optimized for RAG with multilingual support and citations, accessible via API or Hugging Face model weights.
- Gemini Embedding: Google’s Gemini embedding model for RAG.
- Mistral Embed: Mistral’s embedding model complements its LLM offerings by producing dense vector embeddings optimized for RAG tasks.
- OpenAI Embeddings: OpenAI offers various embedding models, such as Embedding-3-Large, Embedding-3-Small, and text-embedding-ada-002, each suited for different use cases in natural language processing tasks like retrieval-augmented generation.
RAG Libraries and Frameworks
These tools enable developers to add RAG capabilities to existing LLMs, providing flexibility and scalability.
- Haystack: An end-to-end framework by Deepset for building RAG pipelines, focused on document search and question answering.
- LlamaIndex: Specializes in data ingestion and indexing, enhancing LLMs with retrieval systems.
- Weaviate: A vector database with RAG features, supporting scalable search and retrieval workflows.
- DSPY: A declarative programming framework for optimizing RAG in large language models.
- Pathway: A framework for deploying RAG at scale with data connectivity.
- Azure Machine Learning: Provides RAG capabilities through Azure AI Studio and Machine Learning pipelines.
- IBM watsonx.ai: Provides frameworks for developing applications that facilitate the implementation of RAG with large language models.
For a more detailed comparison and analysis, see our RAG frameworks benchmark.
Integration Frameworks for RAG
Integration frameworks streamline the development of context-aware, reasoning-enabled applications powered by LLMs. They offer modular components and pre-configured chains tailored to specific needs while allowing customization.
- LangChain: A framework for creating context-aware applications, commonly used with RAG and LLMs.
- Dust: Facilitates custom AI assistant creation with semantic search and RAG support, enhancing LLM applications.
Users can pair these frameworks with vector databases to fully implement RAG, boosting the contextual depth of LLM outputs.
Vector Databases for RAG
Vector Databases (VDs) handle multidimensional data, such as patient symptoms, blood test results, behaviors, and health metrics, making them vital for RAG systems.
- Deep Lake: A data lake optimized for LLMs, supporting vector storage and integration with tools like LlamaIndex.
- Pinecone: A managed vector database service for RAG setups.
- Weaviate: Combines vector storage with RAG-ready features for retrieval.
- Milvus: An open-source vector database for AI use cases.
- Qdrant: A vector search engine for similarity search.
- Zep Vector Store: An open-source platform that supports a document vector store, where you can upload, embed, and search through documents for RAG.
Other Retrieval Models Supporting RAG
Since RAG leverages sequence-to-sequence and retrieval techniques like DPR, developers can combine these models with LLMs to enable retrieval-augmented generation.
- BART with Retrieval: Integrates BART’s generative power with retrieval mechanisms for RAG.
- BM25: A traditional term-frequency-based retrieval algorithm, widely used for its simplicity.
- ColBERT Model: Based on BERT (Bidirectional Encoder Representations from Transformers) and is designed to combine both dense retrieval and traditional sparse retrieval.
- DPR (Dense Passage Retrieval) Model: A model used for information retrieval tasks, particularly in the domain of question answering (QA) and search systems.
What is retrieval-augmented generation?
In 2020, Meta Research introduced RAG models to manipulate knowledge precisely. Lewis and colleagues refer to RAG as a general-purpose fine-tuning approach that can combine pre-trained parametric-memory generation models with a non-parametric memory.
In simple terms, Retrieval-augmented generation (RAG) is a natural language processing (NLP) approach that combines elements of both retrieval and generation models to improve the quality and relevance of generated content. It’s a hybrid approach that leverages the strengths of both techniques to address the limitations of purely generative or purely retrieval-based methods. Here is a brief video about RAG:
How do RAG models work?
RAG system operates in two phases: Retrieval and content generation.
In the retrieval phase:
Algorithms actively search for and retrieve relevant snippets of information based on the user’s prompt or question using techniques like BM25. This retrieved information is the basis for generating coherent and contextually relevant responses.
- In open-domain consumer settings, these facts can be sourced from indexed documents on the internet. In closed-domain enterprise settings, a more restricted set of sources is typically used to enhance the security and reliability of internal knowledge. For example, the RAG system can look for:
- Current contextual factors, such as real-time weather updates and the user’s precise location
- User-centric details, their previous orders on the website, their interactions with the website, and their current account status
- Relevant factual data in retrieved documents that are either private or were updated after the LLM’s training process.
In the content generation phase:
- After retrieving the relevant embeddings, a generative language model, such as a transformer-based model like GPT, takes over. It uses the retrieved context to generate natural language responses. The generated text can be further conditioned or fine-tuned based on the retrieved content to ensure that it aligns with the context and is contextually accurate. The system may include links or references to the sources it consulted for transparency and verification purposes.
RAG LLMs use two systems to obtain external data:
- Vector database: Vector databases help find relevant documents using similarity searches. They can either work independently or be part of the LLM application.
- Feature stores: These are systems or platforms to manage and store structured data features used in machine learning and AI applications. They provide organized and accessible data for training and inference processes in machine learning models like LLMs.
What is retrieval-augmented generation in large language models?
RAG models generate solutions that can address challenges faced by Large language models (LLMs). These main problems include:
- Limited knowledge access and manipulation: LLMs struggle with keeping their world knowledge up-to-date since their training dataset updates are infeasible. Also, they have limitations in precisely manipulating knowledge. This limitation affects their performance on knowledge intensive tasks, often causing them to fall behind task-specific architectures. For example, LLMs lack domain-specific knowledge as they are trained for generalized tasks.
- Lack of transparency: LLMs struggle to provide transparent information about how they make decisions. It is difficult to trace how and why they arrive at specific conclusions or answers, so they are often considered “black boxes”.
- Hallucinations in answers: Language models can answer questions that appear to be accurate or coherent but that are entirely fabricated or inaccurate. Addressing and reducing hallucinations is a crucial challenge in improving the reliability and trustworthiness of LLM-generated content.
What are the different types of RAG?
Speculative RAG
Speculative RAG leverages a smaller, specialised LM to draft multiple answers from different document subsets in parallel, while a larger generalist LM verifies and selects the best response. This dual-system approach enhances accuracy while reducing latency, making it ideal for high-throughput applications where both speed and accuracy matter.
Retrieval-Augmented Fine-Tuning (RAFT)
RAFT combines RAG with supervised fine-tuning to improve domain-specific performance. Think of it as preparing for an open-book exam: instead of just relying on external documents at query time (RAG) or memorizing everything (fine-tuning), RAFT trains the model to “study” the documents beforehand.
How it works:
- Training data includes questions, “oracle” documents (containing the answer), and “distractor” documents (irrelevant noise)
- The model learns to identify relevant information while ignoring distractors
- Chain-of-thought style responses improve reasoning quality
Consideration: Recent research suggests RAFT provides the most significant gains on older LLMs. Newer models may show more modest improvements as they have better built-in retrieval behaviors.
Advanced RAG architectures
The RAG landscape has moved beyond the standard “Contextual” and “Speculative” types into sophisticated architectures designed for complex reasoning. The “retrieve-then-generate” baseline is being replaced by loops where the model actively converses with the retriever.
Graph-Based RAG (GraphRAG)
GraphRAG moves beyond retrieving flat text chunks. It constructs a knowledge graph where documents and entities are nodes, allowing the system to retrieve “sub-graphs” or reasoning paths rather than isolated snippets.
- How it works: Instead of ranking passages in isolation, the system identifies relationships (edges) between entities. It can traverse these connections to answer multi-hop questions (e.g., “How does the CEO of Company A relate to the supplier of Company B?”).
- Structure-Awareness: Systems like G-RETRIEVER construct minimal connected sub-graphs that encode multi-hop contexts before the LLM even sees the prompt, improving faithfulness and reducing hallucination.
- Best for: Complex reasoning tasks where relationships between data points matter more than keyword matching.
Hybrid & Contextual RAG
- Contextual RAG: Enhances standard retrieval by preprocessing chunks with “contextual embeddings” or summaries that explain why a chunk is relevant, reducing retrieval failures.
- Hybrid Retrieval: Combines Dense Retrieval (semantic vectors) with Sparse Retrieval (BM25 keywords). Dense retrieval captures semantic meaning while BM25 catches exact keyword matches that semantic search might miss. This combination is now considered a best practice to mitigate retrieval failures.
Agentic RAG
Agentic pipelines use an LLM controller to orchestrate multiple tools and memory banks. The agent can plan a workflow (e.g., “Retrieve financial data,” then “Use calculator tool,” then “Summarize”).
- Orchestration: Unlike linear RAG, an agentic system uses planning tokens (THOUGHT, ACTION, OBSERVATION) to decide its next move dynamically.
- Tool Use: It can hot-swap tools (e.g., switching from a dense vector index to a SQL database query) depending on the user’s intent.
Iterative & Active RAG
These systems treat retrieval as a conversational loop rather than a one-off step. The model determines when to retrieve and what to keep.
- Active RAG (FLARE): Mechanisms like FLARE (Forward-Looking Active REtrieval) monitor the model’s confidence during generation. If the model generates low-confidence tokens, it pauses to formulate a search query and retrieve new data, rather than hallucinating. This is especially effective for long-form generation where information needs evolve throughout the text.
- Self-RAG: The model generates “reflection tokens” (e.g.,
Retrieve,ISREL,ISSUP,ISUSE) to critique its own retrieved content. It evaluates whether passages are relevant, whether generated content is supported by evidence, and the overall utility of the response—deciding whether to keep, refine, or discard evidence before generating the final answer. - Cyclic Refinement: Architectures like Chain-of-Note oblige the LLM to write concise notes on retrieved documents to assess their reliability before synthesizing an answer.
How to evaluate RAG systems
Evaluating RAG is more complex than standard LLM testing because it requires assessing two distinct components: the Retriever (finding the right data) and the Generator (synthesizing the answer accurately). The research community has moved away from simple surface-level metrics (like BLEU or ROUGE) toward semantic and algorithmic evaluation frameworks that measure three core pillars: Context Relevance, Faithfulness, and Answer Relevance.
1. Component-level metrics
To diagnose performance issues, you must evaluate the retrieval and generation stages separately.
Retrieval metrics (The search phase)
If the retriever fails, the generator has no chance. Key metrics include:
- Precision@k & Recall@k: Precision measures how many of the retrieved documents are actually relevant, while Recall measures if the system found all the relevant documents available in the database.
- Mean reciprocal rank (MRR): This is critical for RAG systems where the LLM pays the most attention to the first few chunks. MRR evaluates how high up the list the first relevant document appears.
- Normalized discounted cumulative gain (nDCG): Unlike binary hit/miss metrics, nDCG accounts for graded relevance, rewarding systems that place the most useful documents at the very top of the context window.
Generation metrics (The answer phase)
- Faithfulness (Groundedness): Measures whether the generated answer is derived exclusively from the retrieved context. This is the primary metric for detecting hallucinations; if the model adds information not present in the source, faithfulness drops.
- Answer relevance: Assesses whether the response actually addresses the user’s query, ensuring the model isn’t just summarizing the context without answering the specific question.
- Negative rejection: A critical safety metric that tests the system’s ability to say “I don’t know” when the retrieved context does not contain the answer, rather than hallucinating a plausible-sounding falsehood.
2. Automated evaluation frameworks
Relying solely on human evaluation is slow and expensive. The industry standard has shifted to “LLM-as-a-judge” frameworks, where a strong model evaluates the outputs of your RAG pipeline.
- RAGAS (Reference-Free Evaluation): RAGAS leverages language models under the hood to judge the quality of responses without needing human-labeled “gold standard” answers. It provides a comprehensive set of metrics including Context Precision, Context Recall, Faithfulness, and Answer Relevance. RAGAS is highly operationally efficient and scalable, though it can be sensitive to the specific prompts used for evaluation.
- ARES (Automated RAG Evaluation System): ARES finetunes lightweight LM judges using synthetic training data to assess context relevance, answer faithfulness, and answer relevance. It uses Prediction-Powered Inference (PPI) with a small set (~150+) of human-annotated datapoints to generate confidence intervals. While ARES offers higher precision and remains effective across domain shifts, it requires more setup compared to RAGAS.
3. Advanced benchmarking
Beyond basic accuracy, advanced benchmarks test specific failure modes:
- Noise robustness: Can the model filter out irrelevant documents mixed into the context window?
- Information integration: Can the model synthesize an answer that requires combining clues from multiple distinct documents (multi-hop reasoning)?
- Counterfactual robustness: Can the model identify and correct errors when the retrieved information conflicts with its internal parametric knowledge (or vice versa)?
RAG Evaluation Matrix
What are the benefits of retrieval-augmented generation?
RAG formulations can be applied to various NLP applications, including chatbots, question-answering systems, and content generation, where correct information retrieval and natural language generation are critical. The key advantages RAG provides include:
Improved relevance and accuracy
Generative AI stats show that Gen AI tools and models like ChatGPT have the potential to automate knowledge intensive NLP tasks that make up ~70% of employees’ time. Yet, ~60% of business leaders consider AI-generated content biased or inaccurate, lowering the adoption rate of LLMs.
By incorporating a retrieval component, RAG models can access external knowledge sources, ensuring the generated text is grounded in accurate and up-to-date information. This leads to more contextually relevant and accurate responses, reducing hallucinations in question answering and content generation.
Contextual coherence
Retrieval-based models provide context for the generation process, making it easier to generate coherent and contextually appropriate text. This leads to more cohesive and understandable responses, as the generation component can build upon the retrieved information.
Handling open-domain queries
RAG models excel in taking open-domain questions where the required information may not be in the training data. The retrieval component can fetch relevant information from a vast knowledge base, allowing the model to provide answers or generate content on various topics.
Reduced generation bias
Incorporating retrieval can help mitigate some inherent biases in purely generative models. By relying on existing information from a diverse range of sources, RAG models can generate less biased and more objective responses.
Efficient computation
Retrieval-based models can be computationally efficient for tasks where the knowledge base is already available and structured. Instead of generating responses from scratch, they can retrieve and adapt existing information, reducing the computational cost.
Multi-modal capabilities
RAG models can be extended to work with multiple modalities, such as text and images. This allows them to generate contextually relevant text to textual and visual content, opening up possibilities for applications in image captioning, content summarization, and more.
Customization and fine-tuning
RAG models can be customized for specific domains or use cases. This adaptability makes them suitable for various applications, including domain-specific chatbots, customer support, and information retrieval systems.
Human-AI Collaboration
RAG models can assist humans in information retrieval tasks by quickly summarizing and presenting relevant information from a knowledge base, reducing the time and effort required for manual search.
Fine-Tuning vs. Retrieval-Augmented Generation
Typically, A foundation model can acquire new knowledge through two primary methods:
- Fine-tuning: This process requires adjusting pre-trained models based on a training set and model weights.
- RAG: This method introduces knowledge through model inputs or inserts information into a context window.
Fine-tuning has been a common approach. Yet, it is generally not recommended to enhance factual recall but rather to refine its performance on specialized tasks. Here is a comprehensive comparison between the two approaches:
Disclaimers
RAG is an emerging field, which is why there are few sources that can categorize these tools and frameworks. Therefore, AIMultiple relied on public vendor statements for such categorization. AIMultiple will improve this vendor list and categorization as the market grows.
RAG models and libraries listed above are sorted alphabetically on this page since AIMultiple doesn’t currently have access to more relevant metrics to rank these companies.
The vendor lists are not comprehensive.
Further reading
Discover recent developments on LLMs and LLMOps by checking out:
- LLMOPs vs MLOPs: Discover the Best Choice for You
- Comparing 10+ LLMOps Tools: A Comprehensive Vendor Benchmark
- Compare Top 20+ AI Governance Tools: A Vendor Benchmark
- Embedding Models: OpenAI vs Gemini vs Cohere
- Top Vector Database for RAG: Qdrant vs Weaviate vs Pinecone
- Hybrid RAG: Boosting RAG Accuracy
Reference Links
Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.
Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.
He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.
Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.
Be the first to comment
Your email address will not be published. All fields are required.