What is Retrieval-Augmented Generation (RAG), and why is a hybrid approach necessary?

RAG allows a generative model, like a Large Language Model (LLM), to generate responses based on external data rather than relying solely on its internal training data. This improves factual accuracy by grounding answers in retrieved information. However, not all data is the same. Some queries demand semantic understanding, while others rely on precise keyword matching—especially when dealing with structured queries or entities extracted from complex information. That’s why hybrid retrieval augmented generation (Hybrid RAG) is essential. It combines dense semantic search with sparse lexical search, providing both contextual relevance and keyword precision. This hybrid nature ensures that the system retrieves context from both structured and unstructured text data, delivering more accurate responses.

How does this Hybrid RAG system handle different types of data?

The current implementation focuses on unstructured text data, such as product reviews, which often contain nuanced opinions, technical details, and varied linguistic patterns. The system uses multiple retrieval techniques to ensure it captures both meaning and exact terms. Looking forward, Hybrid RAG could be extended to include structured information and graph data, allowing it to answer more complex queries by integrating facts from knowledge graphs with the sentiment or context in reviews. This would result in a unified context that bridges raw data, structured documents, and narrative content, enabling richer context during response generation.

What happens during the information retrieval process in this specific system?

When a user submits a query, the system activates two parallel retrieval components: a dense retriever (semantic) and a sparse retriever (lexical). The dense model captures broad meanings and relationships, while the SPLADE-based sparse model locks onto key terms. These two result sets are fused using Reciprocal Rank Fusion (RRF), which resolves the scoring incompatibility between different retrieval methods. This hybrid approach allows the system to retrieve multiple documents that satisfy different parts of a query, improving its ability to generate coherent responses based on the most relevant and comprehensive context available.

Are there downsides to using a hybrid system? What are the computational costs?

Yes, the hybrid approach is more resource intensive. It demands more computational resources due to dual vector generation, double search operations, and fusion logic. This means longer query processing times and a need for additional infrastructure to handle large volumes of data. Despite this, the performance gains—especially in Mean Reciprocal Rank (MRR) and Recall@5—make it a worthwhile trade-off for applications where factual accuracy and completeness matter. When compared in a rigorous benchmark, the hybrid method consistently retrieved more contextually appropriate and precise information than dense-only systems.

How does Hybrid RAG compare to other RAG methods?

Unlike traditional RAG techniques that rely solely on dense embeddings, Hybrid RAG leverages multiple retrieval methods to maximize answer quality. It supports a broader spectrum of query types, from vague to highly specific, thanks to its dual-retrieval design. Its hybrid nature makes it especially powerful in use cases where multiple constraints must be satisfied—such as combining structured information (e.g., “paraben-free”) with broader intents (e.g., “natural deodorant”). This comparative analysis demonstrates that Hybrid RAG offers a more balanced and adaptive response based on both dense and sparse signals.

Can this system work with graph-based or structured data in the future?

Yes, future directions for Hybrid RAG include incorporating knowledge graphs and structured data alongside text. By doing so, it can respond to structured queries and provide answers that synthesize graph-based relationships (like product categories or ingredient interactions) with freeform user reviews. This would allow the system to generate responses grounded in both precise factual structures and nuanced human narratives, improving both factual accuracy and user satisfaction.

Why does better accuracy require more processing power?

Because Hybrid RAG performs two types of retrieval and then fuses the results, it naturally uses more computational resources. Vector generation—especially when generating both dense and sparse embeddings—accounts for over 90% of total latency. Compared to a dense-only approach, this leads to an increase in latency (~201ms per query in our benchmark). This reflects a broader truth in artificial intelligence: more accurate systems often require more computation. But for mission-critical tasks—like extracting structured information from raw data, navigating complex queries, or ensuring high-stakes factual correctness—the accuracy is well worth the cost.

AI RAG

Hybrid RAG: Boosting RAG Accuracy

Cem Dilmegani

with

Ekrem Sarı

updated on Sep 1, 2025

See our ethical norms

Dense vector search is excellent at capturing semantic intent, but it often struggles with queries that demand high keyword accuracy. To quantify this gap, we benchmarked a standard dense-only retriever against a hybrid RAG system that incorporates SPLADE sparse vectors.

Our evaluation, performed on a curated set of 100 challenging, real-world questions, focused on each system’s ability to retrieve and correctly rank the single best answer.

Performance comparison: Dense vs hybrid retrieval

Our benchmark reveals that a well-tuned hybrid search system significantly outperforms a dense-only approach by delivering more accurate and highly ranked results.

Better ranking precision (MRR +18.5%): The hybrid system elevated the Mean Reciprocal Rank from 0.410 to 0.486. This substantial improvement is the most compelling result, as it directly translates to a better user experience by significantly increasing the likelihood that the single best answer appears in the top position.
Improved retrieval rate (Recall@5 +7.2%): The hybrid model increased the Recall@5 score from 0.655 to 0.702. This demonstrates its ability to find the correct answer within the top 5 results more consistently, successfully surfacing documents that the dense-only approach would have missed entirely.

To understand our evaluation and metrics in detail, see our benchmark methodology for Hybrid RAG.

Accuracy vs. latency: The performance trade-off

While the hybrid system delivers superior accuracy, this enhanced performance comes at a measurable computational cost.

The hybrid system introduces an additional 201ms of latency per query, representing a 24.5% increase in processing time. To understand our latency measurement process and timing methodology in detail, see our latency measurement methodology.

Where does the extra time go?

The 201ms increase in latency for the hybrid system is not distributed evenly across all operations. Our detailed timing analysis reveals precisely where the computational cost lies:

This breakdown clearly shows that the majority of the latency comes from the initial vector generation step, where the system must create both a dense vector and a sparse vector.

The actual search and fusion steps are remarkably fast, together contributing less than 7% of the total time. This insight is crucial for future optimization efforts, which could focus on parallelizing or accelerating the vector generation process.

Hybrid RAG system architecture

Our hybrid system combines two complementary retrieval approaches, each addressing different query characteristics through a carefully designed parallel processing architecture.

Figure 1: The workflow of our hybrid retrieval system, from initial user query to the final ranked list of documents sent to the LLM.

Dense component: Semantic understanding

Model: OpenAI text-embedding-3-small
Strength: Captures semantic meaning and context, excelling at understanding user intent even when queries lack specific keywords.
Use case: A query like “stomach-friendly pain relief” will successfully match documents mentioning concepts like “gentle on my digestion” or “didn’t cause an upset stomach,” even if the exact word “friendly” is not used.

Sparse component: Keyword precision

Model: SPLADE (SParse Lexical and Expansion model)
Strength: Identifies and assigns high importance to discriminative keywords, including technical names, model numbers, and specific product attributes that a purely semantic search might overlook.
Use case: A query containing a specific term like “acetaminophen” requires an exact keyword match. SPLADE ensures that documents containing this precise term are highly ranked, a task where a dense model might generalize to “pain reliever” and miss the specific ingredient.

The reciprocal rank fusion (RRF) algorithm

A user query is vectorized by both the OpenAI and SPLADE models simultaneously, resulting in two independent ranked lists. The critical step is combining these lists using Reciprocal Rank Fusion (RRF).

RRF solves the challenge of merging results from systems with incompatible scoring scales (e.g., a dense score of 0.89 vs. a sparse score of 95.4). Instead of using raw scores, it focuses purely on document rank position (1st, 2nd, 3rd).

Example: For the query “natural deodorant without aluminum and parabens”

Dense search ranks a review about “organic, chemical-free deodorant” as #1 (semantic relevance)
Sparse search ranks a review containing “aluminum-free” and “paraben-free” as #1 (exact keywords)
RRF fusion promotes documents appearing high on both lists to the top

A review that’s semantically relevant AND contains the exact keywords gets the highest combined score, ensuring the best overall match ranks #1.

The final score uses the formula:

where k=60 and rank_i is the document’s position in each search result. The sparse_boost parameter (1.2) slightly favors keyword precision without overwhelming semantic understanding.

The role of fusion parameter tuning

A key finding from our research is that simply combining two retrieval systems doesn’t guarantee improved performance. Our initial hybrid configuration actually performed worse than the dense-only baseline, achieving an MRR of only 0.390.

The issue was an improperly tuned fusion parameter:

Initial problematic setting: sparse_boost = 3.0
Optimized setting: sparse_boost = 1.2

The initial configuration gave keyword matches from SPLADE three times the weight of semantic matches from the dense model. This aggressive weighting caused semantically irrelevant but keyword-rich documents to overwhelm contextually appropriate results, degrading overall performance.

The optimization to sparse_boost = 1.2 provides a slight preference for keyword matches without overriding semantic understanding, a balance that proved critical for achieving the 18.5% MRR improvement.

When hybrid retrieval excels: The multi-constraint query

The performance advantage of hybrid systems becomes apparent in specific query types that challenge dense-only approaches. A common and challenging query from our “Health and Personal Care” dataset is:

“I need a natural deodorant that is both aluminum-free and paraben-free.”

This query has two distinct parts: a broad semantic intent (“natural deodorant”) and two strict keyword constraints (“aluminum-free,” “paraben-free”).

How a dense-only system responds: A dense-only retriever is excellent at understanding the “natural deodorant” intent. It will find reviews discussing “gentle, organic deodorants.” However, it might highly rank a review that talks about being “all-natural” and “aluminum-free” even if it never mentions parabens. The system correctly captures the primary intent but fails on one of the non-negotiable constraints.

How the hybrid system wins: The hybrid system addresses this issue through a dual approach:

The sparse search (precision filter): The SPLADE model immediately finds documents containing the exact, high-weight keywords “aluminum-free” and “paraben-free.”
The dense search (relevance filter): Simultaneously, the OpenAI model searches for documents that are semantically related to “natural, effective deodorant.”
The fusion (RRF): RRF then looks at both ranked lists. A document that appears high on both, for instance, a glowing review that explicitly praises a product for being “natural,” “effective,” “aluminum-free,” and “paraben-free,” receives the highest possible fused score and is promoted to the #1 rank.

Benchmark methodology for hybrid RAG

Our evaluation methodology was designed to ensure a fair, transparent, and reproducible comparison between the dense-only and hybrid retrieval systems.

Test setup and data corpus

Knowledge corpus: We used a dataset of 494,094 real-world user reviews from the Amazon Customer Reviews dataset (Health and Personal Care category)¹.
Vector database: We utilized Qdrant to host two separate collections.
- The dense-only collection stored only OpenAI vectors.
- The hybrid collection used Qdrant’s “named vectors” feature to store both a dense (dense) and a sparse (text-sparse) vector for each document.
Similarity metric: Cosine Similarity was used for all dense vector searches.

Test Queries: Selection process

We created a high-quality test set of 100 questions through a three-step, code-driven process to avoid anecdotal or biased evaluation:

Preprocessing: We programmatically cleaned raw Amazon Q&A data², filtering out nonsensical or low-quality questions. We established a “ground truth” answer for each question by selecting the response with the most “helpful” user votes.
Difficulty classification: We applied a rule-based script to score and classify all questions by difficulty. Questions containing comparative language (“difference between,” “vs,” “better than”) or asking for opinions (“experience with”) were scored as more difficult than simple factual questions (“what are the dimensions”).
Final selection: We manually curated the final 100-question benchmark set from the “hard” category. This ensures we are testing the limits of each retrieval system, where the performance differences are most apparent.

Evaluation metrics

Recall@5 (Hit rate): This metric addresses a basic question: “Did the system find the correct information?” It measures the percentage of queries for which the ground-truth answer appeared anywhere within the top 5 search results. A high Recall@5 score indicates an effective system that successfully surfaces relevant information.
MRR (Mean reciprocal rank): This is a rank-sensitive metric that answers: “How quickly did the user find the correct information?” It heavily rewards ranking the correct answer first (a score of 1.0), with diminishing scores for lower ranks (0.5 for 2nd, 0.33 for 3rd, etc.). A high MRR is crucial for user experience, as it signifies that the most accurate result is placed at the top.

Latency measurement

To provide a complete performance analysis, we measured the end-to-end query latency for both the dense-only and hybrid systems. This measurement is critical for understanding the real-world cost of the accuracy gains provided by the hybrid approach.

The process was implemented within our Python evaluation scripts using high-precision time.perf_counter() function. For each of the 100 test queries, we measured the total elapsed time from the moment a query was submitted to the retrieval function until the final, ranked list of documents was returned.

For the hybrid system, we performed a more granular analysis by timing its three distinct stages independently:

Vector generation: The total time required to generate both the dense vector (via an API call to OpenAI) and the sparse vector (via local SPLADE model inference).
Search operations: The time taken to execute two separate search queries against the Qdrant vector database, one for the dense vector and one for the sparse vector.
Fusion (RRF): The computational time for the Reciprocal Rank Fusion algorithm to merge the two result sets and produce the final, re-ranked list.

The final latency figures reported in our results represent the arithmetic mean of the times recorded across all 100 test queries, converted to milliseconds (ms) for clarity. This approach ensures that our latency metrics are robust and representative of the average user experience.

Limitations and scope

Our benchmark focuses specifically on the health and personal care domain using Amazon review data. Performance patterns may differ across other domains with distinct linguistic characteristics or technical terminology requirements.

The evaluation employed document-level granularity, treating each review as a single vector. Results may vary with different chunking strategies or fine-grained retrieval approaches.

💡Conclusion

This benchmark confirms that a well-tuned hybrid retrieval system offers a significant performance advantage over a dense-only approach for challenging, real-world queries. By intelligently combining semantic and lexical search, the hybrid model delivers a superior user experience with more accurate and highly-ranked results.

Key takeaways from our benchmark include:

Hybrid outperforms dense-only: The optimized hybrid system achieved a +7.2% increase in Recall@5 and a substantial +18.5% boost in MRR, proving its superior ability to both find and correctly rank the best answer.
Tuning is non-negotiable: Simply combining dense and sparse search is not enough. Our initial, untuned hybrid system underperformed the dense-only baseline. Strategic optimization of the fusion parameters was essential to performance gains.
Accuracy comes at a cost: The improved accuracy of the hybrid system introduced a ~201 ms (24.5%) latency increase per query. This trade-off is a critical consideration for system designers, balancing the need for precision against real-time performance requirements.

FAQ

Reference Links

McAuley-Lab/Amazon-Reviews-2023 · Datasets at Hugging Face

Amazon question/answer data

Principal Analyst

Cem Dilmegani

Principal Analyst

Follow On

Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

View Full Profile

Researched by