Dense vector search is excellent at capturing semantic intent, but it often struggles with queries that demand high keyword accuracy. To quantify this gap, we benchmarked a standard dense-only retriever against a hybrid RAG system that incorporates SPLADE sparse vectors.
Our evaluation, performed on a curated set of 100 challenging, real-world questions, focused on each system’s ability to retrieve and correctly rank the single best answer.
Performance comparison: Dense vs hybrid retrieval
Our benchmark reveals that a well-tuned hybrid search system significantly outperforms a dense-only approach by delivering more accurate and highly ranked results.
- Better ranking precision (MRR +18.5%): The hybrid system elevated the Mean Reciprocal Rank from 0.410 to 0.486. This substantial improvement is the most compelling result, as it directly translates to a better user experience by significantly increasing the likelihood that the single best answer appears in the top position.
- Improved retrieval rate (Recall@5 +7.2%): The hybrid model increased the Recall@5 score from 0.655 to 0.702. This demonstrates its ability to find the correct answer within the top 5 results more consistently, successfully surfacing documents that the dense-only approach would have missed entirely.
To understand our evaluation and metrics in detail, see our benchmark methodology for Hybrid RAG.
Accuracy vs. latency: The performance trade-off
While the hybrid system delivers superior accuracy, this enhanced performance comes at a measurable computational cost.
The hybrid system introduces an additional 201ms of latency per query, representing a 24.5% increase in processing time. To understand our latency measurement process and timing methodology in detail, see our latency measurement methodology.
Where does the extra time go?
The 201ms increase in latency for the hybrid system is not distributed evenly across all operations. Our detailed timing analysis reveals precisely where the computational cost lies:
Hybrid latency component | Time (ms) | Percentage of total |
---|---|---|
Vector generation | 954.01 | 93.2% |
Search operations | 69.48 | 6.8% |
Fusion (RRF) | 0.25 | <0.1% |
Total query time | 1023.73 | 100% |
This breakdown clearly shows that the majority of the latency comes from the initial vector generation step, where the system must create both a dense vector and a sparse vector.
The actual search and fusion steps are remarkably fast, together contributing less than 7% of the total time. This insight is crucial for future optimization efforts, which could focus on parallelizing or accelerating the vector generation process.
Hybrid RAG system architecture
Our hybrid system combines two complementary retrieval approaches, each addressing different query characteristics through a carefully designed parallel processing architecture.

Dense component: Semantic understanding
- Model: OpenAI text-embedding-3-small
- Strength: Captures semantic meaning and context, excelling at understanding user intent even when queries lack specific keywords.
- Use case: A query like “stomach-friendly pain relief” will successfully match documents mentioning concepts like “gentle on my digestion” or “didn’t cause an upset stomach,” even if the exact word “friendly” is not used.
Sparse component: Keyword precision
- Model: SPLADE (SParse Lexical and Expansion model)
- Strength: Identifies and assigns high importance to discriminative keywords, including technical names, model numbers, and specific product attributes that a purely semantic search might overlook.
- Use case: A query containing a specific term like “acetaminophen” requires an exact keyword match. SPLADE ensures that documents containing this precise term are highly ranked, a task where a dense model might generalize to “pain reliever” and miss the specific ingredient.
The reciprocal rank fusion (RRF) algorithm
A user query is vectorized by both the OpenAI and SPLADE models simultaneously, resulting in two independent ranked lists. The critical step is combining these lists using Reciprocal Rank Fusion (RRF).
RRF solves the challenge of merging results from systems with incompatible scoring scales (e.g., a dense score of 0.89 vs. a sparse score of 95.4). Instead of using raw scores, it focuses purely on document rank position (1st, 2nd, 3rd).
Example: For the query “natural deodorant without aluminum and parabens”
- Dense search ranks a review about “organic, chemical-free deodorant” as #1 (semantic relevance)
- Sparse search ranks a review containing “aluminum-free” and “paraben-free” as #1 (exact keywords)
- RRF fusion promotes documents appearing high on both lists to the top
A review that’s semantically relevant AND contains the exact keywords gets the highest combined score, ensuring the best overall match ranks #1.
The final score uses the formula:
Score = Σ (1 / (k + rank_i))
where k=60 and rank_i is the document’s position in each search result. The sparse_boost parameter (1.2) slightly favors keyword precision without overwhelming semantic understanding.
The role of fusion parameter tuning
A key finding from our research is that simply combining two retrieval systems doesn’t guarantee improved performance. Our initial hybrid configuration actually performed worse than the dense-only baseline, achieving an MRR of only 0.390.
The issue was an improperly tuned fusion parameter:
- Initial problematic setting: sparse_boost = 3.0
- Optimized setting: sparse_boost = 1.2
The initial configuration gave keyword matches from SPLADE three times the weight of semantic matches from the dense model. This aggressive weighting caused semantically irrelevant but keyword-rich documents to overwhelm contextually appropriate results, degrading overall performance.
The optimization to sparse_boost = 1.2 provides a slight preference for keyword matches without overriding semantic understanding, a balance that proved critical for achieving the 18.5% MRR improvement.
When hybrid retrieval excels: The multi-constraint query
The performance advantage of hybrid systems becomes apparent in specific query types that challenge dense-only approaches. A common and challenging query from our “Health and Personal Care” dataset is:
“I need a natural deodorant that is both aluminum-free and paraben-free.”
This query has two distinct parts: a broad semantic intent (“natural deodorant”) and two strict keyword constraints (“aluminum-free,” “paraben-free”).
How a dense-only system responds: A dense-only retriever is excellent at understanding the “natural deodorant” intent. It will find reviews discussing “gentle, organic deodorants.” However, it might highly rank a review that talks about being “all-natural” and “aluminum-free” even if it never mentions parabens. The system correctly captures the primary intent but fails on one of the non-negotiable constraints.
How the hybrid system wins: The hybrid system addresses this issue through a dual approach:
- The sparse search (precision filter): The SPLADE model immediately finds documents containing the exact, high-weight keywords “aluminum-free” and “paraben-free.”
- The dense search (relevance filter): Simultaneously, the OpenAI model searches for documents that are semantically related to “natural, effective deodorant.”
- The fusion (RRF): RRF then looks at both ranked lists. A document that appears high on both, for instance, a glowing review that explicitly praises a product for being “natural,” “effective,” “aluminum-free,” and “paraben-free,” receives the highest possible fused score and is promoted to the #1 rank.
Benchmark methodology for hybrid RAG
Our evaluation methodology was designed to ensure a fair, transparent, and reproducible comparison between the dense-only and hybrid retrieval systems.
Test setup and data corpus
- Knowledge corpus: We used a dataset of 494,094 real-world user reviews from the Amazon Customer Reviews dataset (Health and Personal Care category)1 .
- Vector database: We utilized Qdrant to host two separate collections.
- The dense-only collection stored only OpenAI vectors.
- The hybrid collection used Qdrant’s “named vectors” feature to store both a dense (dense) and a sparse (text-sparse) vector for each document.
- Similarity metric: Cosine Similarity was used for all dense vector searches.
Test Queries: Selection process
We created a high-quality test set of 100 questions through a three-step, code-driven process to avoid anecdotal or biased evaluation:
- Preprocessing: We programmatically cleaned raw Amazon Q&A data2 , filtering out nonsensical or low-quality questions. We established a “ground truth” answer for each question by selecting the response with the most “helpful” user votes.
- Difficulty classification: We applied a rule-based script to score and classify all questions by difficulty. Questions containing comparative language (“difference between,” “vs,” “better than”) or asking for opinions (“experience with”) were scored as more difficult than simple factual questions (“what are the dimensions”).
- Final selection: We manually curated the final 100-question benchmark set from the “hard” category. This ensures we are testing the limits of each retrieval system, where the performance differences are most apparent.
Evaluation metrics
- Recall@5 (Hit rate): This metric addresses a basic question: “Did the system find the correct information?” It measures the percentage of queries for which the ground-truth answer appeared anywhere within the top 5 search results. A high Recall@5 score indicates an effective system that successfully surfaces relevant information.
- MRR (Mean reciprocal rank): This is a rank-sensitive metric that answers: “How quickly did the user find the correct information?” It heavily rewards ranking the correct answer first (a score of 1.0), with diminishing scores for lower ranks (0.5 for 2nd, 0.33 for 3rd, etc.). A high MRR is crucial for user experience, as it signifies that the most accurate result is placed at the top.
Latency measurement
To provide a complete performance analysis, we measured the end-to-end query latency for both the dense-only and hybrid systems. This measurement is critical for understanding the real-world cost of the accuracy gains provided by the hybrid approach.
The process was implemented within our Python evaluation scripts using high-precision time.perf_counter() function. For each of the 100 test queries, we measured the total elapsed time from the moment a query was submitted to the retrieval function until the final, ranked list of documents was returned.
For the hybrid system, we performed a more granular analysis by timing its three distinct stages independently:
- Vector generation: The total time required to generate both the dense vector (via an API call to OpenAI) and the sparse vector (via local SPLADE model inference).
- Search operations: The time taken to execute two separate search queries against the Qdrant vector database, one for the dense vector and one for the sparse vector.
- Fusion (RRF): The computational time for the Reciprocal Rank Fusion algorithm to merge the two result sets and produce the final, re-ranked list.
The final latency figures reported in our results represent the arithmetic mean of the times recorded across all 100 test queries, converted to milliseconds (ms) for clarity. This approach ensures that our latency metrics are robust and representative of the average user experience.
Limitations and scope
Our benchmark focuses specifically on the health and personal care domain using Amazon review data. Performance patterns may differ across other domains with distinct linguistic characteristics or technical terminology requirements.
The evaluation employed document-level granularity, treating each review as a single vector. Results may vary with different chunking strategies or fine-grained retrieval approaches.
Further reading
Explore other RAG benchmarks, such as:
- Embedding Models: OpenAI vs Gemini vs Cohere
- Top Vector Database for RAG: Qdrant vs Weaviate vs Pinecone
- Agentic RAG benchmark: multi-database routing and query generation
💡Conclusion
This benchmark confirms that a well-tuned hybrid retrieval system offers a significant performance advantage over a dense-only approach for challenging, real-world queries. By intelligently combining semantic and lexical search, the hybrid model delivers a superior user experience with more accurate and highly-ranked results.
Key takeaways from our benchmark include:
- Hybrid outperforms dense-only: The optimized hybrid system achieved a +7.2% increase in Recall@5 and a substantial +18.5% boost in MRR, proving its superior ability to both find and correctly rank the best answer.
- Tuning is non-negotiable: Simply combining dense and sparse search is not enough. Our initial, untuned hybrid system underperformed the dense-only baseline. Strategic optimization of the fusion parameters was essential to performance gains.
- Accuracy comes at a cost: The improved accuracy of the hybrid system introduced a ~201 ms (24.5%) latency increase per query. This trade-off is a critical consideration for system designers, balancing the need for precision against real-time performance requirements.
FAQ
Reference Links

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.
Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.
He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.
Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

Comments 0
Share Your Thoughts
Your email address will not be published. All fields are required.