We benchmarked 8 reranker models on ~145k English Amazon reviews to measure how much a reranking stage improves dense retrieval. We retrieved top-100 candidates with multilingual-e5-base, reranked them with each model, and evaluated the top-10 results against 300 queries, each referencing concrete details from its source review. The best reranker lifted Hit@1 from 62.67% to 83.00% (+20.33pp).
Reranker benchmark results
Metrics explained:
ΔHit@1 / ΔHit@10 shows the improvement over the baseline (no reranker) in percentage points (pp). For example, +20.33pp means the reranker improved Hit@1 by 20.33 percentage points compared to the baseline’s 62.67%.
Hit@K measures whether any review with the correct product_id appears in the top-K results. The ground truth is the product_id of the review that generated the query. If a different review of the same product lands in top-K, that counts as a hit. Hit@1 is the strictest test: is the top result from the right product? Hit@10 is more lenient: is the right product somewhere in the first 10 results?
MRR@10 (Mean Reciprocal Rank) averages 1/rank of the first correct result across all queries. If the first matching product_id is at rank 1, the score is 1.0. At rank 2, it is 0.5. At rank 10, it is 0.1. This rewards models that place the correct product as high as possible.
nDCG@10 (Normalized Discounted Cumulative Gain) evaluates the positions of all matching reviews in the top-10, not just the first one. If the same product has multiple reviews in the candidate set and several land in the top-10, nDCG credits each one based on its position. In practice, most products have only 1-2 reviews in the top-100 candidates, so nDCG and MRR track closely.
Recall@10 measures the fraction of matching reviews (same product_id) in the top-10 out of all matching reviews in the full candidate set (top-100). If a product has 3 reviews in the top-100 and the reranker puts 2 of them in the top-10, Recall@10 is 2/3 for that query. Because most products have few duplicate reviews in the candidate set, Recall@10 and Hit@10 are nearly identical in this benchmark.
Latency breakdown
Reranking latency measures the time for each cross-encoder to score 100 candidate documents against the query. Vector search time (~20ms) is excluded since it stays constant across all runs and is independent of the reranker.
Latency metrics explained:
Rerank is the time for the cross-encoder to score all 100 candidate documents against the query. This is where models differ: a single forward pass is fast, while autoregressive decoding is slow.
P95 is the 95th percentile total latency. Some queries have longer review texts, which increases tokenization and scoring time. P95 shows the worst-case you should expect for 95% of queries.
Key findings
A 149M model matches a 1.2B model
gte-reranker-modernbert-base has 149M parameters, nemotron-rerank-1b has 1.2B. Both hit 83.00% Hit@1 on English. The ModernBERT architecture is 8x smaller and delivers identical top-line accuracy.
This does not mean model size is irrelevant. nemotron edges ahead on MRR@10 (0.8514 vs 0.8483) and Hit@10 (88.33% vs 88.00%), meaning it ranks relevant documents slightly better across the full top-10. But for most applications where getting the first result right is what counts, the 149M model is enough.
The largest model is not the best
qwen3_reranker_4b has 4B parameters and takes over a second per query. It hits 77.67% Hit@1, placing fourth behind nemotron (1.2B), gte_modernbert (149M), and jina (560M). You pay 4.5x the latency of nemotron for 5.3 percentage points less accuracy.
qwen3’s architecture uses causal language modeling with a yes/no logit approach. The model reads the query-document pair and outputs the probability of “yes, this is relevant.” This is conceptually clean, but inference is expensive because of autoregressive decoding overhead. The SequenceClassification models (gte_modernbert, bge) and nemotron’s prompt-template approach process the pair in a single forward pass, which is fundamentally faster.
Jina offers the best speed-accuracy tradeoff
jina_reranker_v3 hits 81.33% Hit@1 at 188ms. nemotron hits 83.00% at 243ms. If you need sub-200ms total latency per query, Jina is the only model in the top tier that delivers. The 1.67 percentage point gap may not justify the extra 55ms in a production system serving thousands of requests per second.
One reranker makes results worse
mxbai_rerank_xsmall (70M params) scores 64.67% Hit@1. The baseline without any reranker scores 62.67%. The improvement is only 2 percentage points, which is within noise for 300 queries. At 70M parameters, the model lacks the capacity to reliably judge query-document relevance on longer or more nuanced texts.
A reranker is not automatically beneficial. Test it on your actual data before deploying.
The retriever sets the ceiling
All top rerankers converge around 87-88% Hit@10. This ceiling comes from the retriever. If multilingual-e5-base does not place the correct document in the top-100 candidates, no reranker can recover it. The remaining 12% of queries where every reranker fails represent cases where the dense retriever simply missed the relevant document entirely.
Improving beyond this ceiling requires a better retriever, a larger candidate pool, or both. We tested top-250 candidates and found almost no improvement over top-100, meaning e5_base exhausts its useful candidates well before rank 250.
How rerankers work
A dense retriever (bi-encoder) encodes queries and documents independently into vectors. Retrieval is a nearest-neighbor search over these vectors. This is fast because you only encode the query at search time, but the model never sees the query and document together, so it can miss nuanced relevance signals.
A reranker (cross-encoder) takes a query-document pair as a single input. The model attends to both texts jointly, catching relationships that independent encoding misses. The cost is that you must run the model once per candidate, so you can only afford to score a small pool.
Architectures in this benchmark
We tested four different cross-encoder architectures:
SequenceClassification models (bge_base, bge_v2_m3, mxbai_xsmall, gte_modernbert) take a [query, document] pair as input and output a single logit score. This is the simplest and most common approach.
Nemotron uses a prompt template format: “question:{q} passage:{p}”. The input looks like plain text rather than a structured pair, but the model still outputs a single relevance score through SequenceClassification. The LLM pretraining (based on Llama) gives it strong language understanding.
Qwen3 rerankers use causal language modeling. The model reads the pair and generates a yes/no judgment. The score is log P(yes) / (P(yes) + P(no)). This requires the full autoregressive machinery, which explains the higher latency.
Jina v3 uses a custom API (model.rerank()) that handles tokenization and scoring internally. The underlying architecture uses cross-attention, but the interface abstracts away the details.
Reranker benchmark methodology
- GPU: NVIDIA H100 PCIe 80GB via Runpod
- Vector database: Qdrant 1.12.0 (local binary), cosine distance
- Retriever: multilingual-e5-base (768-dim). Query prefix:
"query: ", document prefix:"passage: " - Software: transformers 5.2.0, PyTorch 2.8.0, CUDA 12.8.1
- Dataset: English subset of Amazon Reviews Multi (Kaggle).1 ~145k reviews after filtering for min 100 characters. Each review has a product_id, review text, and star rating.
- Query generation: Claude Sonnet 4.6 via OpenRouter. 300 English queries (5 types: factual, opinion, usage, problem-solving, feature comparison). Each query must reference specific details from its source review; generic questions (specificity score < 4/5) are filtered out.
- Document format:
"Review Title: {title}\nReview: {body}" - Pipeline: Retrieve top-100 candidates with multilingual-e5-base, rerank with cross-encoder, return top-10. Baseline skips reranking and returns the retriever’s top-10 directly.
- Ground truth: product_id exact match only. No cosine similarity fallback. No partial credit for semantically similar products.
- Controlled variable: Only the reranker model changes between experiments. Retriever, candidate count, query set, and evaluation criteria are identical across all runs.
- No fine-tuning: All models evaluated zero-shot with default HuggingFace weights.
- Latency: Retrieval (Qdrant vector search) + reranking (cross-encoder scoring of 100 candidates). Measured per query on GPU.
Models tested
Limitations
This benchmark uses a single retriever (multilingual-e5-base). A different retriever would produce different candidate sets and could change the reranker rankings. The results reflect how well each reranker works with this specific retriever, not reranker quality in isolation.
We tested on English product reviews from Amazon. Performance on other domains (scientific papers, legal documents, code) or other languages will differ.
The candidate count is fixed at 100. Some rerankers might rank differently with 20 or 200 candidates. We tested 250 candidates and found negligible improvement, suggesting 100 is sufficient for e5_base, but other retrievers may behave differently.
300 queries is a moderate sample size. The top three models (nemotron, gte_modernbert, jina) are separated by less than 2 percentage points. With a larger query set, these rankings could shift. The gap between the top tier and the bottom tier (20+ percentage points) is robust.
Conclusion
Rerankers work. The best model in this benchmark lifts Hit@1 from 62.67% to 83.00% (+20.33pp), meaning 20 out of every 100 queries that previously returned the wrong document first now return the correct one. That is a significant gain for a component that adds under 250ms of latency.
The most useful finding is that model size does not determine reranker quality. gte-reranker-modernbert-base at 149M parameters matches nemotron-rerank-1b at 1.2B on Hit@1. The 4B parameter Qwen3 model finishes fourth. If you are choosing a reranker for a production system, start with the smaller models. You may never need the larger ones.
For latency-sensitive applications, jina-reranker-v3 is the strongest option under 200ms. For maximum accuracy with no latency constraint, nemotron-rerank-1b and gte-reranker-modernbert-base share the top spot. For teams on a GPU budget, gte-modernbert is the clear winner: same accuracy as the 1.2B model at a fraction of the memory footprint.
One pattern held across all experiments: the retriever sets the ceiling. No reranker pushed Hit@10 above 88%, because the remaining 12% of correct documents never appeared in the top-100 candidates. Investing in a better retriever will likely yield larger gains than switching between the top three rerankers.
Further reading
Explore other RAG benchmarks, such as:
- Embedding Models: OpenAI vs Gemini vs Cohere
- Top 16 Open Source Embedding Models for RAG
- Top Vector Database for RAG: Qdrant vs Weaviate vs Pinecone
- Agentic RAG benchmark: Multi-database routing and query generation
- Multimodal Embedding Models: Apple vs Meta vs OpenAI
- Hybrid RAG: Boosting RAG Accuracy
- Top 10 Multilingual Embedding Models for RAG
Be the first to comment
Your email address will not be published. All fields are required.