Top 10 Multilingual Embedding Models for RAG

updated on Feb 20, 2026

We benchmarked 10 multilingual embedding models on ~606k Amazon reviews across 6 languages (German, English, Spanish, French, Japanese, Chinese). We generated 1,800 queries (300 per language), each referencing concrete details from its source review.

Models trained for search (query vs document separation) outperform larger models trained for general text similarity: e5_base (110M params) outperforms models with 5x to 70x more parameters, while LaBSE (471M params), a widely cited multilingual model, ranks second-to-last.

Multilingual retrieval accuracy

Top-1 measures whether the correct review is the first result returned; Top-10 measures whether it appears anywhere in the first ten.

Top-1 accuracy

Top-3 accuracy

Top-5 accuracy

Top-10 accuracy

Metrics explained

Top-K accuracy: Whether the correct document (by product_id exact match) appears in the first K results. “Can the model find the right German review when asked a German question among ~130k German reviews?”
Top-1/3/5/10: K values tested. Top-1 is the strictest (the correct document must be the first result), Top-10 is the most lenient.

To understand our evaluation and metrics in detail, see our evaluation setup and benchmark methodology for multilingual embedding models.

Corpus: ~606k reviews (min_review_length≥100 chars; ZH: ~17.7k, DE/EN/ES/FR/JA: ~120–145k each), no cosine similarity fallback, product_id exact match only. Evaluated on NVIDIA H100 PCIe 80GB.

Latency & throughput

Latency determines whether a model is viable for production. Models with sub-15ms latency can support real-time search; above 25ms, batching or caching is necessary.

Key findings

1. e5_base leads across all languages

e5_base achieves 16.5% Top-1 average across 6 languages, outperforming the next model (e5_small) by 3.8 percentage points. Its asymmetric query/passage prefix training produces precise embeddings that discriminate well between semantically similar reviews in the same language.

2. LLM-based models are competitive despite their size

qwen3_emb_06b (600M params) and llama_embed_nemotron_8b (8B params) both achieve 10%+ monolingual accuracy. Their massive multilingual pre-training appears to build representations that retrieval fine-tuning cannot fully erase, remaining competitive with models a fraction of their parameter count. nemotron reaches 25.8% at Top-10, the third best result overall.

3. nomic_embed_v1_5 fails on CJK languages

nomic achieves 0% accuracy in Chinese and only 4% in Japanese, the only model to completely fail entire languages. Its English-centric training combined with search_query/search_document prefix asymmetry creates severe coverage gaps for non-European languages, despite working well for English (17% Top-1) and German (9%).

4. LaBSE fails at retrieval despite its reputation

LaBSE was explicitly designed for multilingual semantic similarity and is widely cited in the literature. In this benchmark, it ranks second-to-last (4.8% Top-1). Its training on translation pairs and natural language inference did not build the discriminative precision required for retrieval: distinguishing the exact source review from hundreds of semantically similar products in the same language.

5. Top-10 scaling benefits all models, but especially the stronger ones

Moving from Top-1 to Top-10 doubles recall across the board. nemotron shows the best Top-10 monolingual avg (25.8%) despite ranking 3rd at Top-1 (12.0%), suggesting its 4096-dimensional space has good nearest-neighbor structure at larger K.

6. Spanish and French consistently underperform

Across all models, ES and FR rank consistently lower than DE, EN, JA, and ZH. The pattern holds even for models with explicit multilingual training, suggesting lower representation in the pre-training corpora or domain mismatch for product reviews.

How multilingual embeddings work

An embedding model converts text into a high-dimensional vector (e.g., 384 or 768 numbers) that captures the meaning of the text rather than the specific words. Two texts that are semantically similar should have vectors close together in this space, regardless of language.

A multilingual embedding model handles multiple languages in the same vector space. When used for retrieval, the model must find the correct document among tens of thousands of reviews in the same language that often discuss similar products and topics. The challenge is discriminative precision: distinguishing the exact source review from hundreds of semantically similar ones in the same category.

Multilingual evaluation setup

~606k product reviews are indexed in Qdrant (only reviews with ≥100 character body; ZH: ~17.7k, other languages: ~120–145k each). 1,800 queries (300 per language) are generated natively by LLM from reviews meeting the same length threshold. Each query must reference concrete details from its source review (measurements, quantities, brand names, timelines); generic questions are filtered out via a specificity score. Given a query in language X, the task is to find the source review among same-language reviews. Qdrant filters results by language. Accuracy is measured via product_id exact match at Top-1/3/5/10 with no cosine similarity fallback.

Example queries from the benchmark:

German (electronics, OPINION):

French (drugstore, USAGE):

Spanish (industrial_supplies, FACTUAL):

The model must match each query to its exact source review by product_id. A query about WiFi signal loss from an antenna cable could semantically match thousands of electronics reviews discussing connectivity issues; only one describes signal dropping from 60% to 20% after installing this specific cable.

Technical analysis & recommendations

Symmetric vs asymmetric models

The training objective largely predicts retrieval performance:

Why asymmetric models perform best: The query/passage prefix trains the model to embed queries and documents in systematically different regions of the space, creating a retrieval-specific geometry. This produces more discriminative embeddings that separate semantically similar but distinct documents. e5_base achieves this at 110M parameters because the training objective, not model capacity, drives retrieval precision.

Why LLM-based models are competitive: Massive multilingual pre-training builds rich semantic structure in the model weights. Retrieval fine-tuning adds task-specific alignment on top of this deep language understanding, resulting in competitive performance. The trade-off is latency: nemotron’s 4096-dimensional vectors cost 25ms per query vs 11ms for e5_base.

Why LaBSE fails despite its reputation: LaBSE was trained on translation pairs to bring sentence-level meaning close across languages, a similarity task. Retrieval is fundamentally different: it requires distinguishing the exact source review from hundreds of semantically similar products in the same language. Similarity training optimizes for coarse-grained semantic closeness; retrieval demands fine-grained discrimination between near-duplicates.

Which model should you use?

Best accuracy: e5_base (16.5% Top-1, 11ms latency). Use with a language filter.

Best latency/accuracy trade-off: e5_small (12.7% Top-1, 9.7ms), nearly as fast as minilm with better accuracy.

Best top-10 recall: nemotron (25.8% Top-10) if you can afford the 25ms latency and GPU memory for 4096-dim vectors.

For latency-sensitive production systems: e5_small or minilm at ~10ms. e5_small is strongly preferred (12.7% vs 3.8%).

Always use a language filter when you know the query and document languages match. All models show significant accuracy gains with language-filtered search.

Multilingual embedding models methodology

GPU: NVIDIA H100 PCIe 80GB via Runpod
Vector database: Qdrant 1.12.0 (local binary)
Embedding library: sentence-transformers 5.2.2
Query generation: Claude Sonnet 4.6 via OpenRouter. Each question must reference specific details from its source review; generic questions (specificity score < 4/5) are filtered out.
Dataset: Amazon Reviews Multi (Kaggle)¹, train.csv. ~606k reviews indexed (min 100 chars; ZH: ~17.7k, others: ~120-145k each). 6 languages: DE, EN, ES, FR, JA, ZH.
Queries: 1,800 total (300 per language, 5 question types, natively generated in each language).
Document format: "Review Title: {title}\nReview: {body}"
Ground truth: product_id exact match only. No cosine similarity fallback.
Search: Qdrant vector search with cosine distance. Top-K = 10. Language filter applied for monolingual evaluation.
Embedding: L2 normalization. Asymmetric prefixes where applicable: "query: " / "passage: " (e5), "search_query: " / "search_document: " (nomic).
No fine-tuning: All models evaluated zero-shot with default weights.
Latency: Embedding inference only (single query). Does not include vector search time.

Models Evaluated

Why are scores lower than BEIR/MTEB

Absolute accuracy numbers in this benchmark should not be compared directly to scores reported on BEIR or MTEB. The two benchmarks differ in several structural ways:

The exact-match metric is the largest structural difference. Every query references concrete details from its source review (e.g., “How many hours did the 3D printer take to print the cat file from SD card?”), so each query has a clear unique target, but the metric still awards zero for a semantically relevant review from a different product. Partial-credit metrics like nDCG would yield higher numbers on the same retrieval results. What matters in this benchmark is the relative ranking between models, not the absolute numbers.

Limitations

Question types may not represent real user queries. LLM-generated questions tend to be well-formed and specific. Real users often write fragmentary or ambiguous queries.
Only dense retrieval is tested. Sparse methods (BM25), hybrid retrieval, and reranking pipelines are not evaluated. These may significantly change the ranking between models.
300 queries per language is a moderate sample. Per-language results have reasonably narrow confidence intervals, but rankings near the middle of the table should still be interpreted cautiously.
No evaluation of embedding quality beyond retrieval. Clustering quality, semantic similarity accuracy, and other downstream tasks are not measured.

Conclusion

Models trained for search (with separate query and document embeddings) consistently beat models trained for general text similarity, regardless of size. e5_base (110M params) outperforms models 5x to 70x larger. LaBSE (471M params), widely cited for multilingual tasks, ranks second-to-last because its similarity training does not build the fine-grained discrimination that retrieval requires.

LLM-based models (qwen3 at 600M params, nemotron at 8B params) achieve competitive accuracy thanks to deep multilingual pre-training, but they pay for it in latency: nemotron costs 25ms per query vs 11ms for e5_base, with only marginally better Top-10 recall. For most production systems, the smaller search-trained models offer a better trade-off.

For practitioners building multilingual RAG systems, e5_base with a language filter is the clear choice (16.5% Top-1, 11ms latency, and a 3.8 percentage point gap over second place).

Reference Links

https://www.kaggle.com/datasets/mexwell/amazon-reviews-multi/

AI Researcher

Ekrem Sarı

AI Researcher

Follow On

Ekrem is an AI Researcher at AIMultiple, focusing on intelligent automation, GPUs, AI Agents, and RAG frameworks.

View Full Profile

Be the first to comment

Your email address will not be published. All fields are required.

Next to Read

RAGJan 28