Most embedding benchmarks measure semantic similarity. We measured correctness. We tested 11 open-source models on 490,000 Amazon product reviews, scoring each by whether it retrieved the right product review through exact ASIN matching, not just topically similar documents.
Open source embedding models benchmark overview
We evaluated retrieval accuracy and speed across 100 manually curated queries.
Accuracy: Top-K retrieval performance
What is top-K accuracy?
Top-K accuracy measures how often the correct document appears within the top K retrieved results:
- Top-1: The correct answer is ranked first (most precise)
- Top-3: The correct answer appears in the top 3 results
- Top-5: The correct answer appears in the top 5 results (most relevant for RAG, which typically uses 3-5 context documents)
- Average: Mean accuracy across Top-1, Top-3, and Top-5
Higher accuracy means the model successfully finds the right product review more often.
Key insights from accuracy results:
- Perfect Top-5 performers: Three e5 family models (e5-small, e5-base-instruct, e5-large-instruct) achieved 100% Top-5 accuracy; they never missed the correct answer when allowed 5 attempts.
- Top-1 winner: e5-base-instruct achieved 58% Top-1 accuracy, meaning it ranked the correct answer first more than half the time.
- The 56% cluster: Five models (jina-v3, qwen3-0.6b, snowflake-arctic, all-MiniLM-L6-v2, and others) plateaued at 56% Top-5 accuracy, showing a clear performance gap from the leaders.
- Size doesn’t equal accuracy: The smallest model (e5-small, 118M params) matched the performance of models 5× larger.
- all-MiniLM-L6-v2 (200M+ downloads on HuggingFace) achieved only 56% Top-5 accuracy and 28% Top-1, among the lowest scores. Its 2019 architecture can’t compete with modern retrieval-optimized models.
Latency and throughput
What are latency and throughput?
- Latency (ms): Time required for embedding generation only (converting text to vector). Lower is better. Vector search time is not included in these measurements.
- Throughput (QPS): Queries processed per second. Higher is better. Calculated as 1000 ÷ latency_ms. Important for high-volume production systems.
These metrics measure how fast a model can serve users in production.
Key insights from performance results:
- Speed champion: e5-small delivered 16ms embedding latency and 63 QPS, the fastest model tested. It’s 7× faster than qwen3-0.6b (110ms, 9 QPS).
- The 7× performance gap: e5-small processes 7 queries in the time qwen3-0.6b processes 1, while also achieving 44% higher accuracy.
- Sub-30ms cluster: Five models (e5-small, all-MiniLM-L6-v2, mpnet-base-v2, e5-base-instruct, and bge-m3) achieved <30ms latency, making them suitable for real-time applications.
- Production-ready sweet spot: e5-small and e5-base-instruct combine both high accuracy (100% Top-5) and low latency (<30ms), making them ideal for production RAG systems.
Note: These are pure model inference times without vector database operations.
Open source embedding models’ technical features
Understanding the technical specifications:
- Parameters: The model’s size in millions of trainable weights. Larger models (500M+) have more capacity to learn complex patterns but require more memory and compute.
- Dimension: The length of the vector each text is converted into (e.g., 384 means each document becomes a 384-number vector). Higher dimensions (1024) can capture more semantic nuance but require more storage and slower similarity calculations.
- Max Length: The Maximum number of tokens (roughly words) the model can process in a single input. Models with 8192 max length can handle very long documents without chunking, while 512-token models require splitting longer texts.
Key takeaway: Bigger specifications don’t automatically mean better performance. The e5-small model (118M params, 384 dims, 512 tokens) achieved the best results despite having the smallest specifications in the top tier.
Benchmark methodology
Corpus & queries
Dataset: 490,000 Amazon customer reviews (Health & Personal Care category)
- Each review = single document vector
- Indexed in Qdrant with cosine similarity
Test Set: 100 manually curated queries
- Real user questions (e.g., “Is this probiotic good for digestion?”)
- Each is mapped to one correct product via ASIN verification
Ground truth matching
Our evaluation uses the product ASIN (Amazon Standard Identification Number) for exact matching:
- Query specifies the target product ASIN
- Model retrieves Top-5 documents (ranked by cosine similarity)
- System checks if any retrieved document matches the ground truth ASIN
- Binary outcome: Match = Hit ✓, No Match = Miss ✗
Example:
This ensures product-level factual correctness, not just semantic similarity.
The role of cosine similarity
Where cosine similarity is used:
- Qdrant internally ranks all 490K documents by similarity to the query
- The top 5 highest-scoring documents are returned
Where it’s NOT used:
- Ground truth verification uses exact ASIN match (string equality)
- High similarity score ≠ correct answer
Why this matters:
A model might retrieve highly similar but factually incorrect documents:
This demonstrates why factual correctness is more critical than semantic relevance for RAG systems.
Evaluation setup
Hardware: MacBook Air M4 (16GB RAM, MPS backend)
Vector Database: Qdrant (local instance)
Mode: Zero-shot (no fine-tuning)
Batch Size: 8 (consistent across models)
Fairness guarantees:
- Same 490K corpus for all models
- Same 100 queries
- Same hardware and preprocessing
- Isolated collections (no vector leakage)
- Native embedding dimensions per model
Metrics
Top-K Accuracy:
Measured at K=1, 3, and 5. Top-5 is most relevant since RAG systems typically use 3-5 context documents.
Performance:
- Average Latency: Mean time for embedding generation only (text → vector conversion)
- Throughput: Queries per second (1000 / avg_latency_ms)
Limitations
Domain specificity: Results reflect Health & Personal Care product retrieval. Performance may differ in legal, finance, or code search domains.
Sample size: 100 queries provide strong directional insights. Larger test sets (500-1000) would narrow confidence intervals.
Hardware dependency: MacBook Air M4 MPS backend. Performance will differ on:
- CUDA GPUs (2-10× faster, INT8 quantization available)
- Cloud CPUs (typically slower)
ASIN-based matching: Our approach measures product-level accuracy. Alternative datasets without unique identifiers would require different verification methods (document IDs, text snippets, or semantic similarity thresholds).
Zero-shot only: Models tested without domain-specific fine-tuning. Fine-tuned models might achieve different rankings.
11 open source embedding models
e5-small
A compact multilingual retrieval encoder optimized for high-throughput semantic search, commonly deployed in real-time RAG, recommendation, and product retrieval. Trained for efficient contrastive retrieval, it is designed to maximize inference speed without sacrificing ranking quality.
In our evaluation, it delivered the best overall balance:
- 100% Top-5 retrieval accuracy
- The lowest latency
- The highest query throughput.
e5-base-instruct
Instruction-tuned for query–document alignment, making it a strong fit for task-aware search, AI assistants, and guided retrieval pipelines. Its training objective improves prompt understanding at embedding time, increasing precision for structured queries.
e5-large-instruct
A higher-capacity variant designed for accuracy-first retrieval in enterprise knowledge search, legal discovery, and complex query environments. It benefits from deeper representation learning but comes with larger inference costs.
We observed competitive Top-K accuracy, but meaningful trade-offs in latency and QPS, reinforcing that model scale alone does not guarantee better retrieval in production.
gte-multilingual
A 70+ language dense retrieval model built for cross-lingual search and global content discovery, often used for multilingual customer support and international knowledge bases.
It delivered reliable retrieval accuracy but higher latency than optimization-first models, suggesting that broad language generalization introduces compute overhead even in single-language test conditions.
bge-m3
A multi-representation encoder supporting dense, sparse, and hybrid vector retrieval, designed for long documents and multi-vector search pipelines. Frequently used in hybrid lexical-semantic search systems requiring flexibility.
Despite architectural versatility, it trailed smaller optimized models in Top-K accuracy and incurred higher latency, highlighting that multi-objective embedding design does not always translate to stronger retrieval precision.
nomic-embed-v1.5
A Mixture-of-Experts embedding model with Matryoshka dimensional reduction, designed for adaptive vector storage and efficient inference. Often deployed in cost-sensitive vector search systems that scale embedding dimensions dynamically.
In practice, accuracy remained solid but did not outperform smaller dense-only baselines in speed or correctness, showing that theoretical efficiency gains don’t always translate into retrieval wins.
jina-v3
A multilingual retrieval model built for heterogeneous document search, search APIs, and mixed-format enterprise knowledge retrieval. Engineered for generalization across domains and content types.
It delivered stable accuracy and latency, but did not reach top-tier exact-match performance in entity-level retrieval tasks such as product lookups.
qwen3-0.6b
A multilingual retrieval model optimized for instruction-driven semantic search and clustering, used in conversational search, QA retrieval, and multilingual corpora.
It showed competitive accuracy but higher inference latency relative to its parameter size, limiting its efficiency in high-QPS deployments.
snowflake-arctic
A retrieval encoder targeting enterprise-scale semantic search and internal knowledge systems, built for stability across very large vector indexes.
While consistent, it was outperformed by smaller retrieval-optimized models in both accuracy and latency, reinforcing that enterprise scale does not inherently equal higher retrieval precision.
all-MiniLM-L6-v2
A lightweight, CPU-friendly dense encoder widely used for local search, prototyping, and edge deployment where compute is constrained.
It achieved excellent latency and QPS but lower Top-K accuracy for exact entity lookup, showing that compact semantic models are not always sufficient for factual product retrieval.
mpnet-base-v2
A transformer trained for semantic similarity and clustering, frequently applied in analytics, recommendations, and semantic deduplication.
Though strong at capturing semantic meaning, it underperformed on exact-match product retrieval and showed slower inference than retrieval-specialized compact models.
Key considerations for deploying embedding models
When deploying an embedding model (whether a proprietary model or open source embedding models), several factors dictate achieving optimal performance and efficiency:
Performance and accuracy
The right embedding model must be chosen to suit specific retrieval or classification needs. The goal is to generate embeddings that deliver high retrieval quality for your domain.
- Tips: Always consult established benchmarks to evaluate a model’s performance on tasks relevant to your application (semantic similarity, clustering, etc.).
- Note on model size: Larger models offer better accuracy (superior semantic understanding) because they have more parameters to learn complex relationships, but this must be balanced against deployment constraints.
Latency and scaling
Low latency in embedding speed is crucial for real-time applications (e.g., search-as-you-type or live recommendations). This point focuses on the technical requirements for running the model quickly and reliably.
- Tips: Choose a deployment platform that offers efficient autoscaling and optimized hardware (GPUs/TPUs) to ensure consistently low latency and the ability to handle fluctuating traffic.
- Note on model size: Smaller, more efficient models (like distilled models) are often more suitable when low latency is critical. High latency in the retrieval step of a RAG system directly degrades the final user experience by slowing down the answer generation.
3. Integration with complex AI systems
Embedding models are often components within larger, compound AI solutions. For example, a RAG system combines a text embedding model with an LLM.
- Tips: Select platforms that natively support multi-model serving, features like distributed orchestration (managing data flow between models), and observability (monitoring performance across the entire chain). Remember your deployment strategy must simplify the construction and scaling of these multi-model chains.
What is an open source embedding model?
An open source embedding model is a publicly available AI model that converts text into numerical vectors people and systems can semantically compare, cluster, and search over. Unlike closed APIs, you can run it on your own infrastructure, inspect or fine-tune it, and adapt it to your domain.
They matter because they give you:
- Full data ownership, meaning no leaking queries to third-party APIs
- Zero or lower long-term cost at scale
- Custom fine-tuning for domain precision (medical, finance, product search, so on.)
- Offline or on-prem deployment for security-sensitive environments
- Freedom to optimize for latency, size, or accuracy trade-offs.
Embedding models use cases
Embedding models allow for the creation of text embeddings or other data embeddings, which are then positioned in a vector space. The proximity of these single vector representations in this space denotes semantic meaning and similarity, making embedding generation crucial for numerous AI applications, such as:
Semantic search
Semantic search leverages embedding models (including specialized text embedding models) to find relevant content or relevant results based on conceptual meaning rather than keyword matching.
Encoding content into the vector store empowers search engines since it delivers significantly better search accuracy than traditional methods where similarity is often measured by cosine similarity.
Real-life examples for open-source embedding models in semantic search
Enterprise knowledge search
Global enterprises using Jina AI’s open-source embedding models (e.g., jina-embeddings-v2) deploy semantic search to power HR skills matching, financial reconciliation, and internal knowledge retrieval.
The model’s 8K token support and multilingual design enable high-coverage enterprise search without API dependency, improving retrieval depth while keeping inference local.1
Real-life examples for closed source embedding models in semantic search
Translated customer queries
Zendesk uses embedding models (bi-encoders) to translate customer queries and help articles into vectors. The final ranking is a hybrid system combining keyword matching (BM25) and vector proximity (cosine similarity) for relevance.
Zendesk reports that the implementation of semantic search resulted in an average increase of 7% in mean reciprocal rank (MRR) for English help centers. This is a direct metric showing customers found the correct answer significantly faster, leading to increased self-service success.2
Personalized recommendations
Netflix uses deep learning to generate embeddings for content and users. These vectors capture nuanced viewing preferences and content characteristics for personalized ranking and recommendation.
The overall system is credited with saving the company over $1 billion per year by driving high customer retention.3
Information retrieval (IR)
Embedding generation is key for IR across large databases. A notable application is retrieval augmented generation (RAG), where the retrieved data from the vector store using the embedding model helps Large Language Models (LLMs) generate more accurate and up-to-date real-time content. This improves retrieval accuracy and overall content quality.
Real-life example for open source embedding models in IR
Call intelligence
AT&T processes 40 million customer support calls annually, using AI to categorize each call into one of 80 service categories to detect churn signals and enable proactive retention.
After initially using GPT-4 for call classification, AT&T replaced it with a hybrid open-source model pipelinecombining distilled GPT-4 models, H2O.ai’s Danube, and Meta Llama 70B for complex cases, drastically lowering cost while maintaining production accuracy. The open-source system achieved:
- 35% of the previous GPT-4 operating cost
- 91% relative accuracy compared to GPT-4
- 15 hours to 5 hours processing time per day
- ~50,000 customers retained annually through improved churn detection.4
Real-life example for closed source embedding models in IR
RAG chatbot
DoorDash implemented a RAG-based chatbot to automate support for its delivery drivers. The system uses an optimal embedding model within its vector store to achieve high retrieval correctness of knowledge base articles, which is critical for grounding the LLM’s automated advice.
The implementation of the RAG system, combined with their rigorous quality monitoring, successfully reduced LLM hallucinations by 90% and severe compliance issues by 99%.5
Clustering and classification
Embedding models can simplify classifying and organizing content by grouping text embeddings or other data representations in the vector space. This is essential for various downstream tasks like grouping customer feedback by sentiment or categorizing documents by topic.
Real-life example for open-source embedding models in clustering and classifying
AI-driven ticket clustering and classification
ByteDance’s Volcano Engine deployed an AI escalation and routing system in production that clusters, deduplicates, and classifies support tickets at scale using semantic similarity and in-house LLMs (DouBao). The system analyzes support conversations to automatically group recurring issues, assign categories, and route escalations to the right resolution owners without manual tagging.
The deployment was validated on 20,000+ real support tickets which could:
- Process hundreds of new tickets per day
- Reduce operational workload by approximately 10 person-days saved every day
- Apply semantic similarity thresholds of 0.86–0.95 for ticket deduplication and clustering.6
Real-life example for closed-source embedding models in clustering and classifying
AI-driven ticket classification
Gelato, an e-commerce platform, used embedding models built on Google’s Vertex AI to automate the triage and assignment of inbound engineering tickets and customer errors.
The embedding model converts the text description of the issue into a vector. This vector is then classified by a machine learning model into the correct technical category (e.g., “Login Error,” “Payment Failure,” “API Bug”). This way, Gelato increased the ticket assignment accuracy from 60% to 90%.7
Recommendation systems
Embedding models aid these systems by understanding user preferences based on the semantic meaning of their interests and the content available. By measuring the similarity between user and item embeddings, recommendation systems can provide more personalized suggestions.
Real-life example for embedding models in recommendation systems
Dynamic recommendations via CoSeRNN
Spotify leverages embedding models to create vector representations for songs, artists, and users. A key advancement in their recommendation engine is the implementation of the CoSeRNN (Contextual and Sequential Recurrent Neural Network) architecture. This system moves beyond static user profiles to address the dynamic nature of music listening.
The CoSeRNN system models user preferences as a sequence of context-dependent embeddings. These embeddings are influenced by factors like the time of day, the device being used, and the tracks recently played. This helps the model learn to predict a preference vector that maximizes the similarity to other tracks played in the current listening session, enabling highly accurate, moment-to-moment personalization.
The CoSeRNN approach, which relies on generating high-quality sequential user embeddings, performed significantly better than competing approaches, showing gains upwards of 10% on all ranking metrics considered for both session and track recommendation tasks. This improvement directly correlates with user satisfaction and reduces the “skip rate,” as it confirms users are hearing more of what they actually want in that specific context.8
The summary of the embedding model case studies:
💡Conclusion
Our benchmark shows that model size does not guarantee performance, as the 118M-parameter e5-small surpassed models five times larger.
For specialized needs:
- Maximum Top-1 precision → e5-base-instruct
- Multilingual support → gte-multilingual-base or e5-large-instruct
- Budget/popularity ≠ performance → Avoid all-MiniLM-L6-v2 and qwen3-0.6b
Always benchmark on your specific domain and workload before committing to production deployment.
FAQ
Reference Links



Be the first to comment
Your email address will not be published. All fields are required.