What are embedding models?

Embedding models convert complex data (like text, images, or audio) into dense numerical vectors in a multi-dimensional space. Their purpose is to capture the semantic meaning and relationships within the data, allowing similar items to be positioned close together in that vector space. An embedding model processes the raw input data (e.g., words in a sentence) and passes it through a neural network to generate the fixed-length vector output. During training, the model adjusts the vectors so that pieces of data with similar underlying meaning or context have closer vectors (as measured by distance or similarity metrics), making them easy to compare for tasks like search or recommendation.

Strategies to improve the quality of embeddings

To achieve high quality embedding models and boost the performance of tasks like search and classification, focus on these strategies: 1. Fine-tuning: Start with an open source embedding (like a BERT model variant) and fine tune it on your data or data specific to your specific domains. This is critical for improving the semantic accuracy and relevance of generate embeddings in specialized fields, ensuring the right model is used. 2. Contrastive learning: This is one of the most effective methods for training new embedding models. Contrastive pre training teaches the model to differentiate between similar (positive) and dissimilar (negative) pairs of data, which significantly enhances the model's ability to capture subtle semantic differences and improve retrieval quality. 3. Experimenting with dimensions and architectures: The number of embedding dimensions can impact both quality and computational resources. Higher dimensions often capture richer information but at a greater storage and computation cost. Exploring different new models or architectures beyond standard dense retrieval (like incorporating sparse retrieval techniques) can be beneficial.

AI RAG

Benchmark of 11 Best Open Source Embedding Models for RAG

Hazal Şimşek

with

Ekrem Sarı

updated on Nov 17, 2025

See our ethical norms

Most embedding benchmarks measure semantic similarity. We measured correctness. We tested 11 open-source models on 490,000 Amazon product reviews, scoring each by whether it retrieved the right product review through exact ASIN matching, not just topically similar documents.

Open source embedding models benchmark overview

We evaluated retrieval accuracy and speed across 100 manually curated queries.

Accuracy: Top-K retrieval performance

Select Metric:

What is top-K accuracy?

Top-K accuracy measures how often the correct document appears within the top K retrieved results:

Top-1: The correct answer is ranked first (most precise)
Top-3: The correct answer appears in the top 3 results
Top-5: The correct answer appears in the top 5 results (most relevant for RAG, which typically uses 3-5 context documents)
Average: Mean accuracy across Top-1, Top-3, and Top-5

Higher accuracy means the model successfully finds the right product review more often.

Key insights from accuracy results:

Perfect Top-5 performers: Three e5 family models (e5-small, e5-base-instruct, e5-large-instruct) achieved 100% Top-5 accuracy; they never missed the correct answer when allowed 5 attempts.
Top-1 winner: e5-base-instruct achieved 58% Top-1 accuracy, meaning it ranked the correct answer first more than half the time.
The 56% cluster: Five models (jina-v3, qwen3-0.6b, snowflake-arctic, all-MiniLM-L6-v2, and others) plateaued at 56% Top-5 accuracy, showing a clear performance gap from the leaders.
Size doesn’t equal accuracy: The smallest model (e5-small, 118M params) matched the performance of models 5× larger.
all-MiniLM-L6-v2 (200M+ downloads on HuggingFace) achieved only 56% Top-5 accuracy and 28% Top-1, among the lowest scores. Its 2019 architecture can’t compete with modern retrieval-optimized models.

Latency and throughput

What are latency and throughput?

Latency (ms): Time required for embedding generation only (converting text to vector). Lower is better. Vector search time is not included in these measurements.
Throughput (QPS): Queries processed per second. Higher is better. Calculated as 1000 ÷ latency_ms. Important for high-volume production systems.

These metrics measure how fast a model can serve users in production.

Key insights from performance results:

Speed champion: e5-small delivered 16ms embedding latency and 63 QPS, the fastest model tested. It’s 7× faster than qwen3-0.6b (110ms, 9 QPS).
The 7× performance gap: e5-small processes 7 queries in the time qwen3-0.6b processes 1, while also achieving 44% higher accuracy.
Sub-30ms cluster: Five models (e5-small, all-MiniLM-L6-v2, mpnet-base-v2, e5-base-instruct, and bge-m3) achieved <30ms latency, making them suitable for real-time applications.
Production-ready sweet spot: e5-small and e5-base-instruct combine both high accuracy (100% Top-5) and low latency (<30ms), making them ideal for production RAG systems.

Note: These are pure model inference times without vector database operations.

Open source embedding models’ technical features

Understanding the technical specifications:

Parameters: The model’s size in millions of trainable weights. Larger models (500M+) have more capacity to learn complex patterns but require more memory and compute.
Dimension: The length of the vector each text is converted into (e.g., 384 means each document becomes a 384-number vector). Higher dimensions (1024) can capture more semantic nuance but require more storage and slower similarity calculations.
Max Length: The Maximum number of tokens (roughly words) the model can process in a single input. Models with 8192 max length can handle very long documents without chunking, while 512-token models require splitting longer texts.

Key takeaway: Bigger specifications don’t automatically mean better performance. The e5-small model (118M params, 384 dims, 512 tokens) achieved the best results despite having the smallest specifications in the top tier.

Benchmark methodology

Corpus & queries

Dataset: 490,000 Amazon customer reviews (Health & Personal Care category)

Each review = single document vector
Indexed in Qdrant with cosine similarity

Test Set: 100 manually curated queries

Real user questions (e.g., “Is this probiotic good for digestion?”)
Each is mapped to one correct product via ASIN verification

Ground truth matching

Our evaluation uses the product ASIN (Amazon Standard Identification Number) for exact matching:

Query specifies the target product ASIN
Model retrieves Top-5 documents (ranked by cosine similarity)
System checks if any retrieved document matches the ground truth ASIN
Binary outcome: Match = Hit ✓, No Match = Miss ✗

Example:

This ensures product-level factual correctness, not just semantic similarity.

The role of cosine similarity

Where cosine similarity is used:

Qdrant internally ranks all 490K documents by similarity to the query
The top 5 highest-scoring documents are returned

Where it’s NOT used:

Ground truth verification uses exact ASIN match (string equality)
High similarity score ≠ correct answer

Why this matters:

A model might retrieve highly similar but factually incorrect documents:

This demonstrates why factual correctness is more critical than semantic relevance for RAG systems.

Evaluation setup

Hardware: MacBook Air M4 (16GB RAM, MPS backend)

Vector Database: Qdrant (local instance)

Mode: Zero-shot (no fine-tuning)

Batch Size: 8 (consistent across models)

Fairness guarantees:

Same 490K corpus for all models
Same 100 queries
Same hardware and preprocessing
Isolated collections (no vector leakage)
Native embedding dimensions per model

Metrics

Top-K Accuracy:

Measured at K=1, 3, and 5. Top-5 is most relevant since RAG systems typically use 3-5 context documents.

Performance:

Average Latency: Mean time for embedding generation only (text → vector conversion)
Throughput: Queries per second (1000 / avg_latency_ms)

Limitations

Domain specificity: Results reflect Health & Personal Care product retrieval. Performance may differ in legal, finance, or code search domains.

Sample size: 100 queries provide strong directional insights. Larger test sets (500-1000) would narrow confidence intervals.

Hardware dependency: MacBook Air M4 MPS backend. Performance will differ on:

CUDA GPUs (2-10× faster, INT8 quantization available)
Cloud CPUs (typically slower)

ASIN-based matching: Our approach measures product-level accuracy. Alternative datasets without unique identifiers would require different verification methods (document IDs, text snippets, or semantic similarity thresholds).

Zero-shot only: Models tested without domain-specific fine-tuning. Fine-tuned models might achieve different rankings.

11 open source embedding models

e5-small

A compact multilingual retrieval encoder optimized for high-throughput semantic search, commonly deployed in real-time RAG, recommendation, and product retrieval. Trained for efficient contrastive retrieval, it is designed to maximize inference speed without sacrificing ranking quality.

In our evaluation, it delivered the best overall balance:

100% Top-5 retrieval accuracy
The lowest latency
The highest query throughput.

e5-base-instruct

Instruction-tuned for query–document alignment, making it a strong fit for task-aware search, AI assistants, and guided retrieval pipelines. Its training objective improves prompt understanding at embedding time, increasing precision for structured queries.

e5-large-instruct

A higher-capacity variant designed for accuracy-first retrieval in enterprise knowledge search, legal discovery, and complex query environments. It benefits from deeper representation learning but comes with larger inference costs.

We observed competitive Top-K accuracy, but meaningful trade-offs in latency and QPS, reinforcing that model scale alone does not guarantee better retrieval in production.

gte-multilingual

A 70+ language dense retrieval model built for cross-lingual search and global content discovery, often used for multilingual customer support and international knowledge bases.

It delivered reliable retrieval accuracy but higher latency than optimization-first models, suggesting that broad language generalization introduces compute overhead even in single-language test conditions.

bge-m3

A multi-representation encoder supporting dense, sparse, and hybrid vector retrieval, designed for long documents and multi-vector search pipelines. Frequently used in hybrid lexical-semantic search systems requiring flexibility.

Despite architectural versatility, it trailed smaller optimized models in Top-K accuracy and incurred higher latency, highlighting that multi-objective embedding design does not always translate to stronger retrieval precision.

nomic-embed-v1.5

A Mixture-of-Experts embedding model with Matryoshka dimensional reduction, designed for adaptive vector storage and efficient inference. Often deployed in cost-sensitive vector search systems that scale embedding dimensions dynamically.

In practice, accuracy remained solid but did not outperform smaller dense-only baselines in speed or correctness, showing that theoretical efficiency gains don’t always translate into retrieval wins.

jina-v3

A multilingual retrieval model built for heterogeneous document search, search APIs, and mixed-format enterprise knowledge retrieval. Engineered for generalization across domains and content types.

It delivered stable accuracy and latency, but did not reach top-tier exact-match performance in entity-level retrieval tasks such as product lookups.

qwen3-0.6b

A multilingual retrieval model optimized for instruction-driven semantic search and clustering, used in conversational search, QA retrieval, and multilingual corpora.

It showed competitive accuracy but higher inference latency relative to its parameter size, limiting its efficiency in high-QPS deployments.

snowflake-arctic

A retrieval encoder targeting enterprise-scale semantic search and internal knowledge systems, built for stability across very large vector indexes.

While consistent, it was outperformed by smaller retrieval-optimized models in both accuracy and latency, reinforcing that enterprise scale does not inherently equal higher retrieval precision.

all-MiniLM-L6-v2

A lightweight, CPU-friendly dense encoder widely used for local search, prototyping, and edge deployment where compute is constrained.

It achieved excellent latency and QPS but lower Top-K accuracy for exact entity lookup, showing that compact semantic models are not always sufficient for factual product retrieval.

mpnet-base-v2

A transformer trained for semantic similarity and clustering, frequently applied in analytics, recommendations, and semantic deduplication.

Though strong at capturing semantic meaning, it underperformed on exact-match product retrieval and showed slower inference than retrieval-specialized compact models.

Key considerations for deploying embedding models

When deploying an embedding model (whether a proprietary model or open source embedding models), several factors dictate achieving optimal performance and efficiency:

Performance and accuracy

The right embedding model must be chosen to suit specific retrieval or classification needs. The goal is to generate embeddings that deliver high retrieval quality for your domain.

Tips: Always consult established benchmarks to evaluate a model’s performance on tasks relevant to your application (semantic similarity, clustering, etc.).
Note on model size: Larger models offer better accuracy (superior semantic understanding) because they have more parameters to learn complex relationships, but this must be balanced against deployment constraints.

Latency and scaling

Low latency in embedding speed is crucial for real-time applications (e.g., search-as-you-type or live recommendations). This point focuses on the technical requirements for running the model quickly and reliably.

Tips: Choose a deployment platform that offers efficient autoscaling and optimized hardware (GPUs/TPUs) to ensure consistently low latency and the ability to handle fluctuating traffic.
Note on model size: Smaller, more efficient models (like distilled models) are often more suitable when low latency is critical. High latency in the retrieval step of a RAG system directly degrades the final user experience by slowing down the answer generation.

3. Integration with complex AI systems

Embedding models are often components within larger, compound AI solutions. For example, a RAG system combines a text embedding model with an LLM.

Tips: Select platforms that natively support multi-model serving, features like distributed orchestration (managing data flow between models), and observability (monitoring performance across the entire chain). Remember your deployment strategy must simplify the construction and scaling of these multi-model chains.

What is an open source embedding model?

An open source embedding model is a publicly available AI model that converts text into numerical vectors people and systems can semantically compare, cluster, and search over. Unlike closed APIs, you can run it on your own infrastructure, inspect or fine-tune it, and adapt it to your domain.

They matter because they give you:

Full data ownership, meaning no leaking queries to third-party APIs
Zero or lower long-term cost at scale
Custom fine-tuning for domain precision (medical, finance, product search, so on.)
Offline or on-prem deployment for security-sensitive environments
Freedom to optimize for latency, size, or accuracy trade-offs.

Embedding models use cases

Embedding models allow for the creation of text embeddings or other data embeddings, which are then positioned in a vector space. The proximity of these single vector representations in this space denotes semantic meaning and similarity, making embedding generation crucial for numerous AI applications, such as:

Semantic search

Semantic search leverages embedding models (including specialized text embedding models) to find relevant content or relevant results based on conceptual meaning rather than keyword matching.

Encoding content into the vector store empowers search engines since it delivers significantly better search accuracy than traditional methods where similarity is often measured by cosine similarity.

Real-life examples for open-source embedding models in semantic search

Enterprise knowledge search

Global enterprises using Jina AI’s open-source embedding models (e.g., jina-embeddings-v2) deploy semantic search to power HR skills matching, financial reconciliation, and internal knowledge retrieval.

The model’s 8K token support and multilingual design enable high-coverage enterprise search without API dependency, improving retrieval depth while keeping inference local.¹

Real-life examples for closed source embedding models in semantic search

Translated customer queries

Zendesk uses embedding models (bi-encoders) to translate customer queries and help articles into vectors. The final ranking is a hybrid system combining keyword matching (BM25) and vector proximity (cosine similarity) for relevance.

Zendesk reports that the implementation of semantic search resulted in an average increase of 7% in mean reciprocal rank (MRR) for English help centers. This is a direct metric showing customers found the correct answer significantly faster, leading to increased self-service success.²

Personalized recommendations

Netflix uses deep learning to generate embeddings for content and users. These vectors capture nuanced viewing preferences and content characteristics for personalized ranking and recommendation.

The overall system is credited with saving the company over $1 billion per year by driving high customer retention.³

Information retrieval (IR)

Embedding generation is key for IR across large databases. A notable application is retrieval augmented generation (RAG), where the retrieved data from the vector store using the embedding model helps Large Language Models (LLMs) generate more accurate and up-to-date real-time content. This improves retrieval accuracy and overall content quality.

Real-life example for open source embedding models in IR

Call intelligence

AT&T processes 40 million customer support calls annually, using AI to categorize each call into one of 80 service categories to detect churn signals and enable proactive retention.

After initially using GPT-4 for call classification, AT&T replaced it with a hybrid open-source model pipelinecombining distilled GPT-4 models, H2O.ai’s Danube, and Meta Llama 70B for complex cases, drastically lowering cost while maintaining production accuracy. The open-source system achieved:

35% of the previous GPT-4 operating cost
91% relative accuracy compared to GPT-4
15 hours to 5 hours processing time per day
~50,000 customers retained annually through improved churn detection.⁴

Real-life example for closed source embedding models in IR

RAG chatbot

DoorDash implemented a RAG-based chatbot to automate support for its delivery drivers. The system uses an optimal embedding model within its vector store to achieve high retrieval correctness of knowledge base articles, which is critical for grounding the LLM’s automated advice.

The implementation of the RAG system, combined with their rigorous quality monitoring, successfully reduced LLM hallucinations by 90% and severe compliance issues by 99%.⁵

Clustering and classification

Embedding models can simplify classifying and organizing content by grouping text embeddings or other data representations in the vector space. This is essential for various downstream tasks like grouping customer feedback by sentiment or categorizing documents by topic.

Real-life example for open-source embedding models in clustering and classifying

AI-driven ticket clustering and classification

ByteDance’s Volcano Engine deployed an AI escalation and routing system in production that clusters, deduplicates, and classifies support tickets at scale using semantic similarity and in-house LLMs (DouBao). The system analyzes support conversations to automatically group recurring issues, assign categories, and route escalations to the right resolution owners without manual tagging.

The deployment was validated on 20,000+ real support tickets which could:

Process hundreds of new tickets per day
Reduce operational workload by approximately 10 person-days saved every day
Apply semantic similarity thresholds of 0.86–0.95 for ticket deduplication and clustering.⁶

Real-life example for closed-source embedding models in clustering and classifying

AI-driven ticket classification

Gelato, an e-commerce platform, used embedding models built on Google’s Vertex AI to automate the triage and assignment of inbound engineering tickets and customer errors.

The embedding model converts the text description of the issue into a vector. This vector is then classified by a machine learning model into the correct technical category (e.g., “Login Error,” “Payment Failure,” “API Bug”). This way, Gelato increased the ticket assignment accuracy from 60% to 90%.⁷

Recommendation systems

Embedding models aid these systems by understanding user preferences based on the semantic meaning of their interests and the content available. By measuring the similarity between user and item embeddings, recommendation systems can provide more personalized suggestions.

Real-life example for embedding models in recommendation systems

Dynamic recommendations via CoSeRNN

Spotify leverages embedding models to create vector representations for songs, artists, and users. A key advancement in their recommendation engine is the implementation of the CoSeRNN (Contextual and Sequential Recurrent Neural Network) architecture. This system moves beyond static user profiles to address the dynamic nature of music listening.

The CoSeRNN system models user preferences as a sequence of context-dependent embeddings. These embeddings are influenced by factors like the time of day, the device being used, and the tracks recently played. This helps the model learn to predict a preference vector that maximizes the similarity to other tracks played in the current listening session, enabling highly accurate, moment-to-moment personalization.

The CoSeRNN approach, which relies on generating high-quality sequential user embeddings, performed significantly better than competing approaches, showing gains upwards of 10% on all ranking metrics considered for both session and track recommendation tasks. This improvement directly correlates with user satisfaction and reduces the “skip rate,” as it confirms users are hearing more of what they actually want in that specific context.⁸

The summary of the embedding model case studies: