How do embedding models work to understand text?

Embedding models provide a way to translate text into a format that machines can understand. They take unstructured text and use a neural network to generate embeddings. The output is a vector of numbers that serves as a numerical representation of the original text's meaning. This vector places the text as a point within a high-dimensional mathematical concept called an embedding space, where texts with similar meanings are located close to one another.

What is the difference between Sentence and Word Embedding models?

This is a key distinction in natural language processing (NLP). Traditional word embeddings create a single vector for a word, failing to capture context. Modern sentence embeddings, used by the models in this test, are more advanced. They create contextualized vectors for entire sentences, understanding that a word's meaning changes based on surrounding text. This allows them to capture much more nuanced semantic relationships.

What are pre-trained embedding models and why do they matter?

Pre-trained models are a type of machine learning model that has been trained on vast amounts of general text data. All high-quality embedding models in our benchmark are pre-trained. This initial training gives them a foundational understanding of language and semantic relationships. Our test then measures how effectively this pre-trained knowledge handles the complex data of our specific domain without requiring additional, custom training data.

What about other types of embedding models, like images or graphs?

While our benchmark focused on natural language processing, the same principles apply to other data types. Specialized machine learning models are designed to handle different forms of complex data. For example, image embedding models are created using convolutional neural networks to capture visual features, while graph embedding models are used to create numerical representations of nodes and their connections in network data. This flexibility is what makes embedding technology so powerful for a wide range of AI systems.

How do you ensure the quality of the embeddings generated?

The quality of the embeddings significantly affects the benchmark's accuracy. Several factors contribute to generating high-quality embeddings: Model Architecture: Using a powerful machine learning model like a Transformer is critical. Data Quality: The model's performance depends heavily on the quality of its original training data and the cleanliness of the input data it's processing. Methodology: Our use of a "zero-shot" framework on complex data ensures we are testing the model's true ability to create embeddings that are robust and generalizable.

What is the "embedding space" and how does it relate to retrieval?

The embedding space is the conceptual, multi-dimensional space where all the numerical representations (vectors) generated by a model reside. In this space, the distance and direction between vectors correspond to their semantic relationships. When you perform a search, the query is converted to a vector and placed into this same embedding space. The retriever's job is to find the nearest neighboring vectors, which represent the most semantically similar documents, making it a cornerstone of how modern AI systems process natural language.

AI RAG

Embedding Models: OpenAI vs Gemini vs Cohere in 2025

Cem Dilmegani

with Ekrem Sarı

updated on Aug 22, 2025

See our ethical norms

The effectiveness of any Retrieval-Augmented Generation (RAG) system depends on the precision of its retriever component.

We benchmarked 11 leading text embedding models, including those from OpenAI, Gemini, Cohere, Snowflake, AWS, Mistral, and Voyage AI, using nearly 500,000 Amazon reviews. Our evaluation focused on each model’s ability to retrieve and rank the correct answer first.

Embedding models comparison: Accuracy vs price

The most critical measure of an embedding model’s success is its accuracy in finding and ranking the single correct document first. We quantified this using our ‘accuracy score’ and plotted it against each model’s pricing. To understand our evaluation approach in detail, see our benchmark methodology of embedding models.

The scatter plot illustrates that higher-priced models don’t necessarily achieve better accuracy. The top-performing models offer the best balance between accuracy and cost.

Best accuracy overall: mistral-embed achieved the highest accuracy (77.8%), making it ideal for scenarios prioritizing retrieval accuracy even at a moderate cost.
Cost-effective alternatives: Voyage-3.5-lite delivered solid accuracy (66.1%) at one of the lowest costs, making it attractive for budget-sensitive implementations.
Moderate cost options: Snowflake (Cortex AI Functions) snowflake-arctic-embed-l-v2.0 (66.6%) offers good accuracy performance at a moderate cost.
Higher-priced alternative: Google’s (Vertex AI API) gemini-embedding-001 reached a higher accuracy (71.5%) but with the highest price point, limiting its attractiveness in cost-sensitive projects.
Underperforming expensive models: Industry-leading brands like OpenAI’s text-embedding-3-large and Cohere embed-v4.0 Models scored lower accuracy compared to comparable or lower-priced alternatives.

To understand how we calculated the score, see our accuracy methodology.

A model must understand the general meaning and relevance of a query. The “Relevance Score” (average query similarity) measures how semantically aligned the top 5 retrieved documents are with the user’s query.

To understand how we calculated the score, see our relevance methodology.

Consistent leaders: The top performers in accuracy, like mistral-embed and Google’s (Vertex AI API), also lead in relevance, indicating a robust and well-rounded semantic understanding.
The “relevance trap”: An interesting finding is that some models are good at finding semantically related documents, but not necessarily the correct ones. For example, OpenAI’s text-embedding-3-small achieved a respectable relevance score (48.6%) but had one of the lowest accuracy scores (39.2%). This indicates that it identifies the general information area but struggles to pinpoint specific answers.

A high relevance score is a necessary but not sufficient condition for a top-tier retriever. The best models excel at both understanding the topic broadly and identifying the correct answer with high precision.

Embedding models pricing calculator

To help you translate our findings into a practical budget for your own project, use the interactive calculator below to estimate embedding costs based on the number of tokens in your dataset.

Note: Snowflake pricing varies by edition and region. Our benchmark was conducted using Snowflake Standard Edition ($0.10 per million tokens). Pricing for other editions: Enterprise ($0.15), Business Critical ($0.20).

Understanding the embedding model key characteristics

It is crucial to understand the key technical attributes that define an embedding model’s capabilities and resource requirements.

Provider	Model	Embedding Dimensions	Max Tokens
Mistral	mistral-embed	1024	8,000
Google (Vertex AI API)	gemini-embedding-001	3072 (default), 1536, and 768	2,048
Snowflake	snowflake-arctic-embed-l-v2.0	1024	8,192
Voyage AI	voyage-3.5-lite	1024 (default), 256, 512 and 2048	32,000
Voyage AI	voyage-3-large	1024 (default), 256, 512 and 2048	32,000
Google (Vertex AI API)	text-embedding-005	768 (default), 256 and 512	2,048
OpenAI	text-embedding-3-large	3072 (default), 1536 and 1024	8,191
Cohere	embed-v4.0	1536 (default), 256, 512, 1024 and 1536	128,000
OpenAI	text-embedding-3-small	1536 (default), 512 and 1024	8,191
Amazon Bedrock	amazon.titan-embed-text-v2:0	1024 (default), 512 and 256	8,192

Embedding dimensions: Vector size produced by the model. The dimensions listed in our table represent the default or optimal size recommended by the provider for general use. Higher dimensions (e.g., OpenAI’s text-embedding-3-large’s 3072) capture more semantic nuance but require significantly more storage and computational resources. Lower dimensions (e.g., Google text-embedding-005’s 768) are more efficient. Our results demonstrate that larger dimensions don’t automatically improve retrieval accuracy.
Max tokens: Maximum text sequence length processable in a single pass. A larger context window is advantageous for embedding long documents without chunking. While our document-level approach fits within all models’ limits, this attribute becomes critical when implementing fine-grained chunking strategies with large text segments.

Benchmark methodology of embedding models

Our benchmark provides a fair, transparent, and reproducible evaluation of embedding model performance for RAG.

Test setup & data corpus

Knowledge corpus: We used a dataset of 494,094 real-world user reviews from the Amazon reviews dataset as the knowledge base.¹
Vector database: We utilized Qdrant to host all vector collections, which were explicitly configured for cosine similarity search.
Test queries: We manually curated a set of 100 challenging, real-world questions from an external Amazon Q&A dataset.² These questions were selected to test sophisticated reasoning, and each had a user-voted “best answer” to serve as our ground truth. To illustrate the nature of these queries, the test set included complex, multi-constraint questions such as:
- “Is there an A&H natural antiperspirant that contains a safe alternative to Aluminum and Paraben?”
- This type of query is particularly challenging as it requires the model to understand multiple constraints simultaneously (Brand: A&H; Attribute: natural; Negative Constraint: no Aluminum/Paraben) and the abstract concept of finding an “alternative.”

Core evaluation principles

Isolated collections & native dimensions: For each model, we embedded the entire corpus into a dedicated, isolated collection. In line with standard benchmarks like MTEB, we evaluated each model using its native, optimal embedding dimensions.³
Retrieval granularity: We performed this benchmark at the document-level granularity. We treated each user review as a single document and converted it into a single vector. No fine-grained chunking was applied.
Zero-shot evaluation: The test was conducted in a “zero-shot” framework. This means the models were evaluated on a niche dataset they had not seen during their original training. We did not fine-tune or train any model on our specific dataset or queries.

Evaluation metrics: A two-tiered approach

We employed a two-tiered evaluation to distinguish between broad semantic relevance and precise retrieval accuracy. At the core of both metrics is cosine similarity, a standard method for measuring the similarity between two vectors in the embedding space.

Metric 1: The relevance (“Average query similarity” score)

This metric answers: “Does the model understand the general topic of the query?” It measures the broad semantic relevance of the top 5 retrieved documents to the user’s query.

Calculation: For each query, the following steps were taken:

The query text was converted into a vector using the model being tested.
A search was performed to retrieve the top 5 documents.
We calculated the cosine similarity between these two resulting vectors
The final score for the query is the average of these five similarity values.

Metric 2: The accuracy (“Ground-truth similarity” score)

This is our primary and most critical metric. It answers the question: “Can the model find the single best answer and present it to the user first?”

Calculation: For each query, we made a precise comparison:

The top-ranked document returned by the retriever was identified.
The pre-defined “ground-truth” answer text was also identified.
Crucially, both the Rank 1 document text and the ground-truth answer text were converted into vectors using the same model being evaluated.
The cosine similarity was then calculated between these two resulting vectors. The similarity of documents ranked 2 through 5 was explicitly ignored.

A high score in this metric directly measures a model’s precision and its ability to distinguish the most useful information from a pool of semantically similar documents.

Measurement framework: Cosine similarity

Our evaluation uses cosine similarity, a robust metric for measuring the similarity between two vectors.

Instead of measuring the physical distance between vectors, this metric calculates the cosine of the angle between them. In essence, it measures if the vectors are pointing in the same direction, providing a pure measure of orientation, not magnitude. The resulting score ranges from 1 to -1:

1: The vectors are identical in orientation (maximum semantic similarity).
0: The vectors are orthogonal, indicating no semantic relationship.
-1: The vectors point in opposite directions (opposite meaning).

For our embedding benchmark, this allows us to reliably quantify how semantically similar a retrieved document is to a user’s query or a ground-truth answer. We used this core calculation to build our two primary metrics.

Limitations of embedding models benchmark

While this benchmark was designed to be objective, it is important to acknowledge its specific scope and limitations. These factors should be considered when interpreting the results:

Domain specificity: The results are highly specific to the Amazon review dataset used. The performance hierarchy of these models could change when applied to other domains with different linguistic characteristics, such as legal texts, academic papers, or software code. A model that excels at understanding informal, opinion-based review text may not be the optimal choice for a corpus requiring deep technical or formal language comprehension.
Document-level granularity: Our methodology evaluated models at a “document-level” granularity, treating each full review as a single vector. This approach tests a model’s ability to understand the overall context of a document. It does not, however, measure performance on “fine-grained” retrieval tasks that would require splitting documents into smaller chunks (e.g., paragraphs or sentences). A model’s performance may differ with a different chunking strategy.

💡Conclusion

Based on our evaluation, mistral-embed achieved the highest accuracy (77.8%), making it the top choice for scenarios where retrieval precision is paramount, even at a moderate cost.

For cost-conscious implementations, voyage-3.5-lite emerges as the optimal choice for production RAG systems, delivering an excellent accuracy-cost balance with solid performance (66.1%) at one of the lowest price points.

Google’s (Vertex AI API) gemini-embedding-001 provides another high-accuracy option (71.5%), suitable for accuracy-critical applications where premium pricing is acceptable.

For organizations already within the Snowflake ecosystem, Snowflake (Cortex AI Functions) snowflake-arctic-embed-l-v2.0 offers competitive accuracy (66.6%) at a moderate cost.

Key findings about embedding model selection:

Higher dimensions don’t guarantee better performance
Premium pricing doesn’t correlate with superior accuracy
Domain-specific benchmarking is essential for embedding model selection

FAQ

Reference Links

McAuley-Lab/Amazon-Reviews-2023 · Datasets at Hugging Face

McAuley-Lab

Amazon question/answer data

MTEB Leaderboard - a Hugging Face Space by mteb

Massive Text Embedding Benchmark

Principal Analyst

Cem Dilmegani

Principal Analyst

Follow On

Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

View Full Profile

Researched by

Ekrem Sarı

AI Researcher

Follow On

Ekrem is an AI Researcher at AIMultiple, focusing on intelligent automation, AI Agents, and RAG frameworks.

View Full Profile

Comments 0

Share Your Thoughts

Your email address will not be published. All fields are required.

Embedding models comparison: Accuracy vs price

Embedding models pricing calculator

Understanding the embedding model key characteristics

Benchmark methodology of embedding models

Limitations of embedding models benchmark

Next to Read

RAGAug 22

Embedding Models: OpenAI vs Gemini vs Cohere in 2025

Embedding models comparison: Accuracy vs price

Embedding models pricing calculator

Understanding the embedding model key characteristics

Benchmark methodology of embedding models

Test setup & data corpus

Core evaluation principles

Evaluation metrics: A two-tiered approach

Metric 1: The relevance (“Average query similarity” score)

Metric 2: The accuracy (“Ground-truth similarity” score)

Measurement framework: Cosine similarity

Limitations of embedding models benchmark

Further reading

💡Conclusion

FAQ

Reference Links

Comments 0

Share Your Thoughts

Next to Read

Hybrid RAG: Boosting RAG Accuracy in 2025

AI Application Security: Threats, Vulnerabilities & Real Examples

Top Vector Database for RAG: Qdrant vs Weaviate vs Pinecone

Top 20+ Agentic RAG Frameworks in 2025

Best RAG tools: Frameworks and Libraries in 2025

Embedding Models: OpenAI vs Gemini vs Cohere in 2025

Embedding models comparison: Accuracy vs price

Embedding models pricing calculator

Understanding the embedding model key characteristics

Benchmark methodology of embedding models

Test setup & data corpus

Core evaluation principles

Evaluation metrics: A two-tiered approach

Metric 1: The relevance (“Average query similarity” score)

Metric 2: The accuracy (“Ground-truth similarity” score)

Measurement framework: Cosine similarity

Limitations of embedding models benchmark

Further reading

💡Conclusion

FAQ

How do embedding models work to understand text?

What is the difference between Sentence and Word Embedding models?

What are pre-trained embedding models and why do they matter?

What about other types of embedding models, like images or graphs?

How do you ensure the quality of the embeddings generated?

What is the “embedding space” and how does it relate to retrieval?

Reference Links

Comments 0

Share Your Thoughts

Next to Read

Hybrid RAG: Boosting RAG Accuracy in 2025

AI Application Security: Threats, Vulnerabilities & Real Examples

Top Vector Database for RAG: Qdrant vs Weaviate vs Pinecone

Top 20+ Agentic RAG Frameworks in 2025

Best RAG tools: Frameworks and Libraries in 2025