AIMultiple ResearchAIMultiple ResearchAIMultiple Research
We follow ethical norms & our process for objectivity.
This research is not funded by any sponsors.
AIRAG
Updated on Jul 10, 2025

Embedding Models: OpenAI vs Gemini vs Cohere in 2025

Headshot of Cem Dilmegani
MailLinkedinX

The effectiveness of any Retrieval-Augmented Generation (RAG) system depends on the precision of its retriever component.

We benchmarked 10 leading text embedding models, including those from OpenAI, Gemini, Cohere, Snowflake, AWS, Mistral, and Voyage AI, using nearly 500,000 Amazon reviews. Our evaluation focused on each model’s ability to retrieve and rank the correct answer first.

Embedding models comparison: Accuracy vs price

The most critical measure of an embedding model’s success is its accuracy in finding and ranking the single correct document first. We quantified this using our ‘accuracy score’ and plotted it against each model’s pricing. To understand our evaluation approach in detail, see our benchmark methodology of embedding models.

The scatter plot illustrates that higher-priced models don’t necessarily achieve better accuracy. The top-performing models offer the best balance between accuracy and cost.

  • Best accuracy overall: mistral-embed achieved the highest accuracy (77.8%), making it ideal for scenarios prioritizing retrieval accuracy even at a moderate cost.
  • Cost-effective alternatives: Voyage-3.5-lite delivered solid accuracy (66.1%) at one of the lowest costs, making it attractive for budget-sensitive implementations. snowflake-arctic-embed-l-v2.0 (66.6%) also provided a good accuracy performance at a relatively low cost.
  • Higher-priced alternative: Gemini embedding-001 reached a higher accuracy (71.5%) but with the highest price point, limiting its attractiveness in cost-sensitive projects.
  • Underperforming expensive models: Industry-leading brands like OpenAI’s text-embedding-3-large and Cohere embed-v4.0 Models scored lower accuracy compared to comparable or lower-priced alternatives.

To understand how we calculated the score, see our accuracy methodology.

A model must understand the general meaning and relevance of a query. The “Relevance Score” (average query similarity) measures how semantically aligned the top 5 retrieved documents are with the user’s query.

To understand how we calculated the score, see our relevance methodology.

  • Consistent leaders: The top performers in accuracy, like mistral-embed and gemini-embedding-001, also lead in relevance, indicating a robust and well-rounded semantic understanding.
  • The “relevance trap”: An interesting finding is that some models are good at finding semantically related documents but not necessarily the correct ones. For example, OpenAI’s text-embedding-3-small achieved a respectable relevance score (48.6%) but had one of the lowest accuracy scores (39.2%). This indicates it identifies the general information area but struggles with pinpointing specific answers.

A high relevance score is a necessary but not sufficient condition for a top-tier retriever. The best models excel at both understanding the topic broadly and identifying the correct answer with high precision.

Embedding models price calculator

To help you translate our findings into a practical budget for your own project, use the interactive calculator below to estimate embedding costs based on the number of tokens in your dataset.

Understanding the embedding model key characteristics

It is crucial to understand the key technical attributes that define an embedding model’s capabilities and resource requirements.

Updated at 07-09-2025
ModelEmbedding DimensionsMax Tokens
mistral-embed1,0248,000
gemini-embedding-0017682,048
snowflake-arctic-embed-l-v2.01,0248,192
voyage-3.5-lite1,02432,000
voyage-3-large1,02432,000
gemini-text-embedding-0047682,048
OpenAI text-embedding-3-large3,0728,191
Cohere embed-v4.01,536128,000
OpenAI text-embedding-3-small1,5368,191
amazon.titan-embed-text-v2:01,0248,192
  • Embedding dimensions: Vector size produced by the model. The dimensions listed in our table represent the default or optimal size recommended by the provider for general use. Higher dimensions (e.g., OpenAI’s text-embedding-3-large’s 3072) capture more semantic nuance but require significantly more storage and computational resources. Lower dimensions (e.g., Gemini text-embedding-004’s 768) are more efficient. Our results demonstrate that larger dimensions don’t automatically improve retrieval accuracy.
  • Max tokens: Maximum text sequence length processable in a single pass. A larger context window is advantageous for embedding long documents without chunking. While our document-level approach fits within all models’ limits, this attribute becomes critical when implementing fine-grained chunking strategies with large text segments.

Benchmark methodology of embedding models

Our benchmark provides a fair, transparent, and reproducible evaluation of embedding model performance for RAG.

Test setup & data corpus

  • Knowledge corpus: We used a dataset of 494,094 real-world user reviews from the Amazon reviews dataset as the knowledge base.1
  • Vector database: We utilized Qdrant to host all vector collections, which were explicitly configured for cosine similarity search.
  • Test queries: We manually curated a set of 100 challenging, real-world questions from an external Amazon Q&A dataset.2 These questions were selected to test sophisticated reasoning, and each had a user-voted “best answer” to serve as our ground truth. To illustrate the nature of these queries, the test set included complex, multi-constraint questions such as:
    • “Is there an A&H natural antiperspirant that contains a safe alternative to Aluminum and Paraben?”
    • This type of query is particularly challenging as it requires the model to understand multiple constraints simultaneously (Brand: A&H; Attribute: natural; Negative Constraint: no Aluminum/Paraben) and the abstract concept of finding an “alternative.”

Core evaluation principles

  • Isolated collections & native dimensions: For each model, we embedded the entire corpus into a dedicated, isolated collection. In line with standard benchmarks like MTEB, we evaluated each model using its native, optimal embedding dimensions.3
  • Retrieval granularity: We performed this benchmark at the document-level granularity. We treated each user review as a single document and converted it into a single vector. No fine-grained chunking was applied.
  • Zero-shot evaluation: The test was conducted in a “zero-shot” framework. This means the models were evaluated on a niche dataset they had not seen during their original training. We did not fine-tune or train any model on our specific dataset or queries. 

Evaluation metrics: A two-tiered approach

We employed a two-tiered evaluation to distinguish between broad semantic relevance and precise retrieval accuracy. At the core of both metrics is cosine similarity, a standard method for measuring the similarity between two vectors in the embedding space.

Metric 1: The relevance (“Average query similarity” score)

This metric answers: “Does the model understand the general topic of the query?” It measures the broad semantic relevance of the top 5 retrieved documents to the user’s query.

Calculation: For each query, the following steps were taken:

  1. The query text was converted into a vector using the model being tested.
  2. A search was performed to retrieve the top 5 documents.
  3. We calculated the cosine similarity between these two resulting vectors
  4. The final score for the query is the average of these five similarity values.

Metric 2: The accuracy (“Ground-truth similarity” score)

This is our primary and most critical metric. It answers the question: “Can the model find the single best answer and present it to the user first?”

Calculation: For each query, we made a precise comparison:

  1. The top-ranked document returned by the retriever was identified.
  2. The pre-defined “ground-truth” answer text was also identified.
  3. Crucially, both the Rank 1 document text and the ground-truth answer text were converted into vectors using the same model being evaluated.
  4. The cosine similarity was then calculated between these two resulting vectors. The similarity of documents ranked 2 through 5 was explicitly ignored.

A high score in this metric directly measures a model’s precision and its ability to distinguish the most useful information from a pool of semantically similar documents.

Measurement framework: Cosine similarity

Our evaluation uses cosine similarity, a robust metric for measuring the similarity between two vectors.

Instead of measuring the physical distance between vectors, this metric calculates the cosine of the angle between them. In essence, it measures if the vectors are pointing in the same direction, providing a pure measure of orientation, not magnitude. The resulting score ranges from 1 to -1:

  • 1: The vectors are identical in orientation (maximum semantic similarity).
  • 0: The vectors are orthogonal, indicating no semantic relationship.
  • -1: The vectors point in opposite directions (opposite meaning).

For our embedding benchmark, this allows us to reliably quantify how semantically similar a retrieved document is to a user’s query or a ground-truth answer. We used this core calculation to build our two primary metrics.

Limitations of embedding models benchmark

While this benchmark was designed to be objective, it is important to acknowledge its specific scope and limitations. These factors should be considered when interpreting the results:

  • Domain specificity: The results are highly specific to the Amazon review dataset used. The performance hierarchy of these models could change when applied to other domains with different linguistic characteristics, such as legal texts, academic papers, or software code. A model that excels at understanding informal, opinion-based review text may not be the optimal choice for a corpus requiring deep technical or formal language comprehension.
  • Document-level granularity: Our methodology evaluated models at a “document-level” granularity, treating each full review as a single vector. This approach tests a model’s ability to understand the overall context of a document. It does not, however, measure performance on “fine-grained” retrieval tasks that would require splitting documents into smaller chunks (e.g., paragraphs or sentences). A model’s performance may differ with a different chunking strategy.

Conclusion

Based on our evaluation, voyage-3.5-lite emerges as the optimal choice for production RAG systems, delivering an accuracy-cost balance. Gemini’s text-embedding-004 provides a budget-friendly alternative, while snowflake-arctic-embed-l-v2.0 offers another competitive option.

Key findings about embedding model selection:

  • Higher dimensions don’t guarantee better performance
  • Premium pricing doesn’t correlate with superior accuracy
  • Domain-specific benchmarking is essential for embedding model selection

FAQ

How do embedding models work to understand text?

Embedding models provide a way to translate text into a format that machines can understand. They take unstructured text and use a neural network to generate embeddings. The output is a vector—a list of numbers—which serves as a numerical representation of the original text’s meaning. This vector places the text as a point within a high-dimensional mathematical concept called an embedding space, where texts with similar meanings are located close to one another.

What is the difference between Sentence and Word Embedding models?

This is a key distinction in natural language processing (NLP). Traditional word embeddings create a single vector for a word, failing to capture context. Modern sentence embeddings, used by the models in this test, are more advanced. They create contextualized vectors for entire sentences, understanding that a word’s meaning changes based on surrounding text. This allows them to capture much more nuanced semantic relationships.

What are pre-trained embedding models and why do they matter?

Pre-trained models are a type of machine learning model that has been trained on vast amounts of general text data. All high-quality embedding models in our benchmark are pre-trained. This initial training gives them a foundational understanding of language and semantic relationships. Our test then measures how effectively this pre-trained knowledge handles the complex data of our specific domain without requiring additional, custom training data.

What about other types of embedding models, like images or graphs?

While our benchmark focused on natural language processing, the same principles apply to other data types. Specialized machine learning models are designed to handle different forms of complex data. For example, image embedding models are created using convolutional neural networks to capture visual features, while graph embedding models are used to create numerical representations of nodes and their connections in network data. This flexibility is what makes embedding technology so powerful for a wide range of AI systems.

How do you ensure the quality of the embeddings generated?

The quality of the embeddings significantly affects the benchmark’s accuracy. Several factors contribute to generating high-quality embeddings: Model Architecture: Using a powerful machine learning model like a Transformer is critical.
Data Quality: The model’s performance depends heavily on the quality of its original training data and the cleanliness of the input data it’s processing.
Methodology: Our use of a “zero-shot” framework on complex data ensures we are testing the model’s true ability to create embeddings that are robust and generalizable.

What is the “embedding space” and how does it relate to retrieval?

The embedding space is the conceptual, multi-dimensional space where all the numerical representations (vectors) generated by a model reside. In this space, the distance and direction between vectors correspond to their semantic relationships. When you perform a search, the query is converted to a vector and placed into this same embedding space. The retriever’s job is to find the nearest neighboring vectors, which represent the most semantically similar documents, making it a cornerstone of how modern AI systems process natural language.

Share This Article
MailLinkedinX
Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.
Ekrem is an industry analyst at AIMultiple, focusing on intelligent automation, AI Agents, and RAG frameworks.

Next to Read

Comments

Your email address will not be published. All fields are required.

0 Comments