The effectiveness of any Retrieval-Augmented Generation (RAG) system depends on the precision of its retriever component.
We benchmarked 11 leading text embedding models, including those from OpenAI, Gemini, Cohere, Snowflake, AWS, Mistral, and Voyage AI, using nearly 500,000 Amazon reviews. Our evaluation focused on each model’s ability to retrieve and rank the correct answer first.
Embedding models comparison: Accuracy vs price
The most critical measure of an embedding model’s success is its accuracy in finding and ranking the single correct document first. We quantified this using our ‘accuracy score’ and plotted it against each model’s pricing. To understand our evaluation approach in detail, see our benchmark methodology of embedding models.
The scatter plot illustrates that higher-priced models don’t necessarily achieve better accuracy. The top-performing models offer the best balance between accuracy and cost.
- Best accuracy overall: mistral-embed achieved the highest accuracy (77.8%), making it ideal for scenarios prioritizing retrieval accuracy even at a moderate cost.
- Cost-effective alternatives: Voyage-3.5-lite delivered solid accuracy (66.1%) at one of the lowest costs, making it attractive for budget-sensitive implementations.
- Moderate cost options: Snowflake (Cortex AI Functions) snowflake-arctic-embed-l-v2.0 (66.6%) offers good accuracy performance at a moderate cost.
- Higher-priced alternative: Google’s (Vertex AI API) gemini-embedding-001 reached a higher accuracy (71.5%) but with the highest price point, limiting its attractiveness in cost-sensitive projects.
- Underperforming expensive models: Industry-leading brands like OpenAI’s text-embedding-3-large and Cohere embed-v4.0 Models scored lower accuracy compared to comparable or lower-priced alternatives.
To understand how we calculated the score, see our accuracy methodology.
A model must understand the general meaning and relevance of a query. The “Relevance Score” (average query similarity) measures how semantically aligned the top 5 retrieved documents are with the user’s query.
To understand how we calculated the score, see our relevance methodology.
- Consistent leaders: The top performers in accuracy, like mistral-embed and Google’s (Vertex AI API), also lead in relevance, indicating a robust and well-rounded semantic understanding.
- The “relevance trap”: An interesting finding is that some models are good at finding semantically related documents, but not necessarily the correct ones. For example, OpenAI’s text-embedding-3-small achieved a respectable relevance score (48.6%) but had one of the lowest accuracy scores (39.2%). This indicates that it identifies the general information area but struggles to pinpoint specific answers.
A high relevance score is a necessary but not sufficient condition for a top-tier retriever. The best models excel at both understanding the topic broadly and identifying the correct answer with high precision.
Embedding models pricing calculator
To help you translate our findings into a practical budget for your own project, use the interactive calculator below to estimate embedding costs based on the number of tokens in your dataset.
Note: Snowflake pricing varies by edition and region. Our benchmark was conducted using Snowflake Standard Edition ($0.10 per million tokens). Pricing for other editions: Enterprise ($0.15), Business Critical ($0.20).
Understanding the embedding model key characteristics
It is crucial to understand the key technical attributes that define an embedding model’s capabilities and resource requirements.
Provider | Model | Embedding Dimensions | Max Tokens |
---|---|---|---|
Mistral | mistral-embed | 1024 | 8,000 |
Google (Vertex AI API) | gemini-embedding-001 | 3072 (default), 1536, and 768 | 2,048 |
Snowflake | snowflake-arctic-embed-l-v2.0 | 1024 | 8,192 |
Voyage AI | voyage-3.5-lite | 1024 (default), 256, 512 and 2048 | 32,000 |
Voyage AI | voyage-3-large | 1024 (default), 256, 512 and 2048 | 32,000 |
Google (Vertex AI API) | text-embedding-005 | 768 (default), 256 and 512 | 2,048 |
OpenAI | text-embedding-3-large | 3072 (default), 1536 and 1024 | 8,191 |
Cohere | embed-v4.0 | 1536 (default), 256, 512, 1024 and 1536 | 128,000 |
OpenAI | text-embedding-3-small | 1536 (default), 512 and 1024 | 8,191 |
Amazon Bedrock | amazon.titan-embed-text-v2:0 | 1024 (default), 512 and 256 | 8,192 |
- Embedding dimensions: Vector size produced by the model. The dimensions listed in our table represent the default or optimal size recommended by the provider for general use. Higher dimensions (e.g., OpenAI’s text-embedding-3-large’s 3072) capture more semantic nuance but require significantly more storage and computational resources. Lower dimensions (e.g., Google text-embedding-005’s 768) are more efficient. Our results demonstrate that larger dimensions don’t automatically improve retrieval accuracy.
- Max tokens: Maximum text sequence length processable in a single pass. A larger context window is advantageous for embedding long documents without chunking. While our document-level approach fits within all models’ limits, this attribute becomes critical when implementing fine-grained chunking strategies with large text segments.
Benchmark methodology of embedding models
Our benchmark provides a fair, transparent, and reproducible evaluation of embedding model performance for RAG.
Test setup & data corpus
- Knowledge corpus: We used a dataset of 494,094 real-world user reviews from the Amazon reviews dataset as the knowledge base.1
- Vector database: We utilized Qdrant to host all vector collections, which were explicitly configured for cosine similarity search.
- Test queries: We manually curated a set of 100 challenging, real-world questions from an external Amazon Q&A dataset.2 These questions were selected to test sophisticated reasoning, and each had a user-voted “best answer” to serve as our ground truth. To illustrate the nature of these queries, the test set included complex, multi-constraint questions such as:
- “Is there an A&H natural antiperspirant that contains a safe alternative to Aluminum and Paraben?”
- This type of query is particularly challenging as it requires the model to understand multiple constraints simultaneously (Brand: A&H; Attribute: natural; Negative Constraint: no Aluminum/Paraben) and the abstract concept of finding an “alternative.”
Core evaluation principles
- Isolated collections & native dimensions: For each model, we embedded the entire corpus into a dedicated, isolated collection. In line with standard benchmarks like MTEB, we evaluated each model using its native, optimal embedding dimensions.3
- Retrieval granularity: We performed this benchmark at the document-level granularity. We treated each user review as a single document and converted it into a single vector. No fine-grained chunking was applied.
- Zero-shot evaluation: The test was conducted in a “zero-shot” framework. This means the models were evaluated on a niche dataset they had not seen during their original training. We did not fine-tune or train any model on our specific dataset or queries.
Evaluation metrics: A two-tiered approach
We employed a two-tiered evaluation to distinguish between broad semantic relevance and precise retrieval accuracy. At the core of both metrics is cosine similarity, a standard method for measuring the similarity between two vectors in the embedding space.
Metric 1: The relevance (“Average query similarity” score)
This metric answers: “Does the model understand the general topic of the query?” It measures the broad semantic relevance of the top 5 retrieved documents to the user’s query.
Calculation: For each query, the following steps were taken:
- The query text was converted into a vector using the model being tested.
- A search was performed to retrieve the top 5 documents.
- We calculated the cosine similarity between these two resulting vectors
- The final score for the query is the average of these five similarity values.
Metric 2: The accuracy (“Ground-truth similarity” score)
This is our primary and most critical metric. It answers the question: “Can the model find the single best answer and present it to the user first?”
Calculation: For each query, we made a precise comparison:
- The top-ranked document returned by the retriever was identified.
- The pre-defined “ground-truth” answer text was also identified.
- Crucially, both the Rank 1 document text and the ground-truth answer text were converted into vectors using the same model being evaluated.
- The cosine similarity was then calculated between these two resulting vectors. The similarity of documents ranked 2 through 5 was explicitly ignored.
A high score in this metric directly measures a model’s precision and its ability to distinguish the most useful information from a pool of semantically similar documents.
Measurement framework: Cosine similarity
Our evaluation uses cosine similarity, a robust metric for measuring the similarity between two vectors.
Instead of measuring the physical distance between vectors, this metric calculates the cosine of the angle between them. In essence, it measures if the vectors are pointing in the same direction, providing a pure measure of orientation, not magnitude. The resulting score ranges from 1 to -1:
- 1: The vectors are identical in orientation (maximum semantic similarity).
- 0: The vectors are orthogonal, indicating no semantic relationship.
- -1: The vectors point in opposite directions (opposite meaning).
For our embedding benchmark, this allows us to reliably quantify how semantically similar a retrieved document is to a user’s query or a ground-truth answer. We used this core calculation to build our two primary metrics.
Limitations of embedding models benchmark
While this benchmark was designed to be objective, it is important to acknowledge its specific scope and limitations. These factors should be considered when interpreting the results:
- Domain specificity: The results are highly specific to the Amazon review dataset used. The performance hierarchy of these models could change when applied to other domains with different linguistic characteristics, such as legal texts, academic papers, or software code. A model that excels at understanding informal, opinion-based review text may not be the optimal choice for a corpus requiring deep technical or formal language comprehension.
- Document-level granularity: Our methodology evaluated models at a “document-level” granularity, treating each full review as a single vector. This approach tests a model’s ability to understand the overall context of a document. It does not, however, measure performance on “fine-grained” retrieval tasks that would require splitting documents into smaller chunks (e.g., paragraphs or sentences). A model’s performance may differ with a different chunking strategy.
Further reading
Explore other RAG benchmarks, such as:
- Top Vector Database for RAG: Qdrant vs Weaviate vs Pinecone
- Hybrid RAG: Boosting RAG Accuracy
- Agentic RAG benchmark: multi-database routing and query generation
💡Conclusion
Based on our evaluation, mistral-embed achieved the highest accuracy (77.8%), making it the top choice for scenarios where retrieval precision is paramount, even at a moderate cost.
For cost-conscious implementations, voyage-3.5-lite emerges as the optimal choice for production RAG systems, delivering an excellent accuracy-cost balance with solid performance (66.1%) at one of the lowest price points.
Google’s (Vertex AI API) gemini-embedding-001 provides another high-accuracy option (71.5%), suitable for accuracy-critical applications where premium pricing is acceptable.
For organizations already within the Snowflake ecosystem, Snowflake (Cortex AI Functions) snowflake-arctic-embed-l-v2.0 offers competitive accuracy (66.6%) at a moderate cost.
Key findings about embedding model selection:
- Higher dimensions don’t guarantee better performance
- Premium pricing doesn’t correlate with superior accuracy
- Domain-specific benchmarking is essential for embedding model selection
FAQ
Reference Links

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.
Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.
He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.
Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

Comments 0
Share Your Thoughts
Your email address will not be published. All fields are required.