Multimodal Embedding Models: Apple vs Meta vs OpenAI

with

updated on Dec 2, 2025

Multimodal embedding models excel at identifying objects but struggle with relationships. Current models struggle to distinguish “phone on a map” from “map on a phone.” We benchmarked 7 leading models across MS-COCO and Winoground to measure this specific limitation.

To ensure a fair comparison, we evaluated every model under identical conditions using NVIDIA A40 hardware and bfloat16 precision. This deterministic setup reveals which models actually understand scene structure and which ones are simply sophisticated keyword matchers.

Multimodal embedding models benchmark results

Metrics explained

T2I R@1 (Text-to-Image recall@1): Given a caption, can the model rank the correct image as number one among 5,000 candidates? This is the hardest retrieval metric because there is no partial credit for ranking second.
I2T R@1 (Image-to-Text recall@1): Given an image, can the model rank any of the five ground-truth captions as number one among 25,000? Scores are roughly 20 percentage points higher than T2I because there are five valid answers instead of one.
Winoground image: Given two images and two captions that differ only in structure (“phone on a map” versus “map on a phone”), can the model correctly match both pairs? Random chance is 25 percent.

Key findings

Apple DFN5B-H achieves the highest retrieval accuracy (50.1 percent T2I R@1) and the highest compositional reasoning score (35.2 percent on Winoground).
Compositional reasoning remains poor across all models. Even Apple’s 35.2 percent performance barely exceeds the 25 percent random baseline.
OpenAI CLIP shows its age, trailing modern models by 10 to 16 percentage points despite having a similar architecture.

Note: I2T scores are approximately 20 percentage points higher than T2I due to a protocol artifact. Each image has five valid captions, while each caption maps to only one valid image. See the methodology section for details.

How multimodal embedding models work

Before diving into the benchmark details, it is essential to understand what these models actually do and where they break down.

The core mechanism

A multimodal embedding model converts both images and text into numerical vectors, which are lists of numbers that occupy the same geometric space. Similar concepts cluster together, while dissimilar concepts are farther apart.

To search, you compute which image vector is closest to your text vector. This is why embedding-based search is fast: you are comparing numbers, not “understanding” meaning in a human sense.

Where it breaks down

Watch what happens with compositionally different captions:

The vectors are nearly identical. Both captions contain the same concepts: {phone, map, on}. The model encodes what’s present but loses how things relate.

This is the bag-of-words problem. The model sees the same “ingredients” and produces similar embeddings, even though the scenes are completely different. In one, the phone is on top. In the other, the map is. The relational structure vanishes during encoding.

Evaluation tasks: Retrieval vs reasoning

MS-COCO: Finding a needle in a haystack

The Setup:
A gallery of 5,000 images contains clusters of similar content, including hundreds of outdoor scenes, dozens of vehicles, and numerous storage areas and structures. Each image has five different captions written by different annotators, for a total of 25,000 captions.

The Query: “A motorcycle parked under a wooden structure with other items.”

The image:

The same image might also be described as:

“Black motorcycle sitting underneath an overhang outdoors.”
“Motorcycle parked under covered area in fenced yard.”

Each caption is tested separately, and the model must find the correct image regardless of how it is phrased.

The task:
Find the single specific image that matches. Not any motorcycle, not any wooden structure, but this exact scene among 5,000 candidates.

The metric: Recall@1
Binary and unforgiving. Correct image ranked #1 = Hit. Ranked #2 = Miss. No partial credit.

Winoground: Understanding who did what to whom

The Setup:
400 adversarial pairs. Each contains 2 images and 2 captions differing only in compositional structure.

The Query:

Caption A: “there’s a phone on a map”
Caption B: “there’s a map on a phone“

Both captions contain the exact same concepts: {phone, map, on}. The only difference is which object is on top of which.

The image:

The Task:
Match both captions to their correct images simultaneously. Caption A must match Image A (phone resting on map), and Caption B must match Image B (map displayed on phone). No partial credit: getting only one right counts as failure.

The Metric: Image Score
Binary and unforgiving. Both pairs matched correctly = Hit. One or zero correct = Miss. Random chance is 25%.

More examples from Winoground:

Why models fail at composition

The low Winoground scores (30-40% vs. 25% random baseline) indicate that current models struggle with this specific type of compositional reasoning. However, several caveats apply:

Small sample size: Winoground contains only 400 examples, yielding approximately ±5 percentage point confidence intervals. This makes it useful as an indicator but not definitive proof of compositional capabilities.
Specific but diverse task scope: Winoground tests multiple types of compositional reasoning including spatial relationships (on/above/below), agent-patient swaps (who does what to whom), attribute binding (color/size assignments), quantifiers (more/less, counting), action coordination (sits/stands), temporal ordering (before/after), negation (with/without), and scope ambiguity. This diversity makes Winoground an effective probe of compositional understanding across multiple linguistic phenomena.

Technical analysis & deployment recommendations

Data quality beats model scale

Apple, LAION, and MetaCLIP all use the same ViT-H/14 backbone (630M parameters).

Apple’s +3.8pp advantage appears to stem primarily from its Data Filtering Network (DFN) approach.

Automated Curation: Rather than just using synthetic captions, Apple trained a teacher model to aggressively filter the training data. The model learned to identify and discard noisy image-text pairs from the massive web pool.
The implication: At the frontier, improvements come from curation quality (picking the right data) rather than just synthesis or raw scale.

The implication: at the frontier, improvements come from better data, not bigger architectures.

Understanding the 50% performance level

MS-COCO was designed with distinct, curated images where each caption describes a specific scene. While minor ambiguities exist (e.g., two similar parking lot scenes), the dataset creators intentionally selected visually distinguishable images.

The 50% accuracy reflects models genuinely failing to rank the correct image first, not unfair penalization for selecting equally valid alternatives.

Why OpenAI CLIP trails by 10-16pp

OpenAI’s CLIP-L (2021) scores 34.4% T2I R@1, while modern models using similar ViT architectures achieve 44-50%. This 10-16 percentage point gap reflects three years of progress:

While core architectural principles remained similar (vision transformers with contrastive learning), modern models doubled in size. However, most performance gains came from improved data curation and training techniques rather than architectural innovation alone.

ColPali: Trading Speed for Architectural Flexibility

ColPali represents a different architectural approach: instead of encoding each image into a single vector, it produces 1,030 patch embeddings using late interaction. This design choice creates several tradeoffs:

Advantages:

More symmetric retrieval: ColPali shows only a 3.9pp gap between I2T (48.8%) and T2I (44.9%), compared to 16-24pp gaps in dense models. This suggests it encodes image structure more uniformly.
Architectural flexibility: Late interaction allows fine-grained matching between text tokens and image patches, which may benefit specialized domains.

Disadvantages:

Storage overhead: Each image requires 1,030 vectors instead of 1, increasing index size by ~1000×.

Lower overall performance: ColPali ranks 4th in our benchmark (44.9% T2I), trailing the top dense models by 5.2pp (vs. Apple DFN5B-H at 50.1%).

Computational cost: Requires 4× smaller batch sizes (4 vs. 32) due to memory overhead from 1,030 embeddings per image. This translates to slower indexing and higher serving costs at scale.

Which Model Should You Use?

Methodology

Hardware & software

GPU: NVIDIA A40 (48GB VRAM)
Precision: bfloat16
Framework: PyTorch 2.4.0, CUDA 12.1
Libraries: transformers==4.44.0, datasets==2.20.0

Models evaluated

We utilized the following specific model weights from the Hugging Face Hub. All models were loaded in bfloat16 precision directly from these repositories without modification.

Inference protocol

Dense models (CLIP/SigLIP) were evaluated with batch size 32, since a single vector per image allows high parallelism. ColPali used batch size 4, as its 1,030 patch embeddings per image require significantly more memory.

Evaluation protocol

Zero-Shot: Models evaluated out-of-the-box using Hugging Face weights. No fine-tuning.
Deterministic: Random seed fixed to 42. Same dataset order for all models.
Standard Splits: yerevann/coco-karpathy test (5,000 images), facebook/winoground validation.

The I2T vs. T2I gap

I2T scores are consistently ~20pp higher than T2I due to statistical probability, not model error.

T2I (Text-to-Image): The model must find 1 specific image among 5,000. (Target pool = 1).
I2T (Image-to-Text): The model can match any of the 5 valid captions associated with that image. (Target pool = 5).

Because the I2T task offers five distinct ‘correct’ answers for every query, the success rate is naturally inflated compared to the strict one-to-one mapping required in T2I.

Limitations

Winoground sample size

400 samples yield ~±5pp confidence intervals at 35% accuracy. Results are indicative, not definitive. Larger benchmarks (ARO, SugarCrepe) exist but require different infrastructure.

Zero-Shot only

No domain fine-tuning. Medical, legal, or satellite applications could see 5-10pp improvements with domain-specific training.

Dataset limitations:

MS-COCO and Winoground test specific aspects of multimodal understanding. Performance on these benchmarks doesn’t guarantee similar results on domain-specific tasks or other compositional reasoning tests.

💡Conclusion

Current multimodal embedding models are good at object recognition but struggle with compositional reasoning.

For standard retrieval (“find photos of motorcycles”), any top-3 model works well. For relational queries (“phone on a map” vs. “map on a phone”), expect 30-40% accuracy at best.

Based on our findings and current research trends, several approaches may improve performance:

Data quality over scale: Apple’s +3.8pp advantage using the same ViT-H architecture suggests that training data curation contributes significantly, though this is based on a single comparison.
Compositional training data: Including hard negatives with relational variations during training could theoretically improve compositional sensitivity, though this remains largely untested at scale.
Hybrid architectures: Two-stage pipelines (dense retrieval → late-interaction reranking) combine speed with precision, though our benchmark shows this doesn’t yet outperform dense models on these tasks.

Until training paradigms change, compositional understanding remains an open frontier.

Next to Read

LLMsNov 24