Compare Multimodal AI Models on Visual Reasoning

with

updated on Nov 13, 2025

We benchmarked 8 leading multimodal AI models on visual reasoning using 98 visual-based questions. The evaluation consisted of two tracks: 70 Chart Understanding questions testing data visualization interpretation, and 28 Visual Logic questions assessing pattern recognition and spatial reasoning.

Models tested include GPT-5, Gemini 2.5 Pro/Flash, Claude 4.5 Sonnet/Haiku 4.5, Grok-4-Fast, Qwen3-VL-8B-Thinking, and Llama 4 Maverick.

Visual reasoning benchmark

GPT-5 and Gemini 2.5 Pro achieved the highest overall accuracy at 63% and 65% respectively, but demonstrated opposite strengths:
- GPT-5 scored 68% on Visual Logic versus 61% on Chart Understanding, while Gemini 2.5 Pro achieved 67% on Chart Understanding but 61% on Visual Logic.
- This 7-point difference suggests task-specific optimization in their training approaches.
Qwen3-VL-8B-Thinking maintained consistent performance across both categories with 47% on Chart Understanding and 46% on Visual Logic, showing no clear specialization.
Claude 4.5 Sonnet scored 47% on charts but dropped to 36% on logic tasks, while Gemini 2.5 Flash performed similarly at 44% and 36% respectively.
Grok-4-Fast showed slight balance at 37% and 39%, while Claude Haiku 4.5 reversed the pattern with 33% on charts and 39% on logic.
Llama 4 Maverick displayed the widest performance gap:
- 43% on Chart Understanding but only 18% on Visual Logic.
- This 25-point difference indicates strong numerical data extraction capabilities but significant weakness in abstract pattern recognition and spatial reasoning tasks.

See our benchmark methodology to learn our testing procedures.

What is visual reasoning?

Visual reasoning refers to the ability of a model to interpret an image, connect its elements, extract structure and associations, and produce a correct answer to a visual reasoning question. This capability brings together computer vision and language analysis, enabling an LLM to handle multimodal tasks that require understanding both visual and textual inputs.

Recent research and studies present multiple frameworks showing that LLMs can interpret objects, edges, positions, and patterns in an image, and then explain how they arrived at an answer through a reasoning process.

For example, the Cola framework organizes multiple vision-language models to answer visual reasoning questions by asking each model to provide captions and plausible answers, then letting an LLM select the correct answer.

Two approaches are presented: Cola-FT, which is instruction-tuned, and Cola-Zero, which relies on in-context learning without extra training. The framework focuses on the reasoning process, using structured prompts to coordinate models and analyze their contributions.

Figure 1: Graph showing how Cola leverages a coordinative language model for visual reasoning.¹

Another example is the CVR-LLM framework, which improves reasoning by converting images into context-aware descriptions using the CaID method and selecting relevant examples with the CVR-ICL procedure. This framework treats image information as text-based representations, enabling the LLM to analyze associations more effectively across various types of multimodal tasks.²

How visual reasoning works in LLMs

LLMs do not perceive images directly. They rely on vision models that convert images into representations the LLM can understand. The mechanisms below summarize how recent models achieve visual reasoning.

Image interpretation and representation

Models first extract structured visual information using a vision encoder. The vision encoder identifies objects, textures, lines, edges, the lower part or upper part of an image, and the relations between them. This step is similar to classic image recognition, but the output format is tailored for language models.

LLMs then receive this representation along with the question. The model must combine these inputs to build a reasoning chain. For example, when asked a water-image question, such as identifying what people are doing in the middle of a river, the LLM must connect the appearance of the people, the positions of the objects, and the context clues provided.

Recent research suggests two types of mechanisms for solving complex visual scenarios.

In the coordination form, an LLM serves as a central hub that integrates outputs from multiple vision models. Each model may focus on different patterns, and the LLM merges these descriptions to reach the endpoint. This mechanism allows the LLM to cross-check plausible answers and choose one that fits the question and the visual evidence.
In the refinement form, the LLM guides the image captioner through several steps. The LLM may point out missing information or ask clarifying questions. This feedback loop reprocesses the image, aligning the caption with the multimodal reasoning process. The refined description helps the LLM detect symbols, signs, or associations that the initial caption may have overlooked.

Both mechanisms address limitations found in existing methods where a single model often fails to analyze complex scenarios.

In-context learning for multimodal reasoning

Some frameworks extend the process by retrieving similar examples from training data. The LLM receives a set of demonstrations that match the structure of the target question. These examples provide the model with a template for interpreting similar diagrams or images.

For instance, when the model is evaluated on a visual reasoning benchmark for selecting the correct water-related figure, in-context examples help it understand the task’s theory and follow the same reasoning structure.

Producing the final explanation

The LLM then produces an answer supported by a reasoning process. This explanation helps users understand how the model interpreted the image, which part of the image it relied on, and the associations it used.

Business applications of visual reasoning in LLMs

LLMs with visual capabilities can support multiple business scenarios. These applications depend on the model’s ability to analyze images, link them with text data, and produce reliable insights.

Document and content analysis

Businesses handle diagrams, engineering drawings, scientific journal figures, and various forms of visual data. A visual reasoning model can:

Detect missing or incorrect elements.
Identify objects or signs in the lower part or corners of the diagrams.
Connect text and image segments for quality checks.
Extract structured information for further deployment or reporting.

This is especially useful for industries where compliance documents include complex image-based sections.

Quality inspection and operations

In manufacturing and logistics, models can inspect products or packages. Visual reasoning helps detect defects, misalignments, or unusual patterns. The model can compare images against a reference and generate an explanation of what changed or what is missing.

Retail and eCommerce

Models can analyze product images, identify key attributes, and match them to catalog data. They can also identify inconsistencies between the written description and the image. This enhances product classification and reduces manual work.

Security and monitoring

Visual reasoning supports video and image inspection tasks. For example, a model can analyze the sequence of frames, find unusual associations between objects, or detect scenarios that require attention. The ability to explain its reasoning improves reliability for high-stakes environments.

Marketing and user experience

Visual reasoning helps teams understand how users interact with digital content. A model can evaluate screenshots or creatives and provide insights about layout, object placement, and potential issues. This is especially relevant when assessing different categories of visual assets.

Comparative landscape: major players and their approaches

Chance AI

Chance AI developed one of the first commercially available tools centered on vision-first understanding. Its visual reasoning system analyzes images from cultural, historical, functional, and aesthetic perspectives. Instead of producing a short label, it offers structured insights that explain why the object, figure, or scene matters. For example, when analyzing an artwork, the system may describe not only what is depicted but also the style, symbolism, and historical context.

This design focuses on user experience, allowing individuals to explore meaning through images without typing queries. In doing so, it moves away from traditional computer vision and toward reasoning that integrates storytelling, interpretation, and human-like explanation. It is especially relevant for creative industries, education, and tourism, where understanding context adds value beyond recognition.³

Meta AI

Meta’s UniBench framework introduced a unified method for evaluating visual reasoning. The framework integrates over fifty benchmarks covering spatial understanding, compositional reasoning, and counting. Through its evaluation of nearly sixty vision-language models, Meta’s research revealed that scaling up data and model parameters improves perception but not reasoning. Even the most advanced models struggled with simple tasks such as recognizing digits or counting objects.

This insight reshaped how progress in visual reasoning is measured. The findings suggest that improvements require better data quality, targeted objectives, and structured learning, rather than simply larger models. For businesses, UniBench provides a transparent way to assess reasoning performance across various types of multimodal tasks, helping them compare tools before deployment.⁴

Figure 2: The graph shows the median performance of 59 VLMs on 53 benchmarks, revealing that, despite progress, many models still perform near-chance level, particularly on tasks like Winoground, iNaturalist, DSPR, and others (blue: zero-shot median; grey: chance level).⁵

OpenAI

OpenAI advanced the field further with the o3 and o4-mini models, which introduced the ability to think with images. These models integrate image manipulation directly into their reasoning process. During analysis, they may zoom in, crop, or rotate an image to focus on the lower part of a figure or examine specific patterns. This dynamic exploration reflects how humans adjust their visual attention when analyzing diagrams or drawings.

The models were tested on diverse multimodal benchmarks, including chart interpretation, visual problem-solving, and mathematical reasoning. Results showed substantial improvements in both accuracy and contextual understanding (see Figure 2). However, the experiments also revealed limitations, including inconsistent reasoning paths and occasional perceptual errors. These findings highlight the challenge of achieving reliability and consistency in visual reasoning models.

Figure 3: The graph shows the results of all models evaluated under high “reasoning effort” settings.⁶

Academic and open research efforts

This paper introduces VisuLogic, a benchmark for evaluating the performance of multimodal models on visual reasoning tasks. It combines over fifty datasets covering various types of reasoning, including spatial relations, compositional logic, and object counting.

The authors analyze dozens of existing models and find that increasing size or data scale improves image recognition but not reasoning. Models often detect patterns without understanding relationships among objects. The paper emphasizes that reasoning-specific training, better data quality, and detailed evaluation are essential for meaningful progress.

VisuLogic offers a unified framework that helps researchers and enterprises analyze reasoning capabilities rather than relying solely on perception metrics, making it a valuable resource for assessing multimodal reasoning systems.⁷

Explain Before You Answer: A Survey on Compositional Visual Reasoning

This survey reviews current approaches to compositional visual reasoning, focusing on how models combine visual and textual cues to reach a correct answer. It identifies weaknesses in existing methods that rely on recognition rather than structured reasoning.

The authors propose training models to explain before answering, ensuring that each reasoning process is transparent and interpretable. They discuss techniques for aligning visual and linguistic representations so that models can better understand diagrams, figures, and object associations.

The paper concludes that aligned and explainable reasoning enhances reliability and interpretability in multimodal tasks. It highlights that the future of visual reasoning research depends on integrating explanation-based learning into model design.⁸

Challenges and ethical considerations

The progress in visual reasoning also brings technical and ethical challenges that researchers and companies must address.

Reliability remains a central concern. Models can generate different results when an image is slightly rotated or when parts of the visual field are transparent or missing.
Bias and interpretation issues emerge because models may reflect cultural assumptions embedded in their training data.
Explainability is critical for building trust. Users need clear visualizations or diagrams showing how the reasoning developed.
Privacy and security concerns arise from using large image datasets that may include sensitive or proprietary information.

Benchmark methodology

All models were evaluated via OpenRouter API with standardized parameters: temperature set to 0 for deterministic responses and max tokens set to 10,000. Models were instructed to respond with only a single letter (A-E) without explanation, though some models still provided detailed reasoning, which we parsed to extract final answers. Evaluation ran in parallel across all models simultaneously.

The benchmark consisted of 98 questions split into two categories: Chart Understanding (70 questions) covering bar charts, line graphs, scatter plots, and complex data visualizations, and Visual Logic (28 questions) testing pattern recognition, spatial reasoning, and mathematical visual logic. All questions were presented in multiple-choice format with five options (A-E), requiring models to analyze images and select the correct answer.

Questions:

1. Chart Understanding (70 Questions) We evaluated models on their ability to extract, interpret, and analyze information from various data visualizations:

Bar Charts: Horizontal and vertical configurations, stacked and grouped formats
Line Graphs: Single and multi-series trends, time-series data
Scatter Plots: Correlation analysis, pattern identification with labeled axes
Pie Charts: Percentage distributions and proportional reasoning
Complex Visualizations: Combination charts, dual-axis graphs, and multi-panel displays

2. Visual Logic (28 Questions) We assessed abstract reasoning and spatial intelligence through:

Pattern Recognition: Identifying sequences and completing visual patterns
Spatial Reasoning: 3D visualization, cube nets, and geometric transformations
Mathematical Logic: Numerical patterns, algebraic reasoning, and combinatorics
Abstract Thinking: Symbol manipulation, logical deduction, and rule inference

Question Format

Total Questions: 98 (70 Chart Understanding + 28 Visual Logic)
Answer Format: Multiple choice (A, B, C, D, E)
Image Types: PNG/JPEG with varied complexity levels

Reference Links

https://papers.neurips.cc/paper_files/paper/2023/file/ddfe6bae7b869e819f842753009b94ad-Paper-Conference.pdf

https://arxiv.org/pdf/2409.13980

Introducing Visual Reasoning: A New Way to Understand What You See

Chance AI

‪UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling | Research - AI at Meta‬

https://scontent-mxp2-1.xx.fbcdn.net/v/t39.2365-6/470992749_1252636815790167_7435288821033185700_n.pdf?_nc_cat=100&ccb=1-7&_nc_sid=3c67a6&_nc_ohc=gH82UXEtI9sQ7kNvwE4qBW-&_nc_oc=AdkFEOjdlCBGu1zSltFR4TXSAf_4GfRgRn-8brJyWP_Z766lT0mPmHQ0YkeJsKtZcoo&_nc_zt=14&_nc_ht=scontent-mxp2-1.xx&_nc_gid=nwQN9mU4bBhXh9qCmAEY2w&oh=00_AfgTjURrOsnUwcI3V8TwzgwxtOF7WL074hGyPQT-63PAkA&oe=691A402C

Thinking with images | OpenAI

https://arxiv.org/pdf/2504.15279

https://arxiv.org/pdf/2508.17298

Industry Analyst

Sıla Ermut

Industry Analyst

Follow On

Sıla Ermut is an industry analyst at AIMultiple focused on email marketing and sales videos. She previously worked as a recruiter in project management and consulting firms. Sıla holds a Master of Science degree in Social Psychology and a Bachelor of Arts degree in International Relations.

View Full Profile

Researched by

Nazlı Şipi

AI Researcher

Nazlı is a data analyst at AIMultiple. She has prior experience in data analysis across various industries, where she worked on transforming complex datasets into actionable insights.

View Full Profile