We follow ethical norms & our process for objectivity.

This research is not funded by any sponsors.

What are leading LMMs?

What are the latest advancements in multimodal models?

What are open-source large multimodal models?

What is a large multimodal model (LMM)?

What is a multimodal AI agent?

What is the difference between LMMs and LLMs?

What are the data modalities of large multimodal models?

How are large multimodal models trained?

How do LLMs work?

What are the limitations of large language models?

What are leading LMMs?What are the latest advancements in multimodal models?What are open-source large multimodal models?What is a large multimodal model (LMM)?What is a multimodal AI agent?What is the difference between LMMs and LLMs?What are the data modalities of large multimodal models?How are large multimodal models trained?How do LLMs work?What are the limitations of large language models?

Table of contents

LLMs

Updated on Jun 3, 2025

Large Multimodal Models (LMMs) vs LLMs in 2025

Cem Dilmegani

See our ethical norms

Large language models (LLMs) can handle textual tasks well but struggle with non-textual inputs, such as speech or video. In contrast, large multimodal models (LMMs) are emerging to handle various data types, including text, images, and audio.

However, their technical complexity and data demands present potential challenges. Innovations in AI research are aiming to overcome these challenges.

Worldwide search trends for Large Multimodal Models until 07/05/2025

Explore large multimodal models and compare them to large language models:

What are leading LMMs?

General purpose LLMs UI & API features

ChatGPT

Context window (tokens)

128K

Per File Upload Limit (MB)

512

Perplexity

Context window (tokens)

Model-dependent (e.g., 200K via Claude 3.5 Sonnet)

Per File Upload Limit (MB)

Claude

Context window (tokens)

200K

Per File Upload Limit (MB)

Grok

Context window (tokens)

1,000K

Per File Upload Limit (MB)

1024

Gemini

Context window (tokens)

2,000K

Per File Upload Limit (MB)

100

Updated at 04-30-2025

Provider	Context window (tokens)	Per File Upload Limit (MB)
ChatGPT	128K	512
Perplexity	Model-dependent (e.g., 200K via Claude 3.5 Sonnet)	25
Claude	200K	30
Grok	1,000K	1024
Gemini	2,000K	100

Provider

Context window (tokens)

Per File Upload Limit (MB)

ChatGPT

128K

512

Perplexity

Model-dependent
(e.g., 200K via Claude 3.5 Sonnet)

Claude

200K

Grok

1,000K

1024

Gemini

2,000K

100

Vendors are selected among the most popular multimodal LLMs based on comparability, data availability, and timeliness.

LMMs with their price per tokens:

Updated at 03-28-2025

Large multimodal models	Input Price / 1M tokens	Output Price/ 1M tokens
GPT-4.5	$75.00	$150.00
Claude 3 Opus	$15.00	$75.00
Claude 3.5 / 3.7 Sonnet	$3.00	$15.00
GPT-4o	$2.50	$10.00
Gemini 1.5 Pro	$1.25	$5.00
Claude 3.5 Haiku	$0.80	$4.00
GPT-4o mini	$0.15	$0.6

To select the most suitable model, consider factors such as your budget, the required capabilities and performance level, and the expected volume of input/output tokens needed for your specific use case.

What are the latest advancements in multimodal models?

Recent advancements in multimodal models have introduced new capabilities and efficiencies in AI development.

Llama 4 Scout and Llama 4 Maverick by Meta AI

Llama 4 Scout is a multimodal model with 17 billion active parameters and 16 experts. This model outperforms previous generation Llama models and is designed to operate on a single H100 GPU. It features a 10 million token context window for processing large amounts of information. Benchmark results indicate Llama 4 Scout achieves better results than Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1 across a range of widely reported benchmarks.

Llama 4 Maverick is a multimodal model with 17 billion active parameters and 128 experts. This model is presented as a top performer in its class, outperforming GPT-4o and Gemini 2.0 Flash across a range of benchmarks. It achieves comparable performance to DeepSeek v3 in reasoning and coding, while using fewer active parameters. An experimental chat version of Llama 4 Maverick achieved an ELO score of 1417 on the LMArena platform.¹

4o Image Generation by OpenAI

OpenAI’s latest image-generation model, embedded within GPT-4o, integrates text and visual creation into a unified system. This multimodal capability allows GPT-4o to generate images while drawing on its text-based knowledge and chat context, creating an interplay between language and visuals.

Through multi-turn generation, users can refine images conversationally as shown in the figures below. The model builds upon prior text inputs and uploaded images to maintain consistency. By analyzing user-provided visuals and learning in context, GPT-4o adapts to specific details, enhancing its ability to produce context-aware imagery.

**Figure 1**. Prompting the creation of a drawing using references and instructing on text features for the image.

**Figure 2**. Prompting the creation of a photo from the drawing and placing it in a scene.²

Qwen2.5-VL-32B-Instruct by Alibaba

Alibaba’s Qwen2.5-VL-32B-Instruct builds on the Qwen2.5 language model with visual processing features. The 32B parameter model focuses on image understanding and reasoning. It was pretrained on 18T tokens with a 128K token context window and includes multilingual support. The model improves image parsing, content recognition, and visual reasoning, making it useful for applications that combine image and text analysis.

Gemma 3 by Google

Google’s Gemma 3 builds on technology from their Gemini 2.0 models. It comes in four sizes (1B, 4B, 12B, and 27B) for different hardware requirements and offers a 128k-token context window. Gemma 3 performs well on single-accelerator setups and includes text and visual reasoning, function calling, and support for over 35 languages, with pretraining for more than 140. Quantized versions reduce model size and computing needs. The ShieldGemma 2 system provides content safety classification.

Phi-4-multimodal by Microsoft

Microsoft’s Phi-4-multimodal is a 5.6B parameter model that processes speech, vision, and text in a unified architecture. It uses cross-modal learning for context-aware interactions between different input types. The model handles multiple input formats without requiring separate processing systems and is designed for device deployment and edge computing. Applications include smartphone AI, automotive systems, and multilingual services.

What are open-source large multimodal models?

Open source LMMs with their number of GitHub stars:

The graph shows that the GitHub popularity of various open-source LMMs has been increasing, with some models experiencing rapid adoption shortly after their release.

Janus-Series by DeepSeek gained thousands of GitHub stars within days after the release of Janus-Pro on January 27, 2025, surpassing its competitors, which took months to reach similar numbers. This rapid rise was not only due to Janus-Pro’s success but also influenced by the momentum created by DeepSeek-R1.

Gemma 3 by Google: Gemma 3 is a family of lightweight, state-of-the-art open models derived from Gemini 2.0 technology. These models offer advanced text and visual reasoning capabilities, a 128k-token context window, function-calling support, and quantized versions for optimized performance. It includes ShieldGemma 2 for image safety and supports diverse tools and deployment options.³
Janus-Pro by DeepSeek: Janus-Pro is an advanced version of the Janus model, designed to both understand and generate text and images. It features an optimized training strategy, expanded training data, and a larger model size, enhancing its multimodal capabilities.⁴
Qwen2.5-VL by Alibaba: Qwen2.5-VL by Alibaba is a multimodal extension of the Qwen2.5 language model, designed for both text and image understanding. It boasts large-scale pretraining (up to 18T tokens), an extended context window (up to 128K tokens), improved instruction following, and robust multilingual support, making it suitable for tasks like image captioning and visual question answering. ⁵

Building upon the Qwen2.5-VL series, Alibaba optimized and open-sourced Qwen2.5-VL-32B-Instruct, a 32B VL model incorporating enhanced fine-grained image understanding and reasoning. This results in improved performance and detailed analysis within tasks like image parsing, content recognition, and visual logic deduction.⁶
CLIP (Contrastive Language–Image Pretraining) by OpenAI: CLIP is designed to understand images in the context of natural language. It can perform tasks like zero-shot image classification, where it can accurately classify images even in categories it hasn’t explicitly been trained on by understanding text descriptions.⁷
Flamingo by DeepMind: Flamingo is designed to leverage the strengths of both language and visual understanding, making it capable of performing tasks that require interpreting and integrating information from both text and images.⁸

A set of chatbot interactions analyzing images, identifying objects, locations, colors, and shapes, while engaging in reasoning and cognitive tasks. — **Figure 3**. An example taken from Chip Huyen⁹

What is a large multimodal model (LMM)?

A large multimodal model is an advanced type of artificial intelligence model that can process and understand multiple types of data modalities. These multimodal data can include text, images, audio, video, and potentially others. The key feature of a multimodal model is its ability to integrate and interpret information from these different data sources, often simultaneously.

These can be understood as more advanced versions of large language models (LLMs) that can work not only on text but diverse data types. Also, multimodal language model outputs are targeted to be not only textual but visual, auditory, etc.

Multimodal language models are considered to be the next step toward artificial general intelligence.

What is a multimodal AI agent?

Multimodal AI agents are systems designed to interact with the world using various types of data, including images, videos, and text, allowing them to operate in both digital and physical environments. Multimodal models are the core component of these agents, providing them with the ability to perceive and understand information from these diverse sources.

For example, models like Magma utilize vision-language understanding and spatial intelligence, achieved through techniques like Set-of-Mark and Trace-of-Mark during pretraining on multimodal datasets.

This enables the agent to perform tasks ranging from understanding video content and answering questions to navigating user interfaces and controlling robots, demonstrating the versatile capabilities that multimodal models bring to AI agents by leveraging different data modalities. The illustration shows Magma planning robot trajectories to accomplish tasks, showcasing its spatial intelligence in action.¹⁰

Magma's planned robot trajectories for picking up a chip bag, pushing it, and picking up a coke can.

What is the difference between LMMs and LLMs?

Updated at 02-21-2025

Aspect	Large Multimodal Models (LMMs)	Large Language Models (LLMs)
Data Modalities	Can handle and make sense of different types of data, such as text, images, audio, video, and sometimes sensor readings.	Focuses only on text. Doesn’t handle other types of data like images or audio.
Integration Capabilities	Good at combining and understanding various kinds of data at once.	Works only with text and doesn’t combine it with other data types.
Applications and Tasks	Used for tasks that need understanding multiple data types together. For example, analyzing a news article with related photos and videos.	Used for text-based tasks such as writing, translating, answering questions, summarizing, and creating content.
Data Collection and Preparation	Involves collecting diverse media—text, images, audio, and video—requiring careful annotation and normalization for integration.	Involves collecting large amounts of text from books, websites, and other sources, focusing on a wide range of language.
Model Architecture Design	Uses different types of neural networks, like CNNs for images and transformers for text, and combines them to handle various data types.	Typically uses transformer architectures designed specifically for processing and generating text.
Pre-Training	Pre-trained on diverse data to link text with media, enabling tasks like image captioning.	Pre-trained on large text corpora using methods like predicting missing words in sentences.
Fine-Tuning	Fine-tuning involves working with specialized datasets for each data type and learning how different types of data relate to each other.	Fine-tuning uses specific text datasets tailored to particular tasks like answering questions or translating languages.
Evaluation and Iteration	Evaluated on its ability to recognize images, process audio, and integrate diverse data types.	Evaluated based on performance in understanding and generating text, focusing on fluency, coherence, and relevance.

1- Data Modalities

LMMs: They are designed to understand and process multiple types of data inputs, or modalities. This includes text, images, audio, video, and sometimes other data types like sensory data. The key capability of LMMs is their ability to integrate and make sense of these different data formats, often simultaneously.
LLMs: These models are specialized in processing and generating textual data. They are trained primarily on large corpora of text and are adept at understanding and generating human language in a variety of contexts. They do not inherently process non-textual data like images or audio.

2- Applications and Tasks

LMMs: Because of their multimodal nature, these models can be applied to tasks that require understanding and integrating information across different types of data. For example, an LMM could analyze a news article (text), its accompanying photographs (images), and related video clips to gain a comprehensive understanding.
LLMs: Their applications are centered around tasks involving text, such as writing articles, translating languages, answering questions, summarizing documents, and creating text-based content.

What are the data modalities of large multimodal models?

Text

This includes any form of written content, such as books, articles, web pages, and social media posts. The model can understand, interpret, and generate textual content, including natural language processing tasks like translation, summarization, and question-answering.

Images

These models can analyze and generate visual data. This includes understanding the content and context of photographs, illustrations, and other graphic representations. Tasks like image classification, object detection, and even creating images based on textual descriptions fall under this category.

Audio

This encompasses sound recordings, music, and spoken language. Models can be trained to recognize speech, music, ambient sounds, and other auditory inputs. They can transcribe speech, understand spoken commands, and even generate synthetic speech or music.

Video

Combining both visual and auditory elements, video processing involves understanding moving images and their accompanying sounds. This can include analyzing video content, recognizing actions or events in videos, and generating video clips.

While most current large multimodal language models can only process text and images, future research aims to include audio and video data inputs.

How are large multimodal models trained?

Training large multimodal models (LMMs) differs significantly from training large language models (LLMs) in several key aspects:

1- Data Collection and Preparation

LLMs: Focus on text data from books, websites, and written sources, emphasizing linguistic diversity for LLM training data sources.
LMMs: Require text, images, audio, and video data. Collection is more complex due to varied formats. Data annotation and alignment between modalities is essential.

2- Model Architecture Design

LLMs: Use transformer architectures optimized for sequential text processing.
LMMs: Employ more complex architectures that integrate multiple neural network types (CNNs for images, transformers for text) with mechanisms to connect these modalities.

3- Pre-Training

LLMs: Pre-train on text corpora using techniques like masked language modeling.
LMMs: Pre-train across multiple data types, learning to correlate text with images or understand video sequences.

4- Fine-Tuning

LLMs: Fine-tune on specialized text datasets for specific tasks.
LMMs: Require fine-tuning on both modality-specific datasets and cross-modal datasets to establish relationships between different data types.

5- Evaluation and Iteration

LLMs: Evaluation metrics are focused on language understanding and generation tasks like fluency, coherence, and relevance.
LMMs: Assessed on broader metrics covering image recognition, audio processing, and cross-modal integration capabilities.

How do LLMs work?

Large multimodal models are similar to large language models in the training process, design, and operation. They use the same transformer architecture and training strategies. Large multimodal models are trained on:

Text data
Millions or billions of images with text descriptions
Video clips
Audio snippets
Other input data like code

This training involves simultaneous learning of multiple data modalities, enabling the model to:

Recognize a photo of a cat
Identify a woof in an audio clip
Understand concepts and sensory details beyond text

This way, users can upload:

An image to:
- Get a description of what’s going on
- Use the image as part of a prompt to generate text or images
- Ask follow-up questions about specific elements of the image
- Translate the text of the image to a different language (e.g Menu)

In the image, the ChatGPT user is asking large multimodal models to describe the uploaded image. — **Figure 4**. Uploading an image of a cat on ChatGPT to describe it.

Charts and graphs to:
- Ask complicated follow-up questions about what they show
Design mockup to:
- Get the HTML and CSS code necessary to create it.

In the image, the ChatGPT user benefits from large multimodal models to prompt the image in Wes Anderson movie style — **Figure 5**. Prompting the image in Wes Anderson movie-style. ChatGPT feed the prompt into an image generation model (like DALL·E), which interprets the request and produces the styled image.

After the training process, models might incorporate unhealthy stereotypes and toxic ideas. To refine them, techniques like:

Reinforcement learning with human feedback (RLHF)
Supervisory AI models
Red teaming (testing the model’s robustness) can be used.

Also, AI governance tools and responsible AI tools functioning as AI compliance solutions can also enable AI inventory optimization, helping prevent AI bias and other ethical dilemmas. Here is an example of how these tools address gen AI copyright concerns:

The image shows a conversation between ChatGPT and me, illustrating how the tool protects copyrights by rejecting my request to generate a Ghibli Studio anime style. — **Figure 6**. ChatGPT rejects my request due to content policy guidelines to protect copyrights.

The goal is to develop a functional multimodal system capable of handling:

Text to image synthesis
Image captioning
Text-based image retrieval
Visual question answering.

In this way, multimodal AI can integrate various modalities, providing advanced capabilities for tasks that involve both language and vision.

What are the limitations of large language models?

Data requirements and bias: These models require massive, diverse datasets for training. However, the availability and quality of such datasets can be a challenge. Moreover, if the training data contains biases, the model is likely to inherit and possibly amplify these biases, leading to unfair or unethical outcomes.
Computational resources: Training and running large multimodal models require significant computational resources, making them expensive and less accessible for smaller organizations or independent researchers.
Interpretability and explainability: As with a complex AI model, understanding how these models make decisions can be difficult. This lack of transparency can be a critical issue, especially in sensitive applications like healthcare or law enforcement.
Integration of modalities: Effectively integrating different types of data (like text, images, and audio) in a way that truly understands the nuances of each modality is extremely challenging. The model might not always accurately grasp the context or the subtleties of human communication that comes from combining these modalities.
Generalization and overfitting: While these models are trained on vast datasets, they might struggle with generalizing to new, unseen data or scenarios that significantly differ from their training data. Conversely, they might overfit the training data, capturing noise and anomalies as patterns.

For more content on the challenges and risks of generative and language models, check our article.

If you have questions or need help in finding vendors, reach out:

Find the Right Vendors

External Links

Share This Article

Cem Dilmegani

Follow on

Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

Follow on

Comments

Your email address will not be published. All fields are required.

0 Comments

Related research

Text-to-SQL: Comparison of LLM Accuracy in 2025

Jul 37 min read

Compare 10+ LLMs in Healthcare in 2025

Jul 46 min read

Large Multimodal Models (LMMs) vs LLMs in 2025

What are leading LMMs?

General purpose LLMs UI & API features

ChatGPT

Perplexity

Claude

Grok

Gemini

LMMs with their price per tokens:

What are the latest advancements in multimodal models?

What are open-source large multimodal models?

Open source LMMs with their number of GitHub stars:

What is a large multimodal model (LMM)?

What is a multimodal AI agent?

What is the difference between LMMs and LLMs?

1- Data Modalities

2- Applications and Tasks

What are the data modalities of large multimodal models?

How are large multimodal models trained?

1- Data Collection and Preparation

2- Model Architecture Design

3- Pre-Training

4- Fine-Tuning

5- Evaluation and Iteration

How do LLMs work?

What are the limitations of large language models?

External Links

Next to Read

LLM Latency Benchmark by Use Cases in 2025

Top 5 AI Gateways for OpenAI: OpenRouter Alternatives

Compare Top 20 LLM Security Tools & Free Frameworks

Comments

Related research

Text-to-SQL: Comparison of LLM Accuracy in 2025

Compare 10+ LLMs in Healthcare in 2025