LCMs: From LLM Tokenization to Concept-level Representation

with

updated on Oct 6, 2025

Large concept models (LCMs), as introduced by Meta in their work on “Large Concept Models,” represent a fundamental shift away from token-based prediction toward concept-level representation.¹

LCMs differ from traditional LLMs in two key ways:

High-dimensional embedding space: Instead of working with discrete token sequences, LCMs perform all modeling directly in high-dimensional embedding space.
Concept-level abstraction: Modeling is performed at the level of semantic and abstract concepts, not within a specific language or modality. This makes LCMs inherently language- and modality-agnostic.

From Meta’s research, we will explore core components of LCMs and their potential in semantic search, reasoning, based on the following benchmarks:

LCM architecture
LCM efficiency
LCM vs traditional LLMs

Understanding the limitations of LLMs: From tokens to concepts

The role of tokenization in LLMs: Large language models (LLMs) are trained on tokens. Tokens are a small segment of text. It can be a full word, part of a word, or even a single character that the model processes as a unit.

Example of tokenization:

The problem

Tokenization helps models break language into manageable pieces, but it also introduces a constraint. Most LLMs operate over sequences of discrete tokens (e.g., text subwords; visual/audio tokens produced by encoders).

LLMs can ingest multiple modalities, yet their core objective and representation remain sequence-bound, which makes it harder to model meaning directly at a concept level.

Cognition.ai’s findings with Sonnet 4.5 show this clearly: the model senses when its context window is nearly full, rushes to conclusions, and even reports remaining tokens, though inaccurately.²

The solution (Concepts)

Visualization of reasoning in an embedding space of concepts (task of summarization)³

Concepts refer to higher-order representations of meaning. Unlike tokens, they are not tied to any specific language unit and can be derived from text, speech, so the reasoning process remains the same.

This enables:

Better handling of long contexts by reasoning over whole ideas instead of fragmented tokens.
More abstract reasoning since operations are performed at the level of meaning.
Language- and modality-agnostic process to handle multilingual and multimodal tasks without needing separate processing pipelines for each type of input.

What are large concept models?

By contrast, large concept models (LCMs) aim to represent and reason over semantic concepts in a continuous embedding space, not tied to any single language or modality.

Fundamental architecture of a large concept model (LCM):

Source: Meta⁴

Core components of LCMs

1. SONAR encoding (turning text or speech into concept embeddings)

SONAR arhitecture⁵

The first stage of a Large Concept Model (LCM) is the concept encoder, which converts text or speech into a shared embedding space. Instead of breaking input into tokens, it represents entire sentences as mathematical embeddings that capture their meaning.

LCMs use SONAR, a multilingual and multimodal embedding space that supports over 200 text languages and 76 for speech.

For example, the sentences “I love you” in English and “Te quiero” in Spanish are placed close together in this space because they express the same idea. By operating at this concept level, LCMs gain inclusivity, efficiency, and scalability beyond token-based models.

Why is SONAR better than traditional embeddings?

Traditional methods:

mBERT: Provides multilingual embeddings, but they are not consistently aligned at the sentence level, making cross-language tasks less effective.

SONAR advantages:

Language-agnostic: 200+ languages for text input and output (building on Meta’s No Language Left Behind project). 76 languages for speech input and English for speech output.
Cross-lingual alignment: Sentences with the same meaning appear close together, regardless of language.
Higher-level reasoning: Since the units are sentences (or concepts), models can perform tasks such as summarization or translation by manipulating ideas directly.
Zero-shot translation: Can translate between languages and modalities without direct training for every pair.
LLMs vs LCMs

2. LCM core processing (reasoning over embeddings)

LCM core is the reasoning stage, where the model generates new concepts based on context. Unlike LLMs, which predict one token at a time, the LCM Core predicts entire sentences or concepts, operating at a higher semantic level.

The challenge lies in producing continuous embeddings conditioned on context. LLMs generate probability distributions over discrete tokens, but LCMs must directly generate vectors that capture meaning.

To address this, researchers have proposed several approaches, including:

Base-LCM: Standard Transformer predicting embedding: The simplest method is to train a Transformer to directly predict the next embedding, minimizing Mean Squared Error (MSE) loss. While effective in principle, this approach faces challenges because a given context can lead to multiple valid, yet semantically distinct, continuations.

Base-LCM⁶

Diffusion-based LCM: Structural variations for contextualization and denoising: Inspired by image generation, this variant uses a diffusion process. It autoregressively generates concepts, one at a time, performing denoising steps for each generated concept.
- One-tower: A single Transformer stack handles both contextualization and denoising, keeping the design efficient and compact.
- Two-tower: Splits the process into two parts: a contextualizer for understanding context and a denoiser for refining embeddings, offering more flexibility at the cost of complexity.

Source:Diffusion models in image generation⁷

Quantized LCM: Discretized embeddings: Another option is to discretize embeddings into larger symbolic units. This makes the task closer to that of LLMs, where the model generates discrete elements, but here the “tokens” represent much larger, semantically richer chunks of meaning.

3. SONAR decoding (returning to human-readable text or speech)

The final step of an LCM is the concept decoder, which transforms abstract embeddings back into natural text or speech.

Since concepts are stored in a shared embedding space, they can be decoded into any supported language or modality without rerunning the reasoning process.

This language-agnostic design means an LCM could take input in German, reason in concepts, and output in Japanese. It also enables easy scalability: new encoders or decoders (such as for sign language or speech-to-text systems) can be added without retraining the entire model.

By keeping “thinking” separate from expression, the decoder ensures that LCMs remain both flexible and adaptable for multilingual and multimodal applications.

Benchmarking LCM architectures

Meta pre-trained LCMs on the FineWeb-Edu dataset (English-only) and evaluated them across four benchmarks:

ROC-Stories (narrative reasoning),
C4 (web-scale text),
Wikipedia-en (encyclopedic knowledge),
Gutenberg (long-form text).

These datasets were chosen to capture diverse text types, from short narratives to large knowledge bases and extended documents.

Key takeaways:

Diffusion-based LCMs (QUANT-LCM-C, QUANT-LCM-D) are the strongest performers. Their iterative denoising process proved more effective at modeling concept continuations, leading to higher semantic accuracy and coherence

How to interpret the benchmark data:

ℓ₂, ℓ₂-r: Lower = more accurate, consistent embeddings.
PAR: Middle ground is best, shows coherence without collapse.
CA: Higher = better semantic alignment.
MI: Higher = more informative outputs.

Benchmarking LCM efficiency

Meta’s experiments showed that LCMs scale well with context length compared to LLMs when handling the same amount of text. This advantage comes from the fact that a concept corresponds to a full sentence, which includes multiple tokens. Since there are fewer concepts than tokens, the model has fewer units to process, and quadratic attention becomes less demanding.

Key takeaways:

It’s worth noting that these efficiency gains depend heavily on how text is segmented into sentences. Paragraphs split into shorter or longer sentences will affect the number of concepts, and thus the computational load.

Each LCM inference also involves three stages:

SONAR encoding (text or speech: embeddings)
Transformer-LCM reasoning (processing embeddings)
SONAR decoding (embeddings: text or speech)

This pipeline introduces overhead, especially for short inputs:

For short sentences (fewer than ~10 tokens), LLMs can be more efficient than LCMs, since the encoding and decoding steps outweigh the benefits of concept-level processing.

LCM vs traditional LLMs on summarization tasks

Meta also benchmarked a diffusion-based LCM (7B parameters) on news summarization datasets (e.g., CNN/DailyMail, XSum) and compared it with traditional LLMs.

Paradigm descriptions:

SFT: specialized training on summarization examples.
IFT: broader training on instruction datasets, so the model learns summarization as one of many skills.

Parameter descriptions:

ROUGE-L: Overlap with reference summaries.
OVL-3: Input trigram overlap ratio, measuring redundancy from source text.
REP-4: Output four-gram repetition ratio, measuring repetition in generated summaries.
SEAHORSE Q4 and Q5 metrics: Quality and coherence measures.
CoLA-based classifier: Evaluated linguistic acceptability of generated sentences.

Key takeaways:

Strength:

The diffusion LCM demonstrates strong coherence and contextual alignment in long-form summarization, especially when processing large contexts.

Caveats & considerations:

The evaluation mostly targets generative tasks (summarization) rather than broad benchmarks like MMLU.
How paragraphs are split into sentences (e.g. how you define “concepts”) strongly impacts performance.
On linguistic fluency and acceptability, token-based LLMs like LLaMA-3.1-8B and Mistral-7B still hold an edge. While LCMs show promise, they don’t yet deliver clear gains across all metrics, especially in fluency or flexibility.

Reference Links

Large Concept Models: Language Modeling in a Sentence Representation Space | Research - AI at Meta

Cognition | Rebuilding Devin for Claude Sonnet 4.5: Lessons and Challenges

https://arxiv.org/pdf/2412.08821

Large Concept Models Explained | DigitalOcean

DigitalOcean Community

Intro to Diffusion Model — Part 4 | by DZ | Medium

Medium

Principal Analyst

Cem Dilmegani

Principal Analyst

Follow On

Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

View Full Profile

Researched by

Mert Palazoğlu

Industry Analyst

Follow On

Mert Palazoglu is an industry analyst at AIMultiple focused on customer service and network security with a few years of experience. He holds a bachelor's degree in management.

View Full Profile