Smarter models often have worse memory. We tested 26 large language models in a 32-message business conversation to determine which actually retain information.
AI memory benchmark results
We tested 26 popular large language models through a simulated 32-message business conversation with 43 questions. Our benchmark evaluated three key metrics: memory retention, reasoning quality, and hallucination detection using a complex fictional dataset with custom emission factors and 847 supplier records. We included interference tests and pulse checks throughout the conversation to measure how well models recall and apply specific information over extended interactions.
For details on the questions and metrics used, see the methodology.
GPT-5 exclusion: Returned empty outputs when approaching context limits despite JSON formatting and simplified prompts. Reducing batch sizes would invalidate comparisons with other models.
Findings about AI memory
- Reasoning models remember less than standard models.
- Smaller models outperform larger ones on memory tasks.
The AI research community documented this trade-off in 2025: training on larger datasets to improve reasoning reduces the ability to memorize and recall specific information.
Why do large models struggle with memory?
Larger models provide extensive explanations you didn’t request. This fills context windows faster even when they’re larger. You get fewer relevant answers before the model “forgets” earlier questions.
Smaller models deliver focused responses that preserve space for retaining prior information. More questions answered, better continuity.
Transformer models encode knowledge in static weight matrices. Learning new information updates these weights, disrupting previously learned patterns. This “catastrophic forgetting” problem explains why fine-tuning on medical data degrades performance on legal tasks.
Architecture solutions emerging in 2026:
- Google Titans: Test-time memorization with adaptive forgetting
- Google Nested Learning: Multi-speed memory consolidation (fast modules for immediate context, medium for intermediate knowledge, slow for fundamental capabilities)
- DeepSeek Engram: Offloads static knowledge to queryable databases outside GPU memory
How to optimize between intelligence, hallucination rate, and memory?
Our AI hallucination benchmark and memory benchmark don’t perfectly correlate. If you want a model that doesn’t hallucinate AND remembers well, look for the sweet spot on this chart near the upper right corner.
AI memory benchmark methodology
Question Types (43 total across 32 messages)
Simple recall: “What’s our recycled plastic factor?”
Tests: Pure retention
Memory + calculation: “Calculate emissions for 18,500 kg of recycled plastic.”
Tests: Whether the model applies remembered information correctly
Memory interference: Unrelated questions are inserted between confirming a fact and asking for it again
Tests: Cognitive pressure resilience
Cross-conversation synthesis: “Build a three-year ROI model combining carbon pricing, cloud migration benefits, and hybrid work savings.”
Tests: Pulling information from the entire conversation
The dataset
We created a fictional electronics manufacturing company with 450 employees. The dataset includes:
- Custom Life Cycle Assessment (LCA) emissions data from a fictional $2.3M McKinsey study
- 847 suppliers with EcoVadis scores and Science-Based Target timelines
- Operational metrics (hybrid work effects, conference expenses, software licensing)
- Three facilities: Austin (180 employees), Denver (150), Portland (120)
- $3.2M sustainability budget across five categories
The dataset is internally consistent but not publicly available. It’s complex enough to require synthesis across multiple business areas and specific enough that models can’t just look up answers online; they must actually remember.
Success measurement
Perfect performance requires:
- Recalling all custom factors (not industry standards: recycled plastic is 1.2 kg CO₂e/kg in our dataset, not the industry’s 0.6-0.9)
- Handling all interference tests without degradation
- Synthesizing complex scenarios using specific details from full conversation
Evaluation Metrics
1. Memory metrics
- Factor accuracy: Uses custom 1.2 kg CO₂e/kg vs. industry 0.6-0.9
- Retention timeline: When does memory fail?
- Interference resilience: Performance after distracting questions
2. Reasoning quality
- Synthesis: Integrating information from different conversation parts
- Calculation accuracy: Correct recalled factors in equations
- Context maintenance: Tracking vendors, timelines, costs
3. Hallucination detection
- Number fabrication: Invents figures vs. recalls actual ones
- Confidence calibration: Confidently wrong vs. uncertainly correct
- Generic fallback: Conversation specifics vs. business clichés
AI Memory: How It Works
AI memory determines how models store, retrieve, and apply information from prior interactions.
Multi-stage customer support, long-term planning, and vendor management all require retaining information like vendor names, custom parameters, and strategic goals. Without memory:
- AI repeats questions already answered
- Cannot track progress on multi-week projects
- Loses context between conversations
- Forces constant re-explanation
Types of AI memory
Short-Term Memory
Exists within a single conversation session. Temporarily saves recent questions, answers, context. Maintains coherence during back-and-forth exchanges.
Implementation: Context window containing recent messages.
Long-Term Memory
Persists beyond individual sessions. Stores user preferences, project details, custom parameters for recall weeks or months later.
Implementation: Knowledge bases, fine-tuned embeddings, external memory systems.
Business impact: AI agents remember company-specific emissions factors, greet returning users by name, perform multi-step tasks without repeating setup.
Native vs. Retrieval-Augmented Memory
Native memory: Extends context windows to “remember” more conversation history. Expensive and degrades when capacity is reached.
Retrieval-augmented memory (RAG): Stores long-term data externally in vector stores or databases. Model retrieves relevant information when needed. Better control, scalability, and access speed.
Hybrid systems: Native memory for immediate context, retrieval for historical data. Efficient performance across multiple conversation turns.
FAQ
Further reading
Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.
Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.
He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.
Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.
Be the first to comment
Your email address will not be published. All fields are required.