AI models can remember earlier parts of a conversation, but their memory capacity varies wildly. Interestingly, smarter models often have worse memory.
We tested 23 popular large language models to see which ones actually remember information during long conversations.
AI memory benchmark results
We tested 23 popular large language models through a simulated 32-message business conversation with 43 questions. Our benchmark evaluated three key metrics: memory retention, reasoning quality, and hallucination detection using a complex fictional dataset with custom emission factors and 847 supplier records. We included interference tests and pulse checks throughout the conversation to measure how well models recall and apply specific information over extended interactions.
For details of the questions and metrics used, you can refer to the methodology.
GPT-5 models couldn’t handle our test. Our benchmark intentionally provides an extensive conversation history with multiple questions per batch to test long-term recall. When we approached GPT-5’s context limits, it returned empty outputs. We tried strict JSON formatting and simplified prompts, but got the same failures. This was a context capacity problem, not an output format issue.
We could have reduced batch sizes or trimmed history, but that would change the test enough to make comparisons meaningless. We excluded GPT-5 to maintain consistency with the methodology.
Findings about AI memory
- Reasoning models remember less than standard models.
- Smaller models outperform larger ones on memory tasks.
The AI research community has noticed this trade-off. When models train on larger datasets to improve reasoning, their ability to memorize and recall specific information decreases.
Why Large Models Struggle with Memory?
Larger models tend to provide information you didn’t ask for. This fills up the context window faster, even if that window is bigger than smaller models. Result: you get fewer relevant answers before the model “forgets” your earlier questions.
With smaller models, you can ask more questions and get more answers that actually relate to your initial conversation.
How to optimize between intelligence, hallucination rate and memory?
Our AI hallucination benchmark and memory benchmark don’t perfectly correlate. If you want a model that doesn’t hallucinate AND remembers well, look for the sweet spot on this chart near the upper right corner.
AI memory benchmark methodology
We simulated a realistic 32-message business conversation about sustainability management. Five emission factors (steel: 2.4 kg CO₂e/kg, aluminum: 3.8, recycled plastic: 1.2, copper: 5.1, rare earth metals: 15.7) were introduced early, and then we tested whether models remembered and applied these numbers throughout the conversation.
We included two “pulse checks” after messages 4 and 14 to catch memory failures early. We also inserted irrelevant questions to see if models got distracted and forgot key details. The conversation ended with complex questions that required synthesizing information from the entire exchange.
The dataset
We created a fictional electronics manufacturing company with 450 employees. The dataset includes:
- Custom Life Cycle Assessment (LCA) emissions data from a fictional $2.3M McKinsey study
- 847 suppliers with EcoVadis scores and Science-Based Target timelines
- Operational metrics (hybrid work effects, conference expenses, software licensing)
- Three facilities: Austin (180 employees), Denver (150), Portland (120)
- $3.2M sustainability budget across five categories
The dataset is internally consistent but not publicly available. It’s complex enough to require synthesis across multiple business areas and specific enough that models can’t just look up answers online, they must actually remember.
Question types used in this benchmark
We asked 43 questions across 32 messages. Most messages contained multiple questions to increase complexity.
Simple recall: “What’s our recycled plastic factor?” (tests pure retention)
Memory + calculation: “Calculate emissions for 18,500 kg of recycled plastic.” (tests whether the model can apply remembered information)
Memory interference: We ask unrelated questions between confirming a fact and asking for it again (simulates cognitive pressure)
Cross-conversation synthesis: “Build a three-year ROI model combining carbon pricing, cloud migration benefits, and hybrid work savings.” (requires pulling information from the entire conversation)
Success measurement
A model performs perfectly when it recalls all custom factors, handles all interference tests, and synthesizes complex scenarios using specific details from the entire conversation.
1. Memory Metrics
- Factor accuracy: Uses our custom emission factor (1.2 kg CO₂e/kg for recycled plastic) instead of industry standards (0.6-0.9)
- Retention timeline: When does memory start failing?
- Interference resilience: Does performance drop after distracting questions?
2. Reasoning Quality
- Synthesis capability: Can it integrate information from different parts of the conversation?
- Calculation accuracy: Does it use the correct recalled factors in calculations?
- Context maintenance: Does it track specific vendors, timelines, and costs?
3. Hallucination Detection
- Number fabrication: Does it invent figures or recall actual ones?
- Confidence calibration: Is it confidently wrong or uncertainly correct?
- Generic fallback: Does it use conversation specifics or generic business clichés?
What is AI memory & its types?
AI memory is how models store, retrieve, and apply information from prior interactions, much like human cognition. The model retains user inputs, external information, and reasoning steps to generate contextually appropriate responses.
Memory is crucial for connecting variables, interactions, client relationships, and values, but it’s often overlooked. You can add memory through fine-tuning and custom training, but that’s not practical for everyone or every situation.
AI uses memory to maintain conversational continuity, recognize patterns, and adapt to changing user needs, much like humans use past experiences to guide current decisions.
In business contexts (multi-stage customer support, long-term planning), the ability to retain prior information (vendor names, custom emissions factors, strategic goals) reduces duplication, prevents mistakes, and improves workflows.
Types of AI memory
There are two types of AI memory: short-term memory and long-term memory.
Short-Term Memory
Stored within a single conversation session. The system temporarily saves recent questions, answers, and context so it can reference information from seconds ago. This maintains coherence during back-and-forth exchanges.
Long-Term Memory
Persists beyond individual sessions. Essential data (user preferences, project details, custom parameters) is stored so models can recall it weeks or months later without reminders.
Long-term memory uses knowledge bases, fine-tuned embeddings, or external memory systems. This significantly improves productivity and customer satisfaction. AI agents can remember a company’s specific emissions factors, greet returning users by name, or perform multi-step tasks without repeating setup.
Retrieval augmented & native memory
Native memory: Extends context windows so models “remember” more conversation history. It can be expensive and degrade when capacity is reached.
Retrieval-augmented memory: Stores long-term data externally (vector stores, databases). The model retrieves relevant information when needed (like aluminum’s carbon factor). Better control, scalability, and access speed.
Hybrid systems: Combine both approaches, native memory for immediate context, and retrieval for historical data. This ensures efficient performance across multiple conversation turns.
AI memory concerns & best practices
AI memory has many advantages, such as facilitating more organic and context-rich interactions; however, it also poses serious questions about data security and moral implications.
AI systems are particularly vulnerable to data breaches and misuse, as they store and retrieve user preferences, conversation histories, and potentially sensitive information. Users may be reluctant to disclose private or sensitive business information if there is unclear transparency regarding what is stored and how it is safeguarded.
Additionally, poorly maintained memory can cause over-personalization or unintentional bias, where the AI uses historical data in ways that reveal personal information or reinforce preconceptions.
To overcome these challenges and use AI memory responsibly, follow these best practices:
- Minimize data footprint: Store only essential information required for your application and regularly delete outdated or unused data.
- User control & transparency: Offer clear options for users to view, modify, or erase their data, and be transparent about what data is kept.
- Secure storage & access: Protect data with encryption both at rest and during transfer, enforce strict access controls, and keep detailed logs of data access and changes.
- Bias monitoring: Continuously check outputs for unfair or biased patterns, and update memory policies or retrain models as necessary to reduce bias.
- Layered retrieval: Use a combination of short-term memory for immediate context and long-term storage for preferred settings, minimizing the risk of exposing sensitive information.
- Compliance alignment: Ensure your memory handling complies with laws like GDPR and CCPA by documenting retention policies and obtaining necessary consents.
FAQ
Further reading

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.
Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.
He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.
Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.




Be the first to comment
Your email address will not be published. All fields are required.