AIMultiple ResearchAIMultiple ResearchAIMultiple Research
AI Agents
Updated on Sep 12, 2025

12 Reasons AI Agents Still Aren't Ready

Headshot of Cem Dilmegani
MailLinkedinX

For all the bold promises from tech CEOs about AI agents “joining the workforce” and driving “multi-trillion-dollar opportunities,” the reality is far less inspiring.

What we currently have are not autonomous agents, but glorified chatbots dressed in fancy packaging; mostly mimicking scripts. Give them the same task twice, and you’ll often get wildly different results. And if you try to figure out why they went wrong, tracing the logic is nearly impossible.

The limitations are also reflected in the numbers:

  • Best models hit 82% accuracy on enterprise document processing tasks.
  • Accuracy remains below 70% on customer service work.
  • Agents fail to create functional APIs with correct endpoints or method definitions.

Below, I outlined the most common challenges faced by LLM-based agents:

1. Limited real-world adoption

Despite the hype, many users and even industry insiders struggle to identify consistent, high-value applications for AI agents. 

One of the leading advocates for AI agents recently posed a genuine question:

Does anyone actually use the ChatGPT agent? If so, for what purpose? I struggle to find a use-case that fits its (limited) capabilities.

Source: Kahn, Jeremy1

In our AI agent benchmarks, we observed similar challenges with the ChatGPT Agent, which, despite its success in some tasks, struggled with more complex business process requirements and precise tool use. 

AI agents do work in certain domains like customer support to lead generation, but for most users, the promised “general-purpose assistant” has yet to materialize.

Tools like AgentGPT demonstrated the appeal of a general-purpose assistant. Yet in practice, these systems required constant supervision. 

2. Shallow memory

A core weakness of AI agents is memory. Each conversation resets, with no continuity or learning from previous conversations. Larger context windows, even hundreds of pages, add bulk but not reliable recall.

Even the best models, GPT-4.1 and Mistral Devstral Medium, achieved an 88% success rate in our benchmark, failing basic recall once in every eight tries.

3. Weak integration and reasoning over memory

Even when models retain information, applying it meaningfully remains limited.

Our scatterplot shows this clearly: larger models with hundreds of billions of parameters don’t consistently outperform smaller ones on memory. In fact, some of the largest systems, exceeding 500B parameters, landed memory scores as low as 30–40%, far below those of their smaller peers:

This highlights a troubling trade-off: Models optimized for reasoning often perform worse at recall, while smaller models with stronger memory fail to synthesize complex scenarios.

For example, GPT-4.1 and Devstral Medium performed well on direct recall but showed lower accuracy in cross-conversation synthesis tasks.

4. Agents don’t really learn in any lasting way 

LLM-based agents are stateless, meaning they don’t learn from the actions or outputs of the tools they use. 

They rely on a “context window,” which is like a temporary notepad that only holds the current conversation. Once the space runs out or the session ends, the notes are wiped clean.

To put this in perspective: GPT-5 can handle about 200,000 tokens, roughly the length of an 800-page book. That sounds big, but the entire Lord of the Rings series is well over 1,200 pages

Yet when developing an agent, if the task exceeds the token limit, the agent will be unable to complete it. For example, we found that certain models could not be utilized due to limitations in the agent’s context window, which prevented it from functioning effectively.

However, some agentic systems, like Devin, LangGraph frameworks, multi-agent setups, or coding tools such as Cursor, try to bridge this gap by holding project state or coordinating across agents. This gives the impression of continuity, but it remains limited to a session. The “learning” is external, temporary, and wiped clean once the session ends.

Think of it like this: Devin, for example, acts like a project managers who keep good notes. They remember what was said in a meeting, but they don’t actually get better at reasoning from one project to the next. Real learning would mean rewiring how their brain works

That’s what an architectural change is: Redesigning the model itself so it doesn’t just process inputs but can store, update, and reorganize knowledge over time.

Early work like Memory-R1 is exploring this path, but it’s still early days.2

5. Siloed data pools

80% of businesses cited data integration as a major blocker to AI adoption.3

AI agents can pull from many systems, but those systems rarely speak the same language. A “customer” might be a metadata-rich entity in one platform and a group of users in another, leading to incorrect merges. In the end, agents remain hamstrung by messy, inconsistent data.

6. Broken context sharing

Context sharing 4

Multi-agent systems are designed to work together seamlessly, but in practice, communication often breaks down. Instructions can shift or lose meaning as they move between agents, creating serious risks when workflows depend on accurate handoffs.

This creates challenges because:

  • There is still no widely adopted agent API: Most agents communicate through natural language, which is inherently ambiguous and prone to misinterpretation.
  • Semantic misalignment: Basic concepts such as “user” may be defined differently across systems. For example, one agent may treat a user as a contact record, while another sees it as a transaction history.
  • Protocol limitations: The Model Context Protocol (MCP), introduced in 2024 to improve interoperability, is a step forward but comes with design constraints. Its one-to-one structure, where each server manages only a single user and provider, makes multi-user or multi-provider workflows cumbersome and inefficient.5

7. Compounding errors in multi-step workflows

Agentic tasks rarely happen in a single step. In systems as unreliable as today’s LLMs, every additional step introduces another chance for failure. Given enough steps, small mistakes inevitably pile up, and the results can be disastrous.

Penrose.com tested AI on account balance tracking with a year of Stripe data and found that once the model miscalculated an early transaction, every subsequent balance was off.

By the end of the dataset, the cumulative error had grown large, showing how small mistakes at the start of a workflow compound into major distortions over time:

Penrose’s Stripe test is a microcosm of what happens in complex agentic workflows6

8. Vulnerability to cyberattacks

Because current agents only have a shallow grasp of their tasks, they remain highly vulnerable to cyberattacks. 

Researcher Andy Zou found that even the most secure system failed 1.45% of the time, leading to more than 1,500 successful breaches. In enterprise settings, even one breach can be catastrophic, so a failure rate of this magnitude is far from acceptable.

Challenge attack success rate across all user interactions7

9. Limited observability

AI agents often behave like black boxes. It can be nearly impossible to trace how they reached a decision or where they failed. 

Recent agent observability tools like Langfuse have started to fill this gap by providing workflow-level tracing.

But what they don’t provide (yet) is visibility into the internal decision-making process of the model:

  • Why one tool was chosen over another.
  • Why certain context was prioritized.

10. Tooling for AI agent development is still immature

Developers still lack mature, purpose-built environments for building and managing AI agents. While frameworks like AutoGen offer save_state() and load_state() methods.

One developer on r/AutoGenAI described this problem:

“I created an agentic workflow using Autogen… This worked well locally, but now I’m moving to production… facing challenges on how to store these agents.”8

The same issue is also reported on Autogen GitHub page:GitHub issue:9

Some teams patch the problem with quick fixes, such as storing data in global variables or manually saving and reloading states. But these are not reliable production solutions.

11. Testing and debugging remain costly

Testing and debugging AI agents is slow and resource-intensive. Workarounds like mocking model responses or building fake endpoints exist, but they cannot capture the unpredictable runtime errors in real deployments. 

As MIT Prof Armando Solar-Lezama told the WSJ, AI is like a brand new credit card here that is going to allow us to accumulate technical debt in ways we were never able to do before.10

This creates a range of pain points:

  • Some developers reported spending $400 in a single day on Claude Code (Opus 4) while debugging.11
  • Models sometimes waste hours by producing template-based code rather than analyzing the actual input, even admitting they “didn’t read the code.”
  • Frequent re-runs mean paying several cents per test iteration, quickly adding up at scale.
  • No AI system reliably replaces unit tests or semantic debugging, leaving humans to catch logical errors manually.

12. Hallucinations limit reliability

AI agents do not “understand” content; agents generate fluent language strings based on statistical patterns. 

As a result, outputs can appear credible while still being fabricated. In our benchmarks, even a top-performing model like Claude Sonnet 3.7 hallucinated 17% of the time. This means that agents built on such models inevitably carry the same risk.

Share This Article
MailLinkedinX
Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.
Mert Palazoglu is an industry analyst at AIMultiple focused on customer service and network security with a few years of experience. He holds a bachelor's degree in management.

Next to Read

Comments

Your email address will not be published. All fields are required.

0 Comments