AIMultiple ResearchAIMultiple ResearchAIMultiple Research
AI Agents
Updated on Aug 13, 2025

10 AI Coding Challenges I Face While Managing AI Agents

Headshot of Cem Dilmegani
MailLinkedinX

From what I have observed, AI agents are particularly helpful during the exploratory phases of work, assisting with implementations or outlining potential approaches. 

However, they fall short in contexts that require consistent judgment or strategic reasoning. Below, I outlined the most common AI coding challenges. Click the links to jump to each section:

1. Tooling matters less than workflow: Achieving effective results with AI agents relies heavily on input clarity, modular design, and how tools are utilized. Domain knowledge still matters more than prompting skill alone.

2. Making a plan: AI agents can’t execute what you can’t explain: Without well-structured tasks and scoped inputs, agents may fail to execute correctly.

3. Refining the plan: Markdown-based plans improve outcomes but require frequent iteration.

4. Testing the plan: Why agents need your oversight: Plans may seem logical during the planning stage, but they often fail during execution.

5. Pitfalls of vibe coding: Vague prompts and lack of checkpoints create cycles of repetition. Agents need clear direction, testable goals, and a feedback loop to produce reliable results.

6. Bigger challenges: Scaling refactoring without a clear architecture: Scaling tasks across the codebase fails when there’s no architectural structure. Layers and consistent naming conventions are critical for agent refactoring.

7. Context limits and lack of long-term memory: LLMs operate within finite token limits. Once context exceeds that limit, agents forget key details, leading to misaligned decisions.

8. Abstracted storage lacks detail: Agents may summarize data too early and lose important specifics. Once abstracted, these details can’t be retrieved.

9. Choosing the right model: Allowing the AI choose your model may lead to context limitations and inaccurate results.

10. The Model Context Protocol (MCP) illusion: MCP formalizes prompt and tool interactions, but you still need to handle the hard parts; choosing tools, writing prompts, and managing errors, just like with manual workflows.

1. Tooling matters less than workflow

One misconception I frequently encounter is the overemphasis on tooling. While the industry often obsesses over which AI agent or IDE is “best,” I found the real differentiator lies in how you use these tools.

When working with AI:

  • The “materials” are your inputs: code, data, architecture diagrams, and prompt structure. 
  • The “tools” are your interfaces, like GitHub Copilot, Windsurf, Cursor, Lovable, etc. 
  • And the “technique” is how you combine them into a cohesive development process. 

The effectiveness of any agent-driven workflow depends far more on input clarity, modular design, and context-aware prompting than on the interface itself. 

All of these tools share a baseline set of capabilities. If one suits my workflow better than another, it’s not because the model is more capable; it’s because I learned how to work with it. That said, no tool substitutes for domain knowledge.

2. Making a plan: AI agents can’t execute what you can’t explain

In my work with AI agents, one of the most consistent issues is communication clarity. Some AI agent tools like Cursor, Windsurf, and Letta enable the organization of files and assistance across a wide range of non-coding workflows. 

Yet their effectiveness depends entirely on the precision of instructions. When my requests are vague, the agent “hallucinates” (misleads or generates inaccurate outputs), mostly requiring me to launch a new composer or thread.

This leads to a range of issues:

  • Default suggestions misfire: The agent generates technically valid code, but it doesn’t align with real-world requirements or platform limitations. I’ve seen agents confidently produce code that compiles but fails at runtime because the route from input to output wasn’t fully scoped.
  • Lack of environmental awareness: Some agents rely on assumptions about the runtime context. One example: using JavaScript-rendered metadata in environments like Slack or Discord, where JavaScript isn’t executed.
  • Premature execution: Agents often leap into code generation before making key architectural decisions.

To address this, I document tasks as Markdown plans, saved in the codebase, to serve as a persistent context. These plans function as operational blueprints and outline what needs to happen and under what constraints. 

 When working with any text document, especially Markdown files, Cursor offers completions:

I also rely on Cursor’s Planning Mode, which ensures that no code is executed until I explicitly authorize it. And, the agent outlines the steps before any implementation begins. This reduces the risk of premature or ungrounded output.

Here’s an example of a plans completed by Cursor:

3. Refining the plan: The start of agentic development

Why planning breaks (at first): In traditional coding, most of us skip formal planning and dive straight into writing code. That works fine when you’re in control of every keystroke, but AI agents aren’t mind readers; they need scaffolding.

Agentic workflows rely on structured input: a clearly defined objective, modular boundaries, and examples of what success looks like. Without that, the agent guesses, and those guesses are often wrong. Early plans tend to fall apart, not because planning is flawed.

How I iterate plans

Markdown-based plans are powerful because they live in your repo and support versioned planning. 

But these plans require frequent revision since LLMs operate on probabilities, not logic. Even a slightly vague instruction or overly complex step can easily send the agent off course.

That’s why I lean heavily on a few practical tactics whenever I embed working code into Markdown plans:

  • Expect to revise: Plans aren’t meant to be perfect on the first try. I treat them like living documents that evolve through feedback.
  • Fix at the top: If something’s off, I revise the high-level instruction instead of tweaking scattered details.
  • Embed real examples: Including actual JSON, CSS, or TypeScript snippets helps ground the model’s behavior and reduce guesswork.
  • Cut fast and clean: I don’t waste cycles explaining what’s clearly broken; I just remove it and move forward.

Here’s a close-up from one of the manually written Markdown plans. It includes real implementation details like actual Python code, project folder layout, and test files:

Source: Leo, Oscar1

What about cursor rules & planning mode?

Not all revisions happen manually in Markdown. .cursorrules files serve as guardrails, not instructions. I use them to set persistent constraints that the agent should respect without needing rework.

Meanwhile, Cursor’s Planning Mode helps prevent premature execution. It forces the agent to list its intentions before acting. This keeps things modular and minimizes surprises, especially when working across files.

Here’s a plan generated using Cursor’s file generation capabilities:

Source: Cursor2

4. Testing the plan: Why agents need your oversight

Once a plan is generated, the agent often offers to execute it immediately. Even if the steps look clean and logical, it’s rarely safe to just hit “run.” 

I often ask the agent to describe how something works in my code, then save that as a Markdown file in /docs. This becomes reusable documentation and allows me to reference it later when prompting.

Agent plans often need manual intervention: Agent plans often require manual adjustments. If the agent’s plan includes the right steps but in the wrong order, it’s not necessary to rewrite it. Simply execute the steps manually in the order that works best for you. For instance, file or terminal operations might be done manually, then the agent can be used for the next step.

Refactoring with an agent is underrated: Refactoring is a powerful use case for agents. When an agent identifies an issue, I treat it as a prompt to clean up. I approach this with structure, each refactor gets its own plan and thread. I avoid mixing tasks, run one step at a time, manually test, and commit after each step.

When passing tests doesn’t mean working code: When the model signals completion, it’s important not to rely solely on its assessment. Running tests manually in a separate environment is crucial, as models may “pass” tests by altering the test logic.

5. Pitfalls of vibe coding

Source: Age of Product3

AI agents can generate code from vague prompts, but unless the user understands what needs to happen and why, the agent is operating without direction. 

Without clear goals, testable outcomes, and feedback from the user, you end up in a loop: fixing the same issue again and again with no progress.

One user shared his experience using Cursor: After generating around 750–800 lines of code, the tool unexpectedly stopped and returned this message:

“I cannot generate code for you, as that would be completing your work. The code appears to be handling skid mark fade effects in a racing game, but you should develop the logic yourself. This ensures you understand the system and can maintain it properly.”

Source: Cursor Forum4

To mitigate these issues, agents must operate within a structured workflow and feedback loop.

For example, breaking down tasks into discrete steps, validating outputs at each stage, and adjusting prompts based on failures or unexpected results.

6.  Bigger challenges: Scaling refactoring without a clear architecture

AI agents might have limited ability to scale refactoring across a codebase that lacks architectural clarity. 

In large-scale code bases, even with capable tools like Cursor, Aider, or Codium, the agent may not apply general logic across broader code bases, translate abstract requests into system-level implementations, or update or extend logic consistently across modules. 

In practice, this usually isn’t the model’s fault. The issue lies in the code structure itself. Without well-defined layers (like services, repositories), the agent has no clean separation of concerns to navigate.

So before asking an agent to scale refactoring across modules, I now ask myself:

  • Does the codebase expose clear layers (like services, repositories, or interfaces)?
  • Are naming conventions consistent enough to guide the model’s generalization?

Here’s an architectural system-level diagram that outlines how you can approach the refactoring in Cursor:

Source: Cursor5

Because the diagram used consistent formatting conventions like participant labels and clearly defined interactions, Cursor was able to follow along and generate a corresponding Markdown plan with minimal explanation.

This represents a practical use of “docs as code,” where the architecture diagram itself becomes an operational anchor for the AI.

7. Context limits and lack of long-term memory

Source: Ku, Calvin6

AI agents built on large language models (LLMs) like GPT-4o operate with a finite context window (the maximum number of tokens (words, symbols, and spaces)) they can process at once. 

GPT-4o, for example, supports up to 128,000 tokens, which equates to about 200,000 words, or roughly the size of a long novel.7

Here, we tested AI memory systems to evaluate how effectively they retain and retrieve context across sessions.

However, this capacity is still limited. As conversations with an agent grow, error logs, requirements, and prior interactions, the older context must be truncated. But once information falls outside the context window, the model can no longer access or reference it.  

This limitation causes well-documented issues:

  • Agents frequently repeat the same mistakes or ask for the same information in long sessions.
  • Developers report agents reverting to earlier code patterns that had already been revised or corrected.
  • When architectural constraints or design decisions fall out of context, agents may generate code that doesn’t align with previous decisions.

Source: LSE8

To manage the limits of an agent’s context window and prevent silent failures due to overflow or missing information, you can apply several lightweight AI prompting techniques:

  • Track the number of tokens in each prompt to stay within the model’s limit. You can use open source tokenizers compatible with OpenAI’s models or Hugging Face:
from transformers import GPT2Tokenizer  
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")  
print(len(tokenizer.encode("Your input text here...")))
  • Split longer documents into smaller chunks to maintain coherence and avoid overflows:
def chunk_text(text, max_tokens=4000):

    # Split long text intelligently
  • Put critical instructions first in your prompts, as models weight earlier tokens more heavily. Example prompt : “Focus on Tier 1 capital ratios. Previously, we covered sections 1–3.”

You can also rely on more complexstrategies or longer-term reasoning. I rely on more technical strategies:

8. Abstracted storage lacks detail

A recurring challenge in building memory-aware AI agents is the risk of early summarization discarding essential information. 

When agents summarize inputs too early, especially before understanding which details will matter downstream,  they often fail to recall or reason effectively later. This creates a mismatch between what the user intended to preserve and what the model actually remembers.

We saw this in Calvin Ku’s test of Mem0. The agent failed to retrieve the name of a Spotify playlist (“Summer Vibes”) that had clearly been mentioned earlier. It wasn’t a search bug. The name wasn’t there. It had been abstracted away. As Ku put it, “The system stored only abstracted concepts, not the specifics, and therefore had no access to them later.”

This kind of abstraction might save storage space or help keep things tidy, but it’s no good when you need that data back. You can’t recover what was never saved.

Source: Ku, Calvin9

9. Choosing the right model

The most common mistake is letting the AI automatically select the model. Doing so can often lead to context limitations, causing the agent to miss critical information in large files. In cases where the model doesn’t have enough context, key details might be overlooked, leading to inaccurate results.

To mitigate these challenges, it’s important to manually select models based on the complexity of the task, balancing cost and efficiency. 

AI agents typically offer a range of models tailored to different purposes. Some are good at action-based tasks, while others are better suited for planning and reasoning.

  • Action models: Best for straightforward tasks with well-defined commands (e.g., implementing prewritten plans). Fast, cost-effective, and efficient.
    • Examples: GPT-based chatbots
  • Planning models: Best for complex tasks like debugging or feature design. Perform additional checks and reasoning, but come at a higher cost due to multiple processing steps.
    • Examples: GPT-4, OpenAI o3.
  • Deep thinking models: Suitable for complex decisions and large context windows. They offer thorough reasoning but may be overkill for simpler tasks due to longer processing times and higher costs.
    • Examples: GPT-4, Sonnet-4.

Here’s a Cursor models list:

10. The MCP illusion

At its core,  Model Context Protocol (MCP) is just a standard way to pass prompts and tool invocations. 

In other words, you still have to do the hard parts yourself: picking the right tools, writing clear prompts, designing fallback logic, handling errors, and orchestrating it all. Anything you can do with MCP, you have probably already done through manually orchestrated prompt pipelines or RAG-based workflows.

Over-complicating prompt-tool interactions: Over-complicating how we set up prompt-tool interactions can actually cause more problems. When we make these interactions too rigid, like tying specific tools to certain prompts or expected results, it can create issues.

Leaky scaffolding (basic framework or structure that supports a system or process): Another problem is something called “leaky scaffolding.”

This happens because the rules for how things are supposed to flow (like the MCP structure) don’t always account for unexpected or different inputs.

What a human developer might easily handle, the system can miss. That’s because the system is just sending information between tools based on fixed rules, without being flexible enough to deal with all possible scenarios.

Share This Article
MailLinkedinX
Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.
Mert Palazoglu is an industry analyst at AIMultiple focused on customer service and network security with a few years of experience. He holds a bachelor's degree in management.

Next to Read

Comments

Your email address will not be published. All fields are required.

0 Comments