No results found.

Code Execution with MCP: A New Approach to AI Agent Efficiency

Sena Sezer
Sena Sezer
updated on Dec 22, 2025

Anthropic introduced a method in which AI agents interact with Model Context Protocol (MCP) servers by writing executable code rather than making direct calls to tools. The agent treats tools as files on a computer, finds what it needs, and uses them directly with code, so intermediate data doesn’t have to pass through the model’s memory.

Benchmark results

Metric
Regular MCP
MCP with Code Execution
Difference
Success Rate
100%
100%
Same
Avg Latency
9.66s
10.37s
+7%
Avg Input Tokens
15,417
3,310
-78.5%
Avg Output Tokens
87
192
+120%
Total Input Tokens
770,852
165,496
-78.5%
Total Output Tokens
4,345
9,585
+120%
Total All Tokens
775,197
175,081
-77.4%

We compared two approaches for building AI agents that interact with external tools via the Model Context Protocol:

  • Regular MCP: Traditional approach where all tool definitions are loaded into the model’s context window
  • Code Execution MCP: Novel approach where the model writes code that calls tools, keeping intermediate data out of context

Key Findings

Input Token Savings: Code execution uses 78.5% fewer input tokens (165K vs 771K):

  • Regular loads ~15,400 tokens of tool definitions per call
  • Code execution only needs ~3,300 tokens per call

Higher Output Tokens: Code execution approach uses 2.2× more output tokens because the model writes code + explanations

Net Token Savings: 77.4% total token reduction (175K vs 775K)

Cost Implication:

  • Input tokens are typically cheaper than output
  • But 78% input savings far outweighs 2× output increase
  • Estimated ~70% cost reduction with code execution

Both achieved 100% success rate on these queries with GPT-4.1.

The code execution approach is inspired by Anthropic’s post on using code execution with MCP to reduce context window usage while maintaining agent capability.1

Methodology

Tasks

We run each task 50 times for each approach:

  • Go to https://research.aimultiple.com/open-source-embedding-models/, tell me the perfect top-5 performers (i.e., the models with 100% top-5 accuracy)
  • Go to https://research.aimultiple.com/open-source-embedding-models/, tell me which model has the highest latency.

Benchmark Setup

We used Bright Data’s MCP server with pro mode turned on, as it was the MCP server with the highest accuracy in our browser MCP benchmark.

We used GPT-4.1 as the LLM, due to its large context window.

Environment Setup: We cleared any cached data, ensured fresh MCP server connection per run. Each query is executed as a separate subprocess.

Architecture Comparison

Regular MCP Architecture

In the regular MCP approach, the agent follows a straightforward flow: the user query enters a LangGraph ReAct Agent, which has access to all 63 tool definitions in its context window. The agent selects and calls tools through the MCP Client Session, and tool results flow back through the context window to inform the agent’s next action.

Code Execution MCP Architecture

The code execution approach adds an intermediate layer: the user query goes to a Code Execution Agent with a compact context (only tool names, not full schemas). The agent writes Python code that calls tools. This code runs in a sandboxed Code Executor environment, which communicates with the MCP Client Session. Only the final results or summaries return to the agent’s context, not raw intermediate data.

The code execution implementation uses progressive disclosure. Only tool names and truncated descriptions (60 characters) are included in the system prompt. When the model needs to use a tool, it writes Python code that calls an async call_tool() function provided in the execution environment.

Limitations

  1. Query Diversity: Only 2 query types tested; results may vary for other task types.
  2. Single Model: Only tested with GPT-4.1; other models may show different patterns
  3. Code Quality: Code execution success depends on the model’s code generation ability, this may cause decreases in success rates in more complicated tasks.

Why Traditional MCP Wastes Resources

Problem 1: Tool Definitions Consume Excessive Context

Each tool needs instructions in the model’s memory. A basic example:

gdrive.getDocument
Gets a file from Google Drive
Needs: document ID
Returns: the file content

Example: An agent connected to 50 servers with 20 tools each means 1,000 tool definitions. At roughly 150 tokens per definition, that’s 150,000 tokens consumed before the agent reads your first request.

Problem 2: Data Gets Processed Multiple Times

Task: “Get my meeting notes from Google Drive and add them to Salesforce.”

What happens:

  1. Agent gets the document (50,000 tokens)
  2. The model reads it
  3. Agent sends it to Salesforce (another 50,000 tokens)

The model handles 100,000+ tokens to move data from one place to another – like having someone read an entire book out loud just to hand it to someone else.

Comparison: UTCP vs MCP vs Code Execution MCP

UTCP avoids the MCP server layer entirely by describing how to call existing APIs directly. Code execution with MCP keeps MCP servers but changes how agents interact with them.

Current State and Implementation

Anthropic outlined this approach but didn’t release implementation code. The community needs to build:

  1. Tool file generation from MCP server definitions
  2. Filesystem navigation for tool discovery
  3. Execution environment setup with proper sandboxing
  4. Integration with existing MCP clients

Some developers have started exploring this in the MCP community. Cloudflare published similar ideas under the name “Code Mode.”

Key Takeaways

Code execution with MCP addresses two fundamental inefficiencies in traditional MCP implementations:

  1. Tool definitions no longer crowd the context window
  2. Intermediate data stops flowing through the model unnecessarily

The approach works best when:

  • You have many MCP tools connected
  • Your workflows involve multi-step data processing
  • Large documents or datasets move between tools
  • Context window limits affect your agents

The infrastructure requirements mean this isn’t automatically better for all use cases. Small-scale deployments with few tools might not justify the operational complexity.

For organizations already running agents with extensive MCP tool catalogs, the potential for 98%+ token reduction and corresponding cost savings makes this approach worth investigating.

Industry Analyst
Sena Sezer
Sena Sezer
Industry Analyst
Sena is an industry analyst in AIMultiple. She completed her Bachelor's from Bogazici University.
View Full Profile
Researched by
Şevval Alper
Şevval Alper
AI Researcher
Şevval is an AIMultiple industry analyst specializing in AI coding tools, AI agents and quantum technologies.
View Full Profile

Be the first to comment

Your email address will not be published. All fields are required.

0/450