Code Execution with MCP: A New Approach to AI Agent Efficiency

with

updated on Jan 22, 2026

Anthropic introduced a method in which AI agents interact with Model Context Protocol (MCP) servers by writing executable code rather than making direct calls to tools. The agent treats tools as files on a computer, finds what it needs, and uses them directly with code, so intermediate data doesn’t have to pass through the model’s memory. We tested this approach to see if it reduces the token cost while maintaining the same success rate.

Code execution with MCP vs regular MCP

Metric	Regular MCP	MCP with Code Execution	Difference
Success Rate	100%	100%	Same
Avg Latency	9.66s	10.37s	+7%
Avg Input Tokens	15,417	3,310	-78.5%
Avg Output Tokens	87	192	+120%
Total Input Tokens	770,852	165,496	-78.5%
Total Output Tokens	4,345	9,585	+120%
Total All Tokens	775,197	175,081	-77.4%

We compared two approaches for building AI agents that interact with external tools via the MCP:

Regular MCP: Traditional approach where all tool definitions are loaded into the model’s context window
Code execution MCP: Novel approach where the model writes code that calls tools, keeping intermediate data out of context

Key findings

Input token savings: Code execution uses 78.5% fewer input tokens (165K vs 771K):

Regular loads ~15,400 tokens of tool definitions per call
Code execution only needs ~3,300 tokens per call

Higher output tokens: Code execution approach uses 2.2× more output tokens because the model writes code + explanations

Net token savings: 77.4% total token reduction (175K vs 775K)

Cost implication:

Input tokens are typically cheaper than output tokens
But 78% input savings far outweighs 2× output increase
Estimated ~70% cost reduction with code execution

Both achieved 100% success rate on these queries with GPT-4.1.

The code execution approach is inspired by Anthropic’s post on using code execution with MCP to reduce context window usage while maintaining agent capability.¹

Summarize This Research

Methodology of code execution with MCP comparison

Tasks

We run each task 50 times for each approach:

Go to https://research.aimultiple.com/open-source-embedding-models/, tell me the perfect top-5 performers (i.e., the models with 100% top-5 accuracy)
Go to https://research.aimultiple.com/open-source-embedding-models/, tell me which model has the highest latency.

Comparison setup

We used Bright Data’s MCP server with pro mode enabled, as it had the highest accuracy in our browser MCP benchmark.

Bright Data MCP server: web integration tools for AI.

Visit Website

We used GPT-4.1 as the LLM due to its large context window.

Environment Setup: We cleared any cached data and ensured a fresh MCP server connection per run. Each query is executed as a separate subprocess.

Architecture comparison

Regular MCP architecture

In the regular MCP approach, the agent follows a straightforward flow: the user query enters a LangGraph ReAct Agent, which has access to all 63 tool definitions in its context window. The agent selects and calls tools through the MCP Client Session, and tool results flow back through the context window to inform the agent’s next action.

Code execution MCP architecture

The code execution approach adds an intermediate layer: the user query goes to a Code Execution Agent with a compact context (only tool names, not full schemas). The agent writes Python code that calls tools. This code runs in a sandboxed Code Executor environment, which communicates with the MCP Client Session. Only the final results or summaries return to the agent’s context, not raw intermediate data.

The code execution implementation uses progressive disclosure. Only tool names and truncated descriptions (60 characters) are included in the system prompt. When the model needs to use a tool, it writes Python code that calls an async call_tool() function provided in the execution environment.

Limitations of our approach

Query diversity: Only 2 query types tested; results may vary for other task types.
Single model: Only tested with GPT-4.1; other models may show different patterns
Code quality: Code execution success depends on the model’s code generation ability, this may cause decreases in success rates in more complicated tasks.

Why traditional MCP wastes resources

Problem 1: Tool definitions consume excessive context

Each tool needs instructions in the model’s memory. A basic example:

gdrive.getDocument
Gets a file from Google Drive
Needs: document ID
Returns: the file content

Example: An agent connected to 50 servers with 20 tools each means 1,000 tool definitions. At roughly 150 tokens per definition, that’s 150,000 tokens consumed before the agent reads your first request.

Problem 2: Data gets processed multiple times

Task: “Get my meeting notes from Google Drive and add them to Salesforce.”

What happens:

Agent gets the document (50,000 tokens)
The model reads it
Agent sends it to Salesforce (another 50,000 tokens)

The model handles 100,000+ tokens to move data from one place to another – like having someone read an entire book out loud just to hand it to someone else.

When to use code execution with MCP?

Code execution with MCP addresses two fundamental inefficiencies in traditional MCP implementations:

Tool definitions no longer crowd the context window
Intermediate data stops flowing through the model unnecessarily

The approach works best when:

You have many MCP tools connected
Your workflows involve multi-step data processing
Large documents or datasets move between tools
Context window limits affect your agents

The infrastructure requirements mean this isn’t automatically better for all use cases. Small-scale deployments with few tools might not justify the operational complexity.

For organizations already running agents with extensive MCP tool catalogs, the potential for 98%+ token reduction and corresponding cost savings makes this approach worth investigating.