Agentic LLM Benchmark: Top 13 LLMs Compared

with

updated on Feb 24, 2026

We benchmarked 13 LLM’s across 10 software development tasks by using Aider as an agentic CLI tool through OpenRouter. Each task requires generating a full-stack web application (FastAPI backend + frontend) and passing automated API and UI tests.

Agentic LLM benchmark results

Claude 4.5 Sonnet and GPT-5.2 had the highest overall scores with the most consistent results across both API logic and UI integration. GPT-5.2 Codex and Gemini 3 Pro followed, with functional backend logic but weaker frontend output.

Success comparison

See our benchmark methodology to learn our testing procedures.

Claude Sonnet 4.5

Achieving the highest UI rate among all tested models, Claude Sonnet 4.5 consistently produced working frontends with functional backend logic. It successfully implemented CRUD operations, input validation, resource collections, multi-step workflows, and multi-stage status lifecycles. However, some tasks had authentication set up correctly but lacked resource creation, constraint enforcement, or role-based access control in domain-specific endpoints.

Gemini 3.1 Pro Preview

Technically precise backend code but fragile infrastructure setup. Passed baseline auth and listing steps in some tasks but generally failed on:

Frontend initialization
Strict schema validation
Time-based validation constraints
Complex state transitions
Cascading resource creation

GPT-5.2

Functional backends and working frontends characterized most tasks handled by GPT-5.2, with strong performance on CRUD, input validation, role-based access control, and multi-step workflows. Where it fell short:

State machine logic: built auth and resource listing but skipped admin status transitions and irreversible state enforcement
Role enforcement or constrained resource creation in some tasks

GPT-5.2 Codex

Basic flows like registration, resource listing, and collection management were handled well by GPT-5.2 Codex. Its main weaknesses:

Missing detail retrieval endpoints
No admin state transitions
Half of its frontends crashed with runtime errors (5 out of 10)

Compared to GPT-5.2, Codex generated more reliable backends but significantly less stable frontends.

Gemini 3 Pro

On simpler single-role tasks, Gemini 3 Pro correctly implemented CRUD, search, role-based access, and data retrieval. Multi-role applications were its weakness:

Passed health check and auth but failed on resource creation, association management, role enforcement, and admin workflows
Failed 13 out of 16 steps on two multi-role tasks
Frontends failed to render in 4 tasks

Claude Sonnet 4.6

With two total backend failures and low API scores across most tasks, Claude Sonnet 4.6 showed inconsistent performance. One exception: scored 0.92 API on a single task with nearly complete CRUD, authentication, role enforcement, and resource management (failing only on deletion). Across other tasks, it generated project scaffolding and working auth layers but left domain-specific business logic unfinished. Missing implementations:

Resource creation, listing, and detail retrieval
State transitions, role enforcement, input validation
Domain workflows: cart/checkout, ticket management, appointments, polls, event RSVP, transaction tracking

Claude Opus 4.6

Nearly complete frontends emerged from Claude Opus 4.6, but with minimal backend logic. It passed health check, register, and login steps but generally failed on:

Resource creation
State transitions
Role-based access
Input validation
Admin workflows

Kimi K2.5

Complete implementations for some task types contrasted with failed backends for others, suggesting Kimi K2.5 handles simpler CRUD tasks but struggles with complex multi-role or multi-step applications.

GLM 4.7

Limited results characterized GLM 4.7’s overall performance. Its highest-scoring tasks had partially loaded frontends, but authentication endpoints returned incorrect status codes. Most tasks had broken backend or frontend code.

Grok 4

Minimal backend code emerged from Grok 4, typically implementing only health check and authentication endpoints. It completed one task fully but otherwise failed on:

Service listings
Resource creation
Admin operations
State transitions

Devstral 2 2512

Partial backend logic was generated by Devstral, but no valid frontend code appeared in any task due to missing files or broken module references.

Qwen3 Coder Next

Backend code that couldn’t run characterized most tasks attempted by Qwen3 Coder Next. Where backends started, frontends failed due to missing entry points or broken components.

Trinity Large Preview

Producing the lowest scores overall, Trinity Large Preview generated project structures with errors that prevented applications from running. Most backends lacked functional route implementations and frontends had missing or broken components.

Log examples

Below are excerpts from the automated smoke test reports showing the actual test results for Claude Opus 4.6 and GPT-5.2 Codex on the same task. Each test was run 3 times with different random seeds, and both models produced identical results across all 3 runs, confirming the failures are deterministic.

For example in task 2, Claude Opus 4.6 passed 4 out of 17 steps (auth-only), failing 12 steps consistently.

GPT-5.2 Codex passed 10 out of 17 steps (basic flows worked), failing 6 steps on detail retrieval and admin state transitions.

Cost & success comparison

Claude Opus 4.6 was the most expensive model per run but placed in the middle of the rankings, while Devstral had a similar cost to Claude 4.5 Sonnet but scored significantly lower. GPT-5.2 and GPT-5.2 Codex achieved high scores at a relatively low cost.

Completion tokens & latency

Devstral consumed a high amount of tokens across all models but produced no working frontend, meaning a large portion of its output was non-functional or redundant code.

Kimi K2.5 and GLM 4.7 had the highest latencies , spending significantly more time per task without a corresponding improvement in results.

Grok-4 was similarly slow despite generating relatively few tokens , indicating long pauses between generations rather than large outputs. On the faster end, Gemini 3 Pro Preview and GPT-5.2 Codex completed tasks quickly with moderate token usage, and both placed in the upper half of overall scores.

LLM performance on a single successful task

After conducting our benchmark with 10 tasks, we found that there was no task that all LLMs completed correctly, and there were many steps where they failed. Therefore, we wanted to see how the tokens and latency would perform on a task that all of them could easily complete successfully.

To this end, we designed a minimal baseline task: a simple in-memory Notes API with four CRUD endpoints, basic validation, and no authentication or database. Every LLM completed this task with a 100% pass rate, confirming that all models can handle straightforward API generation when complexity is removed.

This allowed us to compare their token usage, cost, and latency on a single successful task.

Cost & lines of code comparison

In the full benchmark, Claude 4.5 Sonnet was the top-scoring model at an average cost of $0.29 per task; here it completed the baseline for just $0.012, matching the cheapest models.

Qwen3 Coder ($0.012) and Trinity (free), which ranked last and second-to-last in the full task benchmark, offered competitive pricing compared to the top-scoring Sonnet models. This means that on a task they can all complete, the cost gap between the best and worst performers largely disappears, except for Opus which remains expensive regardless of task difficulty.

Gemini 3.1 Pro Preview at $0.016 demonstrated efficient pricing on this baseline task, though its cost was slightly higher than the cheapest models. This positioned it competitively among mid-range performers, showing reasonable cost efficiency when task complexity is reduced.

Devstral 2 2512 showed the most dramatic cost reduction, dropping from $0.31 per task to $0.021. Since it scored only 0.07 in the full benchmark, this reveals an important aspect of LLM pricing: high costs don’t always reflect expensive per-token rates, they can result from repeated failed retries rather than the model’s base pricing structure.

Claude Opus 4.6 remained the most expensive at $0.086, consistent with its $1.17 average in the full benchmark, confirming that its per-token pricing makes it costly regardless of task difficulty.

Grok-4 produced the fewest lines of code , consistent with its low token usage in the full benchmark. GPT-5.2 Codex and GPT-5.2 had similar costs , but GPT-5.2 was fasterand more efficient. This mirrors the full benchmark where GPT-5.2 scored higher at the same cost, showing it reaches solutions more directly.

Completion tokens & latency comparison

Kimi K2.5 took 135 seconds for a task most models finished in under 30 seconds, confirming the high latency observed in the full benchmark is a model-level constraint, not driven by task complexity.

GLM 4.7, the slowest model in the full benchmark, completed this task in 24 seconds, a 25x reduction, suggesting its latency scales with difficulty.

Qwen3 Coder was the fastest at 10 seconds despite ranking last in the full benchmark. GPT-5.2 used fewer tokens than GPT-5.2 Codex and finished faster, consistent with the full benchmark where GPT-5.2 scored higher while being more concise.

What is agentic LLM systems?

Building software is iterative: write code, run it, read errors, fix them, repeat. Agentic AI systems enable LLMs to follow this same cycle. The model operates inside a development environment where it can write files, execute commands, read outputs, and make changes based on what it sees, continuing until the task is complete.

This matters because real applications aren’t single files. They have backends with routes and database models, frontends with components and API calls, configuration files, dependencies, and tests. Making these work together requires iterative testing and refinement, which is exactly what agentic architecture enables.

How it works

The model sits inside a harness with access to a shell, file system, and execution output. When asked to build an application, it writes files incrementally. After each step, the harness shows the model what happened: did the server start, did tests pass, did the linter flag errors. Based on that feedback, the model decides what to write or fix next.

This differs fundamentally from single-shot generation. In one-shot setups, the model generates an entire codebase blind, with no way to verify if it works. In agentic LLM systems, the model sees the consequences of each action and course-corrects. However, this capability alone isn’t sufficient. The model still needs strong reasoning to implement business logic correctly, which is where performance differences really emerge.

Benchmark methodology

We used Aider for all agents and connected to the models through OpenRouter. We evaluated their ability to work autonomously on 10 software development tasks (T-1 to T-10) ranging from simple reservation systems to complex interactive dashboards. These tasks require agents to manage multi-file projects and deliver functional products.

Execution and orchestration

Every agent and task begins in a clean environment. The instructions are provided as a TASK.md file, and we use a 20-minute heartbeat watchdog for the launch scripts. During this phase, we record exit codes, execution time, and whether the backend and frontend files were created. We also track real-time token usage across input, output, and cached categories.

Backend validation: We deploy the generated projects in isolated environments to test them against a canonical YAML contract. The validation covers happy path scenarios, error handling (400/403/409), and data consistency.

We test the results in two modes:

Adaptive mode validates functionality even with differing route names, while Strict mode requires exact adherence to the contract.

The backend overall score is calculated as: backend_overall = (ready_tasks / total_tasks) × Average(Adaptive + Strict success rates)

UI and user scenario testing

We use browser automation to simulate real user flows, including preflights, rendering, and authentication. We verify functional steps like login submission and post-login behavior to ensure the application runs without crashes.

The UI performance is measured by the step pass rate: step_pass_rate = passed / (passed + failed + blocked)

Tokens calculation

Token counts are extracted from the LLM API response. We subtract cached input tokens from total input tokens to get the effective input, which reflects only newly processed tokens. Output tokens are never cached, so they remain unchanged.

Final aggregation

The final benchmark score is calculated by combining the results from the previous phases: Final Score = (0.7 × backend_overall) + (0.3 × ui_overall) We assign a higher weight to the backend because logic failures at the API level often invalidate any success in the frontend.

AI Researcher

Nazlı Şipi

AI Researcher

Follow On

Nazlı is a data analyst at AIMultiple. She has prior experience in data analysis across various industries, where she worked on transforming complex datasets into actionable insights.

View Full Profile

Technically reviewed by

Cem Dilmegani

Principal Analyst

Follow On

Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month. Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple. Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization. He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider. Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

View Full Profile

Be the first to comment

Your email address will not be published. All fields are required.

Next to Read

Agentic WebFeb 2

Agentic LLM Benchmark: Top 13 LLMs Compared

Agentic LLM benchmark results

Success comparison

Claude Sonnet 4.5

Gemini 3.1 Pro Preview

GPT-5.2

GPT-5.2 Codex

Gemini 3 Pro

Claude Sonnet 4.6

Claude Opus 4.6

Kimi K2.5

GLM 4.7

Grok 4

Devstral 2 2512

Qwen3 Coder Next

Trinity Large Preview

Log examples

Cost & success comparison

Completion tokens & latency

LLM performance on a single successful task

Cost & lines of code comparison

Completion tokens & latency comparison

What is agentic LLM systems?

How it works

Benchmark methodology

Execution and orchestration

UI and user scenario testing

Tokens calculation

Final aggregation

Be the first to comment

Next to Read

Agentic Search in 2026: Benchmark 8 Search APIs for Agents

LLM Inference Engines: vLLM vs LMDeploy vs SGLang

The LLM Evaluation Landscape with Frameworks

LCMs: From LLM Tokenization to Concept-level Representation

Best LLMs for Extended Context Windows in 2026

How we Moved from LLM Scorers to Agentic Evals?