AIMultipleAIMultiple
No results found.

Benchmarking Agentic AI Frameworks in Analytics Workflows

Cem Dilmegani
Cem Dilmegani
updated on Oct 16, 2025

Frameworks for building agentic workflows differ substantially in how they handle decisions and errors, yet their performance on imperfect real-world data remains largely untested.

To evaluate their performance on real-world analytical workflows, we spent 3 days benchmarking LangGraph, LangChain, CrewAI, and OpenAI Swarm using a 100-record e-commerce dataset with controlled data inconsistencies such as missing IDs, null values and inconsistent date formats.

Each framework was assessed for decision accuracy and efficiency, tool integration performance, and execution performance (time and token usage).

Decision accuracy and efficiency

Loading Chart
  • Decision accuracy measures how effectively each framework resolved data-related issues, including null values, default assignments, field mappings, and failure recovery.
  • Decision efficiency represents the proportion of critical issues resolved relative to total decisions. A score of 100 % denotes optimal one-step resolution, while lower values indicate additional retries or redundant decision cycles that increase computational overhead. You can see benchmark methodology here.

Swarm

High Efficiency, High Accuracy (60%, 90%)

Swarm achieved high accuracy while maintaining efficient execution across analytical workflows.

Performance metrics showed consistently low decision counts and minimal retries. This outcome reflects Swarm’s modular, task-specific architecture, in which individual agents manage defined analytical functions such as KPI analysis or competitor research.

Swarm therefore combines strong coordination with efficient task distribution, making it a good fit for multi-agent analytical environments requiring both speed and precision.

LangGraph

High Efficiency, High Accuracy (60%, 100%)

LangGraph achieved both high accuracy and efficient execution, completing analytical workflows with fewer decision events.

Metrics from repeated test runs showed consistently direct execution paths and minimal retries. This pattern reflects LangGraph’s graph-based architecture, which predefines execution dependencies and reduces redundant operations.

LangGraph thus delivers precise, consistent, and efficient performance, making it a good fit for structured analytical workflows.

CrewAI

Low Efficiency, High Accuracy (21%, 87%)

CrewAI achieved high accuracy but required a substantially greater number of decisions to complete each workflow.

Data recorded by the DecisionTracker and AccuracyLatencyTracker showed multiple additional decision events occurring after tool failures.
This pattern indicates strong fault-tolerance that ensured reliable final outputs but increased computational overhead and runtime.

CrewAI therefore prioritizes result completeness and reliability over execution efficiency.

LangChain

Medium Efficiency, Low Accuracy (42%, 78%)

LangChain demonstrated moderate efficiency but lower accuracy compared to other frameworks.

Recorded metrics showed repeated decision iterations following tool failures, as the framework retried identical operations instead of adapting to alternative strategies. This sequential execution pattern limited recovery effectiveness and resulted in partial task completion.

LangChain therefore offers reasonable throughput but weak fault tolerance, making it better for simpler, low-risk analytical workflows.

Tool integration performance

Swarm

(100 % tool coordination success rate)

Swarm maintained a 100 % tool success rate through its specialized agent architecture. Distinct agents managed analytical tasks such as KPI analysis, competitor comparison, and currency conversion, enabling seamless task handoffs and efficient tool utilization.

LangGraph

(100 % tool coordination success rate)

LangGraph achieved a 100 % tool execution success rate. Its graph-based orchestration effectively mapped tool dependencies and execution order, preventing redundant or conflicting calls. The framework demonstrated high reliability and consistent coordination across all modules.

CrewAI

(37 % tool coordination success rate)

CrewAI showed a low rate of successful tool executions, particularly in KPI and validation modules. Despite this, all tasks were completed through additional reasoning and recovery cycles, indicating strong fault tolerance with higher computational overhead.

LangChain

(51 % tool coordination success rate)

LangChain achieved moderate tool execution success but lacked adaptive recovery. When tool calls failed, it repeated the same operation sequence, resulting in redundant processing and incomplete outputs.

Execution time and completion token

Loading Chart

Swarm

Fastest and most efficient

Swarm completed all workflows in approximately 20 seconds using about 1 K tokens, the lowest among all frameworks. Its consistent completion times and minimal token consumption indicate stable and efficient execution across runs.

LangGraph

Balanced performance

Swarm completed all workflows in approximately 20 seconds using about 1 K tokens, the lowest among all frameworks. Its consistent completion times and minimal token consumption indicate stable and efficient execution across runs.

CrewAI

Resource-intensive but reliable

CrewAI required about 32 seconds and 4.5 K tokens per run, the highest resource usage in the benchmark. Extended reasoning and validation cycles resulted in longer runtimes but consistent task completion, indicating high reliability with increased cost.

LangChain

Slowest and least efficient

LangChain completed runs in approximately 48 seconds, consuming around 2.1 K tokens. Repeated retries after failed tool executions contributed to longer runtimes and inefficient resource utilization.

Error handling approaches

To assess native error management, each framework was evaluated using its own data processing logic instead of a shared preprocessing pipeline. This comparison highlighted key differences between frameworks prioritizing data integrity and those emphasizing processing completeness.

LangGraph and Swarm prioritized accuracy and data integrity through validation and exclusion, while CrewAI and LangChain favored completeness, either by retaining incomplete data or imputing missing values, leading to greater variability in analytical precision.

Here is a detailed breakdown:

Swarm

Swarm applied precise skip logic, excluding invalid or incomplete records while maintaining overall workflow continuity. After resolving minor API compatibility issues, the framework consistently processed verified records without affecting execution flow.

LangGraph

LangGraph enforced strict data validation, omitting entries with null or incomplete values. This conservative approach ensured analytical accuracy by processing only records that passed integrity checks, maintaining consistent results across test runs.

CrewAI

CrewAI operated under a “zero data loss” principle, retaining all records, including those with missing or invalid fields. While this approach preserved dataset completeness, it reduced calculation accuracy due to the inclusion of unverified data points.

LangChain

LangChain used data imputation techniques to infer missing values from existing fields. For example, when Final_Price was null, it computed replacements from Price and Discount fields. Although adaptive, this introduced deviations from expected outcomes, impacting result precision.

When to use each framework?

  • CrewAI: When unexpected problems are likely and autonomous problem-solving is required.
  • LangGraph: For balanced reasoning and structure. Best for general-purpose use cases.
  • Swarm: In production environments where speed and reliability are critical. Fastest and most consistent.
  • LangChain: When detailed traceability and transparency are needed. Logs every step but slower than alternatives.

Developer experience

Framework–LLM integration performance: Different frameworks demonstrate varying levels of compatibility and performance with specific LLM providers. For instance, LangChain exhibits superior integration and accuracy when paired with OpenAI’s ChatGPT models, delivering more precise results through optimized prompt handling.

Architecture-driven behavior consistency: Although frameworks can utilize different LLMs with varying efficiency, their core behavioral characteristics remained largely consistent across models. The characteristic behaviors we observed – such as decision-making patterns, recovery handling and alternative reasoning capabilities – are primarily dependent on their underlying architectural design rather than the specific LLM employed.

This suggests that framework-LLM combinations can impact performance metrics, but the core behavioral patterns like CrewAI’s “whatever it takes” approach or Swarm’s specialized agent coordination remain consistent regardless of the language model used.

Integration challenges: We encountered notable integration challenges when attempting to connect CrewAI with Anthropic’s Claude models. Despite multiple configuration attempts, persistent environment setup errors prevented successful deployment.

Our research indicates this is not an isolated issue – numerous developers in the community have reported similar integration difficulties between CrewAI and Anthropic services, suggesting potential architectural incompatibilities or API handling limitations.

Recommendations for framework–LLM pairing: Based on these findings, we recommend evaluating different framework-LLM combinations when selecting frameworks for your specific use case.

Experimental setup

Objective

We aimed to objectively compare four AI agent frameworks: LangGraph, LangChain, CrewAI, and OpenAI Swarm, using identical datasets, tools, and measurement systems.

The evaluation focused on decision-making accuracy, resource efficiency, and tool integration capabilities under controlled error conditions representative of real-world data challenges.

Dataset description

A 100-record e-commerce dataset was used to test analytical decision-making capacity. Each record represented a single purchase transaction and included the following fields:

  • User_ID: Customer identifier
  • Product_ID: Product identifier
  • Category: Product category
  • Price (Rs.): Original price
  • Discount (%): Discount applied
  • Final_Price (Rs.): Price after discount
  • Payment_Method: Method of payment
  • Purchase_Date: Transaction date

Data perturbations

To replicate real-world imperfections, the dataset contained controlled inconsistencies designed to test recovery and reasoning capabilities:

  • Null values in key fields (e.g., Final_Price)
  • Empty fields: "Product_ID": "", "User_ID": "", "Category": ""
  • Mixed field names: cost → Price, revenue → Discount
  • Inconsistent date formats: "07/01/2024" vs "dd-mm-yyyy"
  • Zero or negative numeric values

Task definitions

Each framework was assigned five identical analytical tasks designed to test multi-step reasoning and data handling:

  1. Data processing: Perform enhanced cleaning and transformation with framework-specific execution logic.
  2. KPI calculation: Compute identical performance indicators using the enhanced_kpi_calculator tool.
  3. Competitor analysis: Conduct analysis on the top three products using the CompetitorAPI.
  4. Currency conversion: Convert total revenue to USD via the CurrencyAPI.
  5. Error handling: Apply native error management and recovery strategies for data inconsistencies.

Execution consistency

All frameworks operated under identical testing conditions:

  • Same JSON dataset, ground truth KPIs, mock APIs, and timing delays (CompetitorAPI = 0.05 s, CurrencyAPI = 0.1 s).
  • Tracking systems (decision_tracker, perf_tracker) were reset before each test.
  • Each framework executed pipelines 10 times, and median values were recorded for all metrics.
  • The same tool functions were used across frameworks, with naming conventions adapted (e.g., _swarm_tool, _crewAI_tool).

Measurement infrastructure

A standardized benchmarking infrastructure was implemented across all frameworks:

  • AccuracyLatencyTracker: Measured timing and latency (start_timer, end_timer).
  • DecisionTracker: Logged and categorized each decision event.
  • EnhancedAnalyticsDataProcessor: Provided consistent data cleaning logic.
  • Mock APIs: Simulated third-party tool behavior with fixed delays and controlled responses.

Methodology

Standardized workflow

All frameworks followed a unified experimental workflow:
tracker reset → dataset load → enhanced data processing → framework-specific execution → result extraction → metrics calculation.

This ensured that each run followed identical procedural steps under controlled conditions.

Validation and fairness controls

We maintained validation through controlled variables and consistent conditions across all frameworks:

  • Identical ground truth datasets
  • Uniform KPI definitions
  • Standardized measurement instruments

Error handling evaluation

To evaluate native recovery mechanisms, we removed the shared preprocessing infrastructure, allowing each framework to handle data inconsistencies independently.

The test dataset included strategically embedded errors such as:

  • Null Final_Price values
  • Empty Product_ID fields
  • Missing User_ID entries across multiple records

Evaluation metrics

Decision accuracy measures how reliably a framework resolves critical data issues and is computed as:

Accuracy was determined by comparing each framework’s decisions against predefined business logic criteria.

Each decision was evaluated in a binary manner (correct / incorrect) based on:

  • Tool failure recovery: whether failed operations were successfully resolved using alternative reasoning
  • Null handling: whether invalid records were correctly skipped
  • Empty field defaults: whether missing values were properly replaced (e.g., “UNKNOWN”)

Decision efficiency evaluates how effectively a framework addresses critical data issues and is computed as:

Critical points were defined as the minimum required decision steps (e.g., null handling, empty field defaults, field mapping). An score of 100 % indicates one decision per critical point, while additional decisions signal inefficiency or over-processing.

Tool performance was measured using the primary success rate, representing the proportion of direct tool calls completed successfully:

Recovery capability measures a framework’s ability to successfully recover from failed tool calls and is computed as:

Principal Analyst
Cem Dilmegani
Cem Dilmegani
Principal Analyst
Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.
View Full Profile
Researched by
Nazlı Şipi
Nazlı Şipi
AI Researcher
Nazlı is a data analyst at AIMultiple. She has prior experience in data analysis across various industries, where she worked on transforming complex datasets into actionable insights.
View Full Profile

Be the first to comment

Your email address will not be published. All fields are required.

0/450