Benchmarking Agentic AI Frameworks in Analytics Workflows

with

updated on Nov 7, 2025

Frameworks for building agentic workflows differ substantially in how they handle decisions and errors, yet their performance on imperfect real-world data remains largely untested.

To evaluate their performance on real-world analytical workflows, we spent 3 days benchmarking LangGraph, LangChain, CrewAI, and OpenAI Swarm using a 100-record e-commerce dataset with controlled data inconsistencies such as missing IDs, null values and inconsistent date formats.

Each framework was assessed for decision accuracy and efficiency, tool integration performance, and execution performance (time and token usage).

Decision accuracy and efficiency

Loading Chart

Decision accuracy measures how effectively each framework resolved data-related issues, including null values, default assignments, field mappings, and failure recovery.
Decision efficiency represents the proportion of critical issues resolved relative to total decisions. A score of 100 % denotes optimal one-step resolution, while lower values indicate additional retries or redundant decision cycles that increase computational overhead. You can see benchmark metho d ology here.

Swarm

High Efficiency, High Accuracy (60%, 90%)

Swarm achieved high accuracy while maintaining efficient execution across analytical workflows.

Performance metrics showed consistently low decision counts and minimal retries. This outcome reflects Swarm’s modular, task-specific architecture, in which individual agents manage defined analytical functions such as KPI analysis or competitor research.

Swarm therefore combines strong coordination with efficient task distribution, making it a good fit for multi-agent analytical environments requiring both speed and precision.

LangGraph

High Efficiency, High Accuracy (60%, 100%)

LangGraph achieved both high accuracy and efficient execution, completing analytical workflows with fewer decision events.

Metrics from repeated test runs showed consistently direct execution paths and minimal retries. This pattern reflects LangGraph’s graph-based architecture, which predefines execution dependencies and reduces redundant operations.

LangGraph thus delivers precise, consistent, and efficient performance, making it a good fit for structured analytical workflows.

CrewAI

Low Efficiency, High Accuracy (21%, 87%)

CrewAI achieved high accuracy but required a substantially greater number of decisions to complete each workflow.

Data recorded by the DecisionTracker and AccuracyLatencyTracker showed multiple additional decision events occurring after tool failures.
This pattern indicates strong fault-tolerance that ensured reliable final outputs but increased computational overhead and runtime.

CrewAI therefore prioritizes result completeness and reliability over execution efficiency.

LangChain

Medium Efficiency, Low Accuracy (42%, 78%)

LangChain demonstrated moderate efficiency but lower accuracy compared to other frameworks.

Recorded metrics showed repeated decision iterations following tool failures, as the framework retried identical operations instead of adapting to alternative strategies. This sequential execution pattern limited recovery effectiveness and resulted in partial task completion.

LangChain therefore offers reasonable throughput but weak fault tolerance, making it better for simpler, low-risk analytical workflows.

Tool integration performance

Swarm

(100 % tool coordination success rate)

Swarm maintained a 100 % tool success rate through its specialized agent architecture. Distinct agents managed analytical tasks such as KPI analysis, competitor comparison, and currency conversion, enabling seamless task handoffs and efficient tool utilization.

LangGraph

(100 % tool coordination success rate)

LangGraph achieved a 100 % tool execution success rate. Its graph-based orchestration effectively mapped tool dependencies and execution order, preventing redundant or conflicting calls. The framework demonstrated high reliability and consistent coordination across all modules.

CrewAI

(37 % tool coordination success rate)

CrewAI showed a low rate of successful tool executions, particularly in KPI and validation modules. Despite this, all tasks were completed through additional reasoning and recovery cycles, indicating strong fault tolerance with higher computational overhead.

LangChain

(51 % tool coordination success rate)

LangChain achieved moderate tool execution success but lacked adaptive recovery. When tool calls failed, it repeated the same operation sequence, resulting in redundant processing and incomplete outputs.

Execution time and completion token

Loading Chart

Swarm

Fastest and most efficient

Swarm completed all workflows in approximately 20 seconds using about 1 K tokens, the lowest among all frameworks. Its consistent completion times and minimal token consumption indicate stable and efficient execution across runs.

LangGraph

Balanced performance

CrewAI

Resource-intensive but reliable

CrewAI required about 32 seconds and 4.5 K tokens per run, the highest resource usage in the benchmark. Extended reasoning and validation cycles resulted in longer runtimes but consistent task completion, indicating high reliability with increased cost.

LangChain

Slowest and least efficient

LangChain completed runs in approximately 48 seconds, consuming around 2.1 K tokens. Repeated retries after failed tool executions contributed to longer runtimes and inefficient resource utilization.

Error handling approaches

To assess native error management, each framework was evaluated using its own data processing logic instead of a shared preprocessing pipeline. This comparison highlighted key differences between frameworks prioritizing data integrity and those emphasizing processing completeness.

LangGraph and Swarm prioritized accuracy and data integrity through validation and exclusion, while CrewAI and LangChain favored completeness, either by retaining incomplete data or imputing missing values, leading to greater variability in analytical precision.

Here is a detailed breakdown:

Swarm

Swarm applied precise skip logic, excluding invalid or incomplete records while maintaining overall workflow continuity. After resolving minor API compatibility issues, the framework consistently processed verified records without affecting execution flow.

LangGraph

LangGraph enforced strict data validation, omitting entries with null or incomplete values. This conservative approach ensured analytical accuracy by processing only records that passed integrity checks, maintaining consistent results across test runs.

CrewAI

CrewAI operated under a “zero data loss” principle, retaining all records, including those with missing or invalid fields. While this approach preserved dataset completeness, it reduced calculation accuracy due to the inclusion of unverified data points.

LangChain

LangChain used data imputation techniques to infer missing values from existing fields. For example, when Final_Price was null, it computed replacements from Price and Discount fields. Although adaptive, this introduced deviations from expected outcomes, impacting result precision.

When to use each framework?

CrewAI: When unexpected problems are likely and autonomous problem-solving is required.
LangGraph: For balanced reasoning and structure. Best for general-purpose use cases.
Swarm: In production environments where speed and reliability are critical. Fastest and most consistent.
LangChain: When detailed traceability and transparency are needed. Logs every step but slower than alternatives.

Developer experience

Framework–LLM integration performance: Different frameworks demonstrate varying levels of compatibility and performance with specific LLM providers. For instance, LangChain exhibits superior integration and accuracy when paired with OpenAI’s ChatGPT models, delivering more precise results through optimized prompt handling.

Architecture-driven behavior consistency: Although frameworks can utilize different LLMs with varying efficiency, their core behavioral characteristics remained largely consistent across models. The characteristic behaviors we observed – such as decision-making patterns, recovery handling and alternative reasoning capabilities – are primarily dependent on their underlying architectural design rather than the specific LLM employed.

This suggests that framework-LLM combinations can impact performance metrics, but the core behavioral patterns like CrewAI’s “whatever it takes” approach or Swarm’s specialized agent coordination remain consistent regardless of the language model used.

Integration challenges: We encountered notable integration challenges when attempting to connect CrewAI with Anthropic’s Claude models. Despite multiple configuration attempts, persistent environment setup errors prevented successful deployment.

Our research indicates this is not an isolated issue – numerous developers in the community have reported similar integration difficulties between CrewAI and Anthropic services, suggesting potential architectural incompatibilities or API handling limitations.

Recommendations for framework–LLM pairing: Based on these findings, we recommend evaluating different framework-LLM combinations when selecting frameworks for your specific use case.

Methodology

Objective: We aimed to objectively compare four AI agent frameworks (LangGraph, LangChain, CrewAI, Swarm) using identical datasets and measurement systems. We evaluated the frameworks’ decision-making accuracy, resource efficiency, and tool integration capabilities under realistic error conditions.

Dataset description We ensured identical test conditions for each framework. We used the same JSON dataset , same ground truth KPIs, same mock APIs and timing delays across all frameworks.

We used a dataset of 100 records, which was sufficient to observe the decision capacities. We reset tracking systems before each test (decision_tracker, perf_tracker reset). We used the same tool functions across all frameworks but adapted naming conventions to each framework (_swarm_tool, crewai tool).

Data perturbations: E-commerce purchase data was utilized. The dataset contains the following fields:

User_ID (Customer identifier),
Product_ID (Product identifier),
Category (Product category),
Price (Rs.) (Original price),
Discount (%) (Discount percentage),
Final_Price(Rs.) (Final price after discount),
Payment_Method (Payment method),
Purchase_Date (Purchase date).

We employed deliberately corrupted e-commerce data:

Null values
Empty fields – “Product_ID”: “”, “User_ID”: “”, “Category”: “”
Mixed field names – “cost”: 1200.0, “revenue”: 150.0
Data inconsistency – Date format variations (“07/01/2024” vs “dd-mm-yyyy”)
Zero/negative values

Task definitions: Each framework was assigned 5 identical tasks:

Data processing – Enhanced data processing with framework-specific execution for cleaning and transformation
KPI calculation – Apply identical KPI calculation algorithms using enhanced_kpi_calculator tool
Competitor analysis – Perform competitor analysis for top 3 products using CompetitorAPI
Currency conversion – Convert total revenue to USD using CurrencyAPI
Error handling – Implement native error management strategies for data inconsistencies

Key decision points expected:

Null handling decision – How to handle null Final_Price
Empty field default decision – How to fill empty fields
Field mapping decision – Field transformations
Data inconsistency decision – Format normalization
Zero value skip decision – Include/exclude zero values
Tool execution decision: Which tool to use when? Whether it will be successful? What to do in case of error? How to handle tool failures and fallback strategies?

We executed each framework pipeline 10 times and took the median values for all metrics.

Execution consistency: We implemented the same measurement infrastructure across all frameworks:

AccuracyLatencyTracker for timing measurement (start_timer/end_timer),
DecisionTracker for decision logging with categorization,
EnhancedAnalyticsDataProcessor for identical data cleaning logic,
Mock APIs including CompetitorAPI (0.05s delay)
CurrencyAPI (0.1s delay)

We maintained framework-specific configurations: LangGraph used graph-based orchestration with confidence scoring and intelligent routing. LangChain employed sequential ReAct agent with ConversationBufferMemory and detailed logging. CrewAI utilized multi-agent collaboration with autonomous problem-solving.

All frameworks (CrewAI, LangGraph, LangChain, and Swarm) were tested using GPT-4.1 to ensure consistent model performance and fair comparison across the evaluation metrics.

Evaluation metrics

Decision accuracy measures how reliably a framework resolves critical data issues and is computed as:

Accuracy was determined by comparing each framework’s decisions against predefined business logic criteria.

Each decision was evaluated in a binary manner (correct / incorrect) based on:

Tool failure recovery: whether failed operations were successfully resolved using alternative reasoning
Null handling: whether invalid records were correctly skipped
Empty field defaults: whether missing values were properly replaced (e.g., “UNKNOWN”)

Decision efficiency evaluates how effectively a framework addresses critical data issues and is computed as:

Critical points were defined as the minimum required decision steps (e.g., null handling, empty field defaults, field mapping). An score of 100 % indicates one decision per critical point, while additional decisions signal inefficiency or over-processing.

Tool performance was measured using the primary success rate, representing the proportion of direct tool calls completed successfully:

Recovery capability measures a framework’s ability to successfully recover from failed tool calls and is computed as:

Principal Analyst

Cem Dilmegani

Principal Analyst

Follow On

Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

View Full Profile

Researched by

Nazlı Şipi

AI Researcher

Nazlı is a data analyst at AIMultiple. She has prior experience in data analysis across various industries, where she worked on transforming complex datasets into actionable insights.

View Full Profile