We evaluated practical performance of popular open-source AI agent frameworks with 4 data analysis tasks (e.g. clustering) on each framework. Each task was executed 100 times per framework to measure consistency, performance, and usability under realistic workloads.
We also examine their agent and function definitions, memory management, and human-in-the-loop features.
Agentic frameworks benchmark
Overview
We benchmarked LangGraph, CrewAI, OpenAI Swarm and LangChain. We compared the latency and completion token usage of each framework across different data analysis tasks. See our methodology in detail.
Overall ranking (based on latency and token efficiency)
- LangGraph: Achieved the lowest latency and token usage across all benchmarks, demonstrating the most efficient execution pattern.
- OpenAI Swarm: Delivered near-LangGraph efficiency, with slightly higher speed in certain tasks and consistently low token consumption.
- CrewAI: Delivered balanced results with moderate latency and token usage.
- LangChain: Incurred the highest latency and token consumption, performing least efficiently among the evaluated frameworks.
Disclaimer: Framework performance varies based on architecture, use case, and deployment environment. Results may differ depending on implementation details and developer design choices.
Potential reasons for performance differences
Architectural, tooling & LLM-involvement factors
Frameworks that limit LLM involvement and rely on predefined or direct execution flows such as LangGraph and OpenAI Swarm tend to operate more efficiently than those that depend on frequent, dynamic LLM reasoning like LangChain:
*Understanding the architectural DNA
- Graph flow: “Following a ready recipe”: Each step is predefined within a deterministic graph. The LLM is used only for decision points.
- Functional: “Direct tool usage”: Tasks run through direct Python calls without extra interpretation.
- Team approach: “Team leader coordinating”: Agents have defined roles and dedicated tools. The LLM functions as a coordinator that manages task delegation/communication.
- Chain logic: “Using a translator constantly”: Every step is interpreted by the LLM, allowing for adaptive and context-aware behavior.
LangGraph
LangGraph organizes tasks into a directed acyclic graph (DAG), which is a structured sequence of steps where each task connects to the next without looping back.
Explicit multi-agent coordination: You can model multiple agents as individual nodes or groups, each with its own logic, memory, and role in the system. The LLM (large language model) is used only when the system needs to make a decision, such as choosing between different possible next steps.
This design reduces unnecessary LLM use, improves execution efficiency, and makes the workflow easier to debug.
AutoGen
Free-form agent collaboration: AutoGen allows multiple agents to communicate by passing messages in a loop. Each agent can respond, reflect, or call tools based on its internal logic.
CrewAI
CrewAI uses a multi-agent, role-based architecture where agents work together under a central structure called a Crew. This setup manages task delegation, communication between agents, and state tracking (keeping shared information consistent). However, CrewAI’s multi-agent orchestration is limited:
- There’s no built-in execution graph or flow control. Agents self-organize based on responses.
- Multi-agent flows are linear or loop-based, not hierarchical or DAG-based.
Each agent is directly connected to its own tools, allowing smooth data flow and less communication overhead. This structure makes CrewAI efficient in coordinating tasks, leading to lower latency and moderate token usage.
OpenAI Swarm
OpenAI Swarm runs lightweight, specialized agents, each connected to its own set of Python-based tools. Tools are called as regular Python functions, and the LLM is used only when needed for coordination or decision-making.
Although, OpenAI describes Swarm as a multi-agent framework, Swarm currently operates via a single-agent control loop, with:
- Natural language routines in the system prompt
- Tool usage via docstring parsing
- An agent iteratively planning and executing tasks
Thus, it has no agent-to-agent communication (single-agent execution). Unlike frameworks like AutoGen (which supports message passing between agents) or CrewAI (which uses role-based team setups), Swarm has no built-in mechanism for agents to interact or collaborate directly.
This keeps token usage low and enables fast execution.
LangChain
Single-agent document processing: LangChain handles the user-to-answer pipeline through one coordinating agent that manages the RAG workflow. Multi-agent capabilities were added later and are not part of its original structure.
Unlike AutoGen’s message-passing system or CrewAI’s role-based teams, LangChain’s base architecture routes everything through a central orchestrator rather than enabling direct agent collaboration.
Tool selection relies on the LLM’s natural language reasoning instead of direct function calls. At each step, the LLM reads the task description, decides which tool to use, and interprets the output.
This repeated reasoning increases both token usage and execution time.
Agent and function definition
LangGraph
LangGraph’s graph-based approach represents each agent as a node that maintains its own state. These nodes are connected through a directed graph, enabling conditional logic, multi-team coordination, and hierarchical control. This enables you build and visualize multi-agent graphs with supervisor nodes for scalable orchestration.
LangGraph uses annotated, structured functions that attach tools to agents. You can build out nodes, connect them to various supervisors, and visualize how different teams interact. Think of it like giving each team member a detailed job description. This makes it easier to build and test agents that work together.
AutoGen
AutoGen defines agents as adaptive units capable of flexible routing and asynchronous communication. Agents interact with each other (and optionally with humans) by exchanging messages, allowing for collaborative problem-solving. Like LangGraph uses annotated, structured functions.
CrewAI
CrewAI takes a role-based design approach. Each agent is assigned a role (e.g., Researcher, Developer) and a set of skills, functions or tools it can access. Function definition is through structured annotations.
OpenAI Swarm
OpenAI Swarm uses a routine-based model where agents are defined through prompts and function docstrings. It doesn’t have formal orchestration or state models, relying instead on manually structured workflows. Functions behavior is inferred by the LLM through docstrings (Swarm identifies what a function does by reading its description) making this setup flexible but less precise.
LangChain
LangChain uses a chain-based architecture where a single orchestrator agent manages calls to language models and various tools. It defines functions through explicit interfaces like toolkits and prompt templates.
While primarily focused on centralized workflows, LangChain supports extensions for multi-agent setups but lacks built-in agent-to-agent communication.
Memory
Memory capabilities:
- Stateful: Whether the framework supports persistent memory across executions.
- Contextual: Whether it supports short-term memory via message history or context passing.
Memory features is a key part of building agentic systems to remember context and adapt over time:
- Short-term memory: Keeps track of recent interactions, enabling agents to handle multi-turn conversations or step-by-step workflows.
- Long-term memory: Stores persistent information across sessions, such as user preferences or task history.
- Entity memory: Tracks and updates knowledge about specific objects, people, or concepts mentioned during interactions (e.g., remembering a company name or project ID mentioned earlier).
LangGraph
LangGraph uses two types of memory: in-thread memory, which stores information during a single task or conversation, and cross-thread memory, which saves data across sessions. You can use MemorySaver
to save the flow of a task and link it to a specific thread_id
. For long-term storage, LangGraph supports tools like InMemoryStore
or other databases. This provides flexible control over how memory is scoped and retained across executions.
AutoGen
AutoGen uses a contextual memory model. Each agent maintains short-term context through a context_variables
object, which stores interaction history. It doesn’t have built-in persistent memory.
CrewAI
CrewAI provides layered memory out of the box. It stores short-term memory in a ChromaDB vector store, recent task results in SQLite, and long-term memory in a separate SQLite table (based on task descriptions). Additionally, it supports entity memory using vector embeddings. This memory setup is automatically configured when memory=True is enabled,
OpenAI Swarm
Swarm is stateless and does not manage memory natively. You can pass short-term memory through context_variables
manually, and optionally integrate external tools or third-party memory layers (e.g., mem0) to store longer-term context.
LangChain
LangChain supports both short-term and long-term memory through flexible components. Short-term memory is typically managed via in-memory buffers that track conversation history within a session. For long-term memory, LangChain integrates with external vector stores or databases to persist embeddings and retrieval data.
You can customize memory scopes and strategies using built-in memory classes, enabling efficient management of contextual and entity-specific memory across interactions.
Human-in-the-loop
LangGraph
LangGraph supports custom breakpoints (interrupt_before
) to pause the graph and wait for user input mid-execution.
AutoGen
AutoGen natively supports human agents via UserProxyAgent
, allowing humans to review, approve, or modify steps during agent collaboration.
CrewAI
CrewAI enables feedback after each task by setting human_input=True
; the agent pauses to collect natural language input from the user.
OpenAI Swarm
OpenAI Swarm offers no built-in HITL.
LangChain
LangChain allows inserting custom breakpoints within chains or agents to pause execution and request human input. This supports review, feedback, or manual intervention at defined points in the workflow.
Which agent framework to select?
- LangGraph: Complex agent workflows requiring fine-grained orchestration
- AutoGen: Research and prototyping where agent behavior needs flexibility and refinement
- CrewAI: Production-grade agent systems with structured roles and task delegation
- OpenAI Swarm: Lightweight experiments and open-ended task execution in LLM-driven pipelines
- LangChain: General-purpose LLM application development with modular components for chains, tools, memory, and retrieval-augmented generation (RAG)
Production readiness comparison
Of note, LangGraph is proprietary software, but it provides an open-source library for agent development.
What agentic frameworks actually do?
Agentic frameworks assist with prompt engineering and managing how data flows to and from LLMs. At a basic level, they help structure prompts so the LLM responds in a predictable format and route responses to the right tool, API, or document.
If building from scratch, you would manually define the prompt, extract the tool the LLM wants to use, and trigger the corresponding API call. Frameworks streamline this by:
- Prompt orchestration: Building, managing, and routing complex prompts to LLMs
- Tool integration: Letting agents call external APIs, databases, code functions, etc.
- Memory: Maintaining state across turns or sessions (short- and long-term)
- RAG integration: Enabling knowledge retrieval from external sources
- Multi-agent coordination: Structuring how agents collaborate or delegate tasks
Benchmark methodology
Benchmark design
We designed fully equivalent pipelines to ensure a fair and direct comparison among the four frameworks: LangGraph, CrewAI, OpenAI Swarm, and LangChain.
Each system executed four fundamental data analysis tasks: Random Forest, Clustering, Descriptive Statistics, and Logistic Regression, using:
- The same dataset,
- The same tools,
- Identical task definitions, and
- Identical prompts and configuration parameters.
Data and workflow
All experiments used the Telco Churn dataset, downloaded through the DownloadDatasetTool from the same GitHub source across all frameworks.
To enable secure and consistent data sharing between agents and tasks, a thread-safe global state management system was implemented using Python’s threading.Lock
mechanism.
All experiments were conducted with:
- The same OpenAI API key,
- The same LLM model, and
- Identical configuration parameters, including timeout, maximum iterations, and related settings.
Tool alignment
Tooling was standardized to ensure a fair comparison. The DownloadDatasetTool, LoadDataTool, TrainModelTool, and EvaluateModelTool were made functionally identical across all frameworks.
Error handling, input/output formats, and shared state management were implemented consistently, and fail-safe mechanisms were added to allow smooth termination in case of errors.
Task definitions and execution flow
Task definitions and execution order were identical across all frameworks. Each system employed two agents roles:
- Data Scientist: handled data acquisition and preprocessing.
- Machine Learning Engineer: handled model training and evaluation.
Task descriptions, role definitions, and expected outputs were kept word-for-word identical to eliminate design-level variation.
Reference Links

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.
Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.
He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.
Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

Be the first to comment
Your email address will not be published. All fields are required.