RAG Frameworks: LangChain vs LangGraph vs LlamaIndex vs Haystack vs DSPy

with

updated on Dec 9, 2025

We benchmarked 5 RAG frameworks: LangChain, LangGraph, LlamaIndex, Haystack, and DSPy, by building the same agentic RAG workflow with standardized components: identical models (GPT-4.1-mini), embeddings (BGE-small), retriever (Qdrant), and tools (Tavily web search). This isolates each framework’s true overhead and token efficiency.

RAG frameworks benchmark results

The benchmark consisted of 100 queries, with each framework running the full set 100 times to provide stable averages.

Loading Chart

Avg. Tokens: Total tokens consumed across all LLM calls (router, document grader, answer grader, and generator), includes both prompts (with retrieved context) and completions. Lower = less API cost.
Framework Overhead: Pure orchestration time (ms), the framework’s internal processing (routing logic, state management, etc.), excluding LLM API and tool calls. Lower = leaner framework.

All implementations achieved 100% accuracy on the test set. Used the same models, temperatures, retrieval provider, web search tool, and a shared context token cap.

Key Findings

We focus on controlling what’s controllable: Same model family and temperatures, node-level max_tokens, retriever (Qdrant + BGE-small, k=5, normalization on), web provider (Tavily-only), router policy (heuristic + model), calculator early-return, shared context token cap, identical grading rubric, unified instrumentation. This substantially reduces major confounders in our measurements.
Framework overhead is measurable but small: We observed ~3–14 ms per query from orchestration logic. These differences are real, but not the main source of the >1 s latency gaps; most time is spent on I/O with external models/tools.
Performance tracks tokens (under these constraints): DSPy shows the lowest framework overhead (~3.53 ms). Haystack (~5.9 ms) and LlamaIndex (~6 ms) follow, while LangChain (~10 ms) and LangGraph (~14 ms) are higher. Token usage is lowest for Haystack (~1.57k), then LlamaIndex (~1.60k); DSPy and LangGraph are ~2.03k, and LangChain ~2.40k.
Routing/tool-path matters: Slight shifts in initial routing (retriever vs. web vs. calculator) and fallback behavior affect both tokens and time, even when prompts and budgets are aligned.

Why do differences persist? The “Framework DNA”

Despite standardization, small variances in token counts and latency remain. These are attributable to the inherent, low-level behaviors of each framework, their “DNA.”

Prompt & message serialization: Each framework wraps the same logical content with slightly different formatting before sending it to the LLM, creating small but consistent token deltas.
Context assembly: The precise ordering and inclusion of metadata within the concatenated context can differ slightly by framework, affecting the final token count.
Routing tie-breaks: In borderline cases, subtle differences in how a framework parses the router’s JSON output can lead to a different initial tool choice.

In this setup, the token footprint appears to be the primary driver, more than framework execution time.

The shared agentic RAG architecture

To achieve a fair comparison, all five implementations were built on the same control flow:

Router: A hybrid model-and-heuristic node that chooses retriever, web_search, or calculator.
Retrieve Documents: Fetches the top 5 documents from Qdrant using normalized BGE-small embeddings.
Grade Documents: An LLM judge assesses document relevance. If irrelevant, it triggers a web search fallback.
Generate Answer: Uses a temperature=0.0 LLM with a shared context token cap to generate a draft answer.
Grade Answer: A second LLM judge evaluates the draft for groundedness, contradictions (hallucinations), and completeness.
Fallback & Early Return: A web search is triggered if the answer grade is insufficient. Calculator results, however, are returned directly, skipping the generation and grading steps.

Workflow Examples

Scenario A — Direct hit from the database:

Scenario B — Recent event triggers web tool:

Scenario C — Calculator provides an early return:

Scenario D — Vector DB insufficient, falls back to web search:

RAG frameworks methodology

All five implementations achieved 100% accuracy on our 100-query test set, matching ground truth answers. This was the foundational requirement, ensuring each framework could successfully execute the same agentic RAG workflow before measuring performance differences.

1. Core components & configuration

The foundational tools were standardized to eliminate performance variables at the source.

LLMs:
- Model: All nodes (router, generator, grader) used the openai/gpt-4.1-mini model via the OpenRouter API.
- Determinism: temperature was set to 0.0 for all LLM calls to ensure maximum consistency in routing, generation, and grading.
- Token limits: Strict max_tokens limits were enforced: 256 for the router and graders, and 512 for the generator. This prevents latency differences caused by one framework generating excessively long responses.
Embedding model & retrieval:
- Model: All frameworks used BAAI/bge-small-en-v1.5 from HuggingFace.
- Normalization: A critical step for performance, normalize_embeddings was set to True in all five frameworks. (LangChain/LangGraph via encode_kwargs; LlamaIndex via normalize=True; Haystack via normalize_embeddings; DSPy retriever normalized.)
- Retrieval: The Qdrant vector store was queried for a k=5 (top 5 documents) in all implementations.
Tooling:
- Web search: The benchmark was restricted to Tavily-only (max_results=3).
- Calculator: All five implementations used the sympy library for mathematical expression parsing and evaluation, ensuring identical capabilities.

2. RAG Control Flow & Policy

The agent’s “decision-making” process was explicitly mirrored across the board.

Routing logic: A hybrid routing strategy was implemented in all five scripts to balance model intelligence with deterministic rules:
1. A regex-based heuristic_route first checks for obvious calculator or web search patterns (e.g., math symbols, years like “2024”).
2. An LLM router_node then makes its own decision.
3. The final decision prioritizes the heuristic for calculators, otherwise deferring to the LLM’s choice.
Context budgeting: This is one of the most critical standardizations. Before the generate_answer node is called, all retrieved document context and web search results are concatenated and then truncated to a shared 2000-token cap using a common truncate_to_token_budget utility. This ensures the generator LLM in each framework receives an input of the exact same size, preventing any single framework from being advantaged or disadvantaged by the verbosity of its retrieved context.
Answer grading policy:
- Lenient rubric: The grade_answer node uses an identical, lenient prompt across all frameworks, instructing the LLM judge to accept semantically similar and reasonably complete answers.
- Failure handling: The logic for handling a failed JSON parse from the grader was standardized. If the grader’s output is not valid JSON, the system defaults to a permissive grade (grounded=True, complete=True), mimicking a real-world scenario where you wouldn’t want a brittle parser to fail an otherwise good answer. DSPy structured fields returns (no JSON parse), this is logged as a robustness difference, not a performance advantage.
Calculator early-return: As seen in the code, a successful call to the calculator_node directly sets the final_answer and terminates the workflow early. This is a significant optimization that is consistently applied, preventing the calculator path from unnecessarily invoking the generate and grade_answer LLMs.
DSPy alignment. To keep fairness with non-CoT baselines, DSPy uses dspy.Predict (no CoT) for Router and AnswerGenerator. Signatures mirror other frameworks’ node contracts; where available, token counts use model-reported usage, otherwise tiktoken fallback.

3. Instrumentation and metrics

The measurement process was identical, using shared utilities and principles.

Latency: High-precision time.perf_counter() was used for all timing. The Framework Overhead is consistently calculated as Total Latency – External Calls Latency.
Tokenization: All token counts for prompts and completions were calculated using tiktoken, the cl100k_base encoding, ensuring a single source of truth for token metrics. The “Avg. Tokens” metric reported in the results represents the cumulative sum of all input (prompt) and output (completion) tokens for every LLM call (e.g., router, graders, generator) within a single query workflow.
State management: While the implementation syntax varies (LangGraph’s TypedDict, LlamaIndex’s class, LangChain’s dictionary), the state structure is functionally identical. Each framework passes the same set of keys (question, documents, web_results, etc.) between nodes, ensuring the control flow logic operates on the same information.

By enforcing these stringent, code-level standardizations, this benchmark aims to move beyond superficial comparisons and offer a replicable analysis of framework performance under a fixed RAG policy.

Interpreting the results:

You can conclude: In this specific, highly controlled setup, orchestration overhead tends to be minor; differences are driven mainly by token counts and tool paths.
- In this specific, highly controlled setup, framework overhead is negligible.
- Performance differences were driven by token count and tool-path variations.
You cannot generalize: Results are specific to this architecture, models, prompts, retriever, and web provider; changing these can alter rankings.

Developer experience: A qualitative comparison

Performance is not the only factor; how a framework feels to build with is equally important.

LangGraph: The declarative graph
Uses a graph-first paradigm. You define nodes and wire them with edges (including add_conditional_edges), so control flow is part of the architecture. State is typed via a TypedDict with reducer-style updates (Annotated[…, add]).
- Choose LangGraph for: complex workflows with multiple branches, retries, and cycles; its structure scales in robustness and maintainability as agents grow.
LlamaIndex: Imperative orchestration
A procedural script where control flow is standard Python if/else; the “graph” lives in your code. State is a dedicated PipelineState class, and the framework provides clean retrieval primitives (VectorStoreIndex → .as_retriever(k=5)).
- Choose LlamaIndex for: readable, single-file workflows where you value clear procedural logic and easy debugging.
LangChain: Imperative with declarative components
Orchestration remains a Python script, but individual tasks are small, composable chains using the | operator (e.g., prompt | llm | parser). State is a flexible, untyped Python dict.
- Choose LangChain for: Rapid prototyping or teams already in the LangChain ecosystem that prefer composing small declarative units within a larger imperative driver.
Haystack: Component-based, manual orchestration Typed, reusable components (@component) with explicit I/O, while control flow stays plain Python (if/else). Easy to swap LLM/retriever/web backends, plus first-class per-step instrumentation (external vs. framework time).
- Choose Haystack for: production-ready, testable pipelines with clear contracts and fine-grained control.
DSPy: Signature-first programs (fewer lines of code)
Define a task via a signature (inputs/outputs + intent), then implement it with Modules that encapsulate prompting and LLM calls. Centralizes prompt/usage handling and removes glue code; swapping internals (e.g., Predict ↔ CoT) doesn’t change the contract.
- Choose DSPy for: minimal boilerplate, readable single-file flows, contract-driven development (with optional optimizers).

Trading optimal performance for comparability

LangGraph might excel with its native graph optimizations when allowed to use parallel execution, state caching, and its conditional edge system for complex branching logic.
DSPy could show dramatically different results when using its signature optimizers (like MIPROv2) and Chain-of-Thought prompting, which can significantly improve answer quality.
Haystack might leverage its production-ready caching, batching features, and component-level optimizations that we disabled for fairness.
LlamaIndex could benefit from its advanced indexing strategies, query engines, and multi-modal capabilities that weren’t exercised in this benchmark.
LangChain might shine with its extensive tool ecosystem and LCEL (LangChain Expression Language) optimizations when not constrained to our standardized toolset.

The “best” framework depends on whether you optimize for: development speed, maintainability, performance, or specific architectural patterns.

💡Conclusion

In a tightly matched agentic RAG pipeline, orchestration overhead is usually a small slice. What moves the needle is how many tokens you process and which tools you invoke, both shaped by prompts, retrieval, and routing. The “right” framework ultimately depends on your team’s preferred orchestration style: declarative graphs (LangGraph), imperative scripts (LlamaIndex), composable chains (LangChain), modular components (Haystack), or signature-first programs (DSPy) that minimize boilerplate.

Next to Read

RAGDec 9

RAG Frameworks: LangChain vs LangGraph vs LlamaIndex vs Haystack vs DSPy

RAG frameworks benchmark results

Key Findings

Why do differences persist? The “Framework DNA”

The shared agentic RAG architecture

Workflow Examples

RAG frameworks methodology

1. Core components & configuration

2. RAG Control Flow & Policy

3. Instrumentation and metrics

Interpreting the results:

Developer experience: A qualitative comparison

Trading optimal performance for comparability

💡Conclusion

Further reading

Be the first to comment

Next to Read

RAG Evaluation Tools: Weights & Biases vs Ragas vs DeepEval vs TruLens

The LLM Evaluation Landscape: 16 Frameworks by Functionality

Benchmarking Agentic AI Frameworks in Analytics Workflows

Top 10+ Agentic Orchestration Frameworks & Tools

Hybrid RAG: Boosting RAG Accuracy

Top 20+ Open Source AI Coding Agents & Frameworks