AIMultipleAIMultiple
No results found.

RAG Frameworks: LangChain vs LangGraph vs LlamaIndex

Cem Dilmegani
Cem Dilmegani
updated on Oct 28, 2025

Comparing Retrieval-Augmented Generation (RAG) frameworks is challenging. Default settings for prompts, routing, and tools can subtly alter behavior, making it difficult to isolate the framework’s impact.

To create a controlled comparison, we replicated the same agentic RAG workflow across LangChain, LangGraph, and LlamaIndex, standardizing components wherever possible. We then measured token throughput and framework overhead under identical instrumentation and constraints.

RAG frameworks benchmark results

The benchmark consisted of 100 queries, with each framework running the full set 100 times to provide stable averages.

Loading Chart

All implementations used the same models, temperatures, retrieval provider, web search tool, and a shared context token cap.

Key Findings

  1. We focus on controlling what’s controllable: Same model family and temperatures, node-level max_tokens, retriever (Qdrant + BGE-small, k=5, normalization on), web provider (Tavily-only), router policy (heuristic + model), calculator early-return, shared context token cap, identical grading rubric, unified instrumentation. This substantially reduces major confounders in our measurements.
  2. Framework overhead is measurable but small: We observed ~6–14 ms per query from orchestration logic. These differences are real, but not the main source of the >1 s latency gaps; most time is spent on I/O with external models/tools.
  3. Performance tracks tokens (under these constraints): In our runs, LlamaIndex was fastest (avg ~2.37 s) and more token-efficient (~1.6k). LangGraph and LangChain followed; higher latencies correlated with higher average tokens and tool-usage patterns.

Why do differences persist? The “Framework DNA”

Despite rigorous standardization, small variances in token counts and latency remain. These are attributable to the inherent, low-level behaviors of each framework, their “DNA.”

  • Prompt & message serialization: Each framework wraps the same logical content with slightly different formatting before sending it to the LLM, creating small but consistent token deltas.
  • Context assembly: The precise ordering and inclusion of metadata within the concatenated context can differ slightly by framework, affecting the final token count.
  • Routing tie-breaks: In borderline cases, subtle differences in how a framework parses the router’s JSON output can lead to a different initial tool choice.

In this setup, the token footprint appears to be the primary driver, more than framework execution time.

The shared agentic RAG architecture

Figure 1: Agentic RAG architecture

To achieve a fair comparison, all three implementations were built on the same control flow:

  • Router: A hybrid model-and-heuristic node that chooses retriever, web_search, or calculator.
  • Retrieve Documents: Fetches the top 5 documents from Qdrant using normalized BGE-small embeddings.
  • Grade Documents: An LLM judge assesses document relevance. If irrelevant, it triggers a web search fallback.
  • Generate Answer: Uses a temperature=0.0 LLM with a shared context token cap to generate a draft answer.
  • Grade Answer: A second LLM judge evaluates the draft for groundedness, contradictions (hallucinations), and completeness.
  • Fallback & Early Return: A web search is triggered if the answer grade is insufficient. Calculator results, however, are returned directly, skipping the generation and grading steps.

Workflow Examples

Scenario A — Direct hit from the database:

Scenario B — Recent event triggers web tool:

Scenario C — Calculator provides an early return:

Scenario D — Vector DB insufficient, falls back to web search:

RAG frameworks methodology

The primary goal of this benchmark was to isolate framework effects by holding the RAG policy and components as constant as possible. Every detail, from prompt wording to failure handling, was meticulously aligned across the three implementations.

1. Core components & configuration

The foundational tools were standardized to eliminate performance variables at the source.

  • LLMs:
    • Model: All nodes (router, generator, grader) used the openai/gpt-4.1-mini model via the OpenRouter API.
    • Determinism: temperature was set to 0.0 for all LLM calls to ensure maximum consistency in routing, generation, and grading.
    • Token limits: Strict max_tokens limits were enforced: 256 for the router and graders, and 512 for the generator. This prevents latency differences caused by one framework generating excessively long responses.
  • Embedding model & retrieval:
    • Model: All frameworks used BAAI/bge-small-en-v1.5 from HuggingFace.
    • Normalization: A critical step for performance, normalize_embeddings was set to True in all three frameworks. This was configured via encode_kwargs in LangChain/LangGraph and the normalize=True parameter in LlamaIndex.
    • Retrieval: The Qdrant vector store was queried for a k=5 (top 5 documents) in all implementations.
  • Tooling:
    • Web search: The benchmark was restricted to Tavily-only (max_results=3).
    • Calculator: All three implementations used the sympy library for mathematical expression parsing and evaluation, ensuring identical capabilities.

2. RAG Control Flow & Policy

The agent’s “decision-making” process was explicitly mirrored across the board.

  • Routing logic: A hybrid routing strategy was implemented in all three scripts to balance model intelligence with deterministic rules:
    1. A regex-based heuristic_route first checks for obvious calculator or web search patterns (e.g., math symbols, years like “2024”).
    2. An LLM router_node then makes its own decision.
    3. The final decision prioritizes the heuristic for calculators, otherwise deferring to the LLM’s choice.
  • Context budgeting: This is one of the most critical standardizations. Before the generate_answer node is called, all retrieved document context and web search results are concatenated and then truncated to a shared 2000-token cap using a common truncate_to_token_budget utility. This ensures the generator LLM in each framework receives an input of the exact same size, preventing any single framework from being advantaged or disadvantaged by the verbosity of its retrieved context.
  • Answer grading policy:
    • Lenient rubric: The grade_answer node uses an identical, lenient prompt across all frameworks, instructing the LLM judge to accept semantically similar and reasonably complete answers.
    • Failure handling: The logic for handling a failed JSON parse from the grader was standardized. If the grader’s output is not valid JSON, the system defaults to a permissive grade (grounded=True, complete=True), mimicking a real-world scenario where you wouldn’t want a brittle parser to fail an otherwise good answer.
  • Calculator early-return: As seen in the code, a successful call to the calculator_node directly sets the final_answer and terminates the workflow early. This is a significant optimization that is consistently applied, preventing the calculator path from unnecessarily invoking the generate and grade_answer LLMs.

3. Instrumentation and metrics

The measurement process was identical, using shared utilities and principles.

  • Latency: High-precision time.perf_counter() was used for all timing. The Framework Overhead is consistently calculated as Total Latency – External Calls Latency.
  • Tokenization: All token counts for prompts and completions were calculated using tiktoken, the cl100k_base encoding, ensuring a single source of truth for token metrics. The “Avg. Tokens” metric reported in the results represents the cumulative sum of all input (prompt) and output (completion) tokens for every LLM call (e.g., router, graders, generator) within a single query workflow.
  • State management: While the implementation syntax varies (LangGraph’s TypedDict, LlamaIndex’s class, LangChain’s dictionary), the state structure is functionally identical. Each framework passes the same set of keys (question, documents, web_results, etc.) between nodes, ensuring the control flow logic operates on the same information.

By enforcing these stringent, code-level standardizations, this benchmark aims to move beyond superficial comparisons and offer a replicable analysis of framework performance under a fixed RAG policy.

Interpreting the results:

  • You can conclude: In this specific, highly controlled setup, orchestration overhead tends to be minor; differences are driven mainly by token counts and tool paths.
    • In this specific, highly controlled setup, framework overhead is negligible.
    • Performance differences were driven by token count and tool-path variations.
  • You cannot generalize: Results are specific to this architecture, models, prompts, retriever, and web provider; changing these can alter rankings.

Developer experience: A qualitative comparison

Performance is not the only factor; how a framework feels to build with is equally important.

  • LangGraph: The declarative graph
    Uses a graph-first paradigm. You define nodes and wire them with edges (including add_conditional_edges), so control flow is part of the architecture. State is typed via a TypedDict with reducer-style updates (Annotated[…, add]).
    Choose LangGraph for: complex workflows with multiple branches, retries, and cycles; its structure scales in robustness and maintainability as agents grow.
  • LlamaIndex: Imperative orchestration
    A procedural script where control flow is standard Python if/else; the “graph” lives in your code. State is a dedicated PipelineState class, and the framework provides clean retrieval primitives (VectorStoreIndex → .as_retriever(k=5)).
    Choose LlamaIndex for: readable, single-file workflows where you value clear procedural logic and easy debugging.
  • LangChain: Imperative with declarative components
    Orchestration remains a Python script, but individual tasks are small, composable chains using the | operator (e.g., prompt | llm | parser). State is a flexible, untyped Python dict.
    Choose LangChain for: Rapid prototyping or teams already in the LangChain ecosystem that prefer composing small declarative units within a larger imperative driver.

💡Conclusion

In a tightly matched agentic RAG pipeline, orchestration overhead is usually a small slice. What moves the needle is how many tokens you process and which tools you invoke, both shaped by prompts, retrieval, and routing. In our 100-query, 100× runs, LlamaIndex showed the lowest average latency (~2.37 s) and token usage (~1.6k), while LangGraph and LangChain trailed mainly due to higher tokens and routing mix, not inherent framework cost.

Further reading

Explore other RAG benchmarks, such as:

Principal Analyst
Cem Dilmegani
Cem Dilmegani
Principal Analyst
Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.
View Full Profile
Researched by
Ekrem Sarı
Ekrem Sarı
AI Researcher
Ekrem is an AI Researcher at AIMultiple, focusing on intelligent automation, GPUs, AI Agents, and RAG frameworks.
View Full Profile

Be the first to comment

Your email address will not be published. All fields are required.

0/450