How we Moved from LLM Scorers to Agentic Evals?

updated on Nov 26, 2025

Evaluating LLM applications primarily focuses on testing an application end-to-end to ensure it performs consistently and reliably.

We previously covered traditional text-based LLM evaluation methods like BLEU or ROUGE. Those classical reference-based NLP metrics are useful for tasks such as translation or summarization, where the goal is simply to match a reference output.

Agentic evals, however, take a different perspective. Instead of only judging the fluency or similarity of generated text, agentic evals measure whether an application functions as intended, completing tasks, using tools correctly, retaining context across turns, and staying aligned with its assigned role.

For this article, I researched common metrics used in practice for evaluating:

multi-turn conversations
RAG pipelines
task-completion agents

I also provided implementation examples of frameworks like DeepEval, RAGAS, and OpenAI’s Evals library to show how these evaluations can be applied in real-world workflows.

Key takeaways:

Common mistakes while evaluating tool-use agents include over-relying on classical NLP metrics like BLEU/ROUGE, which fail to capture semantic nuance in LLM outputs.
LLM-as-a-judge is a reliable evaluation method. It uses the model’s own judgment to assess multi-turn conversations or open-ended tasks, offering nuanced insights but often requiring techniques like G-Eval for effective scoring.
The choice of metrics depends on the type of system being evaluated (e.g., multi-turn conversations, RAG, task-completion agents). Each comes with its own set of metrics.
RAG systems require splitting evaluation into retrieval metrics (context recall, precision, relevancy) and generation metrics (faithfulness, answer relevancy). End-to-end checks (the RAG triad) like semantic similarity and correctness provide holistic insights but blur retriever vs. generator issues.
Naming inconsistencies remain a challenge. For example, one framework may use the term “faithfulness” while another uses “groundedness,” even though they describe the same concept.
Evaluation frameworks for LLM applications vary in scope and purpose:
- LlamaIndex provides framework-specific evaluations, a good fit for prototyping.
- OpenAI Evals and MLflow Evals are lightweight tools, with OpenAI Evals allowing users to define their own metrics and MLflow focused on traditional ML pipelines with limited LLM-specific metrics.
- RAGAS, originally a metric library for RAG pipelines, now offers broader evaluation options and integrates with LangChain.
- DeepEval offers a set of 40+ metrics, including those from RAGAS, along with G-Eval, which enables the creation of custom metrics.

Framework	Pytest / CLI Runner	Ready-made Metrics	Synthetic Data Gen	Offline Judge	Model-Agnostic
DeepEval	✅ (deepeval test)	40+	✅ (deepeval create-dataset)	✅	✅
RAGAS	❌ (script asserts)	6 core RAG + additional metrics	✅ (KG-based Q-gen)	✅	✅
MLflow Evaluate	✅ (mlflow.evaluate)	3–4	❌ (BYO)	✅	❌
OpenAI Evals	✅ (CLI orchestrator)	~10	🟧 (helper script)	❌	✅

Evaluation frameworks: core differences. ¹See explanation of the table.

Despite differences, most frameworks share a common foundation. They typically support:
- Running evals on custom datasets
- Metrics for multi-turn conversations, RAG, and agents
- LLM-as-a-judge integration
- Custom metric definitions
- CI/CD compatibility

How we used to evaluate NLP models

If you already know the basics of NLP evaluation metrics and benchmarks, you can skip ahead to agentic evaluations (below). For anyone new to this area, it’s worth taking a moment to examine what earlier metrics, such as accuracy and BLEU, were intended to capture and how they are applied in practice.

The timeline below shows how NLP evaluation has evolved, starting with human-likeness tests, moving to n-gram metrics like BLEU and ROUGE, then to semantic methods such as BERTScore, and finally to agentic evaluations for multi-turn dialogue and LLM-based agents.

How NLP model evaluation has evolved over the decades²

This diagram highlights three high-level categories of NLP evaluation metrics. In the following sections, we will explore these categories briefly before moving on to agentic evaluation metrics.

Evaluation metrics for LLM content. Adapted from³

Statistical scorers (Reference-based metrics)

How we evaluated simple outputs: Traditional NLP evaluation relied on statistical scorers such as accuracy, precision, recall, F1, BLEU, and ROUGE.

These were most effective when the model produced a single “right” answer. These methods worked for structured NLP tasks but fell short once models started generating open-ended, free-form responses.

Accuracy: still widely used in classification. For instance, a spam classifier that labels 910 of 1,000 emails correctly achieves an accuracy of 0.91.
F1, precision, recall: common in information retrieval or QA, where balancing false positives and negatives is crucial.
BLEU: designed for machine translation, checking how much of the generated translation overlaps with a human reference.
ROUGE: widely applied in summarization, measuring recall by counting how many words or phrases from a human-written summary appear in the generated one.

Model-based scorers (Reference-free metrics)

How we tried to move beyond strict overlap: To move beyond reference dependence, metrics were introduced that judge quality against the source text or through logical checks. They are more adaptable but still limited to analyzing text outputs.

Supert / BLANC / ROUGE-C: quality-based methods that measure coverage and informativeness without ground-truth references.
SummaC / FactCC / DAE: entailment-based metrics that test whether generated content is logically consistent with the source.
SRLScore / QAFactEval / QuestEval: QA-based methods that reframe outputs as questions and test factual consistency.

LLM-based scorers: The shift towards agentic evals

What changed recently is the rise of LLM-as-a-judge. Instead of relying only on overlap scores, another LLM (often GPT-4) is used to evaluate responses. These scorers represent the transition toward agentic evaluations; they were more adaptable than overlap-based metrics

MT-Bench: pairs GPT-4 with multi-turn chatbot outputs and asks it to decide which response is better.
Chatbot Arena: began with human raters but now increasingly uses LLMs as judges to scale evaluation.
Semantic scorers like BERTScore provide an intermediate option, comparing embeddings to measure meaning rather than exact word overlap.

Agentic evals: How we evaluate LLM applications?

The focus of evaluation has shifted beyond the LLM’s text output to the entire system pipeline, including preprocessing, retrieval, tool use, and multi-turn context handling.

Agentic evaluations extend this further by testing whether the system reliably performs its intended functions.

Key dimensions include:

verifying task completion
ensuring tools or APIs are invoked correctly
checking that context is preserved across multiple turns
confirming that the system remains aligned with its assigned role

What are agentic evaluation metrics?

Agentic evaluations, especially for LLM-based systems, involve testing and assessing agentic applications; those that require models to perform tasks, manage workflows, interact with users, or operate autonomously, like:

multi-turn conversations
retrieval-augmented generation (RAG)
task-completion agents

The evaluation methods for such systems typically involve assessing not just the model’s outputs but also its ability to complete tasks effectively and follow instructions..

Each comes with its own set of metrics. In the sections below, we will discuss common agentic evaluation categories in detail:

1. Evaluating multi-turn conversations

The first area to consider is multi-turn conversations, typical in chatbots.

Relevance and completeness

Relevancy measures whether the model responds appropriately to the user’s request and stays on topic.
Completeness reflects whether the overall outcome addresses the user’s goal.

For example, imagine a customer support chatbot. If a user asks, “Can I get a refund?”, and the bot replies, “Yes, refunds are available”, that’s relevant, but incomplete. A complete answer would also include the steps to request a refund, the time window allowed, and any conditions (e.g., unused items only).

By tracking both relevance and completeness across entire conversations, you can see whether your chatbot is actually resolving customer issues or just giving partial responses.

Real-world example: LangWatch simulation-based evaluation:

LangWatch defines success criteria for multi-turn conversations and runs simulations to check whether those criteria are satisfied.⁴

Knowledge retention and reliability

Knowledge retention measures whether the system remembers important details across turns in a conversation.
Reliability checks if it can use that memory consistently and correct itself when mistakes are made.

A typical failure in coding agents is repeating the same mistake after being corrected. For instance, if the assistant misuses a library function and the user fixes it, a reliable system should adapt. When it forgets the correction and makes the same error again, that’s a clear sign of poor reliability.

By evaluating both retention and reliability, we can assess whether an LLM-driven system is capable of holding context over time and adapting when errors occur, which is critical for complex, multi-step tasks.

Role adherence and prompt alignment

Role adherence measures whether the model consistently stays in its assigned role.
Prompt alignment checks if it follows the instructions in the system prompt.

Together, these metrics ensure the chatbot doesn’t drift into areas it shouldn’t.

Real-world example: DeepEval evaluating a medical chatbot:

This walkthrough shows how to evaluate a healthcare assistant for role adherence and prompt alignment in multi-turn settings.

⁵

2. Evaluating retrieval-augmented generation (RAG) pipelines

For those unfamiliar with RAG (Retrieval Augmented Generation), here’s a quick primer.

A RAG system consists of two main components:

Retriever: Finds relevant documents from a knowledge base, usually via a vector search in a vector database.
Generator: Uses both the user input and the retrieved context to produce the final output.

High-quality RAG outputs depend equally on both retriever and generator performance. That’s why evaluation metrics are split into two categories:

retrieval-focused
generation-focused

In the implementation examples below, I simulated the user inputs, responses, and retrieved contexts to illustrate how the metrics work in practice.

Evaluating the retriever

The first step in RAG evaluation is to check whether the retriever surfaces the right documents.

Frameworks like RAGAS and DeepEval provide reference-free, LLM-judge style metrics such as context recall and context precision, which use an LLM to score the semantic relevance of retrieved chunks against the query.

Unlike classical metrics (e.g., Precision@k), these don’t require pre-labeled datasets, making them practical for live production environments.

Contextual recall

Contextual recall evaluates whether the retriever captured enough of the relevant information to support the expected output.

How it’s computed:

Implementation example:

This snippet shows how RAGAS demonstrates the use of a retrieval metric (in this case, Context Recall) within a SingleTurnSample.

Source: Ragas⁶

Output:

Breaking it down:

User Input: “Where is the Eiffel Tower located?”
Response (actual_output): “The Eiffel Tower is located in Paris.”
Reference (expected_output): “The Eiffel Tower is located in Paris.”
Retrieved Context: “Paris is the capital of France.”

Here, the retriever surfaced a fact about Paris (“Paris is the capital of France”), but not one that explicitly answers the question about the Eiffel Tower. The model still produced the correct response, but only because it relied on its prior knowledge rather than the retrieved context.

Contextual precision

Contextual precision measures whether the retriever ranks relevant chunks higher than irrelevant ones. This matters because LLMs weigh early-ranked chunks more heavily when generating answers.

How it’s calculated:

Compare the retrieved chunks with the expected or “best” answer.
Score how many of the top-ranked chunks are actually relevant.
Higher precision means fewer irrelevant distractions in the top results.

The formula:

k – i+1th node in the retrieval context

k – i+1th node in the retrieval context
n – number of nodes in the retrieval context
rₖ – the binary relevance of the kth node. 1 if relevant, 0 otherwise⁷

Implementation example:

Source: Ragas⁸

Output:

Breaking it down:

User Input: “Where is the Eiffel Tower located?”
Response (actual_output): “The Eiffel Tower is located in Paris.”
Retrieved context: “The Eiffel Tower is located in Paris.”

Here, the retrieved context directly answers the query. Every context statement is on-topic and contributes to the final response. There’s no extra or unrelated information included.

Because the retriever provided only highly relevant context, the precision score is nearly perfect. In this case, the evaluation returned 0.9999999999.

Contextual relevancy

Contextual relevancy is a simpler metric: it measures what portion of the retrieved context is useful for answering the query.

Why it matters: High relevancy means the retriever is consistently pulling in meaningful chunks (small sections of text, like short paragraphs or document snippets) instead of irrelevant noise.

Implementation example:

Output:

Breaking it down:

User Input: asks about the location of the Eiffel Tower.
Response (actual_output): correctly states it is in Paris.
Expected Output: matches the correct answer.
Retrieved Context: says “Paris is the capital of France” but doesn’t mention the Eiffel Tower.

Here, the context is only loosely related: it provides background about Paris but not the specific fact connecting the Eiffel Tower to Paris. Because of this gap, the contextual relevancy score will be low.

Evaluating the generator

Once retrieval quality is confirmed, the next question is whether the generator uses the retrieved information effectively.

Common metrics here include Faithfulness and answer relevancy which we will cover below.

Faithfulness: Are all claims in the answer supported by the retrieved documents?
Answer Relevancy: Does the response directly address the user’s query?

Faithfulness

Faithfulness checks whether the generator’s output is consistent with the retrieved context. Every claim in the answer should be backed by evidence in the documents.

How it’s calculated:

Extract claims: Break down the generated answer into individual factual claims.
Check against context: For each claim, compare it with the retrieved documents using a yes/no/idk framing.
- yes → claim is supported by context
- no → claim contradicts context
- not enough info → context doesn’t contain enough information to decide
Score: Count all truthful claims (yes + not enough info) and divide by the total number of claims.

Implementation example:

Output:

Breaking it down:

User Input: asks where the Eiffel Tower is located.
Response (actual_output): “The Eiffel Tower is located in Paris.”
Retrieved Context: explicitly contains the fact “The Eiffel Tower is located in Paris.”

Here, the claim in the answer is fully supported by the retrieved context.

Answer relevancy

Answer relevancy measures whether the answer directly addresses the user’s query. Irrelevant or tangential sentences lower the score.

Implementation example:

Output:

Breaking it down:

User Input: asks specifically where the Eiffel Tower is located.
Response (actual_output): gives the correct location (relevant) but also adds construction details (not directly relevant to the query).
Retrieved Context: contains the location fact only.

So, one part of the answer is relevant, while the other part is extra information. The relevancy score will reflect this mix.

End-to-end metrics: The RAG triad

Source: Atamel.Dev⁹

In some cases, rather than evaluating the retriever and generator separately, the RAG pipeline is assessed as a whole. These “end-to-end” metrics include:

Answer semantic similarity: measures whether the final output is semantically close to a reference answer, often using embedding-based methods.
Answer correctness: evaluates whether the generated output is factually accurate compared to a ground-truth reference.

Both of these map closely to the way we benchmarked RAG systems.

3. Evaluating task-completion agents

If your system involves agent workflows, evaluation goes beyond text quality, you need to measure whether the agent itself is behaving correctly. Below are two of the most common metrics.

Tool correctness

Tool correctness measures whether the agent used the right tools for the task. Unlike most metrics, this one doesn’t rely on an LLM-as-a-judge. Instead, it’s based on exact matching: the tools the agent actually called are compared against the tools it was expected to use.

Implementation example:

Breaking it down:

User Input: “What if these shoes don’t fit?”
Expected Tools: Only WebSearch should have been called.
Actual Tools: The agent called WebSearch and ToolQuery.

Here, the agent invoked an extra, unnecessary tool (ToolQuery). This signals tool misuse and highlights that the workflow didn’t strictly follow the expected path.

Task completion

Implementation example:

Breaking it down:

User Input: “Plan a 3-day itinerary for Paris with cultural landmarks and local cuisine.”
Tools Called: The agent used both an itinerary generator and a restaurant finder.
Generated Output: A 3-day plan covering major landmarks and food stops.

Here, the agent’s output matches the request to plan a 3-day Paris itinerary with landmarks and cuisine.

It covers the Eiffel Tower, Louvre, and Montmartre for cultural sites, and suggests Le Jules Verne, Angelina Paris, and a wine bar for dining.

Explanation of the table:

Pytest / CLI Runner: Whether the framework supports automated testing using Pytest or CLI tools. Pytest allows for running tests on models, while CLI runners enable executing tasks like model evaluations directly from the command line.
Metrics Ready-made: # of predefined metrics available for model evaluation.
Synthetic Data Gen: Whether the framework has the ability to generate synthetic data. Synthetic data is artificial data that mimics real-world data and is useful when real data is unavailable, expensive, or sensitive.
Offline Judge: Whether the framework supports offline evaluation, meaning it can evaluate a model using pre-existing datasets without needing a live data connection.
Model-Agnostic: Whether the framework is model-agnostic, meaning it works with any machine learning model.

Reference Links

Rate limit · GitHub

https://arxiv.org/pdf/2503.22458

Evaluation metrics | Microsoft Learn

Multi-Turn Conversations - LangWatch

Build and Evaluate a Multi-Turn Chatbot Using DeepEval | DeepEval - The Open-Source LLM Evaluation Framework

Context Recall - Ragas

Contextual Precision | Confident AI Docs

Context Precision - Ragas

RAG Evaluation - A Step-by-Step Guide with DeepEval - Mete Atamel

Principal Analyst

Cem Dilmegani

Principal Analyst

Follow On

Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

View Full Profile