LLMs often struggle with raw, unstructured data such as email threads or technical documents, leading to factual errors and weak reasoning. We benchmarked systematic context engineering and achieved up to +13.0% improvement in task accuracy, confirming that structured context is key to enhancing performance in complex tasks.
Benchmark Comparison: No Context vs Structured Context
Three approaches tested:
- NoContext: Raw text (baseline)
- CleanFormat: Noise-removed text
- GroundedCtx: Text enriched with structured context summary
Scope: Complex, multi-step tasks requiring high factual fidelity and precise task completion.
Evaluation metrics:
Task accuracy score
The results conclusively show that the GroundedCtx approach delivers superior reliability and performance. See the methodology to learn how we measured these components.
Definition:
This metric measures how correctly, completely, and meaningfully the model’s answer matches the expected response.
In other words, it evaluates whether the output fulfills the task both technically and semantically.
Computation steps
1. Exact match
If the model’s answer and the reference text are identical, the score is directly assigned as 5.0.
2. Fuzzy match ratio
Using difflib.SequenceMatcher, the character-level similarity between two texts is calculated as fuzzy_ratio.
This ratio ranges from 0 to 1.
3. Semantic similarity (Cosine similarity)
Both texts are converted into vector representations using SentenceTransformer(‘all-MiniLM-L6-v2’).
Then, cosine similarity is calculated with cosine_similarity, resulting in cosine_sim (ranging between 0 and 1).
This step measures the semantic accuracy of the response.
4. Word overlap (Jaccard similarity)
The ratio of overlapping words between the response and reference is calculated as:
Jaccard = |common_words| / |union_of_words|
This value also ranges from 0 to 1.
5. Key information consistency
Dates, numbers, and named entities are extracted and compared across both texts.
For each category (e.g., date matches), a separate ratio is computed, and the mean value is taken:
key_info_score = (date_match + number_match + name_match) / 3
Final formula
TaskAccuracy = (0.8C + 0.1K + 0.08J + 0.02F) × 5
Where:
C: Cosine similarity (semantic accuracy)
K: Key information consistency
J: Word overlap (Jaccard similarity)
F: Fuzzy similarity ratio
The final result is normalized between 0 and 5.
5 points: The answer is completely correct and semantically aligned.
0 points: The answer fails to address the task or is incorrect.
Grounding score (Fidelity)
Definition:
This metric measures how consistent the model’s response is with the provided context or source text.
The goal is to determine whether the model avoids hallucination and remains grounded in verifiable information.
Computation steps
1. Semantic similarity (BERTScore)
Using bert_score.score(), an F1-based semantic similarity value (semantic_score) is computed between the model’s response and the source text.
The score ranges from 0 to 1, representing how semantically close the response is to the given context.
2. Entity consistency
The following entities are extracted from both texts:
– Dates → e.g., “2023-10-01”
– Numbers → e.g., “25%”, “3.14”
– Names → e.g., “John Smith”
– Email addresses
The intersection of these entities is used to calculate:
entity_score = |common_entities| / |entities_in_answer|
If no entities are present in the response, entity_score = 1.0 by default.
Weighted combination
Grounding = (0.7 × semantic_score + 0.3 × entity_score) × 5
Interpretation:
5 points: Fully consistent with the context; no hallucinated content.
3 points: Generally consistent but with minor factual deviations.
0 points: Contextually irrelevant or hallucinated response.
Efficiency score (Conciseness)
This metric evaluates whether the model’s answer is short, clear, and free of unnecessary repetition or filler words.
The goal is to ensure the response is both readable and information-dense.
Computation steps
1. Length Score (L)
The number of words (word_count) is measured.
10–30 words → ideal (1.0)
5–50 words → good (0.8)
<5 words → too short (0.3)
>50 words → too long (0.1–0.5)
2. Information density (D)
density = unique_word_count / total_word_count
Values close to 1 indicate low redundancy.
3. Sentence structure score (S)
The average sentence length should range between 8 and 20 words.
Scores decrease as it moves outside this optimal range.
4. Redundancy score (R)
Repeated sentences or phrases are detected.
A value of 1.0 means there is no repetition.
5. Filler word penalty (P)
Filler expressions such as “actually,” “basically,” “in fact” are detected.
Each filler applies a 0.15 penalty, with a maximum of 0.5.
Final formula
Efficiency = (0.3L + 0.25D + 0.25S + 0.2R) × 5 – (P × 5)
Where:
L: Length score
D: Density score
S: Sentence structure score
R: Redundancy score
P: Filler penalty
The final score is normalized between 0 and 5.
5 points: The response is concise, clear, and fully efficient.
0 points: The response is overly long, repetitive, or filled with unnecessary words.
Analysis of key performance gains
The comparison between the two models reveals crucial insights into the impact of context versus a model’s intrinsic intelligence.
- Structured context boosts all models: Both Gemini and Sonnet benefited from the GroundedCtx approach, showing significant gains in accuracy and fidelity. Gemini 2.5 Flash saw a strong +13.0% increase in Task Accuracy, validating that structured context enhances an LLM’s capacity for complex reasoning.
- Higher baseline and diminished returns: Claude Sonnet 4 started with a higher baseline accuracy (2.95 vs. Gemini’s 2.52) on raw text, suggesting stronger intrinsic reasoning capabilities. Consequently, it experienced more modest gains from structured context (+5.4%), as it was already better at handling unstructured data.
- Efficiency and grounding improvements: Both models showed measurable improvements in grounding fidelity (Gemini +3.3%, Sonnet +1.7%). Gemini also demonstrated a strong efficiency gain (+11.3%), while Sonnet remained stable with a marginal +0.2%.
- Overall superiority with context: With structured context, Claude Sonnet 4 achieved the highest overall task accuracy (3.11), while Gemini demonstrated larger relative improvements from its baseline. This confirms that optimizing the context is a powerful tool that can both elevate weaker baselines and consolidate the strengths of more capable models.
What is context engineering?
Context engineering involves building dynamic systems that provide an LLM with all the context it needs to accomplish a task plausibly. Unlike prompt engineering, which focuses on short task descriptions, context engineering emphasizes assembling a full context for complex tasks.
Most context failures happen when the model sees incomplete, irrelevant, or poorly structured input. Context engineering addresses this by ensuring the context window contains just the right information in the right format. This may include user input, detailed instructions, structured output, and access to available tools.
Prompt engineering remains a subset of context engineering. As AI applications evolve, the term context engineering reflects the shift to dynamic systems that improve tool selection accuracy, reduce irrelevant information, and enable effective AI agents in both day-to-day and long-running tasks.
Why context > model
Advances in models alone do not solve the central issue in AI systems. Context effectiveness matters more than raw model power. Two agents using the same model can produce different outcomes depending on whether they receive full context or limited context.
- Limited context: Minimal or poorly assembled information leads to generic outputs. This is why people associate prompts with weak results in early AI applications.
- Rich context: Adding user preferences, calendar data, or external information creates results that are practical and aligned with the task.
From first principles, a model cannot plausibly accomplish a task if it does not see the right context. Applying context engineering ensures that the focus is on the quality of information, rather than simply adding more context.
The anatomy of context
Context refers to all the information the model perceives before generating a response. In context engineering, this is not a single element but a combination of different aspects that together form the full context. Each part contributes to context effectiveness, and a complete system must balance more context with careful control of token use and the exclusion of irrelevant information.
- System instructions: These define how the agent behaves. They set role, tone, and rules, often including detailed instructions and examples. Clear instructions provide the right context for handling complex tasks and help avoid context failures.
- User input: This is the direct request that triggers the task. While prompt engineering often focuses only on short task descriptions, in practice, user input is one component of a broader context window.
- Short-term memory: This includes the most recent exchanges in a conversation. Because models cannot store unlimited history, context summarization is often used to compress dialogue while preserving meaning. This allows effective AI agents to maintain continuity without exceeding token limits.
- Long-term memory: It stores user preferences, facts, and summaries of past interactions, enabling AI assistants to support day-to-day and long-running tasks.
- Retrieved information: Retrieval-Augmented Generation (RAG) systems and vector store retrieval incorporate external information when the model alone lacks sufficient knowledge. This provides additional context for question answering, writing, or other related tasks that require external data.
- Available tools and tool descriptions: Many agentic systems depend on external tools. Clear tool descriptions and accurate tool calls are essential for accurate tool selection. Providing relevant tools in the correct format helps the model accomplish the task plausibly.
- Structured output: Responses often need to follow defined schemas such as JSON. Structured output supports related tasks by ensuring the information can be used directly by downstream systems or other tools.
Context engineering architecture: how structure drives reliability
Structured context represents a shift from sending raw data to preparing curated information snapshots. Traditional models struggle because they must simultaneously interpret unstructured input and generate answers. The GroundedCtx method reduces this burden by handling interpretation before the prompt reaches the model.
- Noise removal (CleanFormat): Non-essential elements, such as headers or signatures, are removed, thereby reducing the model’s workload.
- Structural injection (GroundedCtx Extractor): Rules create a structured snapshot with sections such as:
- Participant summary: Key parties and roles.
- Event timeline: Chronological record of events.
- Decisions and action items: Outcomes and next steps.
This architecture ensures that the model sees structured and relevant context, improving both accuracy and reasoning.
Importance of context engineering
Even advanced models cannot perform effectively when the context is incomplete, irrelevant, or poorly structured. Applying context engineering ensures the context window is filled with just the correct information, presented in the right format, so that the model can plausibly accomplish the task. This shift reflects the next phase of AI applications, where building dynamic systems is more critical than relying solely on stronger models.
Accuracy
Unstructured data, such as long email threads or technical documents, often leads to hallucinations or errors because the model sees too much irrelevant information. Techniques such as noise removal, context summarization, and structured context snapshots help mitigate this problem. By filtering and reorganizing data into participant summaries, timelines, or decision lists, the model receives relevant context that improves reasoning and factual grounding. In benchmarks, these methods have produced measurable performance gains in domains such as code summarization and debugging.
Efficiency
Context engineering enhances the utilization of the context window. Structured output and concise summaries reduce token use while retaining necessary detail, which is critical for industrial-strength LLM apps and long-running tasks.
Methods such as token-level selection, attention steering for long contexts, and responsibility tuning further increase information density and lower computational cost. This efficiency enables AI agents to operate effectively in scenarios with limited data or strict resource constraints.
Reliability
A reliable system provides full context together with relevant tools. This combination increases the likelihood of consistent results across complex tasks such as question answering, planning, or document analysis.
Reliability also comes from adaptability. Agentic systems built with dynamic context pipelines can integrate external tools, adjust to updated user input, and retrieve external information when needed. This reduces context failures and supports effective AI agents that perform reliably across different aspects of use.
User experience
Context engineering improves how AI assistants interact with users in day-to-day tasks. Managing short-term memory enables the system to maintain a smooth flow in conversations, while long-term memory supports continuity across sessions by remembering user preferences and past decisions.
Structured output ensures that results connect smoothly with related tasks, and clear tool descriptions improve the accuracy of tool selection. Together, these practices make interactions more practical and aligned with user expectations.
Principles of effective context engineering
Several principles guide how to apply context engineering in AI systems:
- System not string: Context is built by dynamic systems that gather and format relevant context before the model is called.
- Dynamic and task-specific: Short-term memory, long-term memory, and retrieved information must be adapted to the task. Different aspects of context are needed for question answering, writing, or external tools.
- Right information and tools: Effective AI agents depend on relevant context and relevant tools. Irrelevant information lowers context effectiveness.
- Format matters: Summaries, structured output, and clear tool descriptions are more useful than raw data.
- Plausibility check: Before execution, the system should verify that it has the full context to accomplish the task plausibly.
- Feedback loops: Observability tools can track how context is assembled, enabling refinement of tool calls, tool selection, and context effectiveness.
Case study: Sequence analysis and computational reasoning (DWR Gas Deals)
The performance leap achieved by the GroundedCtx system is best illustrated by a challenging query that requires complex inferential reasoning over noisy text.
Challenging query: Based on the discussions in the email thread regarding DWR’s gas deals, infer the timeline of events leading to the proposed meeting with Baldwin. What key decisions and actions must occur before this meeting can take place, and how do the roles of Jeff and Baldwin influence this sequence?
In the NoContext scenario, the model struggles with the scattered information. However, when the GroundedCtx Extractor processes the thread, it provides a clear, pre-sequenced snapshot. This structured context enables the LLM to successfully execute the required reasoning, resulting in a demonstrable improvement in fidelity with Gemini 2.5 Flash, achieving a +3.3% gain and Claude Sonnet 4 showing a +1.7% gain in the Grounding Score.
Methodology
Test setup and data source
The dataset was derived from the widely used Enron Email Dataset, known for capturing complex, real-world corporate data. For each email thread, human experts manually created a “Gold Task Set” to rigorously test information extraction, summarization, and complex reasoning skills.
Experimental conditions
- NoContext (raw text): The baseline, where the raw, noisy email thread is passed directly to the model.
- CleanFormat (noise removed): The text is passed through a rule-based preprocessor to remove elements such as footers and headers.
- GroundedCtx (structured context): The text is first cleaned, and then a dedicated extractor generates a concise, structured summary of key facts, which is injected into the prompt.
Tech stack
- Large Language Models (LLMs): Core evaluation focused on Google Gemini 2.5 Flash and Anthropic Claude Sonnet 4, representing state-of-the-art frontier models.
- API and model access: Models were accessed through the OpenRouter API platform.
- Data handling and processing: Pandas, NumPy, and the re (regex) library were used for data structuring, cleaning, and entity extraction.
- Custom Evaluation Metrics: A custom LightweightEvaluator was built using libraries such as sentence-transformers, scikit-learn, bert_score, and difflib for semantic and lexical checks.


Be the first to comment
Your email address will not be published. All fields are required.