ProofAgent is the accountability platform for production AI agents. It turns agent risk into deployment evidence through adversarial multi-juror scoring, production log audits, artifact reviews, signed readiness reports, and human review. The platform is built around the open-source ProofAgent Harness.

How do I test my AI agent with ProofAgent?

Install the open-source harness with 'pip install proofagent-harness', wrap your agent in a function returning AgentResponse, then call Harness().evaluate(my_agent, role, goal, knowledge, context). The harness runs adversarial multi-turn sessions and returns a /10 readiness score with traceable findings and fix recommendations.

What is adversarial multi-juror scoring?

Adversarial multi-juror scoring is ProofAgent's evaluation approach: a planner picks domain traps, a conductor applies sustained pressure across 25+ turns, and three independent juror agents score every behavior change. No single LLM call ever decides the verdict — the jury agents reach consensus or debate to a final score.

Is ProofAgent SOC 2 / HIPAA / GDPR compliant?

ProofAgent is SOC 2 Type II aligned, HIPAA-ready (BAAs available for enterprise customers), and follows GDPR best practices. Enterprise customers can deploy on-premises or in a private cloud with SSO/SAML, RBAC, tamper-evident audit logs, TLS 1.2+ in transit, and AES-256 at rest.

Can I use my own LLM with ProofAgent?

Yes. ProofAgent is BYO Harness LLM — the harness internals can run on any LLM provider (OpenAI, Anthropic, Google, local models). You bring your own model and API key; the harness orchestrates the multi-juror evaluation around it.

What metrics does ProofAgent measure?

11+ production metrics including Task Success, Hallucination Control, Safety, Policy Compliance, Memory Stability, Tone and Empathy, Manipulation Resistance, Tool Picking, Reasoning Quality, Relevance, and Drift Detection. Every metric is anchored to per-turn transcript evidence.

What is the difference between ProofAgent Platform and ProofAgent Harness OSS?

ProofAgent Harness OSS is the open-source multi-turn adversarial testing engine — Tier 1 of the platform, available standalone for developers and CI under Apache 2.0. ProofAgent Platform is the enterprise product that adds the other four tiers (production log audit, artifact review, multi-agent orchestration scoring, expert human review), a hosted dashboard, REST API, governance features, signed readiness reports, and dedicated support.

← All posts

Case Studies

Context Engineering: The Highest-Leverage Skill You're Not Measuring

Name: ProofAgent Platform
Brand: ProofAgent
Availability: InStock

ProofAgent Team · Jun 30, 2026 · 9 min read

Diagram illustrating context engineering for AI agents, showing system prompts, tool schemas, and conversation history as key

Context Engineering: What It Is, Why Agents Fail Without It, and How to Measure It ?

When an agent misbehaves, the instinct is to reach for a bigger model or a cleverer prompt. The real lever is usually neither. It is the context: the system prompt, the tool schemas, the retrieved knowledge, and the running history that you feed the model on every call. A language model is a function of its context, so the quality of what goes in sets a hard ceiling on what comes out. Context engineering is the discipline of designing and managing that input deliberately, and it has quietly become the highest-leverage skill in building reliable agents.

TL;DR. Prompt engineering crafts a single clever instruction. Context engineering designs the whole information environment the model reasons inside: instructions, tools, grounding, memory, and history. For agents, which run many steps and accumulate context as they go, getting this right is the difference between a system that ships and one that leaks data, invents policy, or quietly burns tokens. You cannot improve what you do not measure, so this post ends with how to score your agent's context across seven criteria using the open-source ProofAgent-Harness.

What Is Context Engineering?

Context engineering is the practice of assembling everything inside a model's context window so it has exactly what it needs to succeed, and nothing that gets in the way. That includes the system prompt and role, the task instructions, the tool definitions and their schemas, retrieved documents and knowledge, few-shot examples, memory of past interactions, and the conversation history so far.

The term gained traction in 2025 as practitioners noticed that the hard part of building with LLMs had shifted. The phrasing people landed on is simple: prompt engineering is what you type, context engineering is what the model sees. Andrej Karpathy described it as the delicate art of filling the context window with just the right information for the next step, and Shopify's Tobi Lütke framed it as the core skill of the era. The point behind the slogans is that a modern agent rarely fails because the prompt was not eloquent enough. It fails because the context was incomplete, contradictory, bloated, or unguarded.

Why It Matters for the LLM Flow

A large language model is stateless. It does not remember your last call, and it knows nothing beyond what is in the context window for this request. Everything it reasons over, every fact it can cite, every rule it is meant to follow, has to be present in the input. This makes the classic principle of computing brutally literal here: garbage in, garbage out. If the grounding is missing, the model fills the gap with a plausible guess, which we call a hallucination. If the instructions conflict, it picks one more or less at random.

Bigger context windows did not make this go away. Models do not attend to a long context evenly. The widely cited Lost in the Middle study showed that models retrieve information placed at the start or end of a long context far more reliably than information buried in the middle, so simply stuffing everything in degrades quality rather than improving it. More recent work on context rot shows that accuracy can fall as the input grows, even on tasks the model handles easily at short lengths. The lesson is that context is a budget to spend wisely, not a bucket to fill.

Why It Matters Even More for Agents

A single LLM call is a static problem: you assemble one context, you get one answer. An agent is a dynamic one. It runs in a loop, calling tools, reading their outputs, and appending each step to a growing context that it carries into the next turn. Two things follow from that loop, and both make context engineering decisive.

First, the system prompt and the tool schemas are re-sent on every turn. They are fixed overhead that you pay again and again, multiplied by every step, every user, and every run. A thousand wasted tokens in the preamble is not a one-time cost. It is a recurring tax that compounds with scale.

Second, the accumulated context degrades in characteristic ways. Practitioners have catalogued several failure modes of long agent contexts: poisoning, when a hallucination or bad tool output gets referenced again and again; distraction, when the history grows so large the model leans on it instead of reasoning; confusion, when irrelevant tools or content sway the output; and clash, when later context contradicts earlier context. Left unmanaged, an agent that started sharp drifts, loops, or contradicts itself several steps in. Good context engineering is what keeps the working set small, relevant, and consistent across the whole trajectory.

The Impact

Dimension	What weak context engineering does to you
Cost	Redundant boilerplate and dead context are billed on every single call, then multiplied by every turn and every user.
Energy and carbon	Every wasted token is compute that draws power and emits CO₂. Trimming dead context lowers energy per request at zero quality loss.
Reliability	Most agent failures trace to the setup rather than the model. A vague role or a missing rule produces wrong behaviour no model upgrade will fix.
Security	Untrusted input mixed in with instructions is the doorway for prompt injection and data exfiltration.
Latency	Fewer input tokens mean faster responses and lower queueing under load. Lean context is a free speed-up.

Why Agents Fail: It Is Usually the Context, Not the Model

Almost every production agent failure you can name has a root cause that lives in the context, and a corresponding part of the setup that, had it been right, would have prevented it. The pattern is consistent enough to put in a table.

Failure you see in production	The real cause in the context	What was missing
Agent invents a policy or a fact	No grounding corpus supplied, or weak retrieval	Grounding sufficiency
Agent approves an out-of-policy action under pressure	No explicit refusals or escalation rules in the prompt	Guardrail coverage
Agent obeys an instruction hidden inside a document	Untrusted data concatenated with instructions, no separation	Injection hardening
Agent calls the wrong tool, or with bad arguments	Vague tool names, descriptions, or parameters	Tool schema quality
Agent does one thing, then contradicts it	Competing directives across prompt sections	Instruction consistency
Agent drifts off its job over a long session	Role, scope, and success criteria left implicit	Role clarity
Bill is double what it should be, responses are slow	Bloated boilerplate re-sent on every call	Token efficiency

The uncomfortable part is that none of these show up in a normal behaviour score on a good day. They surface under adversarial pressure, on the unusual input, in the long session. If you never measure the context itself, you ship every one of these blind and find out in production.

The Challenges

The context window is a finite budget. Everything competes for the same space: instructions, tools, knowledge, memory, history. Completeness and token efficiency pull against each other constantly.
Attention is uneven. More context is not strictly better. Important content buried in the middle of a long input gets under-weighted, so placement and pruning matter as much as inclusion.
Context decays over a trajectory. Across a long agent run, history accumulates and quality rots. Managing what to keep, summarise, or drop is an active job, not a one-time setup.
Trust boundaries blur. Retrieved documents and tool outputs are untrusted, yet they land in the same window as your instructions. Without clear separation, an attacker writes instructions into a web page and your agent follows them.
It is invisible. A bloated or unguarded context produces no error and no stack trace. It just quietly costs more and fails on the bad day, which is exactly why it goes unmeasured.

Why Measure It?

Context is the one part of the stack you fully control. You cannot retrain the model, and you may not be able to change the framework, but you can fix the instructions today. That makes it the highest-leverage and lowest-cost place to improve an agent. The problem is that "is my context any good?" is usually answered by gut feel, if at all. Measuring turns an invisible, recurring tax into a number you can track, compare across versions, and drive down, the same way you already track latency or a test pass rate.

Measure It with ProofAgent-Harness

ProofAgent-Harness is an open-source evaluation harness for AI agents. As of 0.7.1 it can grade the quality of your agent's context as a separate, additive sub-score, alongside the behaviour metrics it already runs. You hand it the system prompt and tool schemas your agent uses, set assess_context=True, and the harness LLM scores the context across seven criteria. The result never touches the behaviour scores or the release gate, because it grades the setup, not the run.

from proofagent_harness import AgentContext, Harness

report = Harness(llm="claude-sonnet-4-6").evaluate(
    my_agent,
    role="airline customer support",
    context=AgentContext(system_prompt=open("system.md").read(), tools=tool_schemas),
    assess_context=True,          # opt-in: additive sub-score, never gates
)
ce = report.context_engineering
print(ce["score"], ce["grade"])   # → 7.6 adequate

# CLI:  proof run my_agent.py --assess-context   ·   proof artifact ./brd.md --assess-context

A Worked Example

Here is the criteria breakdown for a real refund agent. It is strong where most teams already invest (role, tools, grounding) and weak exactly where the production incidents come from.

Criterion	Score	What this score is telling you
Role clarity	9.0	The agent knows its job, its scope, and what success looks like.
Guardrail coverage	5.0	The weak link. Refusals, prohibited actions, and escalation paths are thin, so the agent has no firm line to hold under pressure.
Instruction consistency	8.0	Mostly coherent, with a couple of directives that could compete.
Tool schema quality	9.0	Tools are well named and typed, so the model calls them correctly.
Grounding sufficiency	9.0	Enough knowledge is supplied for the agent to answer without inventing facts.
Injection hardening	7.0	Reasonable, but untrusted input is not fully separated from instructions.
Token efficiency	6.0	Recoverable bloat. Boilerplate and dead context are billed on every call.

The seven criteria average to 7.6 out of 10, a grade of adequate. More useful than the headline number is what it points at: guardrail coverage at 5.0 is the single highest-value fix. This is the same agent that, in its behaviour run, approved an out-of-policy refund under social-engineering pressure. The context score explained why before the incident did. The harness prints the breakdown next to the scorecard, with each finding tagged by its token impact so you also see where to cut spend:

Context Engineering  7.6/10  (adequate)
  Strong role, tools, and grounding; guardrails are the weak link.
  Role Clarity               9.0/10
  Guardrail Coverage         5.0/10
  Instruction Consistency    8.0/10
  Tool Schema Quality        9.0/10
  Grounding Sufficiency      9.0/10
  Injection Hardening        7.0/10
  Token Efficiency           6.0/10
  [↑] Missing refusal rules: add explicit refusals for out-of-policy refunds, payments, and PII.
  [↓↓] Dead context: drop the deprecated tool catalog and unused persona backstory.
  (Separate sub-score, never affects the metric scores or the gate.)

Each finding carries a verdict on its token impact (↓↓ big_cut, ↓ cut, → neutral, ↑ adds) and the run reports an estimate of the tokens you can reclaim, so the same panel answers what is wrong, how to fix it, and where the savings are.

Start Measuring

pip install -U proofagent-harness

# Grade your agent's context as part of a normal run
proof run my_agent.py --assess-context

# Score a finished deliverable and its producing agent's context together
proof artifact ./brd.md --assess-context

# Push the result, including the context engineering sub-score, to the dashboard
proof run my_agent.py --assess-context --upload --agent acme-support --fail-on block

When you upload, the context engineering sub-score travels with the report and the dashboard renders it as its own panel, version over version, next to the behaviour scorecard and the compliance posture. Behaviour tells you whether the agent passed. Context engineering tells you whether it was ever set up to.

References

Liu et al., Lost in the Middle: How Language Models Use Long Contexts (2023).
Lewis et al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2020).
Anthropic, Building Effective Agents (2024), and Anthropic Engineering on effective context engineering for agents (2025).
Drew Breunig, How Long Contexts Fail and How to Fix Your Context (2025).
Philipp Schmid, The New Skill in AI Is Context Engineering (2025).
Simon Willison, writing on prompt injection.
ProofAgent-Harness: proofagent.ai/harness/docs#context-engineering · PyPI.

#context-engineering#ai-agents#prompt-engineering#ai-engineer#ml-practitioner#llm-developer#gpt-5-5#shopify#proofagent-harness#retrieval-augmented-generation#memory-management#llm-evaluation#ai-safety#agent-design#ai-trends

See all posts →