ProofAgent Harness: Adversarial Evaluation for Production AI Agents

ProofAgent Harness (arXiv:2605.24134) by Dr. Fouad Bousetouane is the open source framework for adversarial AI agent evaluation. It runs a full pipeline of planning, adversarial conducting, multi juror scoring, debate consensus, and signed reporting, and returns a readiness verdict you can ship to security and audit teams.

Single answers pass; real agents fail under pressure

Most evaluation scores a single answer with one model acting as judge. Production agents fail differently: three turns in, under pressure, through domain specific failure modes. The Harness selects domain relevant traps, sustains adversarial pressure across a full conversation, scores the whole trajectory with a panel of jurors, and produces evidence linked findings rather than one opaque number.

What the paper covers

  • The full pipeline: domain aware planner, adversarial conductor, three juror personas, Delphi or debate consensus, signed reporter
  • A 183 trap library across 11 attack families, including composite attack chains spanning five to seven turns
  • A six metric rubric: task success, instruction following, hallucination resistance, tool use, safety, manipulation resistance
  • Asymmetric evaluation: a small local model serving as the juror against frontier class agents
  • Multi juror consensus that measurably reduces the bias of a single model acting as judge

Why it matters for enterprises

Capability is no longer the bottleneck. The risk that keeps agents out of production is behavior under pressure: a support agent that leaks data when pushed, a finance agent that breaks policy when flattered, a tool call that fires when it should refuse. Run the Harness as a gate in CI and CD, block deployment on a failed critical metric, and ship with a signed readiness report your auditors accept.

Headline finding

Production grade agents built on GPT 5.5 and Claude Opus 4.7 fail the Harness with serious safety and manipulation resistance issues under composite chain pressure. Frontier LLMs alone are not enough. Every result is reproducible from the open source code under Apache 2.0. The companion paradigm paper, Human-on-the-Bridge (arXiv:2606.16871), formalizes the approach.

Plain language explainer on Medium: ProofAgent Harness: The Open Source Harness for Complete AI Agent Evaluation.