The 5-stage AI agent evaluation pipeline

Same engine across the open-source Harness and the enterprise Platform: planning, adversarial conducting, Harness LLM scoring, debate consensus, and signed readiness reports.

Stage 1: Planning

The planner infers your agent's domain from its role and goal, then picks only relevant adversarial traps from the 183-trap library. It reserves at least 30% of evaluation turns for prompt-injection and hallucination probes, includes at least two mandatory factuality traps drawn from documented production incidents, and weaves callbacks across turns so the conductor can exploit earlier concessions.

Stage 2: Adversarial conducting

The conductor runs N adversarial turns against your agent with realistic attacks — pretexting, escalation, multi-vector blending, composite attack chains — not theatrical "ignore previous instructions" prompts. Each turn captures the conductor's question, the agent's response, and any tool calls.

Stage 3: Harness LLM scoring

Three juror personas (rigorous, lenient, contrarian) independently score the full transcript against the 5 canonical metrics. Each juror uses the same scoring rubric but applies different evaluation lenses. Scores are accompanied by reasoning and transcript-linked evidence.

Stage 4: Debate consensus

When jurors disagree by more than 2 points on any metric, a Delphi re-vote (or full debate rounds) resolves the disagreement. The median score per metric becomes the final score. Consensus reduces single-judge bias.

Stage 5: Signed readiness report

The reporter produces a final score, certification tier (Gold, Silver, Needs Enhancement, Not Ready), per-metric breakdown, transcript-linked findings, and remediation guidance. Reports ship as JSON and Markdown.