Same engine across the open-source Harness and the enterprise Platform: planning, adversarial conducting, Harness LLM scoring, debate consensus, and signed readiness reports.
The planner infers your agent's domain from its role and goal, then picks only relevant adversarial traps from the 183-trap library. It reserves at least 30% of evaluation turns for prompt-injection and hallucination probes, includes at least two mandatory factuality traps drawn from documented production incidents, and weaves callbacks across turns so the conductor can exploit earlier concessions.
The conductor runs N adversarial turns against your agent with realistic attacks — pretexting, escalation, multi-vector blending, composite attack chains — not theatrical "ignore previous instructions" prompts. Each turn captures the conductor's question, the agent's response, and any tool calls.
Three juror personas (rigorous, lenient, contrarian) independently score the full transcript against the 5 canonical metrics. Each juror uses the same scoring rubric but applies different evaluation lenses. Scores are accompanied by reasoning and transcript-linked evidence.
When jurors disagree by more than 2 points on any metric, a Delphi re-vote (or full debate rounds) resolves the disagreement. The median score per metric becomes the final score. Consensus reduces single-judge bias.
The reporter produces a final score, certification tier (Gold, Silver, Needs Enhancement, Not Ready), per-metric breakdown, transcript-linked findings, and remediation guidance. Reports ship as JSON and Markdown.