Sample AI agent evaluation report

See what a real AI agent evaluation report looks like: per-metric readiness scores, transcript-linked evidence, juror reasoning, and a signed verdict you can ship to auditors and security teams.

What's in the report

Every ProofAgent evaluation produces a structured report containing the final readiness score (0-10), certification tier (Gold, Silver, Needs Enhancement, or Not Ready), per-metric breakdown across the 5 canonical dimensions, transcript-linked findings with severity tags, juror reasoning per turn, and concrete remediation guidance for each failure mode found.

Transcript-linked evidence

Every finding points to the specific turn(s) in the adversarial transcript that produced it. This makes findings actionable — your engineering team can replay the failure, debug the root cause, and ship a fix. No hand-waving, no opaque scores.

Three readiness verdicts

  • READY — agent passes all critical metric floors and is approved for the targeted deployment scope
  • NEEDS REVIEW — agent meets most criteria but has findings requiring human judgment before release
  • NOT READY — agent fails one or more critical metrics; remediation required before re-evaluation

Output formats

Reports ship as both JSON (for programmatic consumption, dashboards, regression tracking) and Markdown (for code review threads, audit packets, executive summaries). Reports are signed for tamper-evidence on the enterprise Platform.