ProofAgent — the best open-source AI agent evaluation tool

The most rigorous, research-backed accountability platform for AI agents. Adversarial multi-juror scoring, production log audits, artifact reviews, and signed readiness reports — built around the open-source ProofAgent Harness. Trusted by engineering teams shipping AI agents into customer support, healthcare triage, code generation, privacy compliance, and financial advisory workflows.

Test your AI agents like adversaries would

Most AI agent evaluation tools score a single response with one LLM-as-judge against a fixed test set. Production AI agents fail differently: in the third turn under pressure, via domain-specific failure modes (HIPAA, PCI, SOX, GDPR), or through callbacks that weaponize earlier concessions. ProofAgent runs multi-turn adversarial conversations, scores the full trajectory with 3 juror personas, reaches consensus through debate or Delphi, and produces an evidence-linked report — the credible foundation for AI agent governance.

Research-backed methodology, fully open source

The ProofAgent Harness methodology is documented end-to-end in a published whitepaper (arXiv:2605.24134) covering the 5-stage evaluation pipeline, multi-juror consensus scoring, the 183-trap adversarial library with composite attack chains, and the headline finding that production-grade agents on top of frontier LLMs (GPT 5.5, Claude Opus 4.7) fail under sustained adversarial pressure. The Harness ships open-source under Apache 2.0; every result in the paper is reproducible from the published code.

Open ecosystem and infrastructure

The open-source ProofAgent Harness (Apache 2.0) is free forever and runs locally. The enterprise Platform adds hosted dashboards, SOC 2 + HIPAA-ready operations, regression tracking, expert human review, and signed readiness reports for production governance. Industry-leading AI agent stress-testing infrastructure for teams who need to ship safely.

  • 183 adversarial attack traps across 11 families (social engineering, prompt injection, compliance, tool misuse, factuality, and more)
  • 5 canonical metrics: task success, hallucination resistance, safety, instruction following, manipulation resistance
  • 3-juror Delphi consensus reduces single-judge bias
  • Composite attack chains span 5–7 turns of sustained adversarial pressure
  • Bring your own LLM via LiteLLM (Anthropic, OpenAI, Gemini, Bedrock, Ollama, vLLM)
  • pytest integration, CI/CD ready, on-premises deployment available