How do I test my AI agent with ProofAgent?

Install the open-source harness with 'pip install proofagent-harness', wrap your agent in a function returning AgentResponse, then call Harness().evaluate(my_agent, role, goal, knowledge, context). The harness runs adversarial multi-turn sessions and returns a /10 readiness score with traceable findings and fix recommendations.

What is adversarial multi-juror scoring?

Adversarial multi-juror scoring is ProofAgent's evaluation approach: a planner picks domain traps, a conductor applies sustained pressure across 25+ turns, and three independent juror agents score every behavior change. No single LLM call ever decides the verdict — the jury agents reach consensus or debate to a final score.

Is ProofAgent SOC 2 / HIPAA / GDPR compliant?

ProofAgent is SOC 2 Type II aligned, HIPAA-ready (BAAs available for enterprise customers), and follows GDPR best practices. Enterprise customers can deploy on-premises or in a private cloud with SSO/SAML, RBAC, tamper-evident audit logs, TLS 1.2+ in transit, and AES-256 at rest.

Can I use my own LLM with ProofAgent?

Yes. ProofAgent is BYO Harness LLM — the harness internals can run on any LLM provider (OpenAI, Anthropic, Google, local models). You bring your own model and API key; the harness orchestrates the multi-juror evaluation around it.

What metrics does ProofAgent measure?

11+ production metrics including Task Success, Hallucination Control, Safety, Policy Compliance, Memory Stability, Tone and Empathy, Manipulation Resistance, Tool Picking, Reasoning Quality, Relevance, and Drift Detection. Every metric is anchored to per-turn transcript evidence.

What is the difference between ProofAgent Platform and ProofAgent Harness OSS?

ProofAgent Harness OSS is the open-source multi-turn adversarial testing engine — Tier 1 of the platform, available standalone for developers and CI under Apache 2.0. ProofAgent Platform is the enterprise product that adds the other four tiers (production log audit, artifact review, multi-agent orchestration scoring, expert human review), a hosted dashboard, REST API, governance features, signed readiness reports, and dedicated support.

ProofAgent — the best open-source AI agent evaluation tool

Name: ProofAgent Platform
Brand: ProofAgent
Availability: InStock

The most rigorous, research-backed accountability platform for AI agents. Adversarial multi-juror scoring, production log audits, artifact reviews, and signed readiness reports — built around the open-source ProofAgent Harness. Trusted by engineering teams shipping AI agents into customer support, healthcare triage, code generation, privacy compliance, and financial advisory workflows.

Test your AI agents like adversaries would

Most AI agent evaluation tools score a single response with one LLM-as-judge against a fixed test set. Production AI agents fail differently: in the third turn under pressure, via domain-specific failure modes (HIPAA, PCI, SOX, GDPR), or through callbacks that weaponize earlier concessions. ProofAgent runs multi-turn adversarial conversations, scores the full trajectory with 3 juror personas, reaches consensus through debate or Delphi, and produces an evidence-linked report — the credible foundation for AI agent governance.

Research-backed methodology, fully open source

The ProofAgent Harness methodology is documented end-to-end in a published whitepaper (arXiv:2605.24134) covering the 5-stage evaluation pipeline, multi-juror consensus scoring, the 183-trap adversarial library with composite attack chains, and the headline finding that production-grade agents on top of frontier LLMs (GPT 5.5, Claude Opus 4.7) fail under sustained adversarial pressure. The Harness ships open-source under Apache 2.0; every result in the paper is reproducible from the published code.

Open ecosystem and infrastructure

The open-source ProofAgent Harness (Apache 2.0) is free forever and runs locally. The enterprise Platform adds hosted dashboards, SOC 2 + HIPAA-ready operations, regression tracking, expert human review, and signed readiness reports for production governance. Industry-leading AI agent stress-testing infrastructure for teams who need to ship safely.

183 adversarial attack traps across 11 families (social engineering, prompt injection, compliance, tool misuse, factuality, and more)
5 canonical metrics: task success, hallucination resistance, safety, instruction following, manipulation resistance
3-juror Delphi consensus reduces single-judge bias
Composite attack chains span 5–7 turns of sustained adversarial pressure
Bring your own LLM via LiteLLM (Anthropic, OpenAI, Gemini, Bedrock, Ollama, vLLM)
pytest integration, CI/CD ready, on-premises deployment available