ProofAgent is the accountability platform for production AI agents. It turns agent risk into deployment evidence through adversarial multi-juror scoring, production log audits, artifact reviews, signed readiness reports, and human review. The platform is built around the open-source ProofAgent Harness.

How do I test my AI agent with ProofAgent?

Install the open-source harness with 'pip install proofagent-harness', wrap your agent in a function returning AgentResponse, then call Harness().evaluate(my_agent, role, goal, knowledge, context). The harness runs adversarial multi-turn sessions and returns a /10 readiness score with traceable findings and fix recommendations.

What is adversarial multi-juror scoring?

Adversarial multi-juror scoring is ProofAgent's evaluation approach: a planner picks domain traps, a conductor applies sustained pressure across 25+ turns, and three independent juror agents score every behavior change. No single LLM call ever decides the verdict — the jury agents reach consensus or debate to a final score.

Is ProofAgent SOC 2 / HIPAA / GDPR compliant?

ProofAgent is SOC 2 Type II aligned, HIPAA-ready (BAAs available for enterprise customers), and follows GDPR best practices. Enterprise customers can deploy on-premises or in a private cloud with SSO/SAML, RBAC, tamper-evident audit logs, TLS 1.2+ in transit, and AES-256 at rest.

Can I use my own LLM with ProofAgent?

Yes. ProofAgent is BYO Harness LLM — the harness internals can run on any LLM provider (OpenAI, Anthropic, Google, local models). You bring your own model and API key; the harness orchestrates the multi-juror evaluation around it.

What metrics does ProofAgent measure?

11+ production metrics including Task Success, Hallucination Control, Safety, Policy Compliance, Memory Stability, Tone and Empathy, Manipulation Resistance, Tool Picking, Reasoning Quality, Relevance, and Drift Detection. Every metric is anchored to per-turn transcript evidence.

What is the difference between ProofAgent Platform and ProofAgent Harness OSS?

ProofAgent Harness OSS is the open-source multi-turn adversarial testing engine — Tier 1 of the platform, available standalone for developers and CI under Apache 2.0. ProofAgent Platform is the enterprise product that adds the other four tiers (production log audit, artifact review, multi-agent orchestration scoring, expert human review), a hosted dashboard, REST API, governance features, signed readiness reports, and dedicated support.

ProofAgent Community Blog

Findings, research, and field notes
from adversarial agent evaluation.

Name: ProofAgent Platform
Brand: ProofAgent
Availability: InStock

Case studies, methodology deep-dives, release notes, and community contributions on evaluating production AI agents.

Tutorials

Step by Step: Stress Test Your AI Agent in 10 Lines with ProofAgent Harness

Learn how to stress test any AI agent in just 10 lines using a configurable harness for adversarial, multi-turn evaluation and evidence-linked reports. No agent rebuild required.

ProofAgent Team May 27, 2026 5 min read

Ecosystem

ProofAgent Harness vs LangSmith, Phoenix, DeepEval, and Langfuse: The AI Agent Evaluation Stack in 2026

AI agent evaluation is a multi-layered lifecycle involving pre deployment testing, debugging, regression checks, and production monitoring. This article compares top tools addressing these needs.

ProofAgent Team May 24, 2026 7 min read

Case Studies

When the Refusal Itself Crashes: A GPT 5.5 Privacy and Security Agent Case Study

A privacy and security agent powered by GPT 5.5 resisted 25 turns of adversarial probing without leaking data. Yet, upstream content filters caused refusal delivery failures.

ProofAgent Team May 23, 2026 5 min read

Case Studies

Claude Opus 4.7 Failed Tool Trace Test in Healthcare Triage

Claude Opus 4.7 failed a safety-critical tool-call test in healthcare triage. ProofAgent Harness surfaced the gap, showing why adversarial evaluation is essential for deployment.

ProofAgent Team May 23, 2026 9 min read

Research

2026 Trends in Adversarial AI Agent Evaluation: The Field Guide

Why adversarial multi-turn evaluation replaced static benchmarks for production AI agents in 2026. Red teaming, jailbreak patterns, real tool comparison, evidence-based. (158 chars)

Dr. Fouad Bousetouane May 22, 2026 8 min read