<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>ProofAgent Community Blog</title>
    <link>https://www.proofagent.ai/community/blog</link>
    <atom:link href="https://www.proofagent.ai/community/blog/rss.xml" rel="self" type="application/rss+xml" />
    <description>Research, case studies, findings, and tutorials on adversarial AI agent evaluation.</description>
    <language>en-us</language>
    <lastBuildDate>Fri, 19 Jun 2026 10:16:50 GMT</lastBuildDate>
    
    <item>
      <title>How to Evaluate a LangGraph Agent (Step by Step)</title>
      <link>https://www.proofagent.ai/community/blog/how-to-evaluate-a-langgraph-agent-step-by-step</link>
      <guid isPermaLink="true">https://www.proofagent.ai/community/blog/how-to-evaluate-a-langgraph-agent-step-by-step</guid>
      <pubDate>Thu, 18 Jun 2026 22:30:56 GMT</pubDate>
      <description>A step by step guide to evaluating a LangGraph agent: build the LLM, tools, skills, and policy, then run adversarial evaluation with ProofAgent Harness and read the reportA step by step guide to evaluating a LangGraph.</description>
      
      
    </item>
    <item>
      <title>A Free 4B Model on a Laptop Audited a Claude Opus 4.8 Agent and Caught It Faking a Tool Call</title>
      <link>https://www.proofagent.ai/community/blog/free-4b-model-audits-claude-opus-4-8-agent</link>
      <guid isPermaLink="true">https://www.proofagent.ai/community/blog/free-4b-model-audits-claude-opus-4-8-agent</guid>
      <pubDate>Thu, 18 Jun 2026 03:18:42 GMT</pubDate>
      <description>A free Gemma 3 4B model running locally on a laptop adversarially audited a production-grade Claude Opus 4.8 agent across 100 turns — matching a cloud evaluator within 0.13 and catching a faked tool call.</description>
      
      
    </item>
    <item>
      <title>ProofAgent-Harness v 0.5.1 — Release Notes</title>
      <link>https://www.proofagent.ai/community/blog/proofagent-harness-0-5-1-release-notes</link>
      <guid isPermaLink="true">https://www.proofagent.ai/community/blog/proofagent-harness-0-5-1-release-notes</guid>
      <pubDate>Tue, 16 Jun 2026 01:09:23 GMT</pubDate>
      <description>Discover the latest features in the open-source, domain-aware test harness for AI agents. Evaluate agents via adversarial conversations or artifact review with robust scoring.</description>
      
      <category>ai-agent-evaluation</category><category>test-harness</category><category>llm-evaluation</category><category>ai-engineer</category><category>ml-researcher</category><category>devops</category><category>python-3-10</category><category>claude-sonnet</category><category>gpt-4-1-mini</category><category>lite-llm</category><category>adversarial-testing</category><category>artifact-review</category><category>zero-tolerance</category><category>jury-consensus</category><category>ai-safety</category>
    </item>
    <item>
      <title>How to Evaluate AI-Generated Artifacts (Business Plans, Specs, Code) with ProofAgent-Harness</title>
      <link>https://www.proofagent.ai/community/blog/how-to-evaluate-ai-generated-artifacts-business-plans-specs-code-with-proofagent</link>
      <guid isPermaLink="true">https://www.proofagent.ai/community/blog/how-to-evaluate-ai-generated-artifacts-business-plans-specs-code-with-proofagent</guid>
      <pubDate>Tue, 16 Jun 2026 01:03:21 GMT</pubDate>
      <description>Learn how to evaluate AI-generated business plans, specs, and code using artifact-based grading. ProofAgent-Harness automates strict, claim-by-claim reviews.</description>
      
      <category>ai-artifact-evaluation</category><category>business-plan-review</category><category>specification-grading</category><category>code-quality-assessment</category><category>ai-engineer</category><category>product-manager</category><category>software-architect</category><category>proofagent-harness</category><category>artifact-mode</category><category>rubric-packs</category><category>document-validation</category><category>ground-truth-corpus</category><category>ai-deliverables</category><category>llm-evaluation</category><category>ai-safety</category>
    </item>
    <item>
      <title>What Claude Opus 4.8 Missed Under 300+ Adversarial Turns</title>
      <link>https://www.proofagent.ai/community/blog/what-claude-opus-4-8-missed-under-300-adversarial-turns-</link>
      <guid isPermaLink="true">https://www.proofagent.ai/community/blog/what-claude-opus-4-8-missed-under-300-adversarial-turns-</guid>
      <pubDate>Wed, 03 Jun 2026 21:31:24 GMT</pubDate>
      <description>A 300+ turn adversarial evaluation revealed how Claude Opus 4.8 agents handle operational reliability, safety, and tool-use under real-world pressure. Key gaps emerged.</description>
      
      <category>llm-evaluation</category><category>adversarial-testing</category><category>ai-agent</category><category>ai-engineer</category><category>compliance-team</category><category>healthcare-cto</category><category>claude</category><category>code-generation</category><category>financial-advisory</category><category>medical-triage</category><category>tool-use</category><category>operational-safety</category><category>ai-safety</category><category>opus4.8</category><category>claudeopus48</category>
    </item>
    <item>
      <title>Step by Step: Stress Test Your AI Agent in 10 Lines with ProofAgent Harness</title>
      <link>https://www.proofagent.ai/community/blog/evaluate-your-ai-agent-in-10-lines</link>
      <guid isPermaLink="true">https://www.proofagent.ai/community/blog/evaluate-your-ai-agent-in-10-lines</guid>
      <pubDate>Wed, 27 May 2026 02:36:30 GMT</pubDate>
      <description>Learn how to stress test any AI agent in just 10 lines using a configurable harness for adversarial, multi-turn evaluation and evidence-linked reports. No agent rebuild required.</description>
      
      <category>ai-agent-testing</category><category>llm-evaluation</category><category>adversarial-testing</category><category>ai-engineer</category><category>mlops</category><category>security-team</category><category>claude</category><category>gpt-4</category><category>python</category><category>multi-turn-evaluation</category><category>agent-context</category><category>chatbot-evaluation</category><category>customer-support-bot</category><category>ai-safety</category><category>ai-evaluation</category>
    </item>
    <item>
      <title>ProofAgent Harness vs LangSmith, Phoenix, DeepEval, and Langfuse: The AI Agent Evaluation Stack in 2026</title>
      <link>https://www.proofagent.ai/community/blog/proofagent-harness-vs-langsmith-phoenix-deepeval-langfuse</link>
      <guid isPermaLink="true">https://www.proofagent.ai/community/blog/proofagent-harness-vs-langsmith-phoenix-deepeval-langfuse</guid>
      <pubDate>Sun, 24 May 2026 18:04:46 GMT</pubDate>
      <description>AI agent evaluation is a multi-layered lifecycle involving pre deployment testing, debugging, regression checks, and production monitoring. This article compares top tools addressing these needs.</description>
      
      <category>ai-agent-evaluation</category><category>pre-deployment-testing</category><category>regression-testing</category><category>ai-engineer</category><category>product-manager</category><category>claude-opus-4-7</category><category>gpt-5-5</category><category>langchain</category><category>agent-debugging</category><category>production-observability</category>
    </item>
    <item>
      <title>When the Refusal Itself Crashes: A GPT 5.5 Privacy and Security Agent Case Study</title>
      <link>https://www.proofagent.ai/community/blog/when-the-refusal-itself-crashes-a-gpt-5-5-privacy-and-security-agent-case-study</link>
      <guid isPermaLink="true">https://www.proofagent.ai/community/blog/when-the-refusal-itself-crashes-a-gpt-5-5-privacy-and-security-agent-case-study</guid>
      <pubDate>Sat, 23 May 2026 14:49:13 GMT</pubDate>
      <description>A privacy and security agent powered by GPT 5.5 resisted 25 turns of adversarial probing without leaking data. Yet, upstream content filters caused refusal delivery failures.</description>
      
      <category>privacy-security</category><category>ai-refusal</category><category>adversarial-evaluation</category><category>ai-engineer</category><category>security-team</category><category>privacy-ops</category><category>gpt-5-5</category><category>gemma-4b</category><category>harness-llm</category><category>content-filtering</category><category>prompt-injection</category><category>gdpr-compliance</category><category>audit-trail</category><category>llm-safety</category><category>ai-evaluation</category>
    </item>
    <item>
      <title>Claude Opus 4.7 Failed Tool Trace Test in Healthcare Triage</title>
      <link>https://www.proofagent.ai/community/blog/claude-opus-4-7-failed-tool-trace-test-healthcare-triage</link>
      <guid isPermaLink="true">https://www.proofagent.ai/community/blog/claude-opus-4-7-failed-tool-trace-test-healthcare-triage</guid>
      <pubDate>Sat, 23 May 2026 03:30:47 GMT</pubDate>
      <description>Claude Opus 4.7 failed a safety-critical tool-call test in healthcare triage. ProofAgent Harness surfaced the gap, showing why adversarial evaluation is essential for deployment.</description>
      
      <category>agent-evaluation</category><category>adversarial-testing</category><category>tool-trace</category><category>healthcare-ai</category><category>#claudeopus47</category><category>#anthropic</category><category>#aiupdate</category><category>#claudeai</category>
    </item>
    <item>
      <title>2026 Trends in Adversarial AI Agent Evaluation: The Field Guide</title>
      <link>https://www.proofagent.ai/community/blog/2026-trends-in-adversarial-ai-agent-evaluation-the-field-guide</link>
      <guid isPermaLink="true">https://www.proofagent.ai/community/blog/2026-trends-in-adversarial-ai-agent-evaluation-the-field-guide</guid>
      <pubDate>Fri, 22 May 2026 23:07:17 GMT</pubDate>
      <description>Why adversarial multi-turn evaluation replaced static benchmarks for production AI agents in 2026. Red teaming, jailbreak patterns, real tool comparison, evidence-based. (158 chars)</description>
      <author>noreply@proofagent.ai (Dr. Fouad Bousetouane)</author>
      
    </item>
  </channel>
</rss>