Human-on-the-Bridge (arXiv:2606.16871) by Dr. Fouad Bousetouane is the research paradigm behind the open-source ProofAgent Harness. Instead of putting a human reviewer in the loop on every agent run, it curates human expertise once — upstream — into reusable traps, juror personas, and scoring rubrics, then lets small evaluator models stress-test frontier-grade AI agents at scale.
A captain on the bridge sets the course, the rules, and the instruments rather than steering every wave. Human-on-the-Bridge applies the same separation to AI agent evaluation: experts invest judgment once into the evaluation machinery, and that machinery then runs automatically on every agent and every release — rigorous because it carries real expertise, scalable because it runs locally and cheaply without a reviewer babysitting each transcript.
Production-grade agents built on frontier LLMs (GPT 5.5, Claude Opus 4.7) fail under sustained, composite adversarial pressure — and a small, local evaluator model armed with pre-curated expert judgment can reliably catch it. Frontier models alone are not enough; the agent layer needs its own evaluation infrastructure, and that infrastructure does not have to be as large as the agent it tests. The companion core whitepaper (arXiv:2605.24134) documents the full pipeline and experimental results.
Plain language explainer on Medium: Human-on-the-Bridge (HoB): The New Paradigm for Scaling AI Agent Evaluation.