Agent Surface
Testing

Testing

Building evals, metrics, and red-team suites to measure agent reliability

Summary

Agent testing measures statistical performance under non-determinism, not deterministic behavior. Covers the full lifecycle: evaluation frameworks and metrics (pass@k, tool-routing accuracy), LLM-as-judge calibration, specific platforms (Braintrust, Promptfoo, Vitest), red-teaming for injection/jailbreak, observability, and CI/CD integration. Start with framework definition, then pick a platform matching your workflow.

  • Metrics: pass@k, pass^k, MTTR, tool-routing F1, schema conformance
  • LLM-as-judge: bias calibration, when to use
  • Platforms: Braintrust (managed), Promptfoo (red-team), Vitest (in-process)
  • Red-teaming: injection, jailbreak, OWASP LLM Top 10
  • Observability: OpenTelemetry GenAI conventions
  • CI integration: PR-gated evals, budget guardrails

Agent testing is fundamentally different from traditional software testing. Traditional testing verifies deterministic behavior — given input X, does output Y occur every time? Agent testing measures statistical performance under non-determinism. An agent may take different paths, call tools in different orders, or succeed on the fifth attempt when it failed on the first. A single success or failure tells you nothing.

This section covers the full testing lifecycle: from evaluation frameworks and metrics (pass@k, tool-routing accuracy) through LLM-as-judge techniques, specific platforms (Braintrust, Promptfoo, Vitest), red-teaming, observability instrumentation, and CI integration.

Quick reference

  • Metrics/docs/testing/metrics covers pass@k, pass^k, MTTR, tool-routing F1, schema conformance, hallucination rates
  • LLM-as-judge/docs/testing/llm-as-judge explains when to use judge models, their biases, and calibration techniques
  • Evaluation framework/docs/testing/evaluation-framework (already here) — dataset, task, scorer, aggregation loop
  • Braintrust/docs/testing/braintrust — end-to-end eval platform with dataset versioning and trace integration
  • Promptfoo/docs/testing/promptfoo — YAML-first evals with strong red-teaming, now part of OpenAI ecosystem
  • Vitest harness/docs/testing/vitest-harness — in-process evals, useful for fast iteration and local debugging
  • Red-teaming/docs/testing/red-teaming — prompt injection, jailbreak, data exfiltration attack templates; OWASP LLM Top 10 mapping
  • Observability/docs/testing/observability — OpenTelemetry GenAI semantic conventions for trace-driven debugging
  • CI integration/docs/testing/ci-integration — PR-gated evals, regression tests, statistical significance, budget guardrails

Start with the evaluation framework, then pick a platform that fits your workflow. If you're running Vitest-based agents, use the Vitest harness. If you want managed experiment tracking, use Braintrust. For quick red-teaming, Promptfoo is the fastest path.

On this page