Testing

Summary

Agent testing measures statistical performance under non-determinism, not deterministic behavior. Covers the full lifecycle: evaluation frameworks and metrics (pass@k, tool-routing accuracy), LLM-as-judge calibration, specific platforms (Braintrust, Promptfoo, Vitest), red-teaming for injection/jailbreak, observability, and CI/CD integration. Start with framework definition, then pick a platform matching your workflow.

Metrics: pass@k, pass^k, MTTR, tool-routing F1, schema conformance
LLM-as-judge: bias calibration, when to use
Platforms: Braintrust (managed), Promptfoo (red-team), Vitest (in-process)
Red-teaming: injection, jailbreak, OWASP LLM Top 10
Observability: OpenTelemetry GenAI conventions
CI integration: PR-gated evals, budget guardrails

Agent testing is fundamentally different from traditional software testing. Traditional testing verifies deterministic behavior — given input X, does output Y occur every time? Agent testing measures statistical performance under non-determinism. An agent may take different paths, call tools in different orders, or succeed on the fifth attempt when it failed on the first. A single success or failure tells you nothing.

This section covers the full testing lifecycle: from evaluation frameworks and metrics (pass@k, tool-routing accuracy) through LLM-as-judge techniques, specific platforms (Braintrust, Promptfoo, Vitest), red-teaming, observability instrumentation, and CI integration.

Quick reference

Metrics — /docs/testing/metrics covers pass@k, pass^k, MTTR, tool-routing F1, schema conformance, hallucination rates
LLM-as-judge — /docs/testing/llm-as-judge explains when to use judge models, their biases, and calibration techniques
Evaluation framework — /docs/testing/evaluation-framework (already here) — dataset, task, scorer, aggregation loop
Braintrust — /docs/testing/braintrust — end-to-end eval platform with dataset versioning and trace integration
Promptfoo — /docs/testing/promptfoo — YAML-first evals with strong red-teaming, now part of OpenAI ecosystem
Vitest harness — /docs/testing/vitest-harness — in-process evals, useful for fast iteration and local debugging
Red-teaming — /docs/testing/red-teaming — prompt injection, jailbreak, data exfiltration attack templates; OWASP LLM Top 10 mapping
Observability — /docs/testing/observability — OpenTelemetry GenAI semantic conventions for trace-driven debugging
CI integration — /docs/testing/ci-integration — PR-gated evals, regression tests, statistical significance, budget guardrails

Start with the evaluation framework, then pick a platform that fits your workflow. If you're running Vitest-based agents, use the Vitest harness. If you want managed experiment tracking, use Braintrust. For quick red-teaming, Promptfoo is the fastest path.

Testing

Summary

Quick reference

On this page