Metrics for Agent Evaluation
Defining pass@k, tool-routing accuracy, schema conformance, and other quantitative measures
Summary
Quantitative measures that bridge "the agent works" and "the agent reliably solves production tasks." Covers pass@k (at least one of k trials succeeds) and pass^k (all k trials succeed) for non-determinism, tool-routing accuracy (F1 score), schema conformance (does output match spec), trajectory correctness (did agent take the right steps), and latency metrics. Core insight: agent metrics capture multi-step execution and failure modes invisible to unit tests.
- pass@k: At least one of k attempts succeeds (capability)
- pass^k: All k attempts succeed (consistency/reliability)
- Tool-routing F1: Precision and recall of correct tool selection
- Schema conformance: Output conforms to declared type/structure
- Trajectory correctness: Agent steps in right order, correct tool params
- Latency & token: Cost efficiency, speed benchmarks
Agent metrics bridge the gap between "the agent works" and "the agent reliably solves production tasks." Unlike LLM benchmarks that measure a single forward pass, agent metrics capture multi-step execution, non-determinism, and failure modes that are invisible to unit tests.
Core metrics
pass@k and pass^k
These metrics measure agent capability and consistency under non-determinism.
pass@k is the probability that at least one of k attempts succeeds. Use this when a single success demonstrates the agent is capable of the task.
// Calculate pass@k given k independent trials
function calculatePassAtK(results: boolean[], k: number): number {
if (results.length < k) {
throw new Error(`Need at least ${k} trials, got ${results.length}`);
}
const successCount = results.slice(0, k).filter(r => r).length;
return successCount > 0 ? 1 : 0; // 1 if any pass, 0 if all fail
}
// Example: k=1..10
const trials = [true, false, true, false, true, false, false, false, false, false];
for (let k = 1; k <= 10; k++) {
console.log(`pass@${k}: ${calculatePassAtK(trials, k)}`);
}
// pass@1: 1 (first trial succeeded)
// pass@2: 1 (at least one of first two succeeded)
// pass@3: 1
// ...
// pass@10: 1 (at least one of all 10 succeeded)pass^k (pass-caret-k) is the probability that all k attempts succeed. Use this when consistency is required — every execution must work.
function calculatePassCaretK(results: boolean[], k: number): number {
if (results.length < k) {
throw new Error(`Need at least ${k} trials, got ${results.length}`);
}
return results.slice(0, k).every(r => r) ? 1 : 0; // 1 only if all pass
}
// Same trials as above
for (let k = 1; k <= 10; k++) {
console.log(`pass^${k}: ${calculatePassCaretK(trials, k)}`);
}
// pass^1: 1 (first trial succeeded)
// pass^2: 0 (second trial failed, so not all succeeded)
// pass^3: 0
// ...
// pass^10: 0 (many later trials failed)For production agents, aim for pass@1 close to pass@k. A gap between these means the agent is "lucky" on some runs — reliable systems have consistent success, not just possible success.
Tool-routing accuracy
The fraction of tool calls where the agent selected the correct tool for its intent. Measured as precision and recall (F1).
interface ToolCall {
name: string;
arguments: Record<string, unknown>;
}
function calculateToolF1(
expected: string[],
actual: string[]
): { precision: number; recall: number; f1: number } {
const expectedSet = new Set(expected);
const actualSet = new Set(actual);
const truePositives = Array.from(actualSet).filter(t => expectedSet.has(t)).length;
const falsePositives = Array.from(actualSet).filter(t => !expectedSet.has(t)).length;
const falseNegatives = Array.from(expectedSet).filter(t => !actualSet.has(t)).length;
const precision = (truePositives + falsePositives === 0)
? 1
: truePositives / (truePositives + falsePositives);
const recall = (truePositives + falseNegatives === 0)
? 1
: truePositives / (truePositives + falseNegatives);
const f1 = (precision + recall === 0)
? 0
: 2 * (precision * recall) / (precision + recall);
return { precision, recall, f1 };
}
// Example
const expected = ["search_docs", "read_file", "summarize"];
const actual = ["search_docs", "search_docs", "read_file", "summarize"];
const { f1 } = calculateToolF1(expected, actual);
console.log(`Tool F1: ${f1.toFixed(2)}`); // 1.00 (all expected tools used, minor repetition)Schema conformance rate
Fraction of tool calls where all required parameters are present and correctly typed. Measured by validating against the tool's JSON schema.
interface ToolSchema {
properties: Record<string, { type: string }>;
required: string[];
}
function conformanceRate(
calls: ToolCall[],
schemas: Record<string, ToolSchema>
): number {
let conformant = 0;
for (const call of calls) {
const schema = schemas[call.name];
if (!schema) continue;
// Check required fields
const hasAllRequired = schema.required.every(
field => call.arguments[field] !== undefined
);
// Check types
const typesCorrect = Object.entries(call.arguments).every(
([key, value]) => {
const spec = schema.properties[key];
if (!spec) return false; // Unknown field
const actualType = typeof value;
return (spec.type === actualType) ||
(spec.type === "number" && actualType === "number");
}
);
if (hasAllRequired && typesCorrect) {
conformant++;
}
}
return calls.length > 0 ? conformant / calls.length : 1;
}Mean time to recovery (MTTR)
For agents that retry on failure, MTTR measures how many attempts it takes to recover after a transient error (rate limit, timeout, service unavailable).
interface Attempt {
timestamp: number;
success: boolean;
error?: string;
}
function mttr(attempts: Attempt[]): number {
let totalRecoverySteps = 0;
let recoveryCount = 0;
let failureStart: number | null = null;
for (let i = 0; i < attempts.length; i++) {
if (!attempts[i].success) {
if (failureStart === null) {
failureStart = i;
}
} else if (failureStart !== null) {
totalRecoverySteps += i - failureStart;
recoveryCount++;
failureStart = null;
}
}
return recoveryCount > 0 ? totalRecoverySteps / recoveryCount : 0;
}
// Example: agent failed, retried, succeeded after 2 more attempts
const attempts = [
{ timestamp: 1, success: false },
{ timestamp: 2, success: false },
{ timestamp: 3, success: true },
];
console.log(`MTTR: ${mttr(attempts)} steps`); // 2Hallucination rate
Fraction of tool calls where the agent invoked a tool that does not exist or passed arguments that violate the schema. Hallucinations indicate the agent's reasoning has diverged from the actual tool definitions.
function hallucinationRate(
calls: ToolCall[],
availableTools: Set<string>,
schemas: Record<string, ToolSchema>
): number {
let hallucinations = 0;
for (const call of calls) {
// Non-existent tool
if (!availableTools.has(call.name)) {
hallucinations++;
continue;
}
// Invalid schema
const schema = schemas[call.name];
const missingRequired = schema.required.some(
f => call.arguments[f] === undefined
);
if (missingRequired) {
hallucinations++;
}
}
return calls.length > 0 ? hallucinations / calls.length : 0;
}
const tools = new Set(["search", "read_file", "write_file"]);
const badCalls = [
{ name: "search", arguments: { q: "test" } },
{ name: "delete_file", arguments: {} }, // Doesn't exist!
{ name: "read_file", arguments: {} }, // Missing required 'path'
];
console.log(`Hallucination rate: ${hallucinationRate(badCalls, tools, schemas).toFixed(2)}`);
// 0.67 (2 of 3 are hallucinations)Composite metrics
Cost per task
Total API cost divided by successful tasks completed. Accounts for retries, LLM-as-judge costs, and model selection.
interface TaskExecution {
success: boolean;
tokenCost: number; // Sum of input + output token costs
judgeCost?: number; // LLM-as-judge cost if applicable
}
function costPerTask(executions: TaskExecution[]): number {
const successful = executions.filter(e => e.success).length;
const totalCost = executions.reduce((sum, e) =>
sum + e.tokenCost + (e.judgeCost ?? 0), 0
);
return successful > 0 ? totalCost / successful : Infinity;
}Latency percentiles
End-to-end latency for task completion, reported as p50, p95, p99.
function percentile(values: number[], p: number): number {
const sorted = [...values].sort((a, b) => a - b);
const index = Math.ceil((p / 100) * sorted.length) - 1;
return sorted[Math.max(0, index)];
}
const latencies = [100, 200, 250, 300, 500, 1000, 1500, 2000];
console.log(`p50: ${percentile(latencies, 50)}ms`);
console.log(`p95: ${percentile(latencies, 95)}ms`);
console.log(`p99: ${percentile(latencies, 99)}ms`);When to measure what
- Capability evals — use pass@k, F1, schema conformance to benchmark agent ability on new features
- Regression evals — use pass@1 (or pass^5 for critical systems) to detect breakage in CI
- Production monitoring — track pass@1, MTTR, cost, hallucination rate in real time
- Debugging — capture full traces with tool-routing accuracy and parameter details to understand failure modes
See OpenTelemetry GenAI semantic conventions for how to instrument code to collect these metrics.
References
- Cite Anthropic's demystifying evals for pass@k / pass^k definitions
- OpenAI HumanEval introduced pass@k as the standard metric for code generation
See also
/docs/testing/evaluation-framework— dataset and grader design/docs/testing/llm-as-judge— calibrating judge models for hallucination detection/docs/testing/observability— instrument agents to collect metrics