Scoring Framework
Measure and track your codebase's agent-readiness across 11 dimensions
Summary
Quantifies agent-readiness on a 0–30 scaled spectrum across 11 concrete dimensions: API, CLI, MCP, Discovery, Auth, Error Handling, Tool Design, Context Files, Multi-Agent, Testing, and Data Retrievability. Each dimension scores 0–3 (Not Implemented → Excellent), then the applicable total is scaled to 30. Provides a baseline score, identifies gaps, and tracks progress as you modernize. Scale: 0–7 (Human-only) to 23–30 (Agent-first).
- 0 = Not implemented (blocker)
- 1 = Basic (weak, human-oriented)
- 2 = Good (agent-ready, most use cases work)
- 3 = Excellent (production, resilient, cross-framework)
- N/A = Genuinely inapplicable (rare)
Agent-readiness is not binary. A codebase's ability to be consumed by AI agents exists on a spectrum. The scoring framework quantifies that spectrum across 11 concrete dimensions, giving you a measurement today, a clear view of gaps, and a way to track progress.
Why Score?
Ambiguity that humans resolve through intuition becomes failure modes at scale with agents. When an agent cannot ask for clarification, every implicit assumption, missing description, or hidden error path becomes a blocker. A 0–3 score per dimension tells you exactly where to invest: optimize the few surfaces agents actually touch, deprioritize human-only features.
The 0–3 Scale
Each dimension scores from 0 to 3:
- 0 (Not implemented) — Feature missing or unsuitable for agents. This dimension is a blocker.
- 1 (Basic) — Feature exists but weak. Human-oriented; terse; missing critical context. Agents can use it with heavy prompting.
- 2 (Good) — Solid implementation. Agent-oriented descriptions, structured data, clear examples. Most agent use cases work.
- 3 (Excellent) — Production-ready for agents. Optimized for token efficiency, cross-framework, resilient error handling, advanced patterns (OAuth, MCP Tasks, delegation). Agents perform reliably with minimal prompt engineering.
The 11 Dimensions
1. API Surface — How well your HTTP API is described for agent tool generation. OpenAPI quality, operationIds, agent-oriented descriptions, Arazzo workflow support.
2. CLI Design — How well your CLI tool works with agent automation. JSON output, exit codes, schema introspection, input hardening (path traversal defense), SKILL.md packaging.
3. MCP Server — Whether and how well you expose an MCP server. Tool count, annotations, resources, OAuth support, pagination, testing with InMemoryTransport.
4. Discovery & AEO — How discoverable, readable, governable, and callable your project is by agents. llms.txt, AGENTS.md, JSON-LD schema, Markdown content negotiation, robots.txt, Content Signals, API Catalog, MCP Server Cards, Agent Skills indexes, OAuth metadata, Web Bot Auth, and optional commerce protocols.
5. Authentication — Whether agents can authenticate without human browser interaction. API keys, OAuth 2.1 Client Credentials, Token Exchange (RFC 8693), agent identity as first-class principal.
6. Error Handling — Whether errors give agents enough information to recover. RFC 9457 Problem Details, is_retriable, suggestions array, trace_id, doc_uri, semantic exit codes.
7. Tool Design — Quality of tool definitions across any framework (MCP, AI SDK, LangChain). Descriptions as prompts ("when to use" and "do not use for"), typed schemas, toModelOutput, cross-framework portability.
8. Context Files — Quality of agent context files for AI coding assistants. AGENTS.md + multi-tool overrides (CLAUDE.md, .cursor/rules). Hand-curated, commands first, permission boundaries, progressive disclosure.
9. Multi-Agent Support — How well your project supports multi-agent orchestration. Sub-agent delegation, state management, A2A agent cards, memory patterns, dynamic agent selection.
10. Testing & Evaluation — Whether agent interactions are tested and evaluated. Tool routing accuracy, error recovery testing, multi-step flow tests, pass@k metrics, eval-driven development.
11. Data Retrievability — Whether project knowledge, documents, and user data are searchable and retrievable by agents. Semantic search, hybrid retrieval, reranking, metadata filters, RAG evals, and knowledge graphs.
Confidence Levels
Every score includes a confidence marker:
| Level | Meaning | When Appropriate |
|---|---|---|
| High | Examined >80% of relevant code | Small projects, focused dimensions |
| Medium | Examined key files representatively | Large projects, sampled intelligently |
| Low | Examined <30% or inferred from structure | Very large projects, time-constrained |
Low-confidence scores should be flagged for manual re-verification.
The N/A Trap
Not all dimensions apply to every project. Mark a dimension N/A only when it is genuinely inapplicable:
- CLI Design = N/A when your project is a web app with no CLI tool (but if you have any CLI at all, score it)
- Multi-Agent = N/A when your project does not orchestrate or coordinate agents
- Discovery & AEO = N/A when your project is a pure library with no web presence
"We haven't built an MCP server yet" is a 0, not N/A. "This project has no reason to expose an MCP server" might be N/A—but only if you are certain agents will never consume it.
Evidence Requirements
Every score must cite specific evidence:
- File paths with line numbers:
openapi.yaml:45 — description is 4 words - Grep findings:
No files match /isError|is_retriable/ - Glob results:
0 files found matching **/*.mcp.json - Command output:
mytool --help --jsonreturns exit code 2 with no structured output
Scores without evidence are guesses. Guesses drift high due to optimism bias. Always ground scores in concrete findings from the codebase.
Overall Rating
Once all applicable dimensions are scored, the total is scaled to 0–30 and mapped to a human-readable rating:
| Total | Rating | Meaning |
|---|---|---|
| 0–7 | Human-only | Built for humans. Agents will struggle. |
| 8–14 | Agent-tolerant | Usable with heavy prompt engineering. Requires context injections. |
| 15–22 | Agent-ready | Solid agent support. Most use cases work reliably. Few gaps. |
| 23–30 | Agent-first | Purpose-built for agents. Best in class. Minimal friction. |
What's Next
- Rubric — Full 0/1/2/3 criteria for each dimension
- Evidence — What counts as evidence and how to avoid optimism bias
- Scorecard Format — JSON schema the skill emits
- Delta Scoring — Computing improvements before/after transformation
- Clustering — Organizing findings into actionable clusters
- Calibration — Maintaining rubric consistency across raters and time