Scoring Framework

Summary

Quantifies agent-readiness on a 0–30 scaled spectrum across 11 concrete dimensions: API, CLI, MCP, Discovery, Auth, Error Handling, Tool Design, Context Files, Multi-Agent, Testing, and Data Retrievability. Each dimension scores 0–3 (Not Implemented → Excellent), then the applicable total is scaled to 30. Provides a baseline score, identifies gaps, and tracks progress as you modernize. Scale: 0–7 (Human-only) to 23–30 (Agent-first).

0 = Not implemented (blocker)
1 = Basic (weak, human-oriented)
2 = Good (agent-ready, most use cases work)
3 = Excellent (production, resilient, cross-framework)
N/A = Genuinely inapplicable (rare)

Agent-readiness is not binary. A codebase's ability to be consumed by AI agents exists on a spectrum. The scoring framework quantifies that spectrum across 11 concrete dimensions, giving you a measurement today, a clear view of gaps, and a way to track progress.

Why Score?

Ambiguity that humans resolve through intuition becomes failure modes at scale with agents. When an agent cannot ask for clarification, every implicit assumption, missing description, or hidden error path becomes a blocker. A 0–3 score per dimension tells you exactly where to invest: optimize the few surfaces agents actually touch, deprioritize human-only features.

The 0–3 Scale

Each dimension scores from 0 to 3:

0 (Not implemented) — Feature missing or unsuitable for agents. This dimension is a blocker.
1 (Basic) — Feature exists but weak. Human-oriented; terse; missing critical context. Agents can use it with heavy prompting.
2 (Good) — Solid implementation. Agent-oriented descriptions, structured data, clear examples. Most agent use cases work.
3 (Excellent) — Production-ready for agents. Optimized for token efficiency, cross-framework, resilient error handling, advanced patterns (OAuth, MCP Tasks, delegation). Agents perform reliably with minimal prompt engineering.

The 11 Dimensions

1. API Surface — How well your HTTP API is described for agent tool generation. OpenAPI quality, operationIds, agent-oriented descriptions, Arazzo workflow support.

2. CLI Design — How well your CLI tool works with agent automation. JSON output, exit codes, schema introspection, input hardening (path traversal defense), SKILL.md packaging.

3. MCP Server — Whether and how well you expose an MCP server. Tool count, annotations, resources, OAuth support, pagination, testing with InMemoryTransport.

4. Discovery & AEO — How discoverable, readable, governable, and callable your project is by agents. llms.txt, AGENTS.md, JSON-LD schema, Markdown content negotiation, robots.txt, Content Signals, API Catalog, MCP Server Cards, Agent Skills indexes, OAuth metadata, Web Bot Auth, and optional commerce protocols.

5. Authentication — Whether agents can authenticate without human browser interaction. API keys, OAuth 2.1 Client Credentials, Token Exchange (RFC 8693), agent identity as first-class principal.

6. Error Handling — Whether errors give agents enough information to recover. RFC 9457 Problem Details, is_retriable, suggestions array, trace_id, doc_uri, semantic exit codes.

7. Tool Design — Quality of tool definitions across any framework (MCP, AI SDK, LangChain). Descriptions as prompts ("when to use" and "do not use for"), typed schemas, toModelOutput, cross-framework portability.

8. Context Files — Quality of agent context files for AI coding assistants. AGENTS.md + multi-tool overrides (CLAUDE.md, .cursor/rules). Hand-curated, commands first, permission boundaries, progressive disclosure.

9. Multi-Agent Support — How well your project supports multi-agent orchestration. Sub-agent delegation, state management, A2A agent cards, memory patterns, dynamic agent selection.

10. Testing & Evaluation — Whether agent interactions are tested and evaluated. Tool routing accuracy, error recovery testing, multi-step flow tests, pass@k metrics, eval-driven development.

11. Data Retrievability — Whether project knowledge, documents, and user data are searchable and retrievable by agents. Semantic search, hybrid retrieval, reranking, metadata filters, RAG evals, and knowledge graphs.

Confidence Levels

Every score includes a confidence marker:

Level	Meaning	When Appropriate
High	Examined >80% of relevant code	Small projects, focused dimensions
Medium	Examined key files representatively	Large projects, sampled intelligently
Low	Examined `<30%` or inferred from structure	Very large projects, time-constrained

Low-confidence scores should be flagged for manual re-verification.

The N/A Trap

Not all dimensions apply to every project. Mark a dimension N/A only when it is genuinely inapplicable:

CLI Design = N/A when your project is a web app with no CLI tool (but if you have any CLI at all, score it)
Multi-Agent = N/A when your project does not orchestrate or coordinate agents
Discovery & AEO = N/A when your project is a pure library with no web presence

"We haven't built an MCP server yet" is a 0, not N/A. "This project has no reason to expose an MCP server" might be N/A—but only if you are certain agents will never consume it.

Evidence Requirements

Every score must cite specific evidence:

File paths with line numbers: openapi.yaml:45 — description is 4 words
Grep findings: No files match /isError|is_retriable/
Glob results: 0 files found matching **/*.mcp.json
Command output: mytool --help --json returns exit code 2 with no structured output

Scores without evidence are guesses. Guesses drift high due to optimism bias. Always ground scores in concrete findings from the codebase.

Overall Rating

Once all applicable dimensions are scored, the total is scaled to 0–30 and mapped to a human-readable rating:

Total	Rating	Meaning
0–7	Human-only	Built for humans. Agents will struggle.
8–14	Agent-tolerant	Usable with heavy prompt engineering. Requires context injections.
15–22	Agent-ready	Solid agent support. Most use cases work reliably. Few gaps.
23–30	Agent-first	Purpose-built for agents. Best in class. Minimal friction.

What's Next

Rubric — Full 0/1/2/3 criteria for each dimension
Evidence — What counts as evidence and how to avoid optimism bias
Scorecard Format — JSON schema the skill emits
Delta Scoring — Computing improvements before/after transformation
Clustering — Organizing findings into actionable clusters
Calibration — Maintaining rubric consistency across raters and time

Scoring Framework

On this page