Semantic Tool Selection at Scale

At >20 tools, use embedding-based selection. Pick ~12 per turn with cosine similarity + prepareStep hook.

Summary

Passing 50+ tool definitions to the model per request wastes tokens and dilutes focus. Semantic tool selection uses embeddings to pick ~12 relevant tools per turn based on the user's message. The "toolpick" pattern: embed tool descriptions once at startup, cache them, then at each step use cosine similarity to rank tools and pass only the top candidates.

20-tool threshold: Beyond 20 tools, token cost of listing all tools + context loss > benefit of pre-loaded tools.
~12 tools per turn: Empirically optimal; balances discovery vs. context window.
"Always active" set: Small, critical tools (web_search, search_tools, meta-tools) always available.
Embedding cache: Compute embeddings once at startup; reuse across requests (Redis or file cache).
prepareStep hook: Vercel AI SDK's integration point; called before model receives tools.

The Problem

// ❌ Don't do this at scale
const agent = new ToolLoopAgent({
  model: openai("gpt-4o-mini"),
  instructions: systemPrompt,
  tools: {
    // 50+ tools, all passed to every request
    customers_list: {...},
    customers_get: {...},
    customers_create: {...},
    orders_list: {...},
    orders_get: {...},
    orders_update: {...},
    invoices_list: {...},
    // ... 40 more ...
  },
});

Cost: ~500 tokens listing tool definitions alone. The model wastes context parsing irrelevant tools (order tools when asking about customers).

The Pattern: Embedding-Based Selection

The production app uses the toolpick library with OpenAI embeddings:

// chat/tools.ts
import { createToolIndex, fileCache, type ToolIndex } from "toolpick";
import { openai } from "@ai-sdk/openai";
import type { PrepareStepFunction } from "ai";

let cachedIndex: ToolIndex | null = null;

export async function ensureToolIndex(ctx: McpContext) {
  if (cachedIndex) return cachedIndex;

  // Step 1: Get all tool definitions from MCP
  const toolDefinitions = await getMcpToolDefinitions();

  // Step 2: Embed tool descriptions (one-time cost)
  const index = await createToolIndex(toolDefinitions, {
    embeddingModel: openai.embeddingModel("text-embedding-3-small"),
    // Cache embeddings to disk; reuse across restarts
    embeddingCache: fileCache(".toolpick-cache.json"),
    // Cross-domain dependency graph: when tool A is selected,
    // its related tools are pre-loaded even if embeddings wouldn't
    // have picked them. Prevents mid-workflow discovery gaps.
    relatedTools: {
      invoices_create: ["customers_list"],          // Need customer to invoice
      invoices_create_from_tracker: ["customers_list"],
      invoices_recurring_create: ["customers_list"],
      tracker_timer_start: ["tracker_projects_list"], // Need project to track time
      tracker_entries_create: ["tracker_projects_list"],
      tracker_entries_list: ["tracker_projects_list"],
      tracker_projects_list: ["tracker_entries_list"],
      transactions_update: ["categories_list"],     // Need category to categorize
    },
  });

  // Step 3: Warm up (fetch embeddings)
  await index.warmUp();

  cachedIndex = index;
  return index;
}

export function buildPrepareStep(options: {
  maxTools: number;
  alwaysActive?: string[];
}): PrepareStepFunction {
  if (!cachedIndex) {
    throw new Error("Tool index not initialized");
  }

  const base = cachedIndex.prepareStep({ maxTools: options.maxTools });
  const always = options.alwaysActive ?? [];

  return async (stepOptions: any) => {
    // Let toolpick select top N tools by cosine similarity
    const step = await base(stepOptions);

    // Append always-active tools (they don't get filtered by embeddings)
    if (step?.activeTools) {
      for (const name of always) {
        if (!step.activeTools.includes(name)) {
          step.activeTools.push(name);
        }
      }
    }

    return step;
  };
}

How it works:

At startup, createToolIndex embeds all tool descriptions using OpenAI embeddings.
Embeddings are cached (.toolpick-cache.json); subsequent restarts use cached values.
When a user message arrives, prepareStep is called.
toolpick computes the similarity between the user's message and all cached embeddings.
Top-N tools (e.g., 12) are selected and passed to the model.
Always-active tools (web_search, search_tools, meta-tools) are always appended.

Integration with ToolLoopAgent

// chat/assistant-runtime.ts
import { ToolLoopAgent } from "ai";
import { openai } from "@ai-sdk/openai";

export async function streamAssistant(params: {
  systemPrompt: string;
  messages: ModelMessage[];
  tools: Record<string, Tool>;
  ctx: McpContext;
}) {
  // Ensure tool index is warm
  await ensureToolIndex(params.ctx);

  // Get all tools (needed for execution)
  const allTools = params.tools;

  // Build the prepareStep hook
  const prepareStep = buildPrepareStep({
    maxTools: 12,
    // Always expose critical discovery tools
    alwaysActive: ["web_search", "search_tools", "composio_search_tools", "composio_multi_execute"],
  });

  const agent = new ToolLoopAgent({
    model: openai("gpt-4o-mini"),
    instructions: params.systemPrompt,
    tools: allTools,
    prepareStep, // ← Filter tools per turn
    stopWhen: stepCountIs(10),
  });

  return agent.stream({
    messages: params.messages,
    experimental_transform: smoothStream(),
  });
}

What happens:

User sends a message.
Model receives system prompt + message + the ~12 most relevant tools (selected by prepareStep).
Model reads tool descriptions and decides which (if any) to call.
If model calls a tool, framework executes it.
Result appended to conversation; loop continues.
At the next turn, prepareStep re-runs with the updated conversation; different tools may be selected.

Why relatedTools Matters

Without relatedTools, the agent hits a common failure mode: it starts creating an invoice, then discovers mid-workflow that it needs a customer ID but customers_list wasn't in the top-12 selection. It either halts or wastes a step calling search_tools.

The relatedTools map is a dependency injection for multi-step workflows. When invoices_create is selected by embeddings, customers_list is automatically pre-loaded — even if the user's message ("create an invoice for $500") has zero semantic similarity to "list customers."

Guidelines for building the map:

Map write tools to the read tools they depend on (create invoice → list customers)
Map bidirectional relationships where either side needs the other (tracker_projects_list ↔ tracker_entries_list above — listing projects often leads to listing entries, and vice versa)
Keep it minimal — only add dependencies you've observed agents needing in practice
Don't add transitive dependencies (if A→B and B→C, don't add A→C unless agents actually need C when calling A)

Trade-offs

Pros

Token savings: Tool list shrinks from ~500 tokens (50 tools) to ~120 tokens (12 tools) per request.
Model clarity: Model focuses on relevant tools; less distraction.
Discovery: Related tools are suggested via relatedTools config.

Cons

Embedding latency: ~50–100ms per request (slight increase).
Cold start: First request after server restart pays embedding cost (~2–5s).
Coverage: If a tool isn't in the top 12, the model can't use it directly. Mitigation: search_tools meta-tool lets model search for tools dynamically.

Using search_tools for Discovery

If the top-12 selection misses a tool, the model can call search_tools to find it:

server.registerTool(
  "search_tools",
  {
    title: "Search Available Tools",
    description: "Search for a tool by name or capability. Use this when you can't find the tool you need.",
    inputSchema: z.object({
      query: z.string().describe("What do you want to do? (e.g., 'list reports', 'send email', 'delete invoice')"),
    }),
    outputSchema: z.object({
      tools: z.array(z.object({
        name: z.string(),
        description: z.string(),
      })),
    }),
  },
  async (params) => {
    // Search tool index by keyword + similarity
    const results = await cachedIndex.search(params.query, { maxResults: 5 });
    return {
      content: [{
        type: "text",
        text: JSON.stringify(results),
      }],
      structuredContent: { tools: results },
    };
  }
);

Pattern:

User asks for something the model doesn't immediately recognize.
Model (before calling a tool) calls search_tools with the user's intent.
search_tools returns matching tools.
Model picks the best match and uses it.

Example:

User: "Send an email to alice@example.com"
Model (not in top-12, doesn't see email tool) calls search_tools with "send email"
search_tools returns [composio_gmail_send_message, composio_sendgrid_send_email]
Model calls one of them

Warm-Up and Lifecycle

// apps/api/src/index.ts (server startup)
import { ensureToolIndex } from "@api/chat/tools";

// On server startup, pre-warm the tool index
ensureToolIndex(createStubMcpContext()).catch((err) => {
  logger.warn("Tool index warm-up failed (will retry on first request)", { error: err.message });
});

app.listen(3000, () => {
  logger.info("Server started");
});

What the warm-up does:

Creates a stub MCP context.
Calls ensureToolIndex to trigger embedding computation.
If embeddings are cached, returns immediately (~10ms).
If not cached, computes and saves to disk (~3–5s).

Result: First real user request gets instant semantic selection; no 3s latency on cold start.

Tuning maxTools

Different use cases have different optimal values:

// Simple assistant (few domains)
buildPrepareStep({ maxTools: 8, alwaysActive: ["web_search"] })

// Moderate (internal + external tools)
buildPrepareStep({ maxTools: 12, alwaysActive: ["web_search", "search_tools"] })

// Complex (internal + Composio + external + search)
buildPrepareStep({ 
  maxTools: 15, 
  alwaysActive: ["web_search", "search_tools", "composio_search_tools", "composio_multi_execute"] 
})

Guidelines:

Too low (5–8): Model misses relevant tools; forced to use search_tools repeatedly.
Optimal (10–15): Balance between context efficiency and coverage.
Too high (>20): Defeats the purpose; tokens saved are negligible.

Start at 12; measure token usage and model accuracy; adjust ±3 based on results.

Caching Strategy

The embedding cache is critical for performance:

// .toolpick-cache.json (auto-managed by fileCache)
{
  "customers_list": [0.123, -0.456, ..., 0.789],
  "customers_get": [0.234, -0.567, ..., 0.890],
  // ... one vector per tool
}

Key points:

Cache is built once; reused across requests and server restarts.
If you add/remove tools, toolpick automatically recomputes affected embeddings.
Cache is keyed by tool name; renaming a tool invalidates its embedding.
For production, consider caching in Redis instead of disk:

import { createClient } from "redis";

const redis = createClient();
const embeddingCache = redisCache(redis, "toolpick:embeddings");

const index = await createToolIndex(toolDefinitions, {
  embeddingModel: openai.embeddingModel("text-embedding-3-small"),
  embeddingCache,
});

Monitoring and Observability

Add logging inside buildPrepareStep (from the earlier code) to debug selection misses:

// Inside the returned function from buildPrepareStep, after `const step = await base(stepOptions)`:
logger.debug("[toolpick] Selected tools", {
  userMessage: stepOptions.messages[stepOptions.messages.length - 1]?.content?.slice(0, 100),
  selectedTools: step.activeTools,
  count: step.activeTools.length,
});

Metrics to track:

Average tools selected per request.
Frequency of search_tools calls (high frequency = selection misses).
Token usage before/after (should drop significantly).
Model accuracy (did the model pick the right tool?).