Agent Surface

Dense Text Embeddings

Model selection, Matryoshka embeddings, cost/quality/latency trade-offs

Embeddings are dense vectors (1,024–3,072 dimensions) that represent text semantically. They power vector similarity search: the model converts queries and documents into vectors, and nearest-neighbor search returns semantically similar results.

Summary

Dense embeddings transform text into fixed-length vectors suitable for nearest-neighbor search. April 2026: open-source models now match or exceed commercial APIs on MTEB leaderboard. Choose based on cost sensitivity, latency SLA (API calls slower than self-hosted), and specialized domains (legal, code, medical models exist). Matryoshka embeddings (supported by OpenAI 3-large and Cohere v4) let you store full dimensions but truncate to 256–512 at query time for cost savings without recomputing.

Key takeaways:

  • Commercial APIs (OpenAI, Voyage, Cohere) are integrated but expensive ($0.01–0.12 per 1M tokens).
  • Open-source (BGE-M3, E5-Mistral, Nomic) compete on MTEB; zero API costs, variable latency.
  • Matryoshka: store 1,536 dims, truncate to 256/512 at query time for 4–36x speedup + cost drop.
  • Specialized models beat generalists on domain-specific benchmarks (legal, medical, code).

Model selection matrix

ModelDimsCostMTEB ScoreLatency (self-host)Notes
OpenAI text-embedding-3-large3,072 (Matryoshka)$0.02/1M62.5100–300ms (API)De facto standard; integrated file_search
OpenAI text-embedding-3-small1,536 (Matryoshka)$0.02/1M62.0100–300ms (API)Cheaper than 3-large; nearly same quality
Voyage 31,536$0.01/1M64.850–150ms (API)Rerank-2.5 companion; MongoDB-owned
Voyage 41,536$0.01/1M65.550–150ms (API)Latest Voyage; slightly better than v3
Cohere Embed v41,536 (Matryoshka)$0.12/1M64.2100–300ms (API)Multimodal (text + images); enterprise rerank
BGE-M31,024Free65.110–20ms (GPU)Open-source; multilingual; beat commercial APIs March 2026
E5-Mistral-7B768Free63.820–40ms (GPU)Compact; semantic-focused
Nomic Embed 1.5384Free62.15–10ms (CPU)Smallest; matryoshka-native
Jina V31,024Free64.515–30ms (GPU)Long-context (8,191 tokens)
Google Gemini Embedding 2768~$0.02/1M68.3100–300ms (API)Top MTEB leaderboard (March 2026); all-modality (text, image, video, audio, code)

Matryoshka embeddings: truncation without recomputing

Matryoshka embeddings allow storing full dimensions (1,536) but using smaller dimensions (256, 512) at query time:

import { Pinecone } from '@pinecone-client/web';
import OpenAI from 'openai';

const openai = new OpenAI();
const pinecone = new Pinecone();

// Embed once at full 3,072 dims
const response = await openai.embeddings.create({
  model: 'text-embedding-3-large',
  input: 'The quick brown fox jumps over the lazy dog',
  dimensions: 3072, // Store full
});

let vector = response.data[0].embedding;

// Upsert full to vector DB
await pinecone.index('main').upsert([{
  id: 'doc-1',
  values: vector,
  metadata: { text: '...' },
}]);

// At query time, truncate to smaller dimension for speed/cost
const queryResponse = await openai.embeddings.create({
  model: 'text-embedding-3-large',
  input: 'fox jumps',
  dimensions: 256, // Query with smaller dim
});

let queryVector = queryResponse.data[0].embedding.slice(0, 256);

// Search: Pinecone still uses full 3,072 stored dims
// but query vector is smaller (faster, lower latency)
const results = await pinecone.index('main').query({
  vector: queryVector,
  topK: 10,
});

Benefits:

  • 4x–36x faster queries (smaller vectors fit more in CPU cache)
  • 50–80% cost reduction (fewer floating-point ops)
  • No re-embedding needed (truncation is lossless for Matryoshka models)
  • Trade: ~2–5% accuracy loss at 256 dims vs. full

Specialized embedding models

Legal documents: Use domain-specific models (LexGPT embeddings, FinBERT).
Medical/biomedical: SciBERT, BioGPT embeddings.
Code: Code2Vec, CodeBERT, or specialized models like GitCode embeddings.
Multilingual: BGE-M3 (67 languages), LABSE, mBERT embeddings.

These outperform generalists by 10–20% on domain-specific retrieval benchmarks.

API vs. self-hosted trade-offs

API (OpenAI, Voyage, Cohere)

Pros:

  • Integrated frameworks (Langchain, LlamaIndex, Vercel AI SDK)
  • Quality guarantee (vendor maintains model)
  • No infrastructure overhead

Cons:

  • $0.01–0.12 per 1M tokens (adds up at scale)
  • 100–300ms latency (network + API)
  • Privacy: text sent to third parties
  • Rate limits

Self-hosted (open-source)

Pros:

  • Zero token cost (after initial download)
  • 5–40ms latency (local GPU/CPU)
  • Privacy: data never leaves your infrastructure
  • No rate limits

Cons:

  • Infrastructure cost (GPU rental or owned)
  • Model updates are manual
  • Quality variance (open-source models less mature)
  • Latency unpredictable on CPU-only (Nomic Embed 1.5 is exception)

Decision: API if latency < 500ms acceptable and scale < 1M docs/month. Self-hosted if embedding 1M+ docs/month or privacy-critical (legal, medical, finance).

Batch embedding with retry logic

For production embedding pipelines:

import { batch } from 'lodash';
import pRetry from 'p-retry';
import OpenAI from 'openai';

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

async function embedBatch(texts: string[], batchSize: number = 100) {
  const batches = batch(texts, batchSize);
  const results: number[][] = [];

  for (const batchTexts of batches) {
    const embedded = await pRetry(
      () =>
        openai.embeddings.create({
          model: 'text-embedding-3-large',
          input: batchTexts,
          dimensions: 1536, // Use Matryoshka for cost
        }),
      {
        retries: 3,
        onFailedAttempt: (error) => {
          console.warn(
            `Embedding batch attempt ${error.attemptNumber} failed.`,
            error.message
          );
        },
      }
    );

    const sortedResults = embedded.data.sort((a, b) => a.index - b.index);
    results.push(...sortedResults.map((r) => r.embedding));

    // Rate limit: OpenAI allows 500K tokens/min
    await new Promise((resolve) => setTimeout(resolve, 100));
  }

  return results;
}

Evaluation: MTEB leaderboard

MTEB (Massive Text Embedding Benchmark) covers 56+ tasks:

  • Retrieval (BEIR-style)
  • Classification
  • Clustering
  • Semantic similarity
  • Paraphrase detection
  • Reranking

Check MTEB Leaderboard for latest scores. Caveats:

  • Text-only (no multimodal tasks)
  • No cross-lingual retrieval (e.g., Chinese query → English doc)
  • Limited coverage of long documents (> 10K tokens)
  • No MRL (Matryoshka) truncation tests

For production, combine MTEB scores with domain-specific evaluation (RAGAS faithfulness, custom recall@k on your data).

Cost estimation

For embedding 1M documents with text-embedding-3-large at 500 tokens/doc average:

Cost = (1,000,000 docs × 500 tokens/doc) × ($0.02 / 1M tokens) = $10
Matryoshka cost at query time:
- 10,000 queries × 200 tokens = 2M tokens × $0.02 / 1M = $0.04/month

At 100K docs with self-hosted BGE-M3 on GPU ($0.25/hour):

Embedding time: 100K docs / 1000 tokens-per-sec ≈ 50 seconds ≈ $0.003
Subsequent queries: ~free (GPU already rented)

Decision: API for `<100K` docs/month or unknown scale. Self-hosted for `>1M` docs/month or privacy-sensitive data.

Code example: async batch embed with progress

import * as fs from 'fs';
import OpenAI from 'openai';
import { batch } from 'lodash';

async function embedLargeDataset(
  inputFile: string,
  outputFile: string,
  batchSize: number = 100
) {
  const lines = fs
    .readFileSync(inputFile, 'utf-8')
    .split('\n')
    .filter((l) => l.trim());

  const batches = batch(lines, batchSize);
  const client = new OpenAI();
  const results: Array<{ text: string; embedding: number[] }> = [];

  for (let i = 0; i < batches.length; i++) {
    const batchTexts = batches[i];
    const response = await client.embeddings.create({
      model: 'text-embedding-3-large',
      input: batchTexts,
      dimensions: 512, // Matryoshka: small dims for cost
    });

    const sorted = response.data.sort((a, b) => a.index - b.index);
    sorted.forEach((item, idx) => {
      results.push({
        text: batchTexts[idx],
        embedding: item.embedding,
      });
    });

    console.log(`Progress: ${((i + 1) / batches.length) * 100}%`);
  }

  fs.writeFileSync(outputFile, JSON.stringify(results, null, 2));
  console.log(`Embedded ${results.length} texts to ${outputFile}`);
}

// Usage
await embedLargeDataset('documents.txt', 'embeddings.json', 100);

See also

On this page