Dense Text Embeddings

Embeddings are dense vectors (1,024–3,072 dimensions) that represent text semantically. They power vector similarity search: the model converts queries and documents into vectors, and nearest-neighbor search returns semantically similar results.

Summary

Dense embeddings transform text into fixed-length vectors suitable for nearest-neighbor search. April 2026: open-source models now match or exceed commercial APIs on MTEB leaderboard. Choose based on cost sensitivity, latency SLA (API calls slower than self-hosted), and specialized domains (legal, code, medical models exist). Matryoshka embeddings (supported by OpenAI 3-large and Cohere v4) let you store full dimensions but truncate to 256–512 at query time for cost savings without recomputing.

Key takeaways:

Commercial APIs (OpenAI, Voyage, Cohere) are integrated but expensive ($0.01–0.12 per 1M tokens).
Open-source (BGE-M3, E5-Mistral, Nomic) compete on MTEB; zero API costs, variable latency.
Matryoshka: store 1,536 dims, truncate to 256/512 at query time for 4–36x speedup + cost drop.
Specialized models beat generalists on domain-specific benchmarks (legal, medical, code).

Model selection matrix

Model	Dims	Cost	MTEB Score	Latency (self-host)	Notes
OpenAI text-embedding-3-large	3,072 (Matryoshka)	$0.02/1M	62.5	100–300ms (API)	De facto standard; integrated file_search
OpenAI text-embedding-3-small	1,536 (Matryoshka)	$0.02/1M	62.0	100–300ms (API)	Cheaper than 3-large; nearly same quality
Voyage 3	1,536	$0.01/1M	64.8	50–150ms (API)	Rerank-2.5 companion; MongoDB-owned
Voyage 4	1,536	$0.01/1M	65.5	50–150ms (API)	Latest Voyage; slightly better than v3
Cohere Embed v4	1,536 (Matryoshka)	$0.12/1M	64.2	100–300ms (API)	Multimodal (text + images); enterprise rerank
BGE-M3	1,024	Free	65.1	10–20ms (GPU)	Open-source; multilingual; beat commercial APIs March 2026
E5-Mistral-7B	768	Free	63.8	20–40ms (GPU)	Compact; semantic-focused
Nomic Embed 1.5	384	Free	62.1	5–10ms (CPU)	Smallest; matryoshka-native
Jina V3	1,024	Free	64.5	15–30ms (GPU)	Long-context (8,191 tokens)
Google Gemini Embedding 2	768	~$0.02/1M	68.3	100–300ms (API)	Top MTEB leaderboard (March 2026); all-modality (text, image, video, audio, code)

Matryoshka embeddings: truncation without recomputing

Matryoshka embeddings allow storing full dimensions (1,536) but using smaller dimensions (256, 512) at query time:

import { Pinecone } from '@pinecone-client/web';
import OpenAI from 'openai';

const openai = new OpenAI();
const pinecone = new Pinecone();

// Embed once at full 3,072 dims
const response = await openai.embeddings.create({
  model: 'text-embedding-3-large',
  input: 'The quick brown fox jumps over the lazy dog',
  dimensions: 3072, // Store full
});

let vector = response.data[0].embedding;

// Upsert full to vector DB
await pinecone.index('main').upsert([{
  id: 'doc-1',
  values: vector,
  metadata: { text: '...' },
}]);

// At query time, truncate to smaller dimension for speed/cost
const queryResponse = await openai.embeddings.create({
  model: 'text-embedding-3-large',
  input: 'fox jumps',
  dimensions: 256, // Query with smaller dim
});

let queryVector = queryResponse.data[0].embedding.slice(0, 256);

// Search: Pinecone still uses full 3,072 stored dims
// but query vector is smaller (faster, lower latency)
const results = await pinecone.index('main').query({
  vector: queryVector,
  topK: 10,
});

Benefits:

4x–36x faster queries (smaller vectors fit more in CPU cache)
50–80% cost reduction (fewer floating-point ops)
No re-embedding needed (truncation is lossless for Matryoshka models)
Trade: ~2–5% accuracy loss at 256 dims vs. full

Specialized embedding models

Legal documents: Use domain-specific models (LexGPT embeddings, FinBERT).
Medical/biomedical: SciBERT, BioGPT embeddings.
Code: Code2Vec, CodeBERT, or specialized models like GitCode embeddings.
Multilingual: BGE-M3 (67 languages), LABSE, mBERT embeddings.

These outperform generalists by 10–20% on domain-specific retrieval benchmarks.

API vs. self-hosted trade-offs

API (OpenAI, Voyage, Cohere)

Pros:

Integrated frameworks (Langchain, LlamaIndex, Vercel AI SDK)
Quality guarantee (vendor maintains model)
No infrastructure overhead

Cons:

$0.01–0.12 per 1M tokens (adds up at scale)
100–300ms latency (network + API)
Privacy: text sent to third parties
Rate limits

Self-hosted (open-source)

Pros:

Zero token cost (after initial download)
5–40ms latency (local GPU/CPU)
Privacy: data never leaves your infrastructure
No rate limits

Cons:

Infrastructure cost (GPU rental or owned)
Model updates are manual
Quality variance (open-source models less mature)
Latency unpredictable on CPU-only (Nomic Embed 1.5 is exception)

Decision: API if latency < 500ms acceptable and scale < 1M docs/month. Self-hosted if embedding 1M+ docs/month or privacy-critical (legal, medical, finance).

Batch embedding with retry logic

For production embedding pipelines:

import { batch } from 'lodash';
import pRetry from 'p-retry';
import OpenAI from 'openai';

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

async function embedBatch(texts: string[], batchSize: number = 100) {
  const batches = batch(texts, batchSize);
  const results: number[][] = [];

  for (const batchTexts of batches) {
    const embedded = await pRetry(
      () =>
        openai.embeddings.create({
          model: 'text-embedding-3-large',
          input: batchTexts,
          dimensions: 1536, // Use Matryoshka for cost
        }),
      {
        retries: 3,
        onFailedAttempt: (error) => {
          console.warn(
            `Embedding batch attempt ${error.attemptNumber} failed.`,
            error.message
          );
        },
      }
    );

    const sortedResults = embedded.data.sort((a, b) => a.index - b.index);
    results.push(...sortedResults.map((r) => r.embedding));

    // Rate limit: OpenAI allows 500K tokens/min
    await new Promise((resolve) => setTimeout(resolve, 100));
  }

  return results;
}

Evaluation: MTEB leaderboard

MTEB (Massive Text Embedding Benchmark) covers 56+ tasks:

Retrieval (BEIR-style)
Classification
Clustering
Semantic similarity
Paraphrase detection
Reranking

Check MTEB Leaderboard for latest scores. Caveats:

Text-only (no multimodal tasks)
No cross-lingual retrieval (e.g., Chinese query → English doc)
Limited coverage of long documents (> 10K tokens)
No MRL (Matryoshka) truncation tests

For production, combine MTEB scores with domain-specific evaluation (RAGAS faithfulness, custom recall@k on your data).

Cost estimation

For embedding 1M documents with text-embedding-3-large at 500 tokens/doc average:

Cost = (1,000,000 docs × 500 tokens/doc) × ($0.02 / 1M tokens) = $10
Matryoshka cost at query time:
- 10,000 queries × 200 tokens = 2M tokens × $0.02 / 1M = $0.04/month

At 100K docs with self-hosted BGE-M3 on GPU ($0.25/hour):

Embedding time: 100K docs / 1000 tokens-per-sec ≈ 50 seconds ≈ $0.003
Subsequent queries: ~free (GPU already rented)

Decision: API for `<100K` docs/month or unknown scale. Self-hosted for `>1M` docs/month or privacy-sensitive data.

Code example: async batch embed with progress

import * as fs from 'fs';
import OpenAI from 'openai';
import { batch } from 'lodash';

async function embedLargeDataset(
  inputFile: string,
  outputFile: string,
  batchSize: number = 100
) {
  const lines = fs
    .readFileSync(inputFile, 'utf-8')
    .split('\n')
    .filter((l) => l.trim());

  const batches = batch(lines, batchSize);
  const client = new OpenAI();
  const results: Array<{ text: string; embedding: number[] }> = [];

  for (let i = 0; i < batches.length; i++) {
    const batchTexts = batches[i];
    const response = await client.embeddings.create({
      model: 'text-embedding-3-large',
      input: batchTexts,
      dimensions: 512, // Matryoshka: small dims for cost
    });

    const sorted = response.data.sort((a, b) => a.index - b.index);
    sorted.forEach((item, idx) => {
      results.push({
        text: batchTexts[idx],
        embedding: item.embedding,
      });
    });

    console.log(`Progress: ${((i + 1) / batches.length) * 100}%`);
  }

  fs.writeFileSync(outputFile, JSON.stringify(results, null, 2));
  console.log(`Embedded ${results.length} texts to ${outputFile}`);
}

// Usage
await embedLargeDataset('documents.txt', 'embeddings.json', 100);