all-MiniLM-L6-v2 for Local Vector Memory in MCP Servers — A Deep Dive

When I started SynaBun I had to pick an embedding model. The default move would have been text-embedding-3-small or ada-002. OpenAI hosts them. They are good. They are easy.

I picked a 22MB model called all-MiniLM-L6-v2 instead. It runs on the user's CPU. It has no API key. It works on a plane.

This post is the long version of why, what the tradeoffs are, and how it actually performs.

What `all-MiniLM-L6-v2` is

all-MiniLM-L6-v2 is a sentence-transformers model from the sentence-transformers library, originally trained by the UKP Lab at TU Darmstadt. The relevant numbers:

Architecture: 6-layer MiniLM (a distilled BERT variant)
Embedding dimensions: 384
Max sequence length: 256 tokens (longer inputs are truncated)
Model size: ~22MB on disk, ~80MB in RAM with the tokenizer + ONNX runtime
License: Apache 2.0
Inference speed (CPU): ~12ms per 256-token chunk on an M1 MacBook Air

The training corpus is over 1 billion sentence pairs from across the web (Reddit, Quora, S2ORC, NLI, MS MARCO, Yahoo Answers, and more). It is a general-purpose encoder, not a code-specific one.

Why local matters for MCP memory

The Model Context Protocol (MCP) lets AI coding assistants like Claude Code, Codex, OpenCode, and Gemini call out to external tools. Memory is one of the most important tools because every "remember this" or "recall what I said about X" call has to round-trip somewhere.

When that somewhere is a cloud API, three things happen:

Latency tax. Every recall costs 200-400ms before the model even sees the result.
Cost tax. Embeddings are cheap individually but expensive in aggregate. A heavy coding session can hit 1000+ recall ops a day.
Privacy tax. Your code, project notes, debugging traces, and design decisions get sent to a third party.

For a developer using Claude Code 8 hours a day, those numbers compound fast. Latency alone is the killer — you cannot stay in flow if every memory recall stalls the session.

Local embeddings remove all three. The cost is on-device compute, which on a modern laptop is essentially free.

The OpenAI alternative — and why it loses

Here is the head-to-head:

	all-MiniLM-L6-v2	text-embedding-3-small	ada-002
Dimensions	384	1536 (configurable down to 512)	1536
Cost	$0	$0.02 per 1M tokens	$0.10 per 1M tokens
Latency	~12ms (CPU)	200-400ms (network)	200-400ms (network)
Offline	Yes	No	No
API key	None	Required	Required
Model size	22MB	N/A (cloud)	N/A (cloud)
License	Apache 2.0	Proprietary	Proprietary

For developer memory, the only real argument for OpenAI is recall quality. So I benchmarked.

Benchmark setup

I built a corpus of 12,000 SynaBun memories from my own development sessions over 8 months. Each memory averages 280 characters and includes things like:

Bug fix descriptions
Architecture decisions
API quirks
Code patterns
Session summaries
File-level notes

I then wrote 200 recall queries against this corpus. Each query has a known-good answer in the corpus (manually labeled). I measured recall@5 — does the correct memory show up in the top 5 results?

def evaluate(model, corpus, queries):
    embeddings = model.encode([m.content for m in corpus])
    correct = 0
    for query, expected_id in queries:
        q_emb = model.encode(query)
        scores = cosine_similarity(q_emb, embeddings)
        top5 = scores.argsort()[-5:][::-1]
        if expected_id in [corpus[i].id for i in top5]:
            correct += 1
    return correct / len(queries)

Results

Model	Recall@5	Recall@1	p50 latency	Cost per 1k queries
all-MiniLM-L6-v2 (CPU)	78%	54%	12ms	$0.00
text-embedding-3-small	83%	61%	240ms	$0.04
ada-002	75%	51%	280ms	$0.20
bge-small-en-v1.5 (CPU)	79%	56%	14ms	$0.00
gte-small (CPU)	80%	57%	13ms	$0.00

Three takeaways:

The OpenAI v3 model is genuinely better — 5 points on recall@5, 7 points on recall@1.
The 5-point gap costs you 240ms of latency, 4 cents per 1000 queries, and an internet connection.
bge-small-en-v1.5 and gte-small are slightly better than all-MiniLM-L6-v2 but require 130MB and 70MB respectively. For a 5MB total ship in the Node MCP server, MiniLM wins.

Why 384 dimensions is enough

The conventional wisdom is "more dimensions = better recall". This stops being true past a certain corpus size, and it always trades against memory + storage + compute.

For a typical SynaBun user with under 100,000 memories:

384 dims × 4 bytes × 100k memories = 150MB of vectors
1536 dims × 4 bytes × 100k memories = 600MB of vectors

The 4x storage difference matters when SQLite is doing full-table scans for cosine similarity (more on that below). For corpora under 1M items, 384 dims with cosine on a B-tree index is faster end-to-end than 1536 dims on HNSW.

Storage — sqlite-vec instead of a vector DB

SynaBun uses SQLite + the sqlite-vec extension. No Pinecone. No Weaviate. No Qdrant. Just a single SQLite file at ~/.synabun/memory.db.

Why:

Zero infra. Users do not run a service.
Backupable. Copy the file. Done.
Diffable. SQLite is a real database. You can .dump it.
Embeddable. The Node MCP server links it natively.

The schema looks like this:

CREATE TABLE memories (
  id TEXT PRIMARY KEY,
  content TEXT NOT NULL,
  embedding BLOB NOT NULL,  -- 384 floats packed as 1536 bytes
  category TEXT,
  project TEXT,
  importance INTEGER,
  tags TEXT,                -- JSON array
  created_at INTEGER,
  accessed_at INTEGER,
  access_count INTEGER DEFAULT 0
);

CREATE VIRTUAL TABLE memory_vec USING vec0(
  embedding float[384]
);

Recall is a single SQL query:

SELECT m.id, m.content, m.category, vec.distance
FROM memory_vec vec
JOIN memories m ON m.id = vec.id
WHERE vec.embedding MATCH ?
ORDER BY vec.distance
LIMIT 10;

For a 50k corpus this completes in under 5ms on commodity hardware. The bottleneck is the embedding step, not the vector search.

The recency boost

A pure cosine similarity ranking has one big failure mode for developer memory: yesterday's decision and a 6-month-old decision look equally relevant. SynaBun adds an optional recency weight with a 14-day half-life:

def score(memory, similarity, now):
    age_days = (now - memory.created_at) / 86400
    recency = 0.5 ** (age_days / 14)  # half-life of 14 days
    return 0.55 * recency + 0.45 * similarity

Toggleable per query. Defaults off for general recall, on for "session boot" recall. The 55% recency / 45% similarity blend was tuned against the same 200-query benchmark.

Limitations of `all-MiniLM-L6-v2`

It is not magic. Three things to know:

256-token max. Anything longer gets truncated. SynaBun chunks on save (250-token windows, 50-token overlap) for memories that exceed this.
English-heavy. Recall on Portuguese, Spanish, or Mandarin is noticeably worse. For multilingual users, paraphrase-multilingual-MiniLM-L12-v2 is the upgrade (118MB).
No code-specific training. For pure code search (e.g., "find the function that does X"), a code-specific encoder like unixcoder-base outperforms it. But SynaBun memories are usually English notes about code, not code itself, so the general-purpose encoder works better in practice.

When you should use OpenAI instead

I am not religious about this. Use cloud embeddings if:

Your corpus is over 1 million items and recall quality is your bottleneck.
You are running on infrastructure that cannot host a 22MB model (rare).
You need cross-lingual recall and do not want to ship the larger multilingual model.

For 95% of MCP memory use cases — a single developer with a few thousand session notes — local is the right answer.

How to use it in your own MCP server

If you are building your own MCP server and want local embeddings, the path is short:

import { pipeline } from '@xenova/transformers';

const embedder = await pipeline(
  'feature-extraction',
  'Xenova/all-MiniLM-L6-v2',
  { quantized: false }
);

async function embed(text) {
  const output = await embedder(text, {
    pooling: 'mean',
    normalize: true
  });
  return Array.from(output.data);
}

That is the whole integration. The model downloads on first use (one-time, cached in ~/.cache/transformers), and from then on every embed call is local and fast.

Closing

The cloud-first default for AI tooling is leaving real value on the table. A 22MB model that runs on every laptop made in the last 5 years gets you 95% of the recall quality with 0% of the latency tax and 0% of the cost. For developer memory specifically — where the corpus is small, the queries are frequent, and the privacy stakes matter — local wins.

If you want to see this in action, SynaBun is open source under Apache 2.0. The full memory pipeline lives in mcp-server/src/memory/. Pull it apart, fork it, ship your own.

Using all-MiniLM-L6-v2 for Local Vector Memory in MCP Servers

What all-MiniLM-L6-v2 is