When I started SynaBun I had to pick an embedding model. The default move would have been text-embedding-3-small or ada-002. OpenAI hosts them. They are good. They are easy.
I picked a 22MB model called all-MiniLM-L6-v2 instead. It runs on the user's CPU. It has no API key. It works on a plane.
This post is the long version of why, what the tradeoffs are, and how it actually performs.
What all-MiniLM-L6-v2 is
all-MiniLM-L6-v2 is a sentence-transformers model from the sentence-transformers library, originally trained by the UKP Lab at TU Darmstadt. The relevant numbers:
- Architecture: 6-layer MiniLM (a distilled BERT variant)
- Embedding dimensions: 384
- Max sequence length: 256 tokens (longer inputs are truncated)
- Model size: ~22MB on disk, ~80MB in RAM with the tokenizer + ONNX runtime
- License: Apache 2.0
- Inference speed (CPU): ~12ms per 256-token chunk on an M1 MacBook Air
The training corpus is over 1 billion sentence pairs from across the web (Reddit, Quora, S2ORC, NLI, MS MARCO, Yahoo Answers, and more). It is a general-purpose encoder, not a code-specific one.
Why local matters for MCP memory
The Model Context Protocol (MCP) lets AI coding assistants like Claude Code, Codex, OpenCode, and Gemini call out to external tools. Memory is one of the most important tools because every "remember this" or "recall what I said about X" call has to round-trip somewhere.
When that somewhere is a cloud API, three things happen:
- Latency tax. Every recall costs 200-400ms before the model even sees the result.
- Cost tax. Embeddings are cheap individually but expensive in aggregate. A heavy coding session can hit 1000+ recall ops a day.
- Privacy tax. Your code, project notes, debugging traces, and design decisions get sent to a third party.
For a developer using Claude Code 8 hours a day, those numbers compound fast. Latency alone is the killer — you cannot stay in flow if every memory recall stalls the session.
Local embeddings remove all three. The cost is on-device compute, which on a modern laptop is essentially free.
The OpenAI alternative — and why it loses
Here is the head-to-head:
| all-MiniLM-L6-v2 | text-embedding-3-small | ada-002 | |
|---|---|---|---|
| Dimensions | 384 | 1536 (configurable down to 512) | 1536 |
| Cost | $0 | $0.02 per 1M tokens | $0.10 per 1M tokens |
| Latency | ~12ms (CPU) | 200-400ms (network) | 200-400ms (network) |
| Offline | Yes | No | No |
| API key | None | Required | Required |
| Model size | 22MB | N/A (cloud) | N/A (cloud) |
| License | Apache 2.0 | Proprietary | Proprietary |
For developer memory, the only real argument for OpenAI is recall quality. So I benchmarked.
Benchmark setup
I built a corpus of 12,000 SynaBun memories from my own development sessions over 8 months. Each memory averages 280 characters and includes things like:
- Bug fix descriptions
- Architecture decisions
- API quirks
- Code patterns
- Session summaries
- File-level notes
I then wrote 200 recall queries against this corpus. Each query has a known-good answer in the corpus (manually labeled). I measured recall@5 — does the correct memory show up in the top 5 results?
def evaluate(model, corpus, queries):
embeddings = model.encode([m.content for m in corpus])
correct = 0
for query, expected_id in queries:
q_emb = model.encode(query)
scores = cosine_similarity(q_emb, embeddings)
top5 = scores.argsort()[-5:][::-1]
if expected_id in [corpus[i].id for i in top5]:
correct += 1
return correct / len(queries)
Results
| Model | Recall@5 | Recall@1 | p50 latency | Cost per 1k queries |
|---|---|---|---|---|
| all-MiniLM-L6-v2 (CPU) | 78% | 54% | 12ms | $0.00 |
| text-embedding-3-small | 83% | 61% | 240ms | $0.04 |
| ada-002 | 75% | 51% | 280ms | $0.20 |
| bge-small-en-v1.5 (CPU) | 79% | 56% | 14ms | $0.00 |
| gte-small (CPU) | 80% | 57% | 13ms | $0.00 |
Three takeaways:
- The OpenAI v3 model is genuinely better — 5 points on recall@5, 7 points on recall@1.
- The 5-point gap costs you 240ms of latency, 4 cents per 1000 queries, and an internet connection.
bge-small-en-v1.5andgte-smallare slightly better thanall-MiniLM-L6-v2but require 130MB and 70MB respectively. For a 5MB total ship in the Node MCP server, MiniLM wins.
Why 384 dimensions is enough
The conventional wisdom is "more dimensions = better recall". This stops being true past a certain corpus size, and it always trades against memory + storage + compute.
For a typical SynaBun user with under 100,000 memories:
- 384 dims × 4 bytes × 100k memories = 150MB of vectors
- 1536 dims × 4 bytes × 100k memories = 600MB of vectors
The 4x storage difference matters when SQLite is doing full-table scans for cosine similarity (more on that below). For corpora under 1M items, 384 dims with cosine on a B-tree index is faster end-to-end than 1536 dims on HNSW.
Storage — sqlite-vec instead of a vector DB
SynaBun uses SQLite + the sqlite-vec extension. No Pinecone. No Weaviate. No Qdrant. Just a single SQLite file at ~/.synabun/memory.db.
Why:
- Zero infra. Users do not run a service.
- Backupable. Copy the file. Done.
- Diffable. SQLite is a real database. You can
.dumpit. - Embeddable. The Node MCP server links it natively.
The schema looks like this:
CREATE TABLE memories (
id TEXT PRIMARY KEY,
content TEXT NOT NULL,
embedding BLOB NOT NULL, -- 384 floats packed as 1536 bytes
category TEXT,
project TEXT,
importance INTEGER,
tags TEXT, -- JSON array
created_at INTEGER,
accessed_at INTEGER,
access_count INTEGER DEFAULT 0
);
CREATE VIRTUAL TABLE memory_vec USING vec0(
embedding float[384]
);
Recall is a single SQL query:
SELECT m.id, m.content, m.category, vec.distance
FROM memory_vec vec
JOIN memories m ON m.id = vec.id
WHERE vec.embedding MATCH ?
ORDER BY vec.distance
LIMIT 10;
For a 50k corpus this completes in under 5ms on commodity hardware. The bottleneck is the embedding step, not the vector search.
The recency boost
A pure cosine similarity ranking has one big failure mode for developer memory: yesterday's decision and a 6-month-old decision look equally relevant. SynaBun adds an optional recency weight with a 14-day half-life:
def score(memory, similarity, now):
age_days = (now - memory.created_at) / 86400
recency = 0.5 ** (age_days / 14) # half-life of 14 days
return 0.55 * recency + 0.45 * similarity
Toggleable per query. Defaults off for general recall, on for "session boot" recall. The 55% recency / 45% similarity blend was tuned against the same 200-query benchmark.
Limitations of all-MiniLM-L6-v2
It is not magic. Three things to know:
- 256-token max. Anything longer gets truncated. SynaBun chunks on save (250-token windows, 50-token overlap) for memories that exceed this.
- English-heavy. Recall on Portuguese, Spanish, or Mandarin is noticeably worse. For multilingual users,
paraphrase-multilingual-MiniLM-L12-v2is the upgrade (118MB). - No code-specific training. For pure code search (e.g., "find the function that does X"), a code-specific encoder like
unixcoder-baseoutperforms it. But SynaBun memories are usually English notes about code, not code itself, so the general-purpose encoder works better in practice.
When you should use OpenAI instead
I am not religious about this. Use cloud embeddings if:
- Your corpus is over 1 million items and recall quality is your bottleneck.
- You are running on infrastructure that cannot host a 22MB model (rare).
- You need cross-lingual recall and do not want to ship the larger multilingual model.
For 95% of MCP memory use cases — a single developer with a few thousand session notes — local is the right answer.
How to use it in your own MCP server
If you are building your own MCP server and want local embeddings, the path is short:
import { pipeline } from '@xenova/transformers';
const embedder = await pipeline(
'feature-extraction',
'Xenova/all-MiniLM-L6-v2',
{ quantized: false }
);
async function embed(text) {
const output = await embedder(text, {
pooling: 'mean',
normalize: true
});
return Array.from(output.data);
}
That is the whole integration. The model downloads on first use (one-time, cached in ~/.cache/transformers), and from then on every embed call is local and fast.
Closing
The cloud-first default for AI tooling is leaving real value on the table. A 22MB model that runs on every laptop made in the last 5 years gets you 95% of the recall quality with 0% of the latency tax and 0% of the cost. For developer memory specifically — where the corpus is small, the queries are frequent, and the privacy stakes matter — local wins.
If you want to see this in action, SynaBun is open source under Apache 2.0. The full memory pipeline lives in mcp-server/src/memory/. Pull it apart, fork it, ship your own.
Related reading: