The default for AI tooling is cloud-first. APIs, hosted services, "we manage the infrastructure for you". For most things this is the right default — model inference is genuinely hard to run locally at the quality cloud providers offer.
AI memory is the exception.
For AI coding agents specifically — Claude Code, Codex, Cursor, OpenCode, Gemini in agent mode — local memory wins decisively. Not because cloud is bad, but because the workload pattern of dev memory is uniquely well-suited to local inference and uniquely poorly-suited to cloud APIs.
This post is the case for that, with numbers.
Three taxes that cloud memory imposes
Every cloud memory API charges you three taxes you cannot opt out of:
1. The latency tax
A cloud memory call is two network round trips: one to embed your query, one to retrieve the matching memories. On broadband:
- Embedding API call: 200-400ms
- Vector search call: 100-200ms
- Total: 300-600ms per recall
A local memory call is two CPU operations:
- Local embedding: 12ms
- SQLite vector search: 5ms
- Total: 17ms per recall
The difference is 20-35x. On a single recall it does not matter. On a coding session where you call recall 50-100 times an hour, it adds up to several minutes of pure latency overhead.
It is also the difference between memory that feels invisible and memory that feels like it has a loading state. Interactive flow dies fast when every memory check stalls.
2. The cost tax
Cloud embeddings are cheap but not free:
- OpenAI text-embedding-3-small: $0.02 per 1M tokens
- Cohere embed-english-v3.0: $0.10 per 1M tokens
- Voyage embeddings: $0.12 per 1M tokens
A heavy coding session generates roughly 100,000 tokens of memory recall traffic per day (queries are short but volumes are high). At 250 working days per year:
- 25M tokens/year × $0.02 = $0.50/year/dev (small)
- Plus the storage tier of the hosted memory service: typically $20-100/month per workspace
For a single developer it is rounding error. For a team of 50 it is real money — and the value you get over local memory is essentially zero on the dev memory workload.
3. The privacy tax
Cloud memory means your code, project notes, debugging traces, design decisions, and (often) full conversation history get sent to a third party. Even if the provider is reputable:
- It is one more vendor in your data flow
- Compliance reviews get harder
- You depend on the provider's retention policy
- You depend on the provider's existence
For consumer apps this is fine. For dev work at any company that does compliance reviews, it is a meaningful headache.
Local memory makes all three taxes go to zero.
Why dev memory is different from production memory
Cloud-first AI memory is a great fit for production agents — customer support bots, sales copilots, in-app AI assistants. Those workloads have:
- Centralized state (memory belongs to the product, not the user)
- Heavy concurrency (1000s of users querying memory simultaneously)
- Cross-instance consistency requirements
- Compliance + audit needs that benefit from a managed provider
Dev memory is the opposite. Single user. Single device. No concurrency. No multi-instance state. The "memory" is conversational notes about your own codebase, not user data you have a duty of care for.
For that workload, the architecture that wins is also the simplest: a single SQLite file on disk + a 22MB embedding model that runs on CPU. SynaBun ships with exactly that. Mem0 self-hosted gets close. OpenMemory uses a similar pattern. The cloud-first products (Mem0 hosted, Letta Cloud, Zep Cloud) are all paying for infrastructure they do not need for this use case.
Benchmark: local vs cloud, 100-query test
Same M1 MacBook Air, same 12,000-memory corpus, same 200 evaluation queries. Three configurations:
| Config | Embed model | Vector store | Mean latency | Recall@5 |
|---|---|---|---|---|
| Local SQLite | all-MiniLM-L6-v2 | sqlite-vec | 17ms | 78% |
| Mem0 self-hosted | OpenAI v3-small | Qdrant + Postgres | 95ms | 83% |
| Mem0 cloud | OpenAI v3-small | Mem0 SaaS | 280ms | 83% |
| Zep cloud | OpenAI v3-small | Zep SaaS | 240ms | 81% |
| Letta self-hosted | OpenAI v3-small | pgvector | 110ms | 80% |
The recall quality gap (78% vs 83%) is 5 percentage points. The latency gap is 6-16x.
For dev memory, that is not a close call. The 5% recall delta translates to "occasionally I have to phrase the query slightly differently". The 16x latency delta translates to "memory feels broken vs memory feels native".
What local-first does NOT mean
Three things people get wrong about "local-first AI memory":
- It does not mean offline-only. SynaBun, Mem0 self-hosted, Letta self-hosted all work fine offline, but they also sync to git remotes, S3, or whatever you point them at if you want backups.
- It does not mean you give up search quality. Modern small models (all-MiniLM-L6-v2, bge-small, gte-small) are within 5-7% of OpenAI v3 on most benchmarks. The gap closes every 6 months.
- It does not mean you give up the UI. SynaBun ships a 3D Neural Interface for browsing memory locally. Mem0 has a self-hosted dashboard. The cloud-only-UI-experience is a choice, not a constraint.
When cloud is the right answer
I am not religious about this. Use cloud memory if:
- You are building a multi-tenant product where memory belongs to the product, not the developer
- Your team has a hard "no SQLite in prod" rule (more common than you would expect)
- You need cross-region replication out of the box
- You want a managed service to handle backups + scale
- You are running on infrastructure that cannot host a small model (rare in 2026)
For everything else, local is the simpler answer.
The setup tax is also lower
People assume "self-hosted" means complex setup. For dev memory specifically, it is the opposite.
SynaBun setup:
npm install -g synabun
synabun start
That is it. SQLite is created on first use. The embedding model downloads once and caches. No Postgres. No Qdrant. No Docker compose.
Mem0 self-hosted setup:
docker compose up -d # starts Qdrant + Postgres + Mem0 API
# configure connection strings in .env
# wire MCP server
Functional but heavier.
Letta self-hosted setup:
docker compose up -d # Postgres + Letta API + agent runtime
# configure model provider
# wire MCP wrapper
Where complexity comes from is not "local vs cloud", it is the storage + service architecture each project chose. SynaBun's choice of SQLite is the actual reason its setup is one command.
What I would build if I were starting today
If I were building a new MCP memory server in 2026, I would copy these decisions verbatim:
- SQLite + sqlite-vec. Zero ops. Backupable. Embeddable. Diffable.
- all-MiniLM-L6-v2 or bge-small-en-v1.5 by default. Free, local, fast. Provide an OpenAI/Voyage adapter for users who want it.
- MCP-native, not a wrapper. Be a real MCP server. Implement the tools fully.
- Categorical memory. Vectors alone are too flat. Categories + projects + tags + importance let users actually navigate memory.
- Local UI. Browser-based, served from the same Node process. No separate Electron app, no required cloud login.
That is roughly the SynaBun architecture. It would also describe a cleaner version of OpenMemory or a stripped-down Mem0 self-hosted.
The category is moving local
Watch the next 12 months. The trend lines are clear:
- Small embedding models keep getting better.
bge-large-en-v1.5is already within 1-2% of OpenAI v3 on most benchmarks at 320MB. - SQLite vector extensions (
sqlite-vec,usearch-sqlite) keep getting faster. - MCP keeps standardizing. Wrappers around cloud services will lose ground to native MCP servers that treat memory as a first-class concept.
- Privacy-conscious teams are explicitly asking for local-first AI tooling now. This was a fringe ask in 2024. It is mainstream in 2026.
The cloud-first AI memory products are not going away — they will own the production-agent and multi-tenant-app categories. But for the dev workload, local has already won. Most developers just have not switched yet because the default is sticky.
Closing
If you are using a cloud memory API for your AI coding workflow, try a local one for a week. The latency difference is the kind of thing you cannot un-feel. Once memory becomes invisible, going back to a 300ms loading spinner on every recall feels worse than it sounds.
SynaBun is the local-first option I built. OpenMemory is a credible alternative. Letta and Zep are great if you need their specific features but you are paying a latency tax for it.
Pick local. Your flow will thank you.
Related reading: