2026-05-17

Lightweight RAG: embeddings, SQLite, and when to skip a vector DB

Operational tradeoffs for RAG at small scale: vector size, storage, and thresholds before adopting dedicated vector infra.

A dedicated vector database (Pinecone, Weaviate, hosted Qdrant, …) pays off when you need millions of vectors, millisecond metadata filters, or an ops boundary your team refuses to own. For many MVPs with a few thousand chunks, in-process embeddings + on-disk or RAM similarity is enough and shrinks moving parts.

Chunk count ranges and latency

Order-of-magnitude expectations (not certified benchmarks—profile your own hardware):

N < 5 000 chunks, 1536-dim float32 vectors ≈ 30 MB of floats; naive dot/cosine in RAM is often < 10 ms on a modern CPU without ANN.
N between 10k and 100k: linear scans often push p95 past ~100 ms; you need HNSW/IVF (vector DB feature) or something like hnswlib on the server.
N > 100k for interactive queries: seriously consider a vector engine or at least a serialized on-disk index.

Latency also depends on embedding batching vs on-the-fly calls—cache document embeddings when the corpus is static.

Embedding dimension choice

Always verify the current vendor sheet:

OpenAI text-embedding-3-small supports configurable dimensions (e.g. 512 vs 1536) with documented accuracy tradeoffs in the Embeddings guide.
Open-weight families (e.g. bge, e5) fix dimension per checkpoint—read the Hugging Face model card.

Rule of thumb: for small N, higher dimension often wins recall; for large N, compression / PQ pays off before you scale infra.

Storage patterns: files, SQLite, no SaaS vector

Three MVP patterns:

NumPy / pickle / .npy + JSON manifest mapping chunk_id → offset. Fine for demos; awkward under concurrent writers.
SQL table with chunk_id, embedding BLOB (packed float32), content TEXT, metadata JSON. Query strategy: narrow with metadata filters in SQL, then run similarity only on that subset—avoid global scans when you can filter by source_id.
SQLite + vector extension (e.g. sqlite-vec): ACID + ANN if your deploy allows loading extensions (Nitro/Node policies vary).

Skip the dedicated vector DB when the team is tiny, N stays below your profiled threshold, you do not need multi-region vector replicas, and ops budget is tight.

When to migrate

Objective signals:

p95 retrieval exceeds the product budget (e.g. 200 ms) even after batching + caching.
Corpus grows weekly past the threshold you measured.
You need managed hybrid search (BM25 + vector).
Multiple apps must share one index with different ACLs.

Antipattern: RAG without evaluation

If you never measure hit rate or MRR on a fixed question set, you do not know whether “lightweight RAG” works. Keep 20–50 golden queries with expected chunk ids and rerun the script after every embedding or splitter change.

Top-k pseudocode (cosine, RAM)

def top_k(query_vec, matrix, k):
    # matrix: float32 [n, d], rows L2-normalized
    scores = matrix @ query_vec  # [n]
    return argpartition(scores, -k)[-k:]

Normalize query and rows when replacing explicit cosine with dot product.

sqlite-vec: vectors inside SQLite without a hosted Pinecone

sqlite-vec (Alex Garcia) is a SQLite C extension that adds vector types (vec0 virtual tables), distance helpers (vec_distance_cosine, vec_distance_L2, …), and KNN-style filters like … match :query and k = 10. It shines when you want one .db file with relational metadata plus ANN without another service. The repo flags pre-v1 status—pin the extension artifact like any native dependency and expect breaking changes.

Other operational models (e.g. remote Turso / libSQL) trade local files for HTTP—different constraints.

Pure SQL brute-force ORDER BY distance works until ~10k–50k rows for interactive UX; beyond that you need ANN (sqlite-vec) or an external engine.

Summary

For a few thousand chunks, cached embeddings + linear similarity or SQLite is often sufficient. You reach for a vector DB when N, latency, or features (filters, hybrid) exceed what you want to maintain in application code—not because every tutorial defaults to Pinecone.

References: your embedding model docs (dimension, max tokens, languages) and the chunking guidance of whichever stack you ship.

Want to ship ideas like these into your product?