This is the implementation half of a longer argument. The pricing-and-latency comparison against Snowflake Cortex Search lives in Cortex Search vs Hybrid SQLite RAG; the strategic context is in Why Snowflake's Bet on Streamlit Just Works. Here we just build the thing.
The goal: a retrieval system that combines keyword (BM25) and semantic (vector) search, fuses the rankings with Reciprocal Rank Fusion, and runs entirely inside a single SQLite file. No servers, no API keys (after embedding), no monthly bill. Roughly 200 lines of Python end to end.
The pieces
Four components, all free:
- SQLite — the database engine, built into Python's standard library.
- FTS5 — SQLite's full-text search module, bundled with most modern SQLite builds. We'll use the trigram tokenizer for language-agnostic matching with reasonable BM25 scoring.
- sqlite-vec — a SQLite extension by Alex Garcia that adds vector similarity search as a virtual table. Pip-installable.
- An embedding model — anything that produces a fixed-dimension vector. For this walkthrough we use OpenAI's text-embedding-3-small (1536 dims), but a local model works identically.
pip install sqlite-vec openai
That is the entire dependency tree.
Schema
Three tables. The base table holds the raw documents. The FTS5 virtual table holds the searchable text. The sqlite-vec virtual table holds the embeddings. They are linked by rowid.
import sqlite3
import sqlite_vec
def init_db(path: str) -> sqlite3.Connection:
conn = sqlite3.connect(path)
conn.enable_load_extension(True)
sqlite_vec.load(conn)
conn.enable_load_extension(False)
conn.executescript("""
CREATE TABLE IF NOT EXISTS docs (
id INTEGER PRIMARY KEY,
title TEXT,
content TEXT,
metadata TEXT
);
CREATE VIRTUAL TABLE IF NOT EXISTS docs_fts
USING fts5(
content,
content='docs',
content_rowid='id',
tokenize='trigram'
);
CREATE VIRTUAL TABLE IF NOT EXISTS docs_vec
USING vec0(
embedding float[1536]
);
CREATE TRIGGER IF NOT EXISTS docs_ai AFTER INSERT ON docs BEGIN
INSERT INTO docs_fts(rowid, content) VALUES (new.id, new.content);
END;
CREATE TRIGGER IF NOT EXISTS docs_ad AFTER DELETE ON docs BEGIN
INSERT INTO docs_fts(docs_fts, rowid, content)
VALUES('delete', old.id, old.content);
END;
CREATE TRIGGER IF NOT EXISTS docs_au AFTER UPDATE ON docs BEGIN
INSERT INTO docs_fts(docs_fts, rowid, content)
VALUES('delete', old.id, old.content);
INSERT INTO docs_fts(rowid, content) VALUES (new.id, new.content);
END;
""")
return conn
A few notes on the choices.
Trigram tokenizer. FTS5's default tokenizer is Unicode-aware but assumes whitespace-delimited words. The trigram tokenizer slices text into overlapping three-character windows, which works for Japanese, Chinese, code identifiers, and partial-word matches without any language-specific setup. The trade is index size — trigrams are roughly 2–3x larger than word indexes — and the loss of some linguistic niceties like stemming. For most internal-search workloads, trigram is the right default.
Triggers for FTS sync. FTS5 is a "contentless" or "content-linked" virtual table; it doesn't store its own copy of the text by default if you point it at a content table. The triggers keep the index in sync on insert, update, and delete. Cortex Search does this for you with TARGET_LAG; here we do it ourselves and it happens synchronously.
vec0 virtual table. sqlite-vec provides several index types. vec0 is the simplest — a flat in-memory index that performs an exact nearest-neighbor scan. For corpora up to a few hundred thousand vectors this is plenty fast (single-digit milliseconds). For larger sets, sqlite-vec also supports quantized variants and partitioned indexes.
Indexing
Insert documents, generate embeddings, store both. The embedding call is the only step that touches an external service (or a local model server).
import json
from openai import OpenAI
client = OpenAI()
def embed(text: str) -> list[float]:
resp = client.embeddings.create(
model="text-embedding-3-small",
input=text,
)
return resp.data[0].embedding
def index_documents(conn: sqlite3.Connection, documents: list[dict]) -> None:
cur = conn.cursor()
for doc in documents:
cur.execute(
"INSERT INTO docs (title, content, metadata) VALUES (?, ?, ?)",
(doc["title"], doc["content"], json.dumps(doc.get("metadata", {})))
)
doc_id = cur.lastrowid
vec = embed(doc["content"])
cur.execute(
"INSERT INTO docs_vec (rowid, embedding) VALUES (?, ?)",
(doc_id, json.dumps(vec))
)
conn.commit()
In production you'd batch the embedding calls and write to the database in transactions. For a walkthrough this is clearer.
Hybrid retrieval with RRF
The interesting part. Two queries run in parallel — one against FTS5, one against the vector index — and we fuse their rankings using Reciprocal Rank Fusion.
The RRF formula is almost embarrassingly simple:
score(d) = Σ 1 / (k + rank_i(d))
For each document d, sum across all retrieval methods i the reciprocal of (k + rank), where rank is the position of d in method i's result list. The constant k is typically 60 (per the original RRF paper) and it smooths out the impact of being ranked first versus second.
In SQL, that becomes:
def hybrid_search(
conn: sqlite3.Connection,
query: str,
k: int = 10,
rrf_k: int = 60,
) -> list[dict]:
query_embedding = json.dumps(embed(query))
sql = """
WITH fts_results AS (
SELECT rowid AS id, rank, ROW_NUMBER() OVER (ORDER BY rank) AS rrf_rank
FROM docs_fts
WHERE docs_fts MATCH ?
ORDER BY rank
LIMIT 50
),
vec_results AS (
SELECT rowid AS id, distance, ROW_NUMBER() OVER (ORDER BY distance) AS rrf_rank
FROM docs_vec
WHERE embedding MATCH ?
AND k = 50
),
combined AS (
SELECT id, 1.0 / (? + rrf_rank) AS score FROM fts_results
UNION ALL
SELECT id, 1.0 / (? + rrf_rank) AS score FROM vec_results
),
scored AS (
SELECT id, SUM(score) AS rrf_score
FROM combined
GROUP BY id
)
SELECT
d.id,
d.title,
d.content,
d.metadata,
s.rrf_score
FROM scored s
JOIN docs d ON d.id = s.id
ORDER BY s.rrf_score DESC
LIMIT ?;
"""
cur = conn.execute(sql, (query, query_embedding, rrf_k, rrf_k, k))
return [
{
"id": row[0],
"title": row[1],
"content": row[2],
"metadata": json.loads(row[3]) if row[3] else {},
"score": row[4],
}
for row in cur.fetchall()
]
Both retrievers pull their top 50, the CTEs assign per-method ranks, and the union-plus-group-by computes the fused score. One SQL statement. The whole thing runs in single-digit milliseconds on a million-row index.
Why this works as well as it does
People expect a complicated reranker to beat a simple fusion. Often it doesn't.
The reason is that BM25 and dense embeddings fail in different ways. BM25 misses on synonymy and paraphrase — "automobile" won't match "car." Dense embeddings miss on rare terms, numerical IDs, and exact phrase matches — they will happily return a semantically-similar document that uses none of the actual keywords. When both methods independently agree a document is relevant, that document is almost always actually relevant. When they disagree, the one that's right depends on the query.
RRF captures this without any training. A document ranked highly by both retrievers gets a bonus from both terms in the sum. A document ranked highly by only one gets a smaller score. The constant k=60 softens the contribution of low ranks, so the top of each list dominates without completely dismissing the tail.
For a learned cross-encoder reranker to beat this, it has to learn something specific about your domain. On general-purpose retrieval, RRF is within a few percent of state-of-the-art and costs nothing to run.
Filtering and metadata
The query above ignores metadata. In practice you almost always want to filter by category, date, owner, or some other column. Two ways to handle this in this stack.
Pre-filter in the CTEs. Add a JOIN docs ON docs_fts.rowid = docs.id WHERE docs.category = ? inside fts_results. For the vector side, sqlite-vec supports auxiliary columns via the vec0 partition syntax, or you can do the join after the k=50 retrieval and accept that some of the 50 will be filtered out.
Post-filter after fusion. Run hybrid search, then filter the result list in Python. Simpler, but you might end up with fewer than k results if the filter is aggressive. Bump the inner limits to compensate.
For most internal applications, post-filtering is fine. For high-cardinality filters (per-user, per-tenant), pre-filtering matters.
What 200 lines actually buys you
The full file — schema, indexing, hybrid query, a small CLI — comes in just under 200 lines. That includes blank lines and docstrings. On commodity hardware it handles a million chunks comfortably, returns hybrid results in well under 50 milliseconds, and has the same disaster-recovery story as any SQLite database: copy the file.
The variable cost is the embedding API. If you self-host (a small CPU-friendly model like intfloat/multilingual-e5-base works fine for many domains), that goes to zero too.
What you don't get, compared to a managed service like Cortex Search: a learned reranker, automated chunking strategies, governance/RBAC out of the box, and someone else's pager rotating when something breaks. Whether those are worth a monthly bill depends on what you're building. For most personal projects and a surprising number of internal-tool deployments, the answer is no.
The stack scales further than people expect. SQLite handles databases into the hundreds of gigabytes. sqlite-vec is being actively developed and benchmarks competitively with dedicated vector databases at the scales most applications actually need. When you eventually outgrow it, you migrate — but "eventually" is usually further away than the first vendor pitch suggests.
---
For the strategic and cost framing around when not to roll your own, see Cortex Search vs Hybrid SQLite RAG and the hub piece Why Snowflake's Bet on Streamlit Just Works.