Maybe SQLite Is Still Better Than DuckDB for My Workloads

The Conventional Wisdom

Open any data-engineering thread in 2026 and you'll see the same chorus: "DuckDB is the SQLite of analytics, just use it." Migration guides flood the internet. People are converting their SQLite warehouses to DuckDB and posting benchmarks where aggregations run 100x faster.

Here's what often gets glossed over: DuckDB is not a strict upgrade over SQLite. It's a different tool with a different design center. And for a surprising number of real-world projects, defaulting to DuckDB will hurt you.

This article argues for a less fashionable position: SQLite is still the right default for most embedded workloads, especially anything with incremental writes, concurrent scripts, or transactional state. DuckDB is a fantastic addition to your toolkit, but as an analytical layer, not a replacement.

I'll walk through this using a concrete project I've been building: a Japanese music festival database that aggregates Wikipedia, setlist.fm, MusicBrainz, and the SoundCharts API into a Streamlit dashboard. It's a 70-festival, 8,684-artist, 200K-setlist project running on four SQLite files totaling ~563MB. It would be tempting to consolidate it all into one DuckDB file, but the tradeoffs aren't obvious.

What DuckDB Actually Wins At

Before defending SQLite, credit where credit is due. DuckDB is genuinely excellent for:

- Bulk aggregations over millions of rows (10-100x faster than SQLite)
- Direct query of CSV/Parquet files without an import step
- Columnar compression (typically halving disk size for text-heavy data)
- Pandas / Polars integration via .df() methods
- Append-only analytical workloads (think: financial reports, log analysis, research datasets)

If your project has a static dataset and your queries are "GROUP BY year, SUM", DuckDB will outperform SQLite. There's no contest.

Where DuckDB Hurts: A Real Example

My festival project has a daily cron job that scrapes 1,400 setlist.fm artists, inserting one setlist at a time as the API responds. With SQLite in WAL mode:

- Each INSERT: ~1ms
- Total cron runtime: ~50 minutes (mostly API rate limiting, not DB)
- Streamlit dashboard reads the same DB live, no blocking

Switch this to DuckDB and:

- Each INSERT: 10-50ms (DuckDB's transaction overhead per row)
- Total cron runtime: 4-5 hours
- Streamlit reads block waiting for the writer lock

DuckDB is built around bulk operations. Single-row inserts are an anti-pattern in its design. You can work around it by buffering inserts and using COPY in batches, but now your scraper is more complex than it needs to be.

This is a fundamental mismatch. The setlist.fm API returns one record per HTTP call; the natural code is "fetch, insert, repeat". SQLite does this gracefully. DuckDB's design works against this pattern.

The Concurrency Story Nobody Tells You

During the project's enrichment phase, I had several Python scripts running simultaneously:

- Phase A: SoundCharts festival lookup
- Phase C: Spotify monthly listeners by city (8,684 calls, ~2 hours)
- Phase F: SoundCharts scores + popularity (17,000 calls, ~4 hours)
- Streamlit serving the dashboard

SQLite in WAL mode handled this beautifully:

- Multiple readers + 1 writer concurrent
- Write transactions don't block reads
- Each script gets its own connection, fights only over write commits

DuckDB's concurrency model is global single-writer. In the same scenario:

- Scripts would serialize, not parallelize
- One long-running write operation can block everything
- The total wall-clock time would roughly double

I actually hit this when Phase C was holding a long write transaction on sc.db and Phase F couldn't get a DDL lock for CREATE TABLE. The fix was creating sc_metrics.db as a separate file. The same problem with DuckDB would have been worse, since DuckDB doesn't have WAL's "writer doesn't block readers" property.

Recovery from Crashes

The cron job I mentioned dies regularly:

- HTTP 429 rate limits exceed retry budget
- Network blips (ConnectionError, ReadTimeout)
- Occasionally I kill it with Ctrl-C during testing

SQLite WAL recovers automatically on next connection. The committed transactions are intact, the in-flight transaction rolls back, and life continues. I've never lost data, never had a corrupted file.

DuckDB's storage format is younger and less battle-tested than SQLite's 25 years. For an unattended production cron, this maturity gap matters.

Operational Sugar: The CLI Difference

Throughout this project, I've run hundreds of one-liners:

sqlite3 music_fes.db "SELECT COUNT(*) FROM setlists"
sqlite3 sc.db ".schema sc_artist_audience"
sqlite3 setlistfm.db "SELECT * FROM artists LIMIT 10"

DuckDB's CLI is similar but has subtly different ergonomics:

- .schema works but format differs
- .dump syntax is slightly different
- No PRAGMA equivalent for some SQLite-specific introspection

In a debugging session at 2 AM, muscle memory matters. SQLite's CLI is part of my reflexes. Switching to DuckDB just for that imposes friction.

The Surprising Anti-VectorDB Take

Adjacent to the DuckDB question is: "should I use a vector database for semantic search?"

I tested LanceDB earlier this year. The technology is impressive. But the fundamental issue is verifiability: when a vector similarity returns "X is close to Y", I cannot easily explain WHY. Embedding models are black boxes. Their decisions are not auditable in the way that explicit code is.

For the music festival project, my approach is:

1. Build a synonym table (e.g., "kEYTALK" -> "KEYTALK", "Bz" -> "B'z"). You don't have to write these by hand — Claude Code, Gemini CLI, or a local Nemotron via curl can generate and maintain the entries from your dataset. The artifact stays a plain SQL table: deterministic, auditable, diff-able.
2. Use SQLite FTS5 with trigram tokenization for fuzzy matching
3. Apply BM25 ranking, which is fully deterministic
4. For Japanese specifically, route through English: take the query, strip particles, normalize verb conjugations (走れば → 走る), translate to English, split on whitespace, translate back. Painful but fully auditable.

This pipeline produces results I can explain to a non-technical user. "The system found a match because 'X' is in our synonym table mapped to 'Y'." That is something a vector search cannot give you.

For dynamically generated data with high cardinality (e.g., "find similar companies based on news descriptions"), vector search may genuinely be the only practical path. But for a static, curated dataset like a festival database, the cost of explainability loss outweighs the convenience gain.

The Hybrid Pattern: Best of Both Worlds

If you do need DuckDB's analytical power without losing SQLite's incremental write strengths, the answer is a hybrid:

- Source data layer: CSV / TSV / Parquet (canonical, version-controlled)
- Live transactional: SQLite + WAL (cron writes, user inputs, daily ingest)
- Analytical view: DuckDB read-only attached to SQLite, or nightly ETL snapshot
- Application: FastAPI / Streamlit reads from whichever layer fits

DuckDB can ATTACH SQLite databases natively in read-only mode:

con = duckdb.connect(":memory:")
con.execute("INSTALL sqlite; LOAD sqlite;")
con.execute("ATTACH 'music_fes.db' AS mf (TYPE sqlite, READ_ONLY);")
con.execute("ATTACH 'sc.db'         AS sc (TYPE sqlite, READ_ONLY);")

Now your heavy aggregation queries get DuckDB's planner over your SQLite data. Zero migration cost, no concurrency conflict, the cron keeps writing to SQLite as before.

For really heavy analytical workloads, build a derived DuckDB file via nightly ETL:

duckdb consolidated.duckdb << EOF
INSTALL sqlite; LOAD sqlite;
ATTACH 'sc.db' AS s (TYPE sqlite, READ_ONLY);
COPY FROM DATABASE s TO consolidated;
EOF

Streamlit reads from consolidated.duckdb. Cron writes to SQLite. Refresh nightly. The dashboard becomes 5-10x faster for complex queries, and you've not changed your write path at all.

A Decision Framework

Here's the actual flowchart I use for new projects:

1. Data source: CSV/Parquet/JSON files (no API, no user input)?
- YES → DuckDB default
- NO → continue
2. Are writes incremental (cron, API ingestion, user-driven)?
- YES → SQLite default (with WAL mode)
- NO → DuckDB default
3. Multiple scripts running concurrently?
- YES → SQLite (better concurrency for mixed read/write)
- NO → either is fine
4. Total dataset size:
- <10GB → SQLite is plenty
- 10-100GB → SQLite if scope-limited queries dominate; DuckDB if not
- >100GB → Consider DuckDB for analytics; PostgreSQL for transactions
5. Need explainability/auditability of results?
- YES → Avoid VectorDBs. Use FTS5 + curated synonyms instead.
6. Long-term operational concerns:
- 25-year stability matters? → SQLite
- Mobile / browser / IoT? → SQLite (DuckDB doesn't run there)
- Pure analytical batch? → DuckDB shines

The 50GB Sharding Counter-Argument

Some critics will say: "But once your data hits 50GB, you must move to DuckDB or PostgreSQL."

Not necessarily. A trick that scales SQLite remarkably far is logical sharding. For a 500GB patent database, splitting by decade is natural:

patents_2015_2024.sqlite  (70GB)
patents_2005_2014.sqlite  (60GB)
patents_1995_2004.sqlite  (40GB)
patents_pre1995.sqlite    (20GB)

The key insight: humans rarely query against the full 500GB. They specify a domain at input. ("Show me semiconductor patents from 2015-2020.") That maps to a single shard or two, each ~70GB.

With proper indexes (date + assignee + classification), each shard answers typed queries in sub-second. Aggregations across the full corpus are slow, but that's not the typical interactive query pattern.

This is the same pattern Stripe and Airbnb use with PostgreSQL. They didn't move to a different database; they sharded.

If your interactive query truly needs the full 500GB scanned (research aggregation, year-over-year trends), sure, build a DuckDB analytical layer for those specific queries. Don't migrate the whole production system for them.

Concrete Take-Aways

For new projects:

- Default to SQLite + WAL mode. You'll rarely outgrow it.
- When you do hit a wall, profile to see if it's a read-side or write-side problem. Read-side → add DuckDB analytical view. Write-side → consider PostgreSQL or sharding.
- Avoid VectorDBs unless you've concluded their semantic similarity is irreplaceable. For most "search" needs, FTS5 + BM25 + a curated synonym table works and is auditable.

For existing SQLite projects:

- Profile your actual slow queries before migrating — hype isn't a reason.
- If a specific dashboard page is slow due to aggregation, ATTACH the SQLite file to a DuckDB session for that query alone. Zero migration.
- Keep your CSV/TSV/Parquet sources as the canonical data. The DB is a cache; the source files are truth.

For the AI / LLM augmented workflow:

- Use a local LLM (e.g., Nemotron) to enrich metadata: synonym tables, named entity tags, classification labels. Run it in batch, persist results into SQLite. Now your queries combine the LLM's intelligence with deterministic SQL search.
- Don't put the LLM in the hot path of search. Put it in the offline enrichment path. Search itself stays auditable.

The Stack That Works

For my festival project:

- Storage: SQLite (4 files, total ~563MB, all in WAL mode)
- Backend: Python (sqlite3 stdlib + pandas)
- Frontend: Streamlit (because rapid iteration > pixel-perfect UI)
- Cron: Daily setlist.fm scrape + nightly link computations
- AI: Local LLM (Nemotron) for batch metadata enrichment, never in real-time path
- Vector: None. Curated synonym tables instead.

This combination is unfashionable. There's no Kubernetes, no microservices, no vector database, no streaming pipeline. It's a Python script and a SQLite file and a web UI.

It also runs on a single laptop, costs ~$250/month for the SoundCharts API, and serves the dashboard in sub-second for typical queries. The whole stack fits in one head.

The lesson: pick tools by the shape of your problem, not the shape of the hype curve. SQLite is unfashionable but right for incremental workloads. DuckDB is fashionable and right for analytical batches. VectorDBs trade explainability for convenience — often a bad bargain for curated data.

Build something that works. Verify it. Ship it. Iterate.