Full client engagement pipeline demonstrated on fictional Series A SaaS data: 90,950 CRM rows, 1,025,447 Stripe
transactions, 115,500 QuickBooks invoices, 200 enterprise contracts — the exact profile of a typical new
engagement.
Five stages on one machine. No cluster. No cloud database fees.
Stage 1: Ingest dirty data from CRM, Stripe, and QuickBooks exports.
Stage 2: Audit — quality scores, grime map, null analysis.
Stage 3: Clean — DuckDB SQL transforms, Parquet out.
Stage 4: Analyse — 5 business queries, anomaly detection.
Stage 5: RAG — embed, index, query across documents.
Every stage runs inside DuckDB on a single workstation. This is the pipeline your data goes through on every
engagement.
0
23
Interactive Streamlit dashboard backed by DuckDB — 167,858,646 NYC Yellow Cab records across 4 years (2022–2025),
48 Parquet files, queried live on a single workstation.
Features: KPI cards (total trips, fare revenue, YoY change), monthly trip volume line chart by year, payment type
shift analysis 2022→2025, interactive year/month filters. All queries run in DuckDB — no database server, no
cluster, no Spark.
This is the same data stack delivered to clients. Raw Parquet files in, live dashboard out — pipeline built once,
runs automatically.
Built with: DuckDB 1.4 · Python · Streamlit · Plotly · Pandas
Migrate off Databricks in 4 stages.
172M rows/sec. 80% lower cost.
I replace overpriced Databricks and Snowflake stacks with DuckDB — faster, cheaper, and completely under your control.
Audit → Migrate → Automate → Secure. Fixed-fee engagements. Results in days.