NYC Taxi Data Processing and Analysis Pipeline by Dmitrii KrotovNYC Taxi Data Processing and Analysis Pipeline by Dmitrii Krotov

NYC Taxi Data Processing and Analysis Pipeline

Dmitrii Krotov

Dmitrii Krotov

Data Preparation Pipeline for Analytics & ML (Spark)

End-to-end ETL workflow built with PySpark to process and analyze New York City Yellow Taxi trip data. The pipeline reads public Parquet data, performs cleaning and feature extraction, joins external lookup tables (pickup/drop-off zones), computes daily revenue and tip aggregations, runs basic data-quality checks, and produces a visualization of daily earnings.

🧰 Stack

PySpark · Pandas · Matplotlib · Parquet

⚙️ Features

Modular ETL scripts for ingestion, transformation, and enrichment
Partitioned Parquet outputs for efficient querying
Automated workflow via Makefile (ETL → Aggregation → Zone Join → DQ → Plot)
Lightweight data-quality validator (scripts/dq_checks.py)
Reproducible single-command run: make all

🎯 Outcome

A clean, reproducible Spark pipeline demonstrating practical big-data handling, data-quality validation, and simple analytics — easily extendable to cloud-scale environments (e.g., Delta Lake or multi-month batches).

🧾 Data Quality & Validation

A small Spark-based checker validates pipeline outputs:
python scripts/dq_checks.py

Like this project

Posted Feb 16, 2026

Developed data pipeline for NYC Taxi data using PySpark and other analytics tools.