ETL and Data Analysis of MovieLens Dataset by Ciro ETL and Data Analysis of MovieLens Dataset by Ciro

ETL and Data Analysis of MovieLens Dataset

Ciro

Ciro

Case Study: ETL and Data Analysis of Movies (MovieLens)

Overview

Designed and implemented a complete ETL pipeline to process over 100,000 ratings and 3,600 movies from the MovieLens dataset. Normalized unstructured data into a relational PostgreSQL schema and built Power BI dashboards to analyze rating trends, genre distribution, and user activity.

Challenge

Raw datasets often contain concatenated fields and inconsistent formats, making them unsuitable for direct analysis. The challenge was to normalize the MovieLens dataset, ensure efficient database loading, and deliver actionable insights through BI dashboards.

Approach

Extraction & Transformation (Python):
Loaded multiple CSV files (movies.csv, ratings.csv, tags.csv).
Converted timestamps to datetime.
Normalized genres into a many‑to‑many relational structure.
Loading (PostgreSQL):
Designed schema with tables for Movies, Genres, Ratings, and Tags.
Used psycopg2 with bulk loading (copy_from) for efficiency.
Analysis (Power BI):
Connected directly to PostgreSQL.
Built dashboards for top movies, genre distribution, user activity, and rating evolution.

Solution

Delivered a scalable ETL + BI workflow that transformed raw CSVs into a normalized relational database. The Power BI dashboards provided clear insights into movie popularity, user behavior, and historical rating trends.

Impact

Data Quality: Normalized genres into 3NF relational schema for optimized queries.
Performance: Efficient bulk loading enabled handling of large datasets.
Business Value: Dashboards revealed top contributors, genre preferences (Drama & Comedy), and rating evolution over time.
Like this project

Posted Jan 2, 2026

Developed an ETL pipeline and Power BI dashboards for MovieLens data analysis.