SHEIN Product Data Cleaning & by Kyrylo PSHEIN Product Data Cleaning & by Kyrylo P

SHEIN Product Data Cleaning &

Kyrylo  P

Kyrylo P

SHEIN Product Data Cleaning & E-commerce Analysis Cleaned and structured a large-scale scraped e-commerce dataset (80,000+ product records across 21 CSV files).
The raw dataset contained inconsistent formatting, duplicate entries, missing values, and noisy text fields that made it unsuitable for analysis.
Key work included:
:Merging and standardising 21 raw CSV files into a single structured dataset
Removing 11,000+ duplicate products using title-based deduplication logic
Handling missing discount values using controlled null retention (no artificial imputation)
Filtering out statistical outliers without clipping or distortion
Engineering analytical features such as:
units_sold
log-transformed sales metric
price category segmentation (fixed bins using pd.cut)
discount presence flag
value efficiency score (sales-to-price ratio)
Final output: 70,292 clean, analysis-ready product records.
This project demonstrates real-world e-commerce data wrangling, feature engineering, and dataset preparation for downstream analytics and dashboarding.
Like this project

Posted May 19, 2026

SHEIN Product Data Cleaning & E-commerce Analysis Cleaned and structured a large-scale scraped e-commerce dataset (80,000+ product records across 21 CSV fil...