Clinical Dataset Cleaning: From Raw Data to Insights (1,000 Rows)
I recently completed a data cleaning project on a 1,000-row clinical dataset. My goal was to transform a disorganized spreadsheet into a structured, analysis-ready format 😊. This comprehensive cleanup ensures the data is consistent, efficient to query, and ready for immediate use in further analytical processes.
- Cleaned and standardised all columns and entries to ensure data integrity and consistency across the board.
- Used Gender-guesser on Python to fill missing gender values based on the patients' names.
- Decoupled currency symbols from billing amounts into distinct columns, streamlining the data for future financial modeling.
- Calculated a new "days_between_dates" metric to analyse patient wait times and operational efficiency.
- Corrected data types for dates (booking_date, appointment_date) and numeric values (billing_amount), to ensure smooth querying.
- Systematically handled remaining missing data by assigning explicit NULL values to improve compatibility with SQL analysis.
Postscript: MySQL and I had a few "creative differences" toward the end, but I’m happy to report that I won the battle! 😋