This research study examined how demographics and air quality influenced COVID-19 infection and fatality rates across counties in New York State during the pandemic's first wave. The study revealed that infection and death were highest near NYC, while fatality (deaths per infection) was paradoxically higher in rural areas.
π§ͺ My Technical Contributions
β Stepwise Regression Modeling
Applied forward selection and backward elimination techniques to identify statistically significant predictors (demographic & environmental) for COVID-19 infection and fatality.
Helped determine which features most improved model accuracy (e.g., PM2.5, population age, distance to epicenter).
β Data Wrangling & Cleaning
Merged multiple datasets (census, pollution, and epidemiological data) at the county level.
Preprocessed variables for model readiness: normalization, missing value handling, and encoding.
β Feature Impact Interpretation
Analyzed how variable inclusion/exclusion altered regression accuracy and output.
Supported result validation to ensure models aligned with observed cluster behaviors.
π§ Techniques Used
π Key Insights from the Study
PM2.5 and distance to NYC were major predictors of infection spread
Fatality was more associated with elderly population and long-term pollution exposure
Spatial and demographic segmentation is crucial for targeted public health response
Model interpretability helped explain why certain rural areas had high fatality despite low infection
β Relevance to Freelance Work
This project shows my ability to:
Select and justify data modeling techniques
Build and interpret multivariate regression models
Understand feature importance & business impact
Handle real-world public datasets at scale
Applicable for:
Health analytics, churn modeling, marketing attribution, or KPI drivers