Classification Models Comparison: Diabetes Prediction by Kaedyn CruickshankClassification Models Comparison: Diabetes Prediction by Kaedyn Cruickshank

Classification Models Comparison: Diabetes Prediction

Kaedyn Cruickshank

AI Model Developer

Data Scientist

Jupyter

Python

scikit-learn

Healthcare

Classification Models Comparison: Diabetes Prediction

Log Reg, Random Forest, Gradient Boosting

Purpose

To predict diabetes risk from recorded health metrics such as glucose, BMI, insulin, etc. and compare Logistic Regression vs tree-based models among other observations.

Dataset

Pima Indians Diabetes dataset (768 rows, 9 columns). Target: Outcome (1 = diabetes, 0 = no diabetes).

Public, well-known educational dataset.

Process

Cleaning

Replaced impossible zeros in table fields (Glucose, BloodPressure, SkinThickness, Insulin, BMI) using median imputation.

Models

Logistic Regression (with StandardScaler in pipeline)

Random Forest (300 trees)

Gradient Boosting

Model Evaluation

Metrics: Accuracy, Precision, Recall, F1, ROC-AUC

Visuals: Confusion Matrix per model, ROC Curves

Interpretability: Feature importance

Results — Test Set

Logistic Regression: Accuracy=0.760, Precision=0.698, Recall=0.556, F1Score=0.619, ROC_AUC=0.813

Random Forest: Accuracy=0.747, Precision=0.653, Recall=0.593, F1Score=0.621, ROC_AUC=0.839

Gradient Boosting: Accuracy=0.753, Precision=0.660, Recall=0.611, F1Score=0.635, ROC_AUC=0.829

Best by ROC-AUC: Random Forest

Feature Importance (For Tree Models)

Random Forest and Gradient Boosting (the tree-based models) provide feature importance scores.

Generally, Glucose, BMI, and Age have most feature importance. Some differences are present as the gradient boosting model has a surprising importance score of approx. 0.4 for glucose compared to just 0.25 for the random forest model.

Like this project

Posted Aug 28, 2025

Compared models for diabetes prediction using Pima Indians dataset.

Likes

Views