Cardiovascular Disease Risk Prediction with Machine Learning

Tserundede Ejueyitchie

Completed work

Data Scientist

ML Engineer

ChatGPT

Jupyter

Python

Healthcare

Predicting Cardiovascular Disease Risk using Machine Learning

Princedede

6 min read

Apr 11, 2024

Cardiovascular diseases CVDs, constitute a global health crisis, claiming millions of lives annually. Their burden extends beyond mortality, impacting individuals, families, and healthcare systems due to associated morbidity and economic costs.

Early detection and intervention are critical in managing CVDs effectively. By identifying individuals at risk before the onset of symptoms, preventive measures can be implemented, potentially slowing disease progression, improving patient outcomes, and reducing healthcare costs.

Machine learning presents a powerful tool for tackling this challenge. These algorithms can analyze vast amounts of health data, including individual characteristics, medical history, and lifestyle factors, to identify patterns and relationships associated with an increased risk of developing CVDs.

This project digs into the development and evaluation of a machine learning model specifically designed to predict the likelihood of individuals developing cardiovascular disease. We will explore the utilization of a dataset containing diverse health parameters, such as age, sex, cholesterol levels, blood pressure, and other relevant factors. We will navigate through the model development process, encompassing data pre-processing, feature engineering, model selection, training, and rigorous validation.

Through this exploration, we aim to contribute to the ongoing effort of harnessing the potential of machine learning for early CVD risk assessment and ultimately, fostering better patient care and improved public health outcomes.

Dataset Description

The dataset used in this project contains various health parameters collected from individuals, along with a target variable indicating the presence or absence of cardiovascular disease. The features included in the dataset are:

age: Age of the individual

sex: Gender of the individual (0: female, 1: male)

cp: Chest pain type (0-3)

trestbps: Resting blood pressure (mm Hg)

chol: Serum cholesterol (mg/dl)

fbs: Fasting blood sugar > 120 mg/dl (0: false, 1: true).

restecg: Resting electrocardiographic results (0-2).

thalach: Maximum heart rate achieved.

exang: Exercise-induced angina (0: no, 1: yes).

oldpeak: ST depression induced by exercise relative to rest.

slope: Slope of the peak exercise ST segment.

ca: Number of major vessels colored by fluoroscopy.

thal: Thallium stress test result (0-3)

The target variable (’target’) indicates the presence (1) or absence (0) of cardiovascular disease.

Model Development

We'll develop a machine learning model to predict the likelihood of cardiovascular disease using the RandomForestClassifier algorithm. Before training the model, we'll preprocess the dataset to handle missing values using SimpleImputer and split it into training and testing sets. Next, we'll perform hyperparameter tuning using GridSearchCV to identify the optimal hyperparameters for the RandomForestClassifier model.

Model Evaluation

Once the model is trained, we’ll evaluate its performance on the testing set using various evaluation metrics, including accuracy, precision, recall, F1-score, and the confusion matrix. These metrics will provide insights into how well the model generalizes to unseen data and its ability to correctly classify individuals with and without cardiovascular disease.

You can access the complete code here.

Results

The trained RandomForestClassifier model achieves an accuracy of approximately 86.9% on the testing set. The best hyperparameters identified through hyperparameter tuning are {'n_estimators': 150, 'max_depth': 20, 'min_samples_split': 10, 'min_samples_leaf': 4}. The classification report provides detailed metrics for each class, including precision, recall, and F1-score. The confusion matrix further illustrates the model's performance in predicting true positives, true negatives, false positives, and false negatives.

Comparison with Other Classification Algorithms

In addition to RandomForestClassifier, there are several other classification algorithms that can be used for predicting cardiovascular disease risk. Some of the commonly used algorithms include:

1. Logistic Regression 2. Support Vector Machines (SVM) 3. K-Nearest Neighbors (KNN) 4. Gradient Boosting 5. Decision Trees

RandomForestClassifier, as well as the other algorithms listed, have their own advantages and disadvantages. Let's discuss the advantages of RandomForestClassifier compared to the other algorithms:

Advantages of RandomForestClassifier

1. High Accuracy: RandomForestClassifier tends to produce highly accurate models, especially when trained on large and diverse datasets.

2. Reduced Overfitting: By aggregating multiple decision trees, RandomForestClassifier reduces the risk of overfitting compared to individual decision trees. It achieves this by averaging the predictions of multiple trees, thereby providing a more generalized model.

3. Handle Missing Values: RandomForestClassifier can handle missing values in the dataset. It internally handles missing values by considering other available features during the tree-building process.

4. Feature Importance: RandomForestClassifier provides a feature importance score, which helps in identifying the most relevant features for prediction. This can aid in feature selection and understanding the underlying patterns in the data.

5. Efficient for Large Datasets: RandomForestClassifier is computationally efficient and can handle large datasets with a large number of features.

Pros and Cons of other Algorithms:

Logistic Regression:

—Pros - Simple and easy to interpret. - Efficient for binary classification tasks.

— Cons: - Limited to linear decision boundaries. - May not perform well with complex relationships in the data.

2. Support Vector Machines (SVM): — Pros: - Effective in high-dimensional spaces. - Versatile due to the different kernel options (linear, polynomial, radial basis function). — Cons: - Computationally intensive, especially with large datasets. - Sensitivity to the choice of kernel parameters.

3. K-Nearest Neighbors (KNN): — Pros: - Simple and intuitive concept. - Non-parametric approach, suitable for complex decision boundaries. — Cons: - Computationally expensive during prediction, especially with large datasets. - Sensitive to the choice of k (number of neighbors) and distance metric.

4. Gradient Boosting: — Pros: - Can capture complex patterns in the data. - Typically achieves high accuracy. — Cons: - Prone to overfitting, especially with shallow trees. - More complex to tune compared to RandomForestClassifier.

5. Decision Trees: — Pros: - Easy to interpret and visualize. - Can handle both numerical and categorical data. — Cons: - Prone to overfitting, especially with deep trees. - Less robust to noise and outliers compared to ensemble methods like RandomForestClassifier.

In summary, while RandomForestClassifier has its advantages, it's essential to consider the specific characteristics of the dataset and problem at hand when choosing the most appropriate algorithm. Experimenting with different algorithms and tuning their parameters can help identify the best-performing model for a given task.

Conclusion

In this project, we developed a machine learning model to predict the likelihood of cardiovascular disease based on various health parameters. We utilized the RandomForestClassifier algorithm and employed advanced techniques such as hyperparameter tuning and model evaluation to optimize the model's performance.

After hyperparameter tuning using GridSearchCV, the RandomForestClassifier achieved an impressive accuracy of 86.9% on the testing set. The best hyperparameters identified were {'n_estimators': 150, 'max_depth': 20, 'min_samples_split': 10, 'min_samples_leaf': 4}. These hyperparameters were crucial in enhancing the model's predictive capabilities and ensuring optimal performance.

The model's performance was further evaluated using various metrics, including precision, recall, and F1-score. The classification report revealed high precision and recall values for both classes, indicating the model's ability to effectively identify individuals with and without cardiovascular disease. Additionally, the confusion matrix provided insights into the model's predictive performance, illustrating the distribution of true positives, true negatives, false positives, and false negatives.

Overall, the developed RandomForestClassifier model demonstrates significant potential for predicting cardiovascular disease risk accurately. By leveraging advanced machine learning techniques and optimizing hyperparameters, we can create robust models that assist healthcare professionals in early detection and intervention, ultimately improving patient outcomes and reducing the burden of cardiovascular diseases worldwide.

Like this project

Completed work

Posted Jun 16, 2025

Developed a machine learning model to predict cardiovascular disease risk using RandomForestClassifier.

Likes

Views

Timeline

Apr 5, 2024 - Apr 11, 2024