Comprehensive Data Analysis and Machine Learning on Heart Dise.. by Md. Akmol MasudComprehensive Data Analysis and Machine Learning on Heart Dise.. by Md. Akmol Masud

Comprehensive Data Analysis and Machine Learning on Heart Dise..

Md. Akmol Masud

Md. Akmol Masud

Comprehensive Data Analysis and Machine Learning on Heart Disease Dataset

Table of Contents

Overview

This project provides a comprehensive suite for data analysis, visualization, and machine learning modeling. It includes tools for visualizing data distributions and outliers, as well as the application and evaluation of multiple machine learning models on the dataset.

Data Visualization

Distribution and Outlier Visualization

Combination of violin plots, box plots, and jittered points
Interactive Plotly-based display
High-resolution image export capability

Features

Multi-variable Visualization: Automatically creates subplots for all columns in the input DataFrame.
Combination Plot: Each variable is represented by:
Interactive Display: Utilizes Plotly's interactive features for dynamic data exploration.
Customizable Layout: Flexible options for adjusting plot size, colors, and styling.
Export Functionality: Capability to save the plot as a high-resolution PNG image.
Scalability: Designed to handle datasets with varying numbers of variables.
Informative Labeling: Clear subplot titles and axis labels for easy interpretation.

Dependencies

Python 3.x
Plotly (for interactive plotting)
NumPy (for numerical operations)
Pandas (for data manipulation, implied in the code)
Kaleido (for high-quality image export)

Machine Learning Models

Classifiers Implemented

A total of 25 different classifiers from various categories were implemented:
Linear Models
Tree-based Models
Ensemble Methods
Nearest Neighbors
Naive Bayes Models
Support Vector Machines
Neural Networks
Discriminant Analysis
Other

Evaluation Metrics

The models are evaluated using a comprehensive set of metrics:
Accuracy: Overall correctness of the model
Precision: Ratio of true positives to total predicted positives
Recall: Ratio of true positives to total actual positives
F1 Score: Harmonic mean of precision and recall
AUC-ROC: Area Under the Receiver Operating Characteristic curve
Cohen's Kappa: Agreement between predicted and actual classes, accounting for chance
Log Loss: Logarithmic loss between predicted probabilities and actual class
Average Precision: Area under the precision-recall curve
Jaccard Score: Intersection over Union of the predicted and actual positive labels
Balanced Accuracy: Average of recall obtained on each class
Specificity: True negative rate
Geometric Mean: Geometric mean of sensitivity and specificity
Index of Balanced Accuracy (IBA): Weighted version of geometric mean
Confusion Matrix Components: True Positives (TP), True Negatives (TN), False Positives (FP), False Negatives (FN)
Cross-Validation Scores: Mean and standard deviation of 10-fold stratified cross-validation

Model Evaluation Process

Each model is trained on the training dataset.
Predictions are made on the test dataset.
Comprehensive metrics are calculated for each model.
10-fold stratified cross-validation is performed to assess model stability.
Results are compiled into a DataFrame for easy comparison and analysis.

Results Storage

The evaluation results for all models are stored in a structured DataFrame.
This allows for easy comparison, visualization, and further analysis of model performances.

Dependencies

Python 3.x
Plotly (for interactive plotting)
NumPy (for numerical operations)
Pandas (for data manipulation, implied in the code)
Kaleido (for high-quality image export)

Installation

Ensure you have the required packages installed:
pip install -r requirements.txt
For virtual environment users:
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate` pip install plotly numpy pandas kaleido

Usage

Prepare your dataset as a pandas DataFrame named df.
Import the necessary libraries and paste the provided code into your Python script or Jupyter notebook.
Run the script to generate the visualization.
The plot will be displayed interactively in your default browser or notebook.
A PNG file of the plot will be saved in your working directory.
Example:
import pandas as pd # Your existing code here # ... fig.show() save_figure(fig)

Results

Location: Results are saved in the results directory ``

Code Structure

Import Statements: Required libraries are imported.
Save Function Definition: save_figure function is defined for exporting the plot.
Data Preparation: The input DataFrame is processed to determine subplot layout.
Subplot Creation: A Plotly subplot figure is initialized.
Plot Generation Loop: Iterates through DataFrame columns, creating violin and box plots for each.
Layout Customization: Adjusts the overall appearance of the plot.
Display and Save: Shows the interactive plot and saves it as an image.

Customization

Color Scheme: Modify the colors variable to change the plot color palette.
Plot Dimensions: Adjust height and width in the update_layout function.
Subplot Titles: Customize subplot titles in the make_subplots function call.
Marker Properties: Alter size, opacity, and other properties of plot elements.
Export Settings: Modify the save_figure function parameters for custom image export.

Output

An interactive HTML plot displayed in the default web browser or Jupyter notebook.
A high-resolution PNG file (default name: "Outlier_Graph.png") saved in the working directory.

Detailed Component Explanation

Violin Plot

Represents the probability density of the data at different values.
Wider sections indicate a higher probability of data points in that range.

Box Plot

Shows the quartiles of the dataset.
The box represents the interquartile range (IQR).
The line in the box is the median.
Whiskers extend to show the rest of the distribution.
Points beyond the whiskers are potential outliers.

Jittered Points

Individual data points are plotted with a slight random offset.
Provides a view of the raw data distribution, especially useful for smaller datasets.

Data Preprocessing

Ensure your DataFrame (df) is clean and properly formatted.
Handle missing values appropriately before visualization.
Consider normalizing or scaling data if variables are on vastly different scales.

Performance Considerations

Large datasets may require significant processing time and memory.
For very large datasets, consider sampling or using a more performant plotting library.

Troubleshooting

ImportError: Ensure all required libraries are installed.
MemoryError: For large datasets, try reducing the number of plotted points or use data sampling.
Kaleido Issues: Make sure Kaleido is properly installed for image export functionality.

FAQs

Q: Can I use this with a CSV file? A: Yes, load your CSV into a pandas DataFrame first: df = pd.read_csv('your_file.csv')
Q: How can I change the output image format? A: Modify the fig.write_image() call in the save_figure function, changing the file extension.

Version History

v1.1.0: Added 25 machine learning models and additional visualizations
v1.0.0: Initial release with basic visualization functionality

Future Improvements

Implement automated hyperparameter tuning for ML models
Add more advanced visualization techniques (e.g., t-SNE, UMAP)
Develop a web interface for easy model selection and result visualization
Like this project

Posted Dec 2, 2024

Comprehensive EDA and 25 ML classifier on Heart Disease dataset - masud1901/EDA-and-ML-model-prediction-on-Heart-Dataset