Comprehensive Data Analysis and Machine Learning on Heart Dise.. by Md. Akmol MasudComprehensive Data Analysis and Machine Learning on Heart Dise.. by Md. Akmol Masud

Comprehensive Data Analysis and Machine Learning on Heart Dise..

Md. Akmol Masud

Data Visualizer

ML Engineer

Data Analyst

scikit-learn

seaborn

TensorFlow

Comprehensive Data Analysis and Machine Learning on Heart Disease Dataset

Comprehensive Data Analysis and Machine Learning on Heart Disease Datase

Overview

This project provides a comprehensive suite for data analysis, visualization, and machine learning modeling. It includes tools for visualizing data distributions and outliers, as well as the application and evaluation of multiple machine learning models on the dataset.

Data Visualization

Distribution and Outlier Visualization

Combination of violin plots, box plots, and jittered points

Interactive Plotly-based display

High-resolution image export capability

Features

Multi-variable Visualization: Automatically creates subplots for all columns in the input DataFrame.

Combination Plot: Each variable is represented by:

Interactive Display: Utilizes Plotly's interactive features for dynamic data exploration.

Customizable Layout: Flexible options for adjusting plot size, colors, and styling.

Export Functionality: Capability to save the plot as a high-resolution PNG image.

Scalability: Designed to handle datasets with varying numbers of variables.

Informative Labeling: Clear subplot titles and axis labels for easy interpretation.

Dependencies

Python 3.x

Plotly (for interactive plotting)

NumPy (for numerical operations)

Pandas (for data manipulation, implied in the code)

Kaleido (for high-quality image export)

Machine Learning Models

Classifiers Implemented

A total of 25 different classifiers from various categories were implemented:

Linear Models

Tree-based Models

Ensemble Methods

Nearest Neighbors

Naive Bayes Models

Support Vector Machines

Neural Networks

Discriminant Analysis

Other

Evaluation Metrics

The models are evaluated using a comprehensive set of metrics:

Accuracy: Overall correctness of the model

Precision: Ratio of true positives to total predicted positives

Recall: Ratio of true positives to total actual positives

F1 Score: Harmonic mean of precision and recall

AUC-ROC: Area Under the Receiver Operating Characteristic curve

Cohen's Kappa: Agreement between predicted and actual classes, accounting for chance

Log Loss: Logarithmic loss between predicted probabilities and actual class

Average Precision: Area under the precision-recall curve

Jaccard Score: Intersection over Union of the predicted and actual positive labels

Balanced Accuracy: Average of recall obtained on each class

Specificity: True negative rate

Geometric Mean: Geometric mean of sensitivity and specificity

Index of Balanced Accuracy (IBA): Weighted version of geometric mean

Confusion Matrix Components: True Positives (TP), True Negatives (TN), False Positives (FP), False Negatives (FN)

Cross-Validation Scores: Mean and standard deviation of 10-fold stratified cross-validation

Model Evaluation Process

Each model is trained on the training dataset.

Predictions are made on the test dataset.

Comprehensive metrics are calculated for each model.

10-fold stratified cross-validation is performed to assess model stability.

Results are compiled into a DataFrame for easy comparison and analysis.

Results Storage

The evaluation results for all models are stored in a structured DataFrame.

This allows for easy comparison, visualization, and further analysis of model performances.

Dependencies

Python 3.x

Plotly (for interactive plotting)

NumPy (for numerical operations)

Pandas (for data manipulation, implied in the code)

Kaleido (for high-quality image export)

Installation

Ensure you have the required packages installed:

pip install -r requirements.txt

For virtual environment users:

python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate` pip install plotly numpy pandas kaleido

Usage

Prepare your dataset as a pandas DataFrame named df.

Import the necessary libraries and paste the provided code into your Python script or Jupyter notebook.

Run the script to generate the visualization.

The plot will be displayed interactively in your default browser or notebook.

A PNG file of the plot will be saved in your working directory.

Example:

import pandas as pd # Your existing code here # ... fig.show() save_figure(fig)

Results

Location: Results are saved in the results directory ``

Code Structure

Import Statements: Required libraries are imported.

Save Function Definition: save_figure function is defined for exporting the plot.

Data Preparation: The input DataFrame is processed to determine subplot layout.

Subplot Creation: A Plotly subplot figure is initialized.

Plot Generation Loop: Iterates through DataFrame columns, creating violin and box plots for each.

Layout Customization: Adjusts the overall appearance of the plot.

Display and Save: Shows the interactive plot and saves it as an image.

Customization

Color Scheme: Modify the colors variable to change the plot color palette.

Plot Dimensions: Adjust height and width in the update_layout function.

Subplot Titles: Customize subplot titles in the make_subplots function call.

Marker Properties: Alter size, opacity, and other properties of plot elements.

Export Settings: Modify the save_figure function parameters for custom image export.

Output

An interactive HTML plot displayed in the default web browser or Jupyter notebook.

A high-resolution PNG file (default name: "Outlier_Graph.png") saved in the working directory.

Detailed Component Explanation

Violin Plot

Represents the probability density of the data at different values.

Wider sections indicate a higher probability of data points in that range.

Box Plot

Shows the quartiles of the dataset.

The box represents the interquartile range (IQR).

The line in the box is the median.

Whiskers extend to show the rest of the distribution.

Points beyond the whiskers are potential outliers.

Jittered Points

Individual data points are plotted with a slight random offset.

Provides a view of the raw data distribution, especially useful for smaller datasets.

Data Preprocessing

Ensure your DataFrame (df) is clean and properly formatted.

Handle missing values appropriately before visualization.

Consider normalizing or scaling data if variables are on vastly different scales.

Performance Considerations

Large datasets may require significant processing time and memory.

For very large datasets, consider sampling or using a more performant plotting library.

Troubleshooting

ImportError: Ensure all required libraries are installed.

MemoryError: For large datasets, try reducing the number of plotted points or use data sampling.

Kaleido Issues: Make sure Kaleido is properly installed for image export functionality.

FAQs

Q: Can I use this with a CSV file? A: Yes, load your CSV into a pandas DataFrame first: df = pd.read_csv('your_file.csv')

Q: How can I change the output image format? A: Modify the fig.write_image() call in the save_figure function, changing the file extension.

Version History

v1.1.0: Added 25 machine learning models and additional visualizations

v1.0.0: Initial release with basic visualization functionality

Future Improvements

Implement automated hyperparameter tuning for ML models

Add more advanced visualization techniques (e.g., t-SNE, UMAP)

Develop a web interface for easy model selection and result visualization

Like this project

Posted Dec 2, 2024

Comprehensive EDA and 25 ML classifier on Heart Disease dataset - masud1901/EDA-and-ML-model-prediction-on-Heart-Dataset

Likes

Views

Comprehensive Data Analysis and Machine Learning on Heart Dise..

Comprehensive Data Analysis and Machine Learning on Heart Disease Dataset

Table of Contents

Overview

Data Visualization

Distribution and Outlier Visualization

Features

Dependencies

Machine Learning Models

Classifiers Implemented

Evaluation Metrics

Model Evaluation Process

Results Storage

Dependencies

Installation

Usage

Results

Code Structure

Customization

Output

Detailed Component Explanation

Violin Plot

Box Plot

Jittered Points

Data Preprocessing

Performance Considerations

Troubleshooting

FAQs

Version History

Future Improvements

Challenges

Challenges