IRIS FLOWER CLASSIFICATION

Ali Hassan

Data Modelling Analyst

Data Analyst

AI Model Developer

Microsoft Excel

Python

Data Analysis and Modeling Documentation

Importing Libraries

In this section, we import essential Python libraries for data analysis and modeling. These libraries include numpy for numerical operations, pandas for data manipulation, matplotlib.pyplot for data visualization, seaborn for enhanced data visualization, and modules from scikit-learn for machine learning.

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.svm import SVC

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import confusion_matrix, classification_report

from sklearn.metrics import roc_curve, auc

from scipy import interp

Loading the Iris Dataset

We load the Iris dataset from a CSV file located at 'D:\Data Analyst\CodSoft\Task 2\IRIS.csv' into a Pandas DataFrame named iris_data.

# Load the Iris dataset from a CSV file

iris_data = pd.read_csv('D:\\Data Analyst\\CodSoft\\Task 2\\IRIS.csv')

Data Preparation

Splitting Data into Features and Labels

We split the dataset into features (X) and target labels (y):

We create the feature matrix X by dropping the 'Species' column from iris_data.

We create the target vector y containing the 'Species' column.

# Split the data into features (X) and target labels (y)

X = iris_data.drop('Species', axis=1)

y = iris_data['Species']

Splitting Data into Training and Testing Sets

We split the data into training and testing sets for model evaluation:

We use train_test_split() from sci-kit-learn to split X and y into X_train, X_test, y_train, and y_test.

The test size is set to 30% of the data, and a random seed (random_state) is set for reproducibility.

# Split the data into training and testing sets (70% training, 30% testing)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)

Support Vector Machine (SVM) Modeling

Creating and Training the SVM Model

We create and train a Support Vector Machine (SVM) model with a linear kernel:

We instantiate an SVM model with a linear kernel and probability estimates using SVC().

We fit the SVM model to the training data using fit().

 Train a Support Vector Machine (SVM) model

svm_model = SVC(kernel='linear', C=1, probability=True)

svm_model.fit(X_train, y_train)

y_pred_svm = svm_model.predict(X_test)

Confusion Matrix for SVM

We calculate the confusion matrix for the SVM model to evaluate its performance:

The confusion matrix is computed using confusion_matrix() sci-kit-learn.

# Calculate the confusion matrix for SVM

confusion_matrix_svm = confusion_matrix(y_test, y_pred_svm)

Plotting a Confusion Matrix Heatmap for SVM

We create a heatmap visualization of the confusion matrix for the SVM model:

We use sns.heatmap() to create the heatmap with annotations.

The heatmap is labeled and displayed to visualize the model's classification results.

# Plot a confusion matrix heatmap for SVM

plt.figure()

sns.heatmap(confusion_matrix_svm, annot=True, fmt="d", cmap="Blues", xticklabels=np.unique(y), yticklabels=np.unique(y))

plt.xlabel('Predicted')

plt.ylabel('True')

plt.title('SVM Confusion Matrix')

plt.show()

Random Forest Modeling

Creating and Training the Random Forest Model

We create and train a Random Forest classification model with 100 estimators:

We instantiate a Random Forest model with 100 estimators and a random seed using RandomForestClassifier().

We fit the Random Forest model to the training data using fit().

# Train a Random Forest model

rf_model = RandomForestClassifier(n_estimators=100, random_state=123)

rf_model.fit(X_train, y_train)

y_pred_rf = rf_model.predict(X_test)

Confusion Matrix for Random Forest

We calculate the confusion matrix for the Random Forest model to evaluate its performance:

The confusion matrix is computed using confusion_matrix() from scikit-learn.

# Calculate the confusion matrix for Random Forest

confusion_matrix_rf = confusion_matrix(y_test, y_pred_rf)

Plotting a Confusion Matrix Heatmap for Random Forest

We create a heatmap visualization of the confusion matrix for the Random Forest model:

We use sns.heatmap() to create the heatmap with annotations.

The heatmap is labeled and displayed to visualize the model's classification results.

# Plot a confusion matrix heatmap for Random Forest

plt.figure()

sns.heatmap(confusion_matrix_rf, annot=True, fmt="d", cmap="Greens", xticklabels=np.unique(y), yticklabels=np.unique(y))

plt.xlabel('Predicted')

plt.ylabel('True')

plt.title('Random Forest Confusion Matrix')

plt.show()

ROC Curve Analysis

ROC Curve for SVM

We create Receiver Operating Characteristic (ROC) curves for the SVM model:

We calculate the decision function scores (y_score_svm) for the test data using decision_function().

For each class in the target variable, we calculate False Positive Rate (FPR) and True Positive Rate (TPR) and compute the area under the ROC curve (AUC) using roc_curve() and auc().

ROC curves are plotted for each class, and the AUC is displayed to assess the SVM model's ability to distinguish between classes.

# Create ROC curve for SVM

y_score_svm = svm_model.decision_function(X_test)

fpr_svm = dict()

tpr_svm = dict()

roc_auc_svm = dict()

for i, class_name in enumerate(np.unique(y)):

    fpr_svm[i], tpr_svm[i], _ = roc_curve(y_test == class_name, y_score_svm[:, i])

    roc_auc_svm[i] = auc(fpr_svm[i], tpr_svm[i])

# Plot ROC curve for SVM

plt.figure()

colors = ['blue', 'red', 'green']

for i, color in zip(range(len(np.unique(y))), colors):

    plt.plot(fpr_svm[i], tpr_svm[i], color=color, lw=2, label='ROC curve of class {0} (area = {1:0.2f})'.format(np.unique(y)[i], roc_auc_svm[i]))

plt.plot([0, 1], [0, 1], 'k--', lw=2)

plt.xlim([0.0, 1.0])

plt.ylim([0.0, 1.05])

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive Rate')

plt.title('ROC Curve for SVM')

plt.legend(loc="lower right")

plt.show()

ROC Curve for SVM

ROC Curve for Random Forest

We create ROC curves for the Random Forest model:

We calculate class probabilities (y_score_rf) for the test data using predict_proba().

For each class, we calculate FPR, TPR, and AUC as in the SVM section.

ROC curves are plotted for each class, and the AUC is displayed to assess the Random Forest model's classification performance.

Overall, this code demonstrates the process of loading a dataset, splitting it for model training and testing, building two different classification models (SVM and Random Forest), evaluating their performance using confusion matrices, and assessing their ability to classify data points using ROC curves and AUC values.

# Create ROC curve for Random Forest

y_score_rf = rf_model.predict_proba(X_test)

fpr_rf = dict()

tpr_rf = dict()

roc_auc_rf = dict()

for i, class_name in enumerate(np.unique(y)):

    fpr_rf[i], tpr_rf[i], _ = roc_curve(y_test == class_name, y_score_rf[:, i])

    roc_auc_rf[i] = auc(fpr_rf[i], tpr_rf[i])

# Plot ROC curve for Random Forest

plt.figure()

for i, color in zip(range(len(np.unique(y))), colors):

    plt.plot(fpr_rf[i], tpr_rf[i], color=color, lw=2, label='ROC curve of class {0} (area = {1:0.2f})'.format(np.unique(y)[i], roc_auc_rf[i]))

plt.plot([0, 1], [0, 1], 'k--', lw=2)

plt.xlim([0.0, 1.0])

plt.ylim([0.0, 1.05])

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive Rate')

plt.title('ROC Curve for Random Forest')

plt.legend(loc="lower right")

plt.show()