Loan Eligibility Prediction using Gradient Boosting Classifier

Nitin Arora

Business Analyst
Data Analyst

Overview 🔎

SYL bank is one of Australia's largest banks. Currently, the loan applications which come in to their various branches are processed manually. The decision whether to grant a loan or not is subjective and due to a lot of applications coming in, it is getting harder for them to decide the loan grant status. Thus, they want to build an automated machine learning solution which will look at different factors and decide whether to grant loan or not to the respective individual.
In this ML problem, we will building a classification model as we have to predict if an applicant should get a loan or not. We will look at various factors of the applicant like credit score, past history and from those we will try to predict the loan granting status. We will also cleanse the data and fill in the missing values so that our ML model performs as expected. Thus we will be giving out a probability score along with Loan Granted or Loan Refused output from the model.
Understanding the problem statement and business intent
Understanding different libraries and their respective uses
In depth exploratory data analysis of each feature & Feature engineeringData Cleansing and Preparation
Creating custom functions for Machine Learning Models
In depth explanation of data imputation and filling missing values
Defining an approach to solve ML Classification problems
Data preparation for the Machine Learning models
Training and testing the model using Cross Validation
Building statistical models like Gradient Boosting, XGBoost etc
Selecting the best model based on different metrics
Understanding metrics like ROC Curve, MCC scorer etc
Creating pickle files for model reusability
import pandas as pd

import numpy as np
import os
import operator
import matplotlib.pyplot as plt
import seaborn as sns
import statistics
from sklearn.model_selection import train_test_split,GridSearchCV,cross_val_score
from sklearn.preprocessing import LabelBinarizer,StandardScaler,OrdinalEncoder
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from scipy.stats import boxcox
from sklearn.linear_model import LogisticRegression,RidgeClassifier, PassiveAggressiveClassifier
from sklearn import metrics
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from imblearn.over_sampling import SMOTE
from fancyimpute import KNN,SoftImpute
from xgboost import plot_importance
from matplotlib import pyplot
from sklearn.naive_bayes import BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
import joblib

%matplotlib inline def classify(est, x, y,X_test,y_test):
#Passing the model and train test dataset to fit the model, y)
#Predicting the probabilities of the Tet data
y2 = est.predict_proba(X_test)
y1 = est.predict(X_test)

'''print("Accuracy: ", metrics.accuracy_score(y_test, y1))
print("Area under the ROC curve: ", metrics.roc_auc_score(y_test, y2[:, 1]))
#Calculate different metrics
print("F-metric: ", metrics.f1_score(y_test, y1))
print(" ")
print("Classification report:")
print(metrics.classification_report(y_test, y1))
print(" ")
print("Evaluation by cross-validation:")
print(cross_val_score(est, x, y))'''

return est, y1, y2[:, 1]

#Function to find which features are more important than others through model
def feat_importance(estimator):
feature_importance = {}
for index, name in enumerate(df_LC.columns):
feature_importance[name] = estimator.feature_importances_[index]

feature_importance = {k: v for k, v in feature_importance.items()}
sorted_x = sorted(feature_importance.items(), key=operator.itemgetter(1), reverse = True)

return sorted_x

#Model to predict the ROC curve for various models and finding the best one
def run_models(X_train, y_train, X_test, y_test, model_type = 'Non-balanced'):

clfs = {'GradientBoosting': GradientBoostingClassifier(max_depth= 6, n_estimators=100, max_features = 0.3),
'LogisticRegression' : LogisticRegression(),
#'GaussianNB': GaussianNB(),
'RandomForestClassifier': RandomForestClassifier(n_estimators=10),
'XGBClassifier': XGBClassifier()
cols = ['model','matthews_corrcoef', 'roc_auc_score', 'precision_score', 'recall_score','f1_score']

models_report = pd.DataFrame(columns = cols)
conf_matrix = dict()

for clf, clf_name in zip(clfs.values(), clfs.keys()):, y_train)

y_pred = clf.predict(X_test)
y_score = clf.predict_proba(X_test)[:,1]

print('computing {} - {} '.format(clf_name, model_type))

tmp = pd.Series({'model_type': model_type,
'model': clf_name,
'roc_auc_score' : metrics.roc_auc_score(y_test, y_score),
'matthews_corrcoef': metrics.matthews_corrcoef(y_test, y_pred),
'precision_score': metrics.precision_score(y_test, y_pred),
'recall_score': metrics.recall_score(y_test, y_pred),
'f1_score': metrics.f1_score(y_test, y_pred)})

models_report = models_report.append(tmp, ignore_index = True)
conf_matrix[clf_name] = metrics.confusion_matrix(y_test, y_pred)
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_score, drop_intermediate = False, pos_label = 1)

plt.figure(1, figsize=(6,6))
plt.xlabel('false positive rate')
plt.ylabel('true positive rate')
plt.title('ROC curve - {}'.format(model_type))
plt.plot(fpr, tpr, label = clf_name )
plt.legend(loc=2, prop={'size':11})
plt.plot([0,1],[0,1], color = 'black')

return models_report, conf_matrix, plt #############Reading the dataset############################

###############EDA Starts here####################################

##Drop the duplicates with respect to LOAN ID
data.drop_duplicates(subset="Loan ID",keep='first',inplace=True)
To achieve a passing grade, the accuracy of the model
has to be at least 70 (percent).


