Building Your End-to-End Machine Learning Model

Cornellius Yudha Wijaya

Data Science Specialist

Data Scientist

AI Developer

GitHub

Jupyter

Python

Machine Learning

Hi everyone! I am sure you are reading this article because you are interested in a machine-learning model and want to build one.
You may have tried to develop machine learning models before or you are entirely new to the concept. No matter your experience, this article will guide you through the best practices for developing machine learning models.
In this article, we will develop a Customer Churn prediction classification model following the steps below:
1. Business Understanding 2. Data Collection and Preparation
Collecting Data
Exploratory Data Analysis (EDA) and Data Cleaning
Feature Selection
3. Building the Machine Learning Model
Choosing the Right Model
Splitting the Data
Training the Model
Model Evaluation
4. Model Optimization
5. Deploying the Model
Let's get into it if you are excited about building your first machine learning model.

Understanding the Basics

Before we get into the machine learning model development, let’s briefly explain machine learning, the types of machine learning, and a few terminologies we will use in this article.
First, let’s discuss the types of machine learning models we can develop. Four main types of Machine Learning often developed are:
Supervised Machine Learning is a machine learning algorithm that learns from labeled datasets. Based on the correct output, the model learns from the pattern and tries to predict the new data. There are two categories in Supervised Machine Learning: Classification (Category prediction) and Regression (Numerical prediction).
Unsupervised Machine Learning is an algorithm that tries to find patterns in data without direction. Unlike supervised machine learning, the model is not guided by label data. This type has two common categories: Clustering (Data Segmentation) and Dimensionality Reduction (Feature Reduction).
Semi-supervised machine learning combines the labeled and unlabeled datasets, where the labeled dataset guides the model in identifying patterns in the unlabeled data. The simplest example is a self-training model that can label the unlabeled data based on a labeled data pattern.
Reinforcement Learning is a machine learning algorithm that can interact with the environment and react based on the action (getting a reward or punishment). It would maximize the result with the rewards system and avoid bad results with punishment. An example of this model application is the self-driving car.
You also need to know a few terminologies to develop a machine-learning model:
Features: Input variables used to make predictions in a machine learning model.
Labels: Output variables that the model is trying to predict.
Data Splitting: The process of data separation into different sets.
Training Set: Data used to train the machine learning model.
Test Set: Data used to evaluate the performance of the trained model.
Validation Set: Data use used during the training process to tune hyperparameters
Exploratory Data Analysis (EDA): The process of analyzing and visualizing datasets to summarize their information and discover patterns.
Models: The outcome of the Machine Learning process. They are the mathematical representation of the patterns and relationships within the data.
Overfitting: Occurs when the model is generalized too well and learns the data noise. The model can predict well in the training but not in the test set.
Underfitting: When a model is too simple to capture the underlying patterns in the data. The model performance in training and test sets could be better.
Hyperparameters: Configuration settings are used to tune the model and are set before training begins.
Cross-validation: a technique for evaluating the model by partitioning the original sample into training and validation sets multiple times.
Feature Engineering: Using domain knowledge to get new features from raw data.
Model Training: The process of learning the parameters of a model using the training data.
Model Evaluation: Assessing the performance of a trained model using machine learning metrics like accuracy, precision, and recall.
Model Deployment: Making a trained model available in a production environment.
With all this basic knowledge, let’s learn to develop our first machine-learning model.

1. Business Understanding

Before any machine learning model development, we must understand why we must develop the model. That’s why understanding what the business wants is necessary to ensure the model is valid. Business understanding usually requires a proper discussion with the related stakeholders. Still, since this tutorial does not have business users for the machine learning model, we assume the business needs ourselves.
As stated previously, we would develop a Customer Churn prediction model. In this case, the business needs to avoid further churn from the company and wants to take action for the customer with a high probability of churning. With the above business requirements, we need specific metrics to measure whether the model performs well. There are many measurements, but I propose using the Recall metric.
In monetary values, it might be more beneficial to use Recall, as it tries to minimize the False Negative or decrease the amount of prediction that was not churning while it’s churning. Of course, we can try to aim for balance by using the F1 metric.
With that in mind, let's get into the first part of our tutorial.

2. Data Collection and Preparation

Data Collection

Data is the heart of any machine learning project. Without it, we can’t have a machine learning model to train. That’s why we need quality data with proper preparation before we input them into the machine learning algorithm.
In a real-world case, clean data does not come easily. Often, we need to collect it through applications, surveys, and many other sources before storing it in data storage. However, this tutorial only covers collecting the dataset as we use the existing clean data.
In our case, we would use the Telco Customer Churn data from the Kaggle. It’s open-source classification data regarding customer history in the telco industry with the churn label.

Exploratory Data Analysis (EDA) and Data Cleaning

Let’s start by reviewing our dataset. I assume the reader already has basic Python knowledge and can use Python packages in their notebook. I also based the tutorial on Anaconda environment distribution to make things easier.
To understand the data we have, we need to load it into a Python package for data manipulation. The most famous one is the Pandas Python package, which we will use. We can use the following code to load and review the CSV data.
import pandas as pd

df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
df.head()

Next, we would explore the data to understand our dataset. Here are a few actions that we would perform for the EDA process.
1. Examining the features and the summary statistics. 2. Checks for missing values in the features. 3. Analyze the distribution of the label (Churn). 4. Plots histograms for numerical features and bar plots for categorical features. 5. Plots a correlation heatmap for numerical features. 6. Uses box plots to identify distributions and potential outliers.
First, we would check the features and summary statistics. With Pandas, we can see our dataset features using the following code.
# Get the basic information about the dataset
df.info()

Output>>

RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 customerID 7043 non-null object
1 gender 7043 non-null object
2 SeniorCitizen 7043 non-null int64
3 Partner 7043 non-null object
4 Dependents 7043 non-null object
5 tenure 7043 non-null int64
6 PhoneService 7043 non-null object
7 MultipleLines 7043 non-null object
8 InternetService 7043 non-null object
9 OnlineSecurity 7043 non-null object
10 OnlineBackup 7043 non-null object
11 DeviceProtection 7043 non-null object
12 TechSupport 7043 non-null object
13 StreamingTV 7043 non-null object
14 StreamingMovies 7043 non-null object
15 Contract 7043 non-null object
16 PaperlessBilling 7043 non-null object
17 PaymentMethod 7043 non-null object
18 MonthlyCharges 7043 non-null float64
19 TotalCharges 7043 non-null object
20 Churn 7043 non-null object
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB

We would also get the dataset summary statistics with the following code.
From the information above, we understand that we have 19 features with one target feature (Churn). The dataset contains 7043 rows, and most datasets are categorical.
Let’s check for the missing data.
Our dataset does not contain missing data, so we don’t need to perform any missing data treatment activity. Then, we would check the target variable to see if we have an imbalance case.
There is a slight imbalance, as only close to 25% of the churn occurs compared to the non-churn cases.
Let’s also see the distribution of the other features, starting with the numerical features. However, we would also transform the TotalCharges feature into a numerical column, as this feature should be numerical rather than a category. Additionally, the SeniorCitizen feature should be categorical so that I would transform it into strings. Also, as the Churn feature is categorical, we would develop new features that show it as a numerical column.
We would also provide categorical feature plotting except for the customerID, as they are identifiers with unique values.
We then would see the correlation between numerical features with the following code.
The correlation above is based on the Pearson Correlation, a linear correlation between one feature and the other. We can also perform correlation analysis to categorical analysis with Cramer’s V. To make the analysis easier, we would install Dython Python package that could help our analysis.
Once the package is installed, we will perform the correlation analysis with the following code.
Lastly, we would check the numerical outlier with a box plot based on the Interquartile Range (IQR).
From the analysis above, we can see that we should address no missing data or outliers. The next step is to perform feature selection for our machine learning model, as we only want the features that impact the prediction and are viable in the business.

Feature Selection

There are many ways to perform feature selection, usually done by combining business knowledge and technical application. However, this tutorial will only use the correlation analysis we have done previously to make the feature selection.
First, let’s select the numerical features based on the correlation analysis.
You can play around with the threshold later to see if the feature selection affects the model's performance. We would also perform the feature selection into the categorical features.
Then, we would combine all the selected features with the following code.
In the end, we have six features that would be used to develop the customer churn machine learning model.

3. Building the Machine Learning Model

Choosing the Right Model

There are many considerations to choosing a suitable model for machine learning development, but it always depends on the business needs. A few points to remember:
The use case problem. Is it supervised or unsupervised, or is it classification or regression? Is it Multiclass or Multilabel? The case problem would dictate which model can be used.
The data characteristics. Is it tabular data, text, or image? Is the dataset size big or small? Did the dataset contain missing values? Depending on the dataset, the model we choose could be different.
How easy is the model to be interpreted? Balancing interpretability and performance is essential for the business.
As a thumb rule, starting with a simpler model as a benchmark is often best before proceeding to a complex one. You can read my previous article about the simple model to understand what constitutes a simple model.
For this tutorial, let’s start with linear model Logistic Regression for the model development.

Splitting the Data

The next activity is to split the data into training, test, and validation sets. The purpose of data splitting during machine learning model training is to have a data set that acts as unseen data (real-world data) to evaluate the model unbias without any data leakage.
To split the data, we will use the following code:
In the above code, we split the data into 60% of the training dataset and 20% of the test and validation set. Once we have the dataset, we will train the model.

Training the Model

As mentioned, we would train a Logistic Regression model with our training data. However, the model can only accept numerical data, so we must preprocess the dataset. This means we need to transform the categorical data into numerical data.
For best practice, we also use the Scikit-Learn pipeline to contain all the preprocessing and modeling steps. The following code allows you to do that.
The model pipeline would look like the image below.
The Scikit-Learn pipeline would accept the unseen data and go through all the preprocessing steps before entering the model. After the model is finished training, let’s evaluate our model result.

Model Evaluation

As mentioned, we will evaluate the model by focusing on the Recall metrics. However, the following code shows all the basic classification metrics.
As we can see from the Validation and Test data, the Recall for churn (1) is not the best. That’s why we can optimize the model to get the best result.

4. Model Optimization

We always need to focus on the data to get the best result. However, optimizing the model could also lead to better results. This is why we can optimize our model. One way to optimize the model is via hyperparameter optimization, which tests all combinations of these model hyperparameters to find the best one based on the metrics.
Every model has a set of hyperparameters we can set before training it. We call hyperparameter optimization the experiment to see which combination is the best. To do that, we can use the following code.
The results still do not show the best recall score, but this is expected as they are only the baseline model. Let’s experiment with several models to see if the Recall performance improves. You can always tweak the hyperparameter below.
The recall result has not changed much; even the baseline Logistic Regression seems the best. We should return with a better feature selection if we want a better result.
However, let’s move forward with the current Logistic Regression model and try to deploy them.

5. Deploying the Model

We have built our machine learning model. After having the model, the next step is to deploy it into production. Let’s simulate it using a simple API.
First, let’s develop our model again and save it as a joblib object.
Once the model object is ready, we will move into a Python script to create the API. But first, we need to install a few packages used for deployment.
We would not do it in the notebook but in an IDE such as Visual Studio Code. In your preferred IDE, create a Python script called app.py and put the code below into the script.
In your command prompt or terminal, run the following code.
With the code above, we already have an API to accept data and create predictions. Let’s try it out with the following code in the new terminal.
As you can see, the API result is a dictionary with prediction 0 (Not-Churn). You can tweak the code even further to get the desired result.
Congratulation. You have developed your machine learning model and successfully deployed it in the API.

Conclusion

We have learned how to develop a machine learning model from the beginning to the deployment. Experiment with other datasets and use cases to get the feeling even better. All the code this article uses will be available on my GitHub repository.
Like this project
0

Posted Feb 28, 2025

Machine Learning model is an exciting project Let's to an end-to-end machine learning model system for your business.

Likes

0

Views

0

Tags

Data Science Specialist

Data Scientist

AI Developer

GitHub

Jupyter

Python

Machine Learning

Creating a Useful Voice-Activated Fully Local RAG System
Creating a Useful Voice-Activated Fully Local RAG System