Customer Churn Prediction with Machine Learning by Kristóf NémethCustomer Churn Prediction with Machine Learning by Kristóf Németh

Customer Churn Prediction with Machine Learning

Kristóf Németh

Completed work

Data Analyst

Data Scientist

Python

scikit-learn

XGBoost

Data

customer-churn-analysis

Overview

Customer churn refers to the phenomenon where customers discontinue their relationship or subscription with a company or service provider. It represents the rate at which customers stop using a company's products or services within a specific period.

Churn is an important metric for businesses as it directly impacts revenue, growth, and customer retention. Understanding customer churn is crucial for businesses to identify patterns, factors, and indicators that contribute to customer attrition. By analyzing churn behavior and its associated features, companies can develop strategies to retain existing customers, improve customer satisfaction, and reduce customer turnover.

This project leverages machine learning techniques to analyze and predict customer churn using the Customer Churn Dataset sourced from Kaggle: Customer Churn Dataset.

Data Preprocessing

Data Cleaning

Removed the CustomerID column as it is not a predictive feature.

Handled missing values by dropping rows with null values.

Feature Engineering

Created new features:

Spend_per_Tenure = Total Spend / (Tenure + 1)

SupportCalls_per_Tenure = Support Calls / (Tenure + 1)

Usage_per_Tenure = Usage Frequency / (Tenure + 1)

Data Transformation

Converted data types:

Numerical columns (Age, Tenure, Usage Frequency, Support Calls, Payment Delay, Total Spend, Last Interaction, Churn) were converted to int64.

Categorical Encoding:

Gender, Subscription Type, and Contract Length were converted into categorical variables for better compatibility with machine learning models.

Data Splitting & Scaling

Splitting: 80% training, 20% testing.

Feature Scaling: Min-Max Normalization applied to standardize feature values.

Machine Learning Models

The project applies machine learning models to predict customer churn.

1. Artificial Neural Network (ANN) - Model 1

Architecture:

Input layer: (X_train_scaled.shape[1],)

Hidden layers: 64-32-16 neurons with ReLU activation

Regularization: Dropout(30%)

Output layer: Sigmoid activation

Training:

Optimizer: Adam

Loss function: binary_crossentropy

Epochs: 8

Performance:

Accuracy: 99.26%

AUC-ROC Score: 0.9997

Confusion Matrix: [[38039 24] [ 625 49479]]

2. ANN Model 2 - Weighted Binary Cross-Entropy

Improvements over Model 1:

Introduced Weighted Binary Cross-Entropy loss

Added Batch Normalization

Increased neuron count and Dropout (40% & 30%)

Training Results:

Accuracy: 99.58%

AUC-ROC Score: 0.9997

Confusion Matrix: [[38063 0] [ 369 49735]]

Final Decision: This model was selected as the best-performing ANN model.

3. Artificial Neural Network (ANN) - Model 3

Additional Enhancements:

More neurons per hidden layer (256-128-64-32)

LeakyReLU activations instead of ReLU

L2 Regularization (0.01) added to dense layers

Performance:

Accuracy: 98.82% (slightly lower than Model 2)

AUC-ROC Score: 0.9984

Confusion Matrix: [[38059 4] [ 1038 49066]]

Decision: Model 2 remains the best ANN model; Model 3 did not outperform Model 2.

4. XGBoost Model (Best Performing Model)

Hyperparameter tuning: Used GridSearchCV to optimize:

n_estimators: [100, 300, 500]

max_depth: [4, 6, 8]

learning_rate: [0.01, 0.05, 0.1]

min_child_weight: [1, 3, 5]

gamma: [0, 1, 5]

subsample: [0.8, 1.0]

colsample_bytree: [0.8, 1.0]

Performance (Best model):

Accuracy: 99.92%

Precision, Recall, and F1-score: ~1.00

AUC-ROC Score: 1.0000

Confusion Matrix: [[38063 0] [ 70 50034]]

Final Model Selection: XGBoost was chosen as the final model due to its superior performance.

5. Logistic Regression Model (Baseline)

Performance:

Accuracy: 89.64%

AUC-ROC Score: 0.9596

Confusion Matrix: [[34563 3500] [ 5629 44475]]

This model underperformed compared to the ANN and XGBoost models.

Results

The best-performing model was XGBoost with hyperparameter tuning, achieving:

Accuracy: 99.92%

AUC-ROC Score: 1.0000

Final Model: xgb_tuned.joblib

Project Organization

├── LICENSE ├── Makefile ├── README.md <- Project documentation ├── data <- Dataset and processed files ├── docs <- Documentation ├── models <- Trained models and predictions ├── notebooks <- Jupyter notebooks ├── pyproject.toml <- Project metadata and configuration ├── references <- Data dictionaries, manuals, etc. ├── reports <- Generated analysis and figures ├── requirements.txt <- Dependencies ├── setup.cfg <- Code formatting rules └── customer-churn-analysis ├── init.py ├── config.py ├── dataset.py ├── features.py ├── modeling │ ├── init.py │ ├── predict.py │ └── train.py └── plots.py

Conclusion

This project focuses on analyzing and predicting customer churn using different machine learning models, including artificial neural networks (ANN) and XGBoost. The best-performing model was the optimized XGBoost model, achieving an accuracy of 99.92% and an AUC-ROC score of 1.0000, making it the preferred model for customer churn prediction.

By understanding factors such as tenure, usage frequency, support calls, payment delay, subscription type, and total spend, businesses can implement data-driven strategies to retain customers and enhance customer satisfaction.

Like this project

Completed work

Posted May 7, 2025

Analyzed and predicted customer churn using machine learning models.

Likes

Views