Credit-Risk Probability Model with Alternative Data by Natnael YilmaCredit-Risk Probability Model with Alternative Data by Natnael Yilma

Credit-Risk Probability Model with Alternative Data

Natnael Yilma

Natnael Yilma

Credit-Risk-Probability-Model-for-Alternative-Data

πŸ“˜ Credit Scoring Project

A fully documented credit-scoring model aligned with Basel II regulatory standards. This project covers end-to-end development: data preparation, proxy target creation, modeling, validation, governance, and documentation.

πŸ“‘ Table of Contents

Project Overview

The objective of this project is to build a credit scoring model that predicts the probability that a borrower may default. Because financial institutions operate under strict regulatory frameworks (Basel II/III), the model must be:
Transparent
Interpretable
Fair and non-discriminatory
Well-documented
Validated and monitored
This repository includes all components needed to demonstrate a regulatory-grade credit scoring pipeline.

Credit Scoring Business Understanding

1. Influence of Basel II on Model Interpretability and Documentation

The Basel II Accord establishes standards for credit risk measurement under Internal Ratings–Based (IRB) approaches. Its emphasis on risk governance directly shapes how credit scoring models must be built and documented.

🧩 Key Basel II Requirements Influencing Modeling

βœ” Transparency and Explainability

Basel II requires that:
Each variable must have a defensible relationship to default risk.
Model behavior must be interpretable for regulators, auditors, and credit officers.
No β€œblack box” decision-making can be used for regulatory capital calculations.

βœ” Documentation and Auditability

Institutions must maintain:
Full data lineage documentation
Justification for feature engineering (e.g., WoE binning, monotonicity)
Detailed modeling assumptions
Validation reports (KS, ROC, Gini, PSI, calibration)
Stress testing results
Model monitoring framework

βœ” Ethical and Fair-Lending Requirements

Models must:
Avoid hidden bias
Produce consistent decisions across customer groups
Be explainable and defensible
Conclusion: Basel II strongly favors logistic regression + WoE or similarly interpretable approaches.

2. Need for a Proxy Variable When No Direct Default Label Exists

The dataset does not include a direct β€œdefault” column (e.g., default_flag). But supervised models require a target variable.

πŸ” Why We Must Create a Proxy

To train a PD model, we need to define what β€œdefault” means. Possible proxy definitions include:
90+ days past due
3+ consecutive missed payments
Account written off
Assigned to collections
Without a proxy:
We cannot train or validate the model
No risk segmentation is possible
The PD model cannot be operationalized

⚠ Business Risks of Using Proxy Defaults

1️⃣ Misalignment With True Default Behavior

If the proxy does not reflect actual defaults:
Non-risky customers may be rejected
Risky customers may be approved (leading to financial loss)

2️⃣ Bias Introduction

Proxies may unintentionally reflect:
Operational issues
Customer behavior not linked to credit risk
Socioeconomic artifacts
This can introduce fairness and compliance risks.

3️⃣ Regulatory Defensibility Issues

Regulators can challenge:
Why the proxy definition was chosen
Whether it reflects industry standards
Its statistical robustness

4️⃣ Impact on Portfolio Strategies

A poor proxy can distort:
PD estimation
Risk-based pricing
Capital requirements (RWA)
Write-off policies

4. Proxy Target Variable Engineering Implementation

This project implements a RFM-based proxy target variable to identify high-risk customers from transactional behavior patterns.

πŸ“Š Methodology

The proxy target variable (is_high_risk) is created using the following approach:
RFM Metrics Calculation
Recency (R): Number of days since the customer's most recent transaction
Frequency (F): Total number of transactions per customer
Monetary (M): Total transaction amount per customer
Customer Segmentation
Apply K-Means clustering (n_clusters=3) on scaled RFM features
Use StandardScaler for feature normalization
Random state=42 for reproducibility
High-Risk Identification
Analyze cluster centroids to identify least engaged segment
High-risk cluster characterized by:
High Recency (long time since last transaction)
Low Frequency
Low Monetary value
Create binary target: is_high_risk = 1 for least engaged cluster, 0 otherwise

πŸš€ Usage

Command Line Interface


Python API


πŸ“ Output Files

The script generates three output files:
data/processed/data_with_target.csv
Original transactional data with added is_high_risk column
data/processed/rfm_summary.csv
Customer-level RFM metrics and cluster assignments
Columns: CustomerId, Recency, Frequency, Monetary, Cluster, is_high_risk
data/processed/target_metadata.json
Metadata including:
High-risk cluster ID
Target variable distribution
RFM statistics
Cluster summary

πŸ“š Examples

See examples/proxy_target_engineering_example.py for comprehensive examples demonstrating:
Basic RFM calculation
Customer segmentation
Complete pipeline usage
Handling different column name formats

βš™οΈ Implementation Details

The implementation is located in:
src/data_processing.py: Core functions for RFM calculation and clustering
create_proxy_target.py: Command-line interface script
examples/proxy_target_engineering_example.py: Usage examples
Key functions:
calculate_rfm_metrics(): Compute RFM metrics per customer
segment_customers_with_kmeans(): Apply K-Means clustering
identify_high_risk_cluster(): Identify least engaged cluster
create_proxy_target_variable(): Complete end-to-end pipeline

3. Trade-offs Between Interpretable and Complex Models in a Regulated Environment

πŸ”΅ Interpretable Models (Logistic Regression + WoE)

Advantages
Highly explainable (regulator-friendly)
Clear monotonic relationships
Easy to calibrate and validate
Stable performance over time
Low governance burden
Limitations
May underperform on nonlinear data
Requires manual engineering

πŸ”΄ Complex Models (Gradient Boosting, XGBoost, Random Forests)

Advantages
Higher predictive power
Automatically capture interactions and nonlinearities
Useful for internal analytics and risk segmentation
Limitations
Low interpretability
Requires SHAP/LIME explanation layers
Harder to monitor
More difficult for regulators to approve
Higher risk of overfitting

βš– The Real-World Compromise

Banks typically use:
Interpretable models for production decisions, AND
Complex models internally for portfolio insights
This ensures compliance without sacrificing analytical power.

Diagrams

1. PD Modeling Lifecycle

Loading

Loading

πŸ€– ML Training Pipeline with MLflow Tracking

🎯 Overview

This project includes a complete machine learning training pipeline with:
Multiple Algorithms: Logistic Regression, Decision Tree, Random Forest, Gradient Boosting
Hyperparameter Tuning: Grid Search optimization for each model
MLflow Integration: Complete experiment tracking with metrics, parameters, and model artifacts
Model Registry: Automatic registration of best-performing models
Unit Testing: Comprehensive pytest test suite
Reproducibility: Fixed random states and structured data processing

πŸš€ Quick Start

1. Install Dependencies


2. Run the Complete Training Pipeline


This will:
Create a sample dataset (1000 samples)
Train 4 different models with hyperparameter tuning
Track all experiments in MLflow
Compare model performance
Register the best model

3. Start MLflow UI


Then open http://localhost:5000 to explore results.

πŸ“Š Model Training Details

Supported Models

Logistic Regression
Hyperparameters: C (regularization), penalty (L1/L2)
Best for: Interpretability and regulatory compliance
Decision Tree
Hyperparameters: max_depth, min_samples_split, min_samples_leaf
Best for: Feature importance and rule extraction
Random Forest
Hyperparameters: n_estimators, max_depth, min_samples_split
Best for: Robust performance and feature importance
Gradient Boosting
Hyperparameters: n_estimators, learning_rate, max_depth
Best for: High predictive performance

Evaluation Metrics

All models are evaluated using:
Accuracy: Overall classification accuracy
Precision: True positives / (True positives + False positives)
Recall: True positives / (True positives + False negatives)
F1 Score: Harmonic mean of precision and recall
ROC-AUC: Area under the ROC curve (primary metric for model selection)

MLflow Tracking

Each training run logs:
Parameters: All hyperparameters used
Metrics: All evaluation metrics
Artifacts: Trained model objects
Feature Importance: When available (tree-based models)

πŸ§ͺ Testing

Run All Tests


Test Coverage

The test suite includes:
Data Processing Tests (tests/test_data_processing.py)
RFM metrics calculation
Proxy target variable creation
Edge case handling
Reproducibility verification
ML Training Tests (tests/test_ml_training.py)
Pipeline initialization
Data preparation and scaling
Model configuration
Evaluation metrics
Sample dataset generation

Example Test Output


πŸ“ˆ Usage Examples

Basic Training Pipeline


Custom Training Configuration


πŸ”§ Advanced Configuration

Custom Hyperparameter Grids

Modify the get_model_configs() method in MLTrainingPipeline to customize hyperparameter search spaces:

MLflow Configuration

Set MLflow tracking URI and experiment location:

πŸ“ Project Structure


🎯 Key Features

Reproducibility

Fixed random states throughout the pipeline
Deterministic data splitting with stratification
Consistent feature scaling

Experiment Tracking

Complete MLflow integration
Automatic parameter and metric logging
Model artifact storage
Model registry integration

Model Comparison

Standardized evaluation metrics
Automated best model selection
Performance comparison tables
Feature importance tracking

Testing

Comprehensive unit test coverage
Edge case handling
Reproducibility verification
Mock testing for MLflow integration

πŸš€ Next Steps

Explore Results: Start MLflow UI and compare model performance
Customize Models: Modify hyperparameter grids for your use case
Add Features: Extend the feature engineering pipeline
Deploy Models: Use MLflow Model Registry for model deployment
Monitor Performance: Set up model monitoring and drift detection

πŸ“š Additional Resources

Like this project

Posted May 11, 2026

Developed a credit-scoring model using Basel II standards.