Machine Learning Fraud Detection System by Natnael YilmaMachine Learning Fraud Detection System by Natnael Yilma

Machine Learning Fraud Detection System

Natnael Yilma

Natnael Yilma

Fraud Detection for E-commerce and Bank Transactions

Advanced fraud detection system for e-commerce transactions and credit card payments using machine learning

📋 Table of Contents

🎯 Business Objective

Company: Adey Innovations Inc.
This project aims to build a robust, production-ready fraud detection system capable of identifying fraudulent transactions across two domains:
E-commerce Transactions: Detect suspicious purchases based on user behavior, geolocation, and transaction patterns
Credit Card Transactions: Identify fraudulent bank transactions using advanced feature engineering and anomaly detection

Goals

✅ Achieve high precision and recall in fraud detection
✅ Minimize false positives to avoid customer friction
✅ Handle severe class imbalance effectively
✅ Provide interpretable results for business stakeholders
✅ Build scalable, production-grade code

📊 Datasets

1. Fraud_Data.csv (E-commerce Transactions)

Size: ~151,000 transactions
Features:
User information (ID, age, sex)
Transaction details (purchase value, time, signup time)
Device & browser information
Traffic source (SEO, Ads, Direct)
IP address for geolocation
Target: class (0 = Legitimate, 1 = Fraud)
Class Distribution: ~9% fraudulent transactions

2. creditcard.csv (Bank Credit Card Transactions)

Size: ~284,000 transactions
Features:
28 anonymized features (V1-V28) from PCA transformation
Transaction amount
Time elapsed from first transaction
Target: Class (0 = Legitimate, 1 = Fraud)
Class Distribution: Highly imbalanced (~0.17% fraud)

3. IpAddress_to_Country.csv (Geolocation Mapping)

Purpose: Map IP addresses to countries for geographical fraud analysis
Features: IP ranges and country codes

📁 Project Structure


🚀 Setup Instructions

Prerequisites

Python 3.11 or higher
pip (Python package manager)
Git (for version control)

1. Clone the Repository


2. Create Virtual Environment (Recommended)


3. Install Dependencies


Required packages:
pandas - Data manipulation
numpy - Numerical computing
matplotlib - Visualization
seaborn - Statistical visualization
scikit-learn - Machine learning algorithms
imbalanced-learn - SMOTE and class balancing
xgboost - Gradient boosting ensemble models
shap - Model explainability and interpretability
joblib - Model serialization
jupyter - Interactive notebooks

4. Verify Installation


🏃 How to Run

Option 1: Run EDA Notebook (Recommended for Exploration)

Start Jupyter Notebook

Navigate to

Run All Cells
Click KernelRestart & Run All
Or run cells sequentially with Shift + Enter

Option 2: Train Fraud Detection Models

Train models on both datasets:

This script will:
Load both creditcard.csv and Fraud_Data.csv
Perform stratified train-test split
Train Logistic Regression (baseline) and XGBoost (ensemble) models
Perform 5-fold stratified cross-validation
Compare models and select the best one
Generate visualizations and save trained models
Output locations:
Trained models: models/
Visualizations: temp_charts/
Console: Detailed metrics and recommendations

Option 3: SHAP Model Explainability Analysis

Interpret model predictions with SHAP:

This script provides comprehensive model interpretability:
Extracts and visualizes built-in feature importance
Performs global SHAP analysis (overall feature impact)
Generates local explanations for individual predictions (TP, FP, FN cases)
Compares SHAP importance with built-in importance
Generates actionable business recommendations
Output locations:
Feature importance plots: temp_charts/builtin_feature_importance_*.png
SHAP summary plots: temp_charts/shap_summary_plot_*.png
Interactive force plots: temp_charts/shap_force_plot_*.html
Comparison visualizations: temp_charts/importance_comparison_*.png
Console: Business recommendations and insights
See SHAP_ANALYSIS_GUIDE.md for detailed documentation.

Option 4: Use Individual Components

Load and explore data:

Preprocess data:

Train a custom model:

💻 Usage Examples

Example 1: Complete Model Training Pipeline

Train models on both datasets with full evaluation:

What happens:
Loads data/raw/creditcard.csv and data/raw/Fraud_Data.csv
Performs stratified train-test split (80-20)
Trains Logistic Regression (baseline) and XGBoost (ensemble) models
Runs 5-fold stratified cross-validation
Generates confusion matrices, ROC curves, and PR curves
Compares models and saves the best one to models/

Example 2: Programmatic Model Training

Use the training class in your own scripts:

Example 3: Data Preprocessing Workflow

Use individual preprocessing components:

Example 4: Load and Use a Trained Model


Example 5: SHAP Model Explainability

Understand why your model makes specific predictions:

Example 6: Quick Data Exploration


🔄 Workflow

Phase 1: Data Exploration & Cleaning ✅

Location: notebooks/eda-fraud-data.ipynb
Load datasets using reusable functions
Inspect data quality
Check missing values
Identify duplicates
Validate data types
Clean data
Remove duplicates
Handle missing values
Filter invalid entries (age, amounts)
Class distribution analysis
Calculate imbalance ratios
Visualize fraud vs legitimate transactions
Output:
data/processed/fraud_data_cleaned.csv
data/processed/creditcard_cleaned.csv

Phase 2: Feature Engineering ✅

Location: notebooks/eda-fraud-data.ipynb (Sections 9-10)
Time-based features (Fraud_Data)
Hour of day, day of week
Weekend/night flags
Time since signup
Behavioral features
Quick purchase detection
Rapid transaction velocity
Statistical features (CreditCard)
Mean, std, min, max of V features
Log-transformed amounts
Geolocation integration
IP to country mapping
Fraud rate by country
Output: Engineered features added to cleaned datasets

Phase 3: Data Preprocessing ✅

Location: notebooks/eda-fraud-data.ipynb (Sections 10-11)
Train-test split (80-20 with stratification)
Feature scaling (StandardScaler on numeric features)
Categorical encoding (One-hot encoding)
Class imbalance handling
SMOTE (Synthetic Minority Over-sampling)
Random Under-sampling (alternative)
Output:
data/processed/fraud_X_train_smote.csv
data/processed/fraud_y_train_smote.csv
data/processed/fraud_X_test.csv
data/processed/fraud_y_test.csv

Phase 4: Model Training ✅

Location: scripts/train_fraud_models.py
Baseline Model: Logistic Regression with class_weight='balanced'
Ensemble Model: XGBoost with hyperparameter tuning
Stratified K-Fold Cross-Validation (k=5)
Model Comparison: Side-by-side performance evaluation
Model Selection: Best model based on AUC-PR and interpretability
Run:

Output:
Trained models saved in models/
Evaluation metrics and visualizations in temp_charts/
Comprehensive performance reports in console

Phase 5: Model Evaluation ✅

Integrated in: scripts/train_fraud_models.py
Confusion Matrix
Precision, Recall, F1-Score
ROC-AUC Curve
PR-AUC (Area Under Precision-Recall Curve) - primary metric for imbalanced data
Cross-validation results with mean ± std
Model comparison and recommendations

Phase 6: Model Explainability & Interpretability ✅

Location: scripts/shap_model_explainability.py
Feature Importance Baseline
Extract model's built-in feature importance
Visualize top 10 most important features
Understand limitations of built-in importance
Global SHAP Analysis
SHAP summary plots showing overall feature impact
Identify features that increase vs decrease fraud probability
Mean absolute SHAP values for feature ranking
Local SHAP Analysis
Generate force plots for critical prediction cases:
✅ True Positive (correctly detected fraud)
⚠️ False Positive (legitimate transaction flagged)
❌ False Negative (fraud missed)
Explain individual feature contributions to specific predictions
Interpretation & Insights
Compare SHAP importance with built-in importance
Identify top 5 most influential fraud drivers
Explain unexpected or counterintuitive patterns
Business Recommendations
Generate 3-5 actionable recommendations
Each linked to specific SHAP insights
Ready for fraud prevention team implementation
Run:

Output:
Feature importance visualizations: temp_charts/builtin_feature_importance_*.png
SHAP summary plots: temp_charts/shap_summary_plot_*.png
Interactive force plots: temp_charts/shap_force_plot_*.html
Comparison visualizations: temp_charts/importance_comparison_*.png
Business recommendations printed to console
See SHAP_ANALYSIS_GUIDE.md for comprehensive documentation.

📈 Results Organization

Processed Data Files

All preprocessed data is saved in data/processed/ with clear naming conventions:
File Description fraud_data_cleaned.csv Cleaned e-commerce transaction data creditcard_cleaned.csv Cleaned credit card transaction data fraud_X_train_smote.csv SMOTE-balanced training features fraud_y_train_smote.csv SMOTE-balanced training labels fraud_X_test.csv Test features (imbalanced - real distribution) fraud_y_test.csv Test labels

Key Findings (from EDA)

Fraud_Data (E-commerce)

Imbalance Ratio: ~10:1 (legitimate vs fraud)
High-risk patterns:
Very quick purchases after signup (<5 min)
Late night transactions
Multiple rapid transactions from same user
Certain countries show higher fraud rates

CreditCard (Bank)

Imbalance Ratio: ~580:1 (extreme imbalance!)
High-risk patterns:
Unusual transaction amounts
Atypical V-feature values (PCA components)
Transaction timing patterns

✨ Key Features

Production-Grade Code

Reusable functions with docstrings and error handling
Modular design for easy maintenance and testing
Comprehensive logging at every step
Data validation to catch errors early

Data Leakage Prevention

Proper train-test split before any transformation
Scaling fitted on training data only
SMOTE applied to training set only
Test set maintains real-world distribution

Class Imbalance Solutions

SMOTE for synthetic oversampling
Random Under-sampling as alternative
Visualizations showing before/after balancing
Multiple strategies for comparison

Comprehensive EDA

Parallel analysis of both datasets
Univariate and bivariate visualizations
Geolocation analysis (fraud by country)
Clear documentation of findings

Model Interpretability

SHAP explainability for understanding model decisions
Global and local explanations for comprehensive insights
Feature importance analysis comparing multiple methods
Actionable business recommendations based on SHAP insights
Interactive visualizations for stakeholder presentations

🛠 Technologies Used

Core Libraries

pandas 1.5+ - Data manipulation and analysis
numpy 1.23+ - Numerical computing
scikit-learn 1.2+ - Machine learning algorithms and preprocessing
imbalanced-learn 0.10+ - Handling class imbalance
xgboost 2.0+ - Gradient boosting for ensemble models
shap 0.42+ - Model explainability and SHAP value computation
joblib 1.3+ - Model serialization and persistence

Visualization

matplotlib 3.7+ - Static plots and charts
seaborn 0.12+ - Statistical data visualization
SHAP plots - Interactive force plots and summary visualizations

Development Tools

Jupyter Notebook - Interactive exploration
Git - Version control
pytest - Unit testing framework

🔍 Model Interpretability with SHAP

Understanding why a model makes specific predictions is crucial for fraud detection systems. This project includes comprehensive SHAP (SHapley Additive exPlanations) analysis to provide both global and local explanations.

What is SHAP?

SHAP is a unified framework for explaining model predictions by assigning each feature an importance value for a particular prediction. It's based on game theory and provides theoretically justified feature attributions.

Key Capabilities

1. Global Explainability

SHAP Summary Plots: Visualize how each feature impacts fraud predictions overall
Feature Impact Analysis: Identify which features increase vs decrease fraud probability
Top Fraud Drivers: Rank features by their importance in fraud detection

2. Local Explainability (Individual Predictions)

Force Plots: Interactive visualizations showing how specific feature values push predictions toward fraud or legitimate
Case Analysis: Detailed explanations for:
True Positives: Why fraud was correctly detected
⚠️ False Positives: Why legitimate transactions were flagged (helps reduce false alarms)
False Negatives: Why fraud was missed (critical for improving detection)

3. Business Insights

Actionable Recommendations: 3-5 specific recommendations based on SHAP insights
Feature Comparison: Compare SHAP importance with built-in model importance
Pattern Discovery: Identify unexpected relationships and counterintuitive patterns

Quick Start


Output Files

All SHAP visualizations are saved in temp_charts/:
File Type Description builtin_feature_importance_*.png Model's built-in feature importance shap_summary_plot_*.png Global SHAP summary showing feature impact shap_importance_*.png Mean absolute SHAP values ranking importance_comparison_*.png Comparison of built-in vs SHAP importance shap_force_plot_TP_*.html Interactive force plot for True Positive case shap_force_plot_FP_*.html Interactive force plot for False Positive case shap_force_plot_FN_*.html Interactive force plot for False Negative case
Note: Open HTML files in a web browser for interactive exploration.

Example Business Recommendations

Based on SHAP analysis, you might receive recommendations like:
Transaction Velocity Monitoring
Recommendation: Transactions occurring within X hours of account creation should trigger additional verification.
Justification: SHAP analysis shows that account_age < X strongly increases fraud probability.
Geographic Risk Scoring
Recommendation: Implement country-based risk thresholds for high-risk regions.
Justification: SHAP values reveal strong geographic patterns in fraud probability.
Amount-Based Verification
Recommendation: Implement tiered verification based on transaction amount thresholds.
Justification: SHAP analysis identifies transaction amount as a key fraud driver with non-linear impact.

Understanding SHAP Values

Positive SHAP Value: Feature value increases fraud probability
Negative SHAP Value: Feature value decreases fraud probability
Absolute SHAP Value: Magnitude of feature's impact on prediction

Customization

Adjust analysis parameters in the script:
sample_size: Background samples for explainer (default: 1000)
max_samples: Test samples to explain (default: 500)
For detailed documentation, see SHAP_ANALYSIS_GUIDE.md.

📝 Best Practices Followed

Code Best Practices

Modular Design
✅ Reusable functions in src/ modules (not inline in notebooks)
✅ Functions with comprehensive docstrings and error handling
✅ Notebooks call reusable functions for consistency
✅ Easy to test and maintain
Error Handling
✅ Try/except blocks with informative error messages
✅ Validation of data before processing
✅ Graceful handling of missing files or invalid data
Data Handling
✅ Raw data never modified (read-only)
✅ All transformations documented
✅ Reproducible preprocessing pipeline
✅ Proper train-test split to prevent data leakage
Code Quality
✅ Functions over repeated code
✅ Clear variable naming
✅ Comprehensive comments and docstrings
✅ Consistent code style
Version Control
.gitignore excludes large files and sensitive data
✅ Meaningful commit messages
✅ Separate branches for features
Documentation
✅ README with complete setup instructions
✅ Inline code documentation
✅ Notebook markdown explanations
✅ Function docstrings with parameters and returns

🤝 Contributing

How to Contribute

Fork the repository
Create a feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

Development Workflow

Create issue describing the problem/feature
Assign yourself and link to project board
Write tests for new functionality
Ensure all tests pass
Update documentation
Submit PR for review

📞 Contact & Support

Project Maintainer: Adey Innovations Inc. Repository: GitHub Link
For questions or issues:
Check existing Issues
Create a new issue with detailed description
Tag with appropriate labels (bug, enhancement, question)

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

10 Academy - For the training program and project guidance
Adey Innovations Inc. - For the business case and datasets
Kaggle - For the credit card fraud dataset
Open Source Community - For the amazing libraries used in this project

🚀 Quick Start Summary


Typical Workflow:
Explore Data: Run notebooks/eda-fraud-data.ipynb to understand the datasets
Train Models: Run python scripts/train_fraud_models.py to build and evaluate models
Interpret Models: Run python scripts/shap_model_explainability.py to understand model decisions
Review Results: Check temp_charts/ for visualizations and console output for metrics
Implement Recommendations: Use SHAP insights to develop fraud prevention rules
Deploy: Use saved models from models/ for production predictions
Happy Fraud Hunting! 🕵️‍♂️💳
Like this project

Posted May 11, 2026

Built a machine learning system for detecting fraudulent e-commerce and credit card transactions.