Vesta E-Commerce Fraud Detection

Data Modelling Analyst

Data Scientist

Data Analyst

Python

scikit-learn

Project Summary

This project aims to improve fraud detection systems in e-commerce, partnering with IEEE-CIS and Vesta Corporation to analyze a large dataset of real-world transactions.

Dataset

Size: 590,540 transactions, 434 features.

Key Features: Transaction amount, transaction time, card details, address, distance, email domains, and various anonymized features.

Exploratory Data Analysis

Transaction Amounts: Log transformation revealed clear distinctions between fraudulent and non-fraudulent transactions.

Timing Patterns: Fraudulent transactions often occurred at irregular hours.

Feature Importance: PCA reduced multicollinearity, highlighting key features like 'TransactionAmt' and 'TransactionDT'.

Challenges

Imbalanced Data: The dataset exhibited a significant imbalance between fraudulent and non-fraudulent transactions, posing challenges for effective model training. Due to the many anonymous or encoded features, techniques like SMOTE were not applicable since they require knowledge of the feature meanings. Consequently, we relied heavily on comprehensive feature selection techniques and specific models capable of handling imbalanced data.

Feature Selection: Choosing the most relevant features from a large set of 434 features required extensive visual analysis and dimensionality reduction techniques like PCA and Random Forest feature importance.

Model Selection

Objective: Balance recall and precision to minimize false positives and maximize fraud detection.

Hyperparameter Tuning: Grid search for Random Forest, Logistic Regression, and XGBoost.

Random Forest: ROC-AUC 0.89

Logistic Regression: ROC-AUC 0.51

XGBoost: ROC-AUC 0.88

Precision-Recall Analysis

Threshold Experimentation: XGBoost maintained better stability and performance across different thresholds, chosen as the final model.

Final Model Choice

The XGBoost model was chosen for its robust performance and balanced precision-recall metrics, making it effective for fraud detection. The model demonstrated a strong F1-score of 0.76, with nearly equal precision and recall scores for both classes and an overall accuracy of 81%.