This project aims to improve fraud detection systems in e-commerce, partnering with IEEE-CIS and Vesta Corporation to analyze a large dataset of real-world transactions.
Dataset
Size: 590,540 transactions, 434 features.
Key Features: Transaction amount, transaction time, card details, address, distance, email domains, and various anonymized features.
Exploratory Data Analysis
Transaction Amounts: Log transformation revealed clear distinctions between fraudulent and non-fraudulent transactions.
Timing Patterns: Fraudulent transactions often occurred at irregular hours.
Feature Importance: PCA reduced multicollinearity, highlighting key features like 'TransactionAmt' and 'TransactionDT'.
Challenges
Imbalanced Data: The dataset exhibited a significant imbalance between fraudulent and non-fraudulent transactions, posing challenges for effective model training. Due to the many anonymous or encoded features, techniques like SMOTE were not applicable since they require knowledge of the feature meanings. Consequently, we relied heavily on comprehensive feature selection techniques and specific models capable of handling imbalanced data.
Feature Selection: Choosing the most relevant features from a large set of 434 features required extensive visual analysis and dimensionality reduction techniques like PCA and Random Forest feature importance.
Model Selection
Objective: Balance recall and precision to minimize false positives and maximize fraud detection.
Hyperparameter Tuning: Grid search for Random Forest, Logistic Regression, and XGBoost.
Random Forest: ROC-AUC 0.89
Logistic Regression: ROC-AUC 0.51
XGBoost: ROC-AUC 0.88
Precision-Recall Analysis
Threshold Experimentation: XGBoost maintained better stability and performance across different thresholds, chosen as the final model.
Final Model Choice
The XGBoost model was chosen for its robust performance and balanced precision-recall metrics, making it effective for fraud detection. The model demonstrated a strong F1-score of 0.76, with nearly equal precision and recall scores for both classes and an overall accuracy of 81%.
Further Work
Model Integration: Explore ensembling Random Forest and XGBoost.
Real-Time Deployment: Evaluate the model in a live environment for real-time fraud detection.
This project enhances fraud detection accuracy, ensuring robust protection and minimizing disruptions for genuine users.