Fraud Transaction Detection System

Mohammad

Mohammad Umar

๐Ÿ’ณ Fraud Transaction Detection - Machine Learning Project

๐Ÿ”น Author: Mohammad Umar ๐Ÿ”น Contact: umar.test.49@gmail.com

๐Ÿ“Œ Section 1: Introduction and Objective

Fraudulent transactions have become a significant concern in the financial industry, especially with the increasing volume of digital payments. This project was developed as part of a machine learning internship to build a system capable of flagging potentially fraudulent transactions.

Client (Assumed):

Financial services firm or payment gateway company needing real-time fraud detection to reduce chargebacks.

๐ŸŽฏ Problem Statement:

Identify whether a transaction is fraudulent (1) or legitimate (0) based on transaction metadata.

โ— Importance:

Prevents financial loss for businesses and customers
Automates fraud investigation workflows
Maintains customer trust and platform integrity

โœ… Final Objective:

Build an ML model that:
Accurately classifies transactions
Can be deployed as a user-friendly Streamlit app
Delivers predictions based on user inputs (amount, time, terminal ID, etc.)

๐Ÿ“Š Section 2: Dataset

Dataset Overview:

Source: Simulated data based on real fraud patterns
Files: 183 .pkl files (daily transaction logs)
Total Rows: ~1.75 million transactions
Columns: 9 main features + engineered ones

๐Ÿงพ Important Features:

Feature Description TRANSACTION_ID Unique transaction ID TX_AMOUNT Transaction amount TX_TIME_SECONDS Seconds since midnight TX_FRAUD Target (0: Legit, 1: Fraud) TX_FRAUD_SCENARIO Fraud type ID (if fraud)

โš™ Preprocessing:

Combined all .pkl files into one DataFrame
Converted datetime fields
Removed duplicates and missing values
Engineered temporal features (TX_HOUR, TX_DAY_OF_WEEK)

๐Ÿ” Key Observations:

Class imbalance: <1% fraudulent transactions
Certain fraud scenarios dominate minority class
High fraud density during specific hours/weekdays

โš™๏ธ Section 3: Design / Workflow

Loading
graph LR
A[Data Loading] --> B[Cleaning]
B --> C[EDA]
C --> D[Feature Engineering]
D --> E[Train-Test Split]
E --> F[Model Training]
F --> G[Evaluation]
G --> H[Deployment]

๐Ÿ”„ Steps Breakdown:

Data Loading & Cleaning:

Merged 183 transaction files
Handled null/missing values
Converted date fields to datetime format

Exploratory Data Analysis (EDA):

Visualized fraud vs non-fraud distribution
Analyzed transaction amount patterns
Identified peak fraud times (temporal analysis)

Feature Engineering:

Extracted time-based features (hour, day of week)
Removed raw timestamp after feature extraction
Normalized numerical features

Model Training:

Used Logistic Regression with class_weight='balanced'
Stratified 80/20 train-test split (maintained class ratio)
Scaled features using StandardScaler

Evaluation:

Calculated metrics: โœ“ Accuracy โœ“ Precision โœ“ Recall โœ“ F1-Score
Generated and analyzed confusion matrix
Plotted ROC curve and precision-recall curve

Deployment:

Built interactive Streamlit web application
Designed intuitive input widgets:
Dropdowns for categorical features
Sliders for numerical values
Date/time pickers
Automated feature transformation pipeline

๐Ÿ“ˆ Section 4: Results

โœ… Final Model: Logistic Regression

Metric Score Accuracy 0.7501 Precision 0.0164 Recall 0.4881 F1-Score 0.0317

๐Ÿงช Confusion Matrix:

[[261709  86178]
[ 1503 1433]]

๐Ÿ“Š Key Insights

Model Performance

High Recall (49%): Catches nearly half of all fraud cases
Low Precision (1.6%): Many false positives due to:
Extreme class imbalance (1:99 ratio)
Intentional recall-precision tradeoff for security

Temporal Patterns

๐Ÿ•’ Peak Fraud Times:
12AM-4AM (overnight window)
Weekends (higher frequency)

Common Fraud Scenarios

Scenario 2: Card-not-present transactions
Scenario 3: Small-amount testing transactions

โœ… Section 5: Conclusion

๐ŸŽฏ Achievements

โœ” Built complete fraud detection system โœ” 75% accuracy with 49% fraud recall โœ” Production-ready Streamlit interface โœ” Automated feature engineering pipeline

โš ๏ธ Challenges

Class Imbalance:
Only 1% fraudulent transactions
Required careful metric selection
Technical Constraints:
Dataset size limited model complexity
Chose efficient Logistic Regression
Tradeoffs:
Prioritized recall over precision
Accepted false positives for security

๐Ÿš€ Future Work

Advanced Techniques
Implement SMOTE for class balancing
Test Isolation Forest for anomaly detection
Deployment
Cloud deployment with auto-scaling
Database integration for history
Improvements
Prediction logging system
User feedback mechanism

๐Ÿ‘จโ€๐Ÿ’ป Learnings

Handling imbalanced datasets in practice
Balancing theory vs real-world constraints
Importance of model interpretability
End-to-end ML system development
โญ Star the repository if you found this useful!
Like this project

Posted Aug 30, 2025

Developed a fraud detection system using machine learning for financial transactions.