Intelligent Phishing Detection System

Gabriel

Gabriel Adebayo

Intelligent Phishing Detection System ๐Ÿ”’

An end-to-end ML-powered solution for securing networks against phishing attacks.
A complete end-to-end Machine Learning project that detects phishing websites using supervised learning. It features a clean pipeline architecture, experiment tracking with MLflow & DagsHub, containerized packaging, and cloud-ready deployment.

๐Ÿงญ Problem Statement

Phishing websites impersonate trusted brands to steal credentials and financial data. Rule-based filters struggle against fast-evolving attacks and novel domains.
Why AI solves this:
Learns patterns from large URL & page-level features (not just static rules).
Generalizes to previously unseen phishing sites.
Continuously improves with tracked experiments and retraining.

๐Ÿ“ฆ Dataset

Name: Phishing Website Detection Dataset
Format: CSV
Features: 30+ web attributes (e.g., URL length, SSL state, domain age)
Target Variable: Result
-1: Phishing
1: Legitimate

๐Ÿ—‚ Project Structure (Main Skeleton)

This is the main structure; the production codebase contains additional modules/configs beyond this outline.
networksecurity/
โ”œโ”€โ”€ data/ # Raw + processed data
โ”œโ”€โ”€ notebooks/ # Jupyter notebooks for exploration
โ”œโ”€โ”€ networksecurity/ # Core pipeline modules
โ”‚ โ”œโ”€โ”€ components/ # Data ingestion, transformation, training
โ”‚ โ”œโ”€โ”€ pipeline/ # Training & evaluation pipelines
โ”‚ โ”œโ”€โ”€ utils/ # Helper functions
โ”‚ โ”œโ”€โ”€ logger/ # Custom logging
โ”‚ โ”œโ”€โ”€ exception/ # Error handling
โ”‚ โ””โ”€โ”€ app.py # (Optional) API layer for inference
โ”œโ”€โ”€ config/ # Config files (YAML)
โ”œโ”€โ”€ final_models/ # Trained models
โ”œโ”€โ”€ mlruns/ # MLflow experiments
โ”œโ”€โ”€ requirements.txt
โ”œโ”€โ”€ main.py # Pipeline runner
โ””โ”€โ”€ README.md

๐Ÿงฑ Features & Stack

Feature Technology Used Data Handling pandas, numpy, MongoDB Modeling scikit-learn (LogReg, RF), xgboost Tracking & Versioning MLflow (local/remote), DagsHub Pipeline Architecture Modular, OOP-based Deployment (API Service) app.py service, Dockerized, cloud-ready Model Evaluation Accuracy, Precision, Recall, F1-Score, ROC-AUC

๐Ÿ—๏ธ Infrastructure (Live Diagram)

Workflow Overview
Storage: AWS S3 (data, artifacts, models)
Tracking: MLflow (experiments, metrics) + DagsHub (remote tracking)
Deployment: Docker image โ†’ AWS ECR โ†’ AWS EC2 (serving)
Automation: GitHub Actions for CI/CD (build, test, tag, push)
Observability: Custom logger/ + exception/ modules

๐Ÿš€ Live Pipeline Diagram

Loading
%%{init: {'theme':'default', 'themeVariables': { 'fontSize': '18px'}, 'logLevel': 'debug'}}%%
flowchart LR
subgraph Dev[Developer Workflow]
A[Code & Notebooks] --> B[Git Commit/Push]
end

B --> C[GitHub Actions CI/CD]
C -->|Build & Test| D[Docker Image]
D -->|Push| E[AWS ECR]
E -->|Pull & Run| F[AWS EC2 Service]

subgraph Data & Tracking
G[(AWS S3: Data & Artifacts)]
J[(MongoDB: Features & Logs)]
H[MLflow Tracking Server]
I[DagsHub Remote Tracking]
end

A --> G
A --> J
C --> G
F --> G
F --> J
H <-->|Log Params/Metrics/Artifacts| C
H --> I

style G fill:#f6f8fa,stroke:#888
style J fill:#e8f5e9,stroke:#47A248
style H fill:#e3f2fd,stroke:#1e88e5
style I fill:#fff3e0,stroke:#fb8c00

๐Ÿ” ML Pipeline Stages

Data Ingestion
Data Validation (schema, nulls, ranges)
Data Transformation (feature engineering, scaling/encoding)
Model Training (Logistic Regression, XGBoost, Random Forest)
Model Evaluation (metrics + artifacts logged to MLflow)
Model Pushing (saved locally and/or to AWS S3)
Loading
%%{init: {'theme':'default', 'themeVariables': { 'fontSize': '18px'}, 'logLevel': 'debug'}}%%
flowchart LR
A[Data Ingestion: CSV, Database, API] --> B[Data Validation: Schema, Nulls, Ranges]
B --> C[Data Transformation: Encoding, Scaling, Feature Engg.]
C --> D[Model Training: ML Classifier]
D --> E[Model Evaluation: Metrics, Validation]
E --> F[Model Registry: MLflow Tracking]
F --> G[Deployment: FastAPI + Docker]
G --> H[Monitoring & Logging: Performance + Errors]



๐Ÿ“’ MLflow Integration

Each pipeline step logs parameters, metrics, and artifacts to MLflow.
Runs can be tracked:
Locally โ†’ mlruns/ folder
Remotely โ†’ DagsHub (via MLflow env vars)
Facilitates model comparison, reproducibility, and rollbacks.

๐Ÿงช Model Performance (Illustrative)

Model Accuracy Precision Recall F1-Score ROC-AUC Logistic Regression 91% 90% 89% 89% 0.93 XGBoost 94% 93% 92% 92% 0.96 Random Forest 96% 95% 96% 95% 0.98
Why Recall matters: โš ๏ธ Missing a phishing site (false negative) is riskier than a false positive. ๐Ÿ‘‰ Random Forest was chosen for the best Recall/F1 balance.

๐Ÿงฐ How to Run

# Clone repository
git clone https://github.com/yourusername/networksecurity.git
cd networksecurity

# Create & activate environment
python -m venv venv

# Windows
venv\Scripts\activate

# macOS/Linux
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Run the end-to-end pipeline
python main.py

# Inspect MLflow experiments locally
mlflow ui # open http://127.0.0.1:5000

๐Ÿ”— Remote Tracking (DagsHub)

Set MLflow environment variables for remote tracking
Skip mlflow ui when using DagsHub

โ˜๏ธ Cloud & Deployment (Optional API)

Artifacts & Models: stored in AWS S3
Container Image: built with Docker, pushed to AWS ECR
Serving: pull & run on AWS EC2
CI/CD: automated with GitHub Actions
API: networksecurity/app.py exposes inference endpoints

๐Ÿ”ฎ Future Improvements

โœ… Explainability (SHAP) โ†’ feature contribution & analyst trust
โœ… Streamlit Dashboard โ†’ live insights & SOC analyst workflows
โœ… Threat Intel Feeds โ†’ enrich predictions (OpenPhish / PhishTank)
โœ… LLM-assisted Triage โ†’ natural language rationale for SOC teams
โœ… Unit & Integration Tests โ†’ higher coverage & reliability
โœ… Infra as Code โ†’ Terraform + AWS Secrets Manager

๐Ÿ‘ค Author

Adebayo Gabriel โ€“ ML Engineer (AI ร— Cybersecurity)

๐Ÿ”— Links

โšก This is more than a modelโ€”it's a production-minded AI system for real-world network security.
Like this project

Posted Sep 1, 2025

Created an ML system to detect phishing websites using supervised learning.