Upon seeing the data, I identified that it was not enough information for a classification model to make reasonable predictions. I added several new columns based on domain knowledge and statistics.
took a list of special characters and got the count of each in a url and created columns of the results
EDA
identified distribution and correlation between columns and the classification value, Used groubyby, scatterplots, and heat maps for visualizations. Highlights Below
Identified that the benign url were almost all over the place, but you can see that of the malware, phishing, and defacement url were more tightly clustered.
Malware url, had a significantly higher chance to have an ip address present.
The dataset is very imbalanced , with benign urls having over 50% of the values in the dataset, depending on the results of the models, I would have considered merging the data between the other columns and creating a binary classification vs. a multiclass
Model Building
For Model Building I knew a couple models that I wanted to try ahead of time that I usually get good results from, First I split the data in a 70/30 train test split. I then dropped the "type" column that we are trying to predict from my X data and then dropped the "url" column , since I already have the important features in the new engineered columns. I then used Sklearn's StandardScaler to scale all of the X data. Created a function that takes in all of the takes and trained them all and compared results based on accuracy, recall, precision and f1 scores. Depending on the results of the training I would have used different hyperparameters for each model via a gridsearch.
Model performance
Both Tree Based Models performed the best, but the RandomForestClassifier was the best overall
DecisionTreeClassifier
Accuracy: 93.89999999999999%
Precision: 91.8%
Recall: 91.3%
F1-Score: 91.5%
LogisticRegression
Accuracy: 78.0%
Precision: 62.1%
Recall: 56.699999999999996%
F1-Score: 58.199999999999996%
RandomForestClassifier
Accuracy: 94.5%
Precision: 92.9%
Recall: 92.0%
F1-Score: 92.4%
KNeighborsClassifier
Accuracy: 91.8%
Precision: 89.5%
Recall: 87.7%
F1-Score: 88.5%
Confusion Matrix
Confusion matrix of the True Labels and Predicted Label,
Model Deployment
I then pickled the model and created a flask API endpoint hosted on a local server that takes in a single url and identifies the class
Like this project
0
Posted May 31, 2024
Contribute to crashlattice57/ML_Malicous_URL_Classification development by creating an account on GitHub.