Car Selling Price Prediction Model by Abhijeet KCCar Selling Price Prediction Model by Abhijeet KC

Car Selling Price Prediction Model

Abhijeet KC

ML Engineer

Python

XGBoost

Automotive

Car Selling Price Prediction

This project focuses on building machine learning models to accurately predict the selling price of used cars. It explores data preprocessing, feature engineering, regularization techniques, model comparison, and hyperparameter tuning using cross-validation.

Project Workflow

1. Exploratory Data Analysis (EDA)

Removed missing and duplicate rows

Plotted distributions of numeric and categorical features

Checked correlations and multicollinearity

2. Data Preprocessing

Log-transformed skewed features: selling_price, km_driven, engine

Handled outliers using IQR and percentile methods

One-hot encoded categorical features

Standardized numerical values using StandardScaler

Built preprocessing pipelines using ColumnTransformer

3. Feature Engineering

Derived new features:

car_age = current_year - year

price_per_km = selling_price / km_driven

engine_per_seat = engine / seats

In some experiments, source columns like year, km_driven, engine were dropped to observe the impact

4. Model Training & Evaluation

All models were trained using pipelines and evaluated using cross-validation.

Linear Regression (with and without L1/L2 regularization)

Random Forest Regressor

XGBoost Regressor

LightGBM Regressor

Regularization:

Lasso (L1) used for automatic feature selection

Ridge (L2) used to reduce coefficient variance

ElasticNet (L1 + L2) found best parameters:

    {'model__alpha': 0.01, 'model__l1_ratio': 0.2}
    Best R²: 0.8132

5. Outlier Impact

With Outliers:

Linear Regression: MAE = 101,229.81, R² = 0.714

Random Forest: MAE = 90,691.03, R² = 0.825

XGBoost: MAE = 89,614.51, R² = 0.846

LightGBM: MAE = 90,948.43, R² = 0.846

Without Outliers:

Linear Regression: MAE = 80,589.06, R² = 0.849

Random Forest: MAE = 71,006.79, R² = 0.860

XGBoost: MAE = 72,881.67, R² = 0.866

LightGBM: MAE = 73,807.30, R² = 0.845

Tuned XGBoost Model

Best parameters via GridSearchCV:

{'model__learning_rate': 0.2, 'model__max_depth': 5, 'model__n_estimators': 200}

Best CV R²: 0.9930

Test Set: R² = 0.9933, MAE = 11480.94

Saved as xgboost_pipeline_tuned.pkl

Final Evaluation (Best Model)

MetricValue MAE11480.94 MSE284920425.62 RMSE16879.59 R²0.9933 MAPE3.15%

Experiments with Feature Engineering

When only new features were used and original columns dropped:

Test R²: 0.8577

MAPE: 15.24%

When both original + derived features were used:

Test R²: ~0.9933

MAPE: 3.15%

Conclusion: Keeping both raw and derived features provided the highest predictive power. Removing source columns (like year, engine) decreased accuracy.

How to Run

pip install -r requirements.txt

import joblib
model = joblib.load("models/xgboost_pipeline_tuned.pkl")
predictions = model.predict(X_test)

Future Enhancements

Use Optuna or BayesianOptimization for smarter tuning

Deploy using Flask or Streamlit

Visualize SHAP or LIME for explainable AI

Try stacking or ensembling models

Credits

Developed by Abhijeet K.C. during the ML Project (2025). Dataset: car.csv

Happy Coding!

Like this project

Posted Jul 27, 2025

Built ML models to predict used car prices with high accuracy.

Likes

Views

Timeline

Jul 1, 2025 - Jul 6, 2025

Clients

Aakash Group