Predicting NO₂ Levels Using Machine Learning

Kristóf

Kristóf Németh

air-quality

Predicting NO₂ Levels Using Machine Learning

This project aims to develop a predictive model for estimating NO₂ (Nitrogen Dioxide) levels based on meteorological and air pollutant measurements. NO₂ is a key air pollutant affecting human health and environmental quality.

Dataset

We use the UCI Air Quality Dataset, which contains hourly readings of air pollutants and meteorological factors.

Key Goals:

Data Preprocessing: Handling missing values and feature engineering.
Exploratory Data Analysis (EDA): Understanding patterns and correlations.
Model Training: Implementing XGBoost, LightGBM, and CatBoost.
Hyperparameter Tuning: Optimizing models using Hyperopt and Optuna.
Feature Importance Analysis: Using SHAP values to interpret model predictions.
Deployment: Deploying the best-performing model as an API.

ToDo:

Deployment
Report

Methodology

1. Data Acquisition & Preprocessing

Load the dataset from an Excel file.
Handle missing values (-200 values replaced with NaN).
Create new time-based features (weekday, month, hour).

2. Exploratory Data Analysis (EDA)

Visualize data distributions.
Correlation analysis between features and NO₂.

3. Model Training & Evaluation

Implemented models:
Baseline XGBoost Model
Hyperparameter-tuned XGBoost
LightGBM Model (Tuned with Optuna)
CatBoost Model (Tuned with Optuna)
Metrics used for evaluation:
Root Mean Squared Error (RMSE)
R² Score

4. Model Performance Comparison

Model RMSE R² Generic XGBoost 11.93 0.93 Tuned XGBoost 11.15 0.94 Generic LightGBM 12.34 0.93 Tuned LightGBM 11.24 0.94 Generic CatBoost 13.55 0.91 Tuned CatBoost 10.23 0.95
💡 Tuned CatBoost achieved the best performance.

5. Feature Importance Analysis (SHAP)

NOx(GT) was the most influential factor in predicting NO₂ levels.
Meteorological factors such as absolute humidity (AH) and relative humidity (RH) also played a significant role.
Month and hour showed strong seasonal effects on air pollution levels.

6. Deployment

The best-performing model (CatBoost) was saved using joblib and deployed via:
Flask API for real-time predictions.
Streamlit Dashboard for interactive visualizations.

Project Organization

├── LICENSE            <- Open-source license if one is chosen
├── Makefile <- Makefile with convenience commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.

├── docs <- A default mkdocs project; see www.mkdocs.org for details

├── models <- Trained and serialized models, model predictions, or model summaries

├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited description, e.g.
│ `1.0-jqp-initial-data-exploration`.

├── pyproject.toml <- Project configuration file with package metadata for
│ air-quality and configuration for tools like black

├── references <- Data dictionaries, manuals, and all other explanatory materials.

├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting

├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`

├── setup.cfg <- Configuration file for flake8

└── air-quality <- Source code for use in this project.

Like this project

Posted May 7, 2025

Models to predict NO₂ levels using machine learning.