Predicting NO₂ Levels Using Machine Learning by Kristóf NémethPredicting NO₂ Levels Using Machine Learning by Kristóf Németh

Predicting NO₂ Levels Using Machine Learning

Kristóf Németh

Completed work

Data Analyst

Data Scientist

Python

Visual Studio Code

XGBoost

Environment

air-quality

Predicting NO₂ Levels Using Machine Learning

This project aims to develop a predictive model for estimating NO₂ (Nitrogen Dioxide) levels based on meteorological and air pollutant measurements. NO₂ is a key air pollutant affecting human health and environmental quality.

Dataset

We use the UCI Air Quality Dataset, which contains hourly readings of air pollutants and meteorological factors.

Key Goals:

Data Preprocessing: Handling missing values and feature engineering.

Exploratory Data Analysis (EDA): Understanding patterns and correlations.

Model Training: Implementing XGBoost, LightGBM, and CatBoost.

Hyperparameter Tuning: Optimizing models using Hyperopt and Optuna.

Feature Importance Analysis: Using SHAP values to interpret model predictions.

Deployment: Deploying the best-performing model as an API.

ToDo:

Deployment

Report

Methodology

1. Data Acquisition & Preprocessing

Load the dataset from an Excel file.

Handle missing values (-200 values replaced with NaN).

Create new time-based features (weekday, month, hour).

2. Exploratory Data Analysis (EDA)

Visualize data distributions.

Correlation analysis between features and NO₂.

3. Model Training & Evaluation

Implemented models:

Baseline XGBoost Model

Hyperparameter-tuned XGBoost

LightGBM Model (Tuned with Optuna)

CatBoost Model (Tuned with Optuna)

Metrics used for evaluation:

Root Mean Squared Error (RMSE)

R² Score

4. Model Performance Comparison

Model RMSE R² Generic XGBoost 11.93 0.93 Tuned XGBoost 11.15 0.94 Generic LightGBM 12.34 0.93 Tuned LightGBM 11.24 0.94 Generic CatBoost 13.55 0.91 Tuned CatBoost 10.23 0.95

💡 Tuned CatBoost achieved the best performance.

5. Feature Importance Analysis (SHAP)

NOx(GT) was the most influential factor in predicting NO₂ levels.

Meteorological factors such as absolute humidity (AH) and relative humidity (RH) also played a significant role.

Month and hour showed strong seasonal effects on air pollution levels.

6. Deployment

The best-performing model (CatBoost) was saved using joblib and deployed via:

Flask API for real-time predictions.

Streamlit Dashboard for interactive visualizations.

Project Organization

├── LICENSE            <- Open-source license if one is chosen
├── Makefile           <- Makefile with convenience commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default mkdocs project; see www.mkdocs.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── pyproject.toml     <- Project configuration file with package metadata for 
│                         air-quality and configuration for tools like black
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.cfg          <- Configuration file for flake8
│
└── air-quality   <- Source code for use in this project.

Like this project

Completed work

Posted May 7, 2025

Models to predict NO₂ levels using machine learning.

Likes

Views