USDA Health Score Prediction Using Deep Learning

Hariharasudhan T

πŸ₯— USDA Health Score Prediction

This project focuses on predicting USDA Health Scores for recipes using deep learning models, alongside comprehensive data analytics and visualizations. Built using PySpark for scalable data preprocessing and TensorFlow/Keras for model training, this pipeline explores the impact of data quality on prediction accuracy.

πŸ“Œ Project Overview

πŸ” Goal: Predict USDA Health Scores for recipes based on their ingredients and cuisine types.
πŸ“Š Approach: Analyze and clean the recipe dataset using PySpark, extract features, and train a Neural Network using TensorFlow/Keras.
πŸ“‰ Challenge: The dataset (sourced from Harvard Dataverse) was significantly incomplete, especially in the ingredient listings, leading to reduced model accuracy.
🧼 Solution: Custom data cleaning pipelines were developed to handle missing and inconsistent data. Results showed improved prediction accuracy after rigorous preprocessing.

🧰 Tech Stack

Data Processing: Apache Spark (PySpark)
Machine Learning: TensorFlow + Keras
Visualization: Matplotlib, Seaborn
Notebook Environment: Jupyter Notebooks
Dataset Source: Harvard Dataverse

πŸ“ˆ Key Features

βš™οΈ Feature engineering from raw recipe data
πŸ§ͺ Data cleaning and deduplication using Spark
🍽️ Ingredient-level and cuisine-level distribution plots
🧠 Deep Neural Network model to predict USDA scores
πŸ“‰ Accuracy degradation analysis due to incomplete data
πŸ”„ Pipeline to improve dataset consistency and model performance

πŸ“ Getting Started

Requirements

Python 3.8+
PySpark
TensorFlow / Keras
NumPy
Matplotlib
Like this project

Posted Apr 23, 2025

Predicted USDA Health Scores using deep learning and data analytics.

Swift Soft Jobs Project
Swift Soft Jobs Project
Hariharasudhan's Personal Portfolio Website
Hariharasudhan's Personal Portfolio Website