WalidBenzineb/Airbnb-NYC-Price-Prediction-July-2024-Data-

Walid Benzineb

Data Scientist
Data Analyst
AI Developer
Microsoft Excel
Python
PyTorch

Project Overview

This project aims to predict Airbnb listing prices in New York City using data from July 25, 2024. We'll walk through the entire data science process, from collecting and analyzing data to building a predictive model and creating a user-friendly web application. Project Steps and Key Decisions

Technologies and Skills Showcase

Programming Languages: Python
Data Manipulation: Pandas, NumPy
Data Visualization: Matplotlib, Seaborn
Machine Learning: Scikit-learn, XGBoost
Web Application: Streamlit
Version Control: Git, GitHub
Data Analysis: Exploratory Data Analysis (EDA), Statistical Analysis
Feature Engineering: Creating derived features, handling categorical variables
Model Evaluation: Cross-validation, hyperparameter tuning
Big Data Handling: Processing and analyzing large datasets
Data Cleaning: Handling missing values, outlier detection and treatment

Project Structure

Our project follows these key steps:
Data Cleaning (data_cleaning.py): Preprocesses the raw data, handling missing values and outliers.
Initial Exploratory Data Analysis (exploratory_data_analysis.py): Performs initial data visualization and statistical analysis to understand the dataset's characteristics.
Feature Engineering (feature_engineering.py): Creates new features and transforms existing ones to improve model performance.
Feature-Focused EDA (feature_exploratory_data_analysis.py): Analyzes the engineered features, providing insights into their relationships and potential impact on the target variable.
Model Training (refined_model.py): Trains the XGBoost model using the preprocessed and engineered features.
Streamlit App (streamlit_app.py): Provides a user-friendly interface for interacting with the trained model and visualizing results.
Each script can be run independently, but they should be executed in the order listed above for the full data science pipeline.

Data

This project uses the Airbnb NYC dataset from July 05, 2024. Due to the large size of the files , they are not included directly in this repository. Instead, you can download them from the following links:
After downloading, place these files in the data/ directory of the project before running the scripts.

Project Steps and Key Decisions

1. Data Collection and Initial Exploration

We started with three main data files:
calendar.csv: Contains availability and pricing information
listings.csv: Detailed information about each Airbnb listing
reviews.csv: User reviews for the listings
Key Decision: We focused primarily on the listings.csv file as it contained the most relevant information for price prediction.

2. Data Cleaning and Preparation

Handled missing values Converted data types (e.g., dates, prices) to appropriate formats Removed extreme outliers to improve data quality
Key Decision: We chose to remove extreme price outliers (above 99th percentile) to prevent them from skewing our model.

3. Exploratory Data Analysis (EDA)

Price Distribution

This histogram shows the distribution of Airbnb prices in NYC. We observed that:
Prices are heavily right-skewed
Most listings are concentrated in the lower price range
There are some very high-priced outliers
Key Decision: Given the skewed nature of prices, we decided to use a log transformation on the price variable to make it more normally distributed for our model.
###Price by Room Type###
This box plot displays how prices vary across different room types. We found that:
Entire homes/apartments are generally more expensive
Shared rooms are the least expensive option
There's significant price overlap between private rooms and entire homes/apartments
Key Decision: Room type is clearly an important factor in determining price, so we made sure to include it as a key feature in our model.
Correlation Matrix
This heatmap shows the correlations between different numerical features. Notable observations:
'Number of reviews' and 'reviews per month' are highly correlated (as expected)
'Availability 365' (number of days available in a year) has a moderate negative correlation with price
Key Decision: Based on these correlations, we decided to engineer new features that could capture more complex relationships in the data.

4. Feature Engineering

We created several new features to capture more information:
*availability_rate: Percentage of days a listing is available *avg_price: Average price for each listing *price_category: Categorized prices into low, medium, high *days_since_last_review: To capture the recency of reviews *is_licensed: Whether the listing is licensed
Key Decision: We created the availability_rate feature because we noticed that availability had a relationship with price, but it wasn't perfectly linear. This new feature allowed our model to capture more nuanced patterns.

5. Model Development

We used XGBoost for our final model due to its strong performance on tabular data. Here's how we approached model development:
Split the data into training and testing sets
Created a pipeline that included:
Used RandomizedSearchCV to find the best hyperparameters
Final model performance:
R² Score: 0.8064 (The model explains about 80.64% of the variance in listing prices)
Mean Squared Error (MSE): 0.0726
Root Mean Squared Error (RMSE): 0.2694
Mean Absolute Percentage Error (MAPE): 15.35% (On average, our predictions are off by about 15.35%)
Key Decision: We chose XGBoost and fine-tuned its parameters because it consistently outperformed other algorithms we tried, including linear regression and random forests.
Interactive Web Application
We created a Streamlit web app that allows users to:
Input details about a potential Airbnb listing
View interactive data visualizations
Get a real-time price prediction based on the input
Key Decision: We chose Streamlit for its simplicity and ease of deployment, making our model accessible to non-technical users.

Challenges Faced

Data Quality Issues: The raw dataset contained missing values and outliers. We addressed this by implementing robust data cleaning procedures and carefully considering which data points to exclude to maintain data integrity without losing valuable information.
# Example of handling missing values df['reviews_per_month'] = df['reviews_per_month'].fillna(0) # Removing extreme price outliers df = df[df['price'] <= df['price'].quantile(0.99)]
Feature Engineering: Creating meaningful features that capture the complexities of Airbnb pricing was challenging. We overcame this by combining domain knowledge with data-driven insights.
# Creating availability rate feature df['availability_rate'] = df['availability_365'] / 365 # Creating price per night feature df['price_per_night'] = df['price'] / df['minimum_nights'].clip(lower=1)
Model Optimization: Balancing model complexity with performance was tricky. We used RandomizedSearchCV to efficiently search the hyperparameter space and find the optimal model configuration.
# Hyperparameter tuning with RandomizedSearchCV param_distributions = { 'model__n_estimators': [100, 200, 300], 'model__max_depth': [3, 4, 5, 6], 'model__learning_rate': [0.01, 0.1, 0.3] } random_search = RandomizedSearchCV(pipeline, param_distributions, n_iter=20, cv=5, random_state=42)
Interpreting Complex Models: XGBoost models can be challenging to interpret. We addressed this by using feature importance plots and SHAP (SHapley Additive exPlanations) values to understand the model's decision-making process.

Code Snippets

Here are some key code snippets that showcase important parts of our project:
Data Preprocessing:
def prepare_features(df): df_selected = df[important_features + ['price']].copy() df_selected['reviews_per_month'] = df_selected['reviews_per_month'].fillna(0) df_selected['price_per_night'] = df_selected['avg_price'] / df_selected['minimum_nights'].clip(lower=1) df_selected['is_superhost'] = (df_selected['calculated_host_listings_count'] > 1).astype(int) df_selected['high_availability'] = (df_selected['availability_365'] > 180).astype(int) return df_selected
Model Pipeline Creation:
pythonCopydef create_model_pipeline(numeric_features, categorical_features): numeric_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='median')), ('scaler', RobustScaler()) ]) categorical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='constant', fill_value='Unknown')), ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False)) ]) preprocessor = ColumnTransformer( transformers=[ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features) ]) model = xgb.XGBRegressor(random_state=42, n_jobs=-1) pipeline = Pipeline([ ('preprocessor', preprocessor), ('feature_selection', SelectKBest(f_regression, k='all')), ('model', model) ]) return pipeline
Streamlit App (Price Prediction Section):
if st.button('Predict Price'): try: user_input = user_input[model.feature_names_in_] prediction = np.expm1(model.predict(user_input)) st.success(f'The predicted price is ${prediction[0]:.2f} per night') except Exception as e: st.error(f"An error occurred during prediction: {str(e)}")

Key Findings

Location (neighborhood) and room type are the strongest predictors of Airbnb prices in NYC
The number of reviews and review frequency have a notable impact on pricing
Availability throughout the year moderately affects pricing
There's a non-linear relationship between price and minimum nights stay

Future Improvements

Incorporate external data like proximity to attractions or public transport
Implement time-series analysis to capture seasonal price variations
Experiment with more advanced machine learning techniques, such as deep learning models
Create an automated pipeline to regularly update the model with new data

Running the Project

Clone the repository
Download the data files and place them in the data/raw/ directory (see Data section)
Install dependencies: pip install -r requirements.txt
Run data preprocessing: python src/data_cleaning.py
Perform feature engineering: python src/feature_engineering.py
Train the model: python src/refined_model.py
Launch the Streamlit app: streamlit run streamlit_app.py

Glossary

R² Score: A statistical measure that represents the proportion of the variance in the dependent variable (price) that is predictable from the independent variables. It ranges from 0 to 1, where 1 indicates perfect prediction.
MSE (Mean Squared Error): The average of the squared differences between predicted and actual values. Lower values indicate better model performance.
RMSE (Root Mean Squared Error): The square root of MSE, which provides a measure of the average deviation of predictions from actual values in the same unit as the target variable (price).
MAPE (Mean Absolute Percentage Error): The average percentage difference between predicted and actual values. It's often used because it's easy to interpret.
XGBoost: An optimized distributed gradient boosting library, designed to be highly efficient, flexible and portable.
Feature Engineering: The process of using domain knowledge to create new variables that make machine learning algorithms work better.
Hyperparameter Tuning: The process of finding the optimal set of hyperparameters for a machine learning model.

Contact Information

Walid Benzineb - benzinebwal@gmail.com
Partner With Walid
View Services

More Projects by Walid