Airbnb property ranking prediction: multiclass classification

Brian Lin

Data Modelling Analyst
Data Scraper
Data Analyst
Python
Selenium

Project Overview

About the project

As we entered post-COVID phase, people are now relieved from the terror of COVID, travel rates are increasing, which causes an increase in the amount of bookings for holiday accommodations for hotels and Airbnb. Therefore, numurous property owners are making their property an Airbnb.
But they don't know if their property will shown in the first page when people search for it...

The Key Question

Depending on a number of factors (Reviews, Price, Location, etc.) to predict the ranking of the property.

Approach

Data scrapping
Model creation - multiclass classification
User interface creation

Data collecting

First, we use Selenium to collect data from Airbnb website.
We collect properties from popular states among travelors such as Bay area(San Francisco), New York, Seattle and Boston.
Feature collected includes:
Utiities are included in this property.
Property's owner's data
Reviews among customers in different aspects: Cleaness, Accessbility... etc.
Price of the property.
Eventually, there are 99 columns in our data, and 4350 rows of data collected.

Data cleaning

After collecting data using web scraping approach, we found out there are 699 rows of data contains missing value, we need to clean the data by using following skills:
Column elimitation: 87 inessential column emitted.
Replace null values: Replace nulls with the average of each column.
After conducting data cleaning, our final dataset contains 4350 rows of data, and 12 features left.

Data modeling

Initial intensions:

First, we use linear regression, trying to predict the exact number of ranking for each property, but turns out the accuracy is only 12%. We also tried Decision tree, but the accuracy is only 2%. We think it's because the lack of observation, since we want to prodict the EXACT NUMBER of the ranking, there's only 4 observations for each ranking since we only scraped four locations.

Multiclass classification

We grouped our properties into five different classes based on their ranking: top 20%, 20-40%, 40-60%, 60-80% and 80-100% by creating dummy variables so we can have enough observation for each classes. And the result turns out our approach is correct, the accuracy was amazing raised to 80%.

User Interface Developing

After we created the model, we were thinking: Why not make a simple GUI so that the user can just insert the features of their property and get the ranking result?
Thus: we use Python script to develop a simple user interface for users to key-in their information, and after clicking submit, the data will be put into our multiclass classification model to predict the ranking of the property.
Watch on YouTube

Conclusion

Although the accuracy of the model is high, we can still make things better by:
Better web scrapping skills to decreased amount of nulls in our data.
Attempted more models since more reference is always better.
Better UI appearance with better outlook design.
Give suggestions to users about which part they can improce to increase ranking.
Partner With Brian
View Services

More Projects by Brian