This repository contains a Jupyter notebook that predicts the price of apartments in Mexico City based on their size, location (longitude and latitude) and neighborhood. The notebook uses a linear regression and ridge model.
Predicting Apartment Price with Size
The predict_price_with_size notebook first explores the relationship between apartment prices and size using descriptive statistics and visualizations. It then splits the data into a training set and a test set. The linear regression model is trained on the training set and evaluated on the test set.
The results show that the linear regression model can predict apartment prices with a mean absolute error of $1100 USD. The model also shows that larger apartments tend to have higher prices.
The notebook also includes a section on communicating the results of the analysis. This section includes the model equation, the model intercept and coefficient, and a visualization of the model.
Predicting Apartment Price with Location (Longitude and Latitude)
The predict_price_with_location notebook contains code and analysis for predicting apartment prices in Mexico City using location data.
Price of apartment in USD
Location: Latitude and longitude coordinates
Analysis
The following analyses are included in the notebook:
Exploratory data analysis through data visualization
Training a baseline mean price model
Building a regression pipeline with imputation and model
Evaluating model performance on training and test sets
Model
A linear regression pipeline is implemented with the following steps:
Impute missing values using mean imputation
Fit a linear regression model to predict price based on latitude and longitude
Results
The location features are found to not be strong predictors of apartment price. The model has similar performance to just predicting mean price
Predicting Apartment Prices with Neighborhood
This predict_price_with_neighborhood notebook builds a model to predict apartment prices in Mexico City based on the neighborhood (borough).
Data
The data comes from a CSV file with the following features:
borough - the neighborhood or borough in Mexico City
price_aprox_usd - the apartment price in USD
Approach
The steps taken in the notebook are:
Import and explore the data
Split into training feature and target
Create a baseline prediction using the mean price
One-hot encode the categorical borough feature
Build a linear regression model pipeline
Evaluate on the training data
Predict on the test data
Extract model coefficients and feature importances
Switch to Ridge regression to reduce overfitting
Extract Ridge model coefficients and feature importances
Visualize Ridge feature importances
Key Findings
The linear model reduces the training MAE to around
Ridge regularization further reduces overfitting
The most important features are boroughs like San Ángel, Del Valle Centro, Escandón, etc.
Future Work
Some ways the model could be improved:
Add more features like size, bedrooms, other amenities
Try different regularization techniques
Ensemble methods like random forests could help too.