top of page

Data Mining Project -
Airbnb Price Prediction

project
goal

Ask the right question to find out the business problem and define project goal

data PROCESSing

Data Cleaning, Feature Engineering, and Exploratory Data Analytics are performed to allow data to perform in a better way when used in a project

Machine learning

Building, stacking, and validation models for improved predictions

making
Data-driven decision

Make better decisions that can help hosts to become the best host

The project aims to develop a predictive model for the daily prices of Airbnb rentals based on state-of-the-art techniques from statistical learning.

 

The prediction of the outcome can help the host to make better decisions to become the best hosts and gain profit.

Wave
PROJECTGOL

- FEATURE ENGINEERING PLAN

Step 1. Extracting and dropping feature

Features that are meaningless, high in certain value, high in missing value and high collinearity are dropped.

Step 2.  Handing Missing Value

The missing value of dummy variables is treated as ‘0’ to indicate the absence of some categorical effect that may be expected to shift the outcome. For the predictor which the features which have the '0' value are a rare situation and couldn't fit the common sense of domain knowledge, will fill with mean.

Step 3. Data Encoding

 

- EXPLORATARY DATA ANALASIS

The project's data are split into two files, a training dataset and a second dataset for validation and evaluation.

The training dataset including 83 variables such as review score, amenities, location, room, and 10636 rows corresponds to a separate Airbnb listing in Sydney. 

The project's data are split into two files, a training dataset and a second dataset for validation and evaluation.

The training dataset including 83 variables such as review score, amenities, location, room, and 10636 rows corresponds to a separate Airbnb listing in Sydney. 

DATA PROCESSING

 

Ten models were built to predict the housing prices of Airbnb, which include OLS, Ridge regression, Lasso Regression, Elastic Net, Decision Tree, Bagging, Random Forest, Gradient Boosting, Xgboost models.

The root means square error (RMSE) is applied to measure the performance of the models. Specifically, the Lasso model has the best performance among all the established linear models; Gradient boosting has the best performer among all established non-parametric models and stacking models.

                          

 

 

 

 

 

 

 

 

 

Picture1.png
Picture11.png

We selected 78 variables to fit the lasso model, 9 features are determined as ineffective features by the lasso model. Only 69 features remain. The figure llustrates the 20 features with the largest absolute value of the coefficient. The Cross-validation hyperparameter was tuned as  5 to against overfitting.

The figure represents the features that have significant impacts on training the gradient boosting model. In other words, these features reduced error the most during the training process.

MACHINE LEARNIG

Our study at the Sydney Airbnb dataset found that the number of bedrooms, bathrooms, and accommodations positively influence the listing price. Higher housing capacity has raised the listing price. Besides, a fully equipped kitchen, wifi, air conditioning, free parking, and pool are the most important amenities affecting the price of Airbnb.

Airbnb in Manly, Pittwater, Mosman and the area near to Gold Coast is additionally expensive. We could notice that the relatively expensive housing has features in common.

 

Those places are where tourists and other visitors are present year-round. Such areas will have a consistent yet strong demand for Airbnb rental properties and are ideal for investment.

Online reviews by customers are of the key importance of the Airbnb economy. The number and score of reviews have a strong impact on hosts' reputations. Particularly, charging cleaning fees has a severe impact on Airbnb listing. To provide the best service, we suggest the host respond to guests as soon as possible, choose a flexible cancellation and charging policy, and boost their reviews.

DATA-DRIVEN DECISO
bottom of page