Data Mining Project -
Airbnb Price Prediction
The project aims to develop a predictive model for the daily prices of Airbnb rentals based on state-of-the-art techniques from statistical learning.
The prediction of the outcome can help the host to make better decisions to become the best hosts and gain profit.

- FEATURE ENGINEERING PLAN
Step 1. Extracting and dropping feature
Features that are meaningless, high in certain value, high in missing value and high collinearity are dropped.
Step 2. Handing Missing Value
The missing value of dummy variables is treated as ‘0’ to indicate the absence of some categorical effect that may be expected to shift the outcome. For the predictor which the features which have the '0' value are a rare situation and couldn't fit the common sense of domain knowledge, will fill with mean.
Step 3. Data Encoding
- EXPLORATARY DATA ANALASIS
The project's data are split into two files, a training dataset and a second dataset for validation and evaluation.
The training dataset including 83 variables such as review score, amenities, location, room, and 10636 rows corresponds to a separate Airbnb listing in Sydney.
The project's data are split into two files, a training dataset and a second dataset for validation and evaluation.
The training dataset including 83 variables such as review score, amenities, location, room, and 10636 rows corresponds to a separate Airbnb listing in Sydney.
Ten models were built to predict the housing prices of Airbnb, which include OLS, Ridge regression, Lasso Regression, Elastic Net, Decision Tree, Bagging, Random Forest, Gradient Boosting, Xgboost models.
The root means square error (RMSE) is applied to measure the performance of the models. Specifically, the Lasso model has the best performance among all the established linear models; Gradient boosting has the best performer among all established non-parametric models and stacking models.


We selected 78 variables to fit the lasso model, 9 features are determined as ineffective features by the lasso model. Only 69 features remain. The figure llustrates the 20 features with the largest absolute value of the coefficient. The Cross-validation hyperparameter was tuned as 5 to against overfitting.
The figure represents the features that have significant impacts on training the gradient boosting model. In other words, these features reduced error the most during the training process.
Our study at the Sydney Airbnb dataset found that the number of bedrooms, bathrooms, and accommodations positively influence the listing price. Higher housing capacity has raised the listing price. Besides, a fully equipped kitchen, wifi, air conditioning, free parking, and pool are the most important amenities affecting the price of Airbnb.
Airbnb in Manly, Pittwater, Mosman and the area near to Gold Coast is additionally expensive. We could notice that the relatively expensive housing has features in common.
Those places are where tourists and other visitors are present year-round. Such areas will have a consistent yet strong demand for Airbnb rental properties and are ideal for investment.
Online reviews by customers are of the key importance of the Airbnb economy. The number and score of reviews have a strong impact on hosts' reputations. Particularly, charging cleaning fees has a severe impact on Airbnb listing. To provide the best service, we suggest the host respond to guests as soon as possible, choose a flexible cancellation and charging policy, and boost their reviews.