What affects AirBnB house prices and how good can we predict them?

7 min readJul 13, 2021

House prices are influenced by many factors, and there still isn’t a well-established method to accurately predict them. However, in this post, I’ll try to analyze some of the features related to price of AirBnB houses in Seattle and Boston, and determine if data available on these features may help in determing prices with machine learning algorithms.

I’ll follow the CRISP-DM process, answering the following questions:

Are there any months throughout the year where the price of a house is higher or lower? Which are these months? Are there any difference between the two cities?
Are there any differences in average price between the different neighbourhoods and home-types in the two cities?
Are prices and ratings in the two cities correlated?
Can we predict house prices using machine learning with sufficient accuracy, based on available data?

I’ll be looking at the AirBnB data for Seattle and Boston. For both cities, data is structured in 3 datasets: calendar, listings, reviews.

The first dataset contains a series of listings that have displayed their availability and price on a certain date. The listings dataset contains information about accomodations like neighbourhood, amenities, number of bedrooms, etc., while the third reports all the reviews for each listing.

Part 1: When prices are higher and when prices are lower?

This question can be answered with the information available from the calendar dataset. The average monthly price of each listing is calculated as the average of the daily price values when each listing is available.

I am able then to determine those months where the majority of listings have maximum or minimum price. The following figures show the distributions of the maximum and the minimum values for Boston and Seattle.

From the graphs above, it can be seen that for Boston, in the two top graphs, the maximum monthly rent price is October, while the month with the lowest average price is December. For Seattle, maximums are in April and minimums in January.

The situation for Seattle is easy to interpret, with April being the most expensive as there are holidays like Spring break and Easter (available data refers to 2016), while for Boston it is more complex to read. September and October are the peaks of the distribution of the maximums, but there are a lot of minimums in these months as well.

Part 2: How price differ based on neighbourhood and home-type?

To answer the second question, I’ll look into the listings dataset, taking into consideration the information about price, neighbourhood, and accomodation type (e.g. Entire Home/Apt., Private or Shared Room).

In particular, I want to calculate the average price by grouping houses by neighbourhood and room type. The figure below shows the difference in average price for different neighbourhoods and room types.

For both cities, prices of entire homes/apartments are on average higher than those of private or shared rooms, as one should expect. However, those neighbourhoods thar have the highest entire homes’ average price aren’t the ones that have the highest private/shared room’s average price.

In Seattle, for example, Magnolia is the neighbourhood with the greatest average price for entire homes/apartments, while the most expensive private rooms are in Downtown and the most expensive shared room are in Cascade.

Moreover, one would expect private rooms to be more expensive than shared rooms. In the two cities, almost all neighbourhoods confirm this assumption, with some exceptions in Boston, where Fenway, South End, Jamaica Plain, Allston, West Roxbury and Brighton have available shared rooms that are more expensive than private ones on average.

These results suggest that neighbourhood and accomodation type aren’t good enough as predictors to determine house prices.

Part 3: Are prices and ratings correlated?

The third question revolves around the possibility of price and ratings to be correlated. I’ll use the listings dataset, taking the price and rating variables.

In order to have a visual understading of the correlation between these two variables, I’ll use a scatter plot, which is shown below.

It is possible to see from the plots above that prices and ratings aren’t much c

rrelated, but we can say that for higher prices, ratings tend to be high, while for low prices ratings can be low or high regardless.

This means that, in general, high prices are an indication of good ratings, but we can’t use ratings alone to predict house prices.

Part 4: Predicting prices

We saw above that variables like neighbourhood, accomodation type and rating and even time aren’t reliable enough to predict house prices. To have the best possible prediciton, instead, several variables need to be taken into account.

In order to predict prices, it is first necessary to make the data ready for modelling. The datasets that I’ll use is the listings dataset. When it comes to predicting quantitative variables, like house prices, regression techniques are used.

Different types of regression methods can be used to predict price, however in this analysis I’ll focus on Linear Regression and two other techniques: Random Forest and Extreme Gradient Boost. These algorithms requires the data to be splitted into train and test sets, with the first one used to fit the data to the model and tune its parameters, and the second one to check the model accuracy.

In order to measure the goodness of the predicted prices, I’ll look at two measures: the root-mean-square error (RMSE) and R-squared (R²). The first is the square root of the average of the squared difference between predicted and observed data, the second is used to determine how well the regression model fits the observed data.

The possible RMSE values go from 0 to +∞, while R² can go from -∞ to 1. To determine the goodness of a model, the smaller the RMSE and the closer R² is to 1, the better the model is in predicting prices. Below a summary of the results from the models used is shown.

RMSE

R²

For both cities, Linear Regression provides worse result than Random Forest and XGBoost. Particularly for Seattle, Linear Regression’s RMSE and R² have values that are extremly far from the optimal ones for the test sample. Random Forest and XGBoost are slightly different in terms of results: the first method is more accurate on the train set, while the second is more accurate on the test set.

Eventually, Random Forest and XGBoost provide better results than Linear Regression, because the latter implies that price values have a linear relationships with the rest of the data. However, when it comes to complex varibales like house prices, the mathematical relationship with the other variables is very likely not linear.

Therefore, ensemble methods, like Random Forest and Extreme Gradient Boost, perform better in predicting house prices, as they are implemented to easily capture non-linear relationships between variables.

Does this imply that Random Forest and XGBoost are always better than Linear Regression?

The answer depends on the regression task. An advantage of Linear Regression is that it allows for predictions with new data outside the known range of the training data, while ensemble methods’ predictions on new data don’t fall outside the range of the training data.

With a pratical example, this means that if we wanted to predict house prices from the square feet variable alone, Linear Regression would be the best method, as it is correct to assume that price will increase with square feet.

However, if an ensemble method were to be used in this case, the predictions on new data outside the training data value range will produce price values that fall within the range of the observed data, no matter how great the square feet variable, which of course would be incorrect.

Conclusion

In this article, we looked at how house prices differ based on different factors, and how we can use several variables to try to predict house prices.

We looked at price values throughout the year for Boston and Seattle and found out that there are certain months were house prices are on average higher (December for Boston, April for Seattle) or lower (October for Boston, January for Seattle)
We saw that price values and neighbourhoods aren’t really related, as it is often the case where the most expensive neighbourhoods for apartments isn’t the most expensive ones for private and shared rooms.
In general, price and ratings are not correlated, but with higher prices ratings tend to be higher, which means that higher prices indicate good ratings.
We applied machine learning algorithms to try to predict house prices based on a series of features. Among Linear Regression, Random Forest and Extreme Gradient Boost, we saw that the latter two perform better in this case. In general, however, we can’t detetermine an algorithm that is optimal for every situation, as it needs to be chosen case by case.

And you, how would you predict house prices?

To see more about the analysis discussed in this post, check the Github repository here.