How to predict HDB resale prices using 3 different Machine learning models (Linear Regression, Neural Networks, K Nearest Neighbor)

8 min readOct 22, 2020

In this tutorial, we aim to find the best model for predicting HDB resale prices using three different machine learning models. For the impatient person, K nearest neighbours achieved the best result. To try out the prediction accuracy for yourself, you may go to hdbpricer.com

This is part 1 of a 4-part tutorial
1. Building a good prediction model
2. Hosting the model prediction as an API endpoint on Flask
3. Building a simple VueJS frontend for a users to price their HDBs
4. Deploying the entire full stack application to the internet

The Git repository for this notebook can be found here
My hdbpricer colab notebook
Colab is Google’s hosted notebook service that allows you to use their GPU / TPUs for free, try it out!

Background

Well, at some point in time of my life I would want to buy a house (HDB resale flat to be exact). Therefore, I thought if I could understand prices well based on data, it would benefit me to a large extent. If you’re looking for a great example of a good pricing model for real estate in Singapore, I would recommend SRX.

Some inspiration (and code) for the first two models were taken from this US Housing data prediction example

This is the Dataset for HDB Resale prices

Importing packages

%matplotlib inline
import matplotlib.pyplot as pltimport math
import tensorflow as tf
from collections import defaultdict
import numpy as np
from numpy import unique
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import StandardScaler, LabelEncoder, MinMaxScaler
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.callbacks import ModelCheckpoint, Callback, TensorBoard

Reading data

Here, I hosted my own dataset (no modifications) instead of using data.gov.sg’s API. Then you use pandas to read the file into a dataframe. As you can see, I prefer a separate copy so I do not need to reload everytime I alter the dataframe

Use dataframe.head() to take a sneak peak of the dataframe.

I would then try to understand the data better by plotting some graphs — statistics always helps!

#visualizing house prices
fig = plt.figure(figsize=(10,7))
fig.add_subplot(2,1,1)
sns.distplot(dataframe['resale_price'])
plt.tight_layout()#visualizing square metres
fig = plt.figure(figsize=(20,12))
fig.add_subplot(2,2,1)
sns.scatterplot(dataframe['floor_area_sqm'], dataframe['resale_price'])
#visualizing town, flat type, flat model, storey range
fig = plt.figure(figsize=(15,7))
fig.add_subplot(2,2,1)
sns.countplot(dataframe['town'])
plt.xticks(rotation=90)
fig.add_subplot(2,2,2)
sns.countplot(dataframe['flat_type'])
plt.xticks(rotation=90)
fig.add_subplot(2,2,3)
sns.countplot(dataframe['flat_model'])
plt.xticks(rotation=90)
fig.add_subplot(2,2,4)
sns.countplot(dataframe['storey_range'])
plt.xticks(rotation=90)plt.tight_layout()

Preprocess data

After getting a rough understanding of the data we are working with, we preprocess the data help us in determining the input to our model.

#let's break date to years, months dataframe['date'] = pd.to_datetime(dataframe['month']) dataframe['month'] = dataframe['date'].apply(lambda date:date.month) dataframe['year'] = dataframe['date'].apply(lambda date:date.year) #Get number of years left on lease as a continuous number (ignoring months) dataframe['remaining_lease'] = dataframe['remaining_lease'].apply(lambda remaining_lease:remaining_lease[:2]) #Get storey range as a continuous number dataframe['storey_range'] = dataframe['storey_range'].apply(lambda storey_range:storey_range[:2]) #data visualization house price vs months and years fig = plt.figure(figsize=(16,5)) fig.add_subplot(1,2,1) dataframe.groupby('month').mean()['resale_price'].plot() fig.add_subplot(1,2,2) dataframe.groupby('year').mean()['resale_price'].plot()

Geocoding

I would also like to geocode the addresses. However since geocoders usually have a limit to the amount of addresses you can geocode for free, I decided to geocode the towns.

I used Geopy and Geopanda as shown here https://towardsdatascience.com/geocode-with-python-161ec1e62b89

#Concat address
dataframe['address'] = dataframe['block'].map(str) + ', ' + dataframe['street_name'].map(str) + ', Singapore'import geopandas as gpd
import geopy
from geopy.geocoders import Nominatim
import folium
from folium.plugins import FastMarkerCluster
from geopy.extra.rate_limiter import RateLimiter#Geocode by town (Singapore is so small that geocoding by addresses might not make much difference compared to geocoding to town)town = [x for x in dataframe['town'].unique().tolist() 
 if type(x) == str]
latitude = []
longitude = []
for i in range(0, len(town)):
 # remove things that does not seem usefull here
 try:
 geolocator = Nominatim(user_agent="ny_explorer")
 loc = geolocator.geocode(town[i])
 latitude.append(loc.latitude)
 longitude.append(loc.longitude)
 print('The geographical coordinate of location are {}, {}.'.format(loc.latitude, loc.longitude))
 except:
 # in the case the geolocator does not work, then add nan element to list
 # to keep the right size
 latitude.append(np.nan)
 longitude.append(np.nan)
# create a dataframe with the locatio, latitude and longitude
df_ = pd.DataFrame({'town':town, 
 'latitude': latitude,
 'longitude':longitude})
# merge on Restaurant_Location with rest_df to get the column 
dataframe = dataframe.merge(df_, on='town', how='left')

If you would like to use the lat/lng visually, you can opt for folium.

folium_map = folium.Map(location=[1.3521,103.8198],
 zoom_start=12,
 tiles='CartoDB dark_matter')
FastMarkerCluster(data=list(zip(dataframe['latitude'].values, dataframe['longitude'].values))).add_to(folium_map)
folium.LayerControl().add_to(folium_map)
folium_map

# check if there are any Null values
dataframe.isnull().sum()

Encoding

I usually use label encoding or one-hot encoding, but this time I chose a bit of a more bare metal approach that I need for my webapp. This is not recommended if you have data that is unforeseen.

townDict = {'ANG MO KIO': 1,'BEDOK': 2,'BISHAN': 3,'BUKIT BATOK': 4,'BUKIT MERAH': 5,'BUKIT PANJANG': 6,'BUKIT TIMAH': 7,'CENTRAL AREA': 8,'CHOA CHU KANG': 9,'CLEMENTI': 10,'GEYLANG': 11,'HOUGANG': 12,'JURONG EAST': 13,'JURONG WEST': 14,'KALLANG/WHAMPOA': 15,'MARINE PARADE': 16,'PASIR RIS': 17,'PUNGGOL': 18,'QUEENSTOWN': 19,'SEMBAWANG': 20,'SENGKANG': 21,'SERANGOON': 22,'TAMPINES': 23,'TOA PAYOH': 24,'WOODLANDS': 25,'YISHUN': 26,}
flat_typeDict = {'1 ROOM': 1,'2 ROOM': 2,'3 ROOM': 3,'4 ROOM': 4,'5 ROOM': 5,'EXECUTIVE': 6,'MULTI-GENERATION': 7,}
dataframe['town'] = dataframe['town'].replace(townDict, regex=True)
dataframe['flat_type'] = dataframe['flat_type'].replace(flat_typeDict, regex=True)

Understanding correlations

I did not want to use all the columns of the tabular data, and instead wanted to find out what affected the prices most.

Side note: Do not remove too many columns as this might cause bias in your predictions.

#correlation matrix
corrmat = dataframe.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True);

To get more insight on what correlates most to resale prices, we do the following

#resale_price correlation matrix
k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'resale_price')['resale_price'].index
cm = np.corrcoef(dataframe[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

Drop columns

Drop columns with poor correlations. As mentioned previously, we will try not to drop too many columns as this might induce bias

# drop some unnecessary columns
dataframe = dataframe.drop('date',axis=1)
dataframe = dataframe.drop('block',axis=1)
dataframe = dataframe.drop('month',axis=1)
dataframe = dataframe.drop('street_name',axis=1)
dataframe = dataframe.drop('address',axis=1)
dataframe = dataframe.drop('flat_model',axis=1)
dataframe = dataframe.drop('year',axis=1)
dataframe = dataframe.drop('remaining_lease',axis=1)

This is what we will be working with for all 3 models.

Train — test split

Splitting the data to training and test. Also, the y values in the case will be the resale prices.

X = dataframe.drop('resale_price',axis =1).values
y = dataframe['resale_price'].values#splitting Train and Test 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

Scaling the data

After all the data preparation, we will now move into the 3 models we are working with.

Model 1: Linear Regression

# Multiple Liner Regression
from sklearn.linear_model import LinearRegression
regressor = LinearRegression() 
regressor.fit(X_train, y_train)
#evaluate the model (intercept and slope)
print(regressor.intercept_)
print(regressor.coef_)
#predicting the test set result
y_pred = regressor.predict(X_test)
#put results as a DataFrame
coeff_df = pd.DataFrame(regressor.coef_, dataframe.drop('resale_price',axis =1).columns, columns=['Coefficient'])#compare actual output values with predicted values
y_pred = regressor.predict(X_test)
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})

We will then evaluate the algorithm

# evaluate the performance of the algorithm (MAE - MSE - RMSE)
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, y_pred)) 
print('MSE:', metrics.mean_squared_error(y_test, y_pred)) 
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print('VarScore:',metrics.explained_variance_score(y_test,y_pred))

MAE: 72189.73698345898
MSE: 9413246271.327522
RMSE: 97021.88552758352
VarScore: 0.5966611367993274

As you can see, the RMSE is really high — I wouldn’t want my predictions to be inaccurate by almost a $100,000 when my prices average at about $500,000

Model 2: Keras neural network

# Creating a Neural Network Model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation,Dropout
from tensorflow.keras.optimizers import Adam# having 9 neurons is based on the number of available features
model = Sequential()
model.add(Dense(9,activation='relu'))
model.add(Dense(18,activation='relu'))
model.add(Dense(18,activation='relu'))
model.add(Dense(9,activation='relu'))
model.add(Dense(1))
model.compile(optimizer='Adam',loss='mse')

Time to predict and evaluate

y_pred = model.predict(X_test)
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, y_pred)) 
print('MSE:', metrics.mean_squared_error(y_test, y_pred)) 
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print('VarScore:',metrics.explained_variance_score(y_test,y_pred))
# Visualizing Our predictions
fig = plt.figure(figsize=(10,5))
plt.scatter(y_test,y_pred)
# Perfect predictions
plt.plot(y_test,y_test,'r')

MAE: 54379.245910929014
MSE: 5658879298.306035
RMSE: 75225.52291812956
VarScore: 0.7578333837833753

Well an RMSE of 75,000 is definitely better than the simple linear regression. However, I am looking to be as accurate as possible when it comes to predictions, and the neural network model does not seem to fit the bill. Let’s also take a look at the predictions below.

y_pred_2 = []
for pred in y_pred:
 y_pred_2.append(pred[0])df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred_2})
df1 = df.head(20)
df1

Not the best.. Therefore looking up online, it seems that I should try K nearest neighbours as it seems to be a better fit for tabular data with inputs such as housing data.

Model 3: K nearest neighbours

We would like to try K nearest neighbours to try to improve our pricing accuracy. We use scikit learn, with reference to this example https://www.dataquest.io/blog/machine-learning-tutorial/

from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(algorithm='brute')knn.fit(X_train,y_train)predictions = knn.predict(X_test)from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, predictions)
rmse = mse ** (1/2)
print(rmse)

Here, we have an RMSE of 46496, which is far better than the previous models.

df = pd.DataFrame({'Actual': y_test, 'Predicted': predictions})
df1 = df.head(20)
df1

Conclusion

K nearest neighbours seems to have the best prediction with a relatively low RMSE. Neural networks come in second while Linear regression comes in last.

We can definitely improve the models, perhaps doing more with geolocation, getting more data, perhaps renovation level of this houses etc. Once the input data gets better, I believe the K nearest neighbours model will yield a better RMSE.

Stay tuned for my next blogpost — to see how this model went from a jupyter notebook into www.hdbpricer.com

Originally published at http://royleekiat.com on October 22, 2020.