Predicting House Prices using a Deep Neural Network: Case with the Boston Dataset
Introduction
Although my background is in biological sciences and genetic research with a couple of publications out, I have been working for mortgage company Freddie Mac over 5 years now in the field of data, and so am very excited to combine these two sides of my experience into this series of articles on analyzing mortgage data using data science techniques in Python, with emphasis on neural networks.
We will be focusing on the publicly available Boston housing dataset, which can be loaded from the scikit-learn library and contains descriptions of the different fields. The objective is not only to build a neural network that can predict house prices with decent accuracy, but also tounderstandandexplainhow the different variables and features can affect house pricing, which will be critical for your business partners.
Let’s get started!
Exploratory Data Analysis
Let’s start by loading our dataset and performing Exploratory Data Analysis (EDA) in order to familiarize ourselves with the data. Please read the documentation from the link in the introduction section to better understand what each field means.
A great way to start exploring the relationship among your variables is to use seaborn’s heatmap to look at the correlation between the different variables. The correlation matrix will show you the degree to which a pair of variables are linearly related.
sns.heatmap(df.corr(), linewidths=1.5, linecolor='white', cmap='coolwarm', annot=True)
You can then use the correlation matrix to explore in more details the variables that look highly correlated to our target variable. For example, our median house prices is most highly correlated with “# of Rooms” and “% Lower Income”, with a score of 0.7 and -0.74 respectively. Here, I picked these 2 correlated features and 2 additional features of my choice to satisfy my own curiosity (as a scientist, it’s important to explore your hunches!). I then create a feature-wise customizable pair plot called a PairGrid, which allows me to customize the plotting of each feature to the others, as highlighted below.
g = sns.PairGrid(df[['Crime Rate', '# of Rooms', 'N.O. Concentration', '% Lower Income']])g.map_upper(plt.scatter, color="m")
g.map_lower(sns.kdeplot, cmap="Set2")
g.map_diag(sns.distplot);
From the scatterplots at the top right and the kernel density plots (kde) bottom left, we can get a better idea of the relationship between the variables. For example, you can clearly see the sort of concave inverse relationship between “Median Home Value” and “% Lower Income”, which is telling us that the higher the percentage of low income households in a neighborhood, the lower the median home value (and vice-versa).
For the purposes of this article, I will now jump to feature preparation for creating our deep neural network. However, I highly recommend exploring the data further so that the results of your network make more sense to you and you can interpret the feature impact and effect of your network better.
Feature Preparation
Thankfully for us, the Boston Dataset does not contain null values, which you can verify by performing the following operation:
df.isnull().sum()
In our case, this returns 0 across the board, meaning there are no null values in our dataset. Had we had nulls, we would have had to perform value imputation which we won’t cover here but you can read more about in this article. Generally speaking however, in real life, you will almost always encounter nulls.
Now let’s scale our data with scikit-learn’s MinMaxScaler, which we will first fit on the training dataset X_train, then apply to the test dataset X_test. This step is important when creating neural networks because it generally speeds up learning and leads to faster convergence. Please note that you can also use tensorflow’s normalizer, but I am going with a min-max scaler for this model.
Creating and Training our Neural Network
Alright, now that our data is scaled and ready to go, let’s create our neural network! We will be creating a Deep Neural Network using the Keras API. Remember, the only difference between a “regular” Neural Network and a “Deep” Neural Network is the number of hidden layers. If your network has 2 or more hidden layers, then it becomes “Deep.”
There is no perfect formula for how many perceptrons to use and how many layers to have .The most reliable way to configure these hyperparameters for your specific predictive modeling problem is via systematic experimentation with a robust test harness. I recommend playing around with my network above and see if you can find a more optimal number of neurons/layers.
Alright, let’s train our model and add an early stopping to make sure our model does not overfit. The early stopping will monitor the validation loss (val_loss) over the epochs and stop the training if val_loss increases over x many consecutive epochs. Here, we will be using an x of 10 (sometimes called “patience”). Adding an early stopping also allows us to pick a higher number of epochs and not worry about overfitting. Lastly, I am using the Adam optimizer since it’s been shown to generally do better than the other types of gradient descents on MNIST data. Let’s train our model now!
model.fit(X_train, y_train, epochs=500, callbacks=[early_stop], validation_data=[X_test, y_test], verbose=0)
Using the model history logs, let’s look at our model performance over the epochs by looking at the loss function (in our case, mse, or mean squared error) and the val_loss, which tracks our model performance on our test data. For more information on why you need to split your data into training and testing, and what overfitting is, check out this great article on the topic.
loss_df = pd.DataFrame(model.history.history)loss_df[['loss', 'val_loss']].plot()plt.xlabel("Number of Epochs")
plt.ylabel("Loss")
plt.title("Training and Validation Loss Over Training Period", pad=12);
You will notice that our model is not overfitting because our validation loss and training loss are going hand-in-hand, which is pretty awesome. Also, it looks like we are able to achieve a pretty low value for loss, which we will explore further below.
Performance Evaluation
Now let’s evaluate the performance of our model against our testing data (X_test from the earlier split). Please note that although we used X_test to monitor the overfitting of our run (with val_loss), our model did not actually train or use X_test in any way, shape or form. It was only used to track overfitting by evaluating the loss on the test data without training on it.
We will use our model to predict the Median House Price of our X_test data, which again, we kept on the side for this very purpose. We will store the predictions into a “predictions” variable then compare to the true values,y_test, and see how far off we are on average.
There are many ways to compare predictions for regression projects such as this one, but here we will compute the Mean Squared Error (MSE) and Mean Absolute Error (MAE) from sklearn.metrics. Here is a short article that explains regression model evaluation in more details if you are interested.
So how well did we do? Our mean absolute error is about $28,659. Is that good? Well the median home value in our datasetis $225,328, which makes our model on average about 12% off on predicted values. Not too bad, although we can certainly do better!
Perhaps a better way to understand our accuracy is to visualize the residuals to gain a better grasp of the distribution of our loss.
arr_predictions = np.array([x[0] for x in predictions])errors = y_test - arr_predictionsfig = plt.figure(figsize=(15, 5))sub1 = fig.add_subplot(121)# Our predictions
plt.scatter(y_test,predictions)# Perfect predictions
plt.plot(y_test,y_test,'r')plt.xlabel("Actual")
plt.ylabel("Predicted")
plt.title("Visualizing Residuals with Scatter Plot")sub1 = fig.add_subplot(122)
sns.distplot(errors)plt.xlabel("Residuals")
plt.ylabel("Distribution")
plt.title("Visualizing Residuals with Dist Plot");
The chart on the left shows how our predictions compare to the actual values from our X_test dataset, the red line being a perfect prediction. You will notice that we are consistently under-predicting past $400,000. If we had to improve our model, this is most likely the area we would be focusing on as a first step. The chart on the right hand side shows you the distribution of the residuals which allows you to derive confidence intervals for your predictions.
Feature Importance and Feature Effect
Your job as a data scientist does not end here. Should you end up deciding to use a neural network in your professional life, you will need to be able toexplainyour model to your customers. Please check out my next article called “Interpreting your Neural Networks with Feature Importance and Feature Effect”, which is a continuation of this one, on how to interpret and explain your neural network by using feature importance and feature-level analysis so that you know what’s going on under the hood when you explain it to your customers. I just did not want to have an uber long article on here.
Final Thoughts
I hope you enjoyed reading this article as much as I enjoyed writing it! I teach data science on the side at www.thepythonacademy.com, so if you’d like further training, or even want to learn it all from scratch, feel free to contact us on the website. I also plan to publish many articles on Machine Learning and AI on here, so feel free to follow me as well. Please share, like, connect, and comment, as I always love hearing from you. Thank you!
Yorumlar