1 Introduction

This project develops model to predict the home price for single-family housing in Boulder county, Colorado. Despite powerful models that Zillow possesses for home price prediction, the always unique context of places requires more refined and site-specific models that take local characteristics into consideration for price prediction. However, such a site-specific model is difficult to build because factors that influence home price tend to be correlate with each other within the system of real estate. Moreover, a site-specific model to Boulder has limited data for model training for Boulder is not a large county, a limitation that will harm the accuracy of the model. Responding to the request for a site-specific model, we develops a Hedonic model that deconstructs the home price into the values of constituent parts including internal characteristics such as housing quality, public services and amenities such as public transit, and spatial process such as the clustering of home price. We then use the model to predict the home price in relation to changes in aforementioned features.

The final model that we develop to predict home price in Boulder explains around 72% of the fluctuation of price with a 19% range of accuracy for each price that the model predicts. Our model has o average an absolute error of 130,000 for each price prediction; comparing to the mean of housing price ($739,555), our model has a decent accuracy for prediction. The mean absolute percentage error for all prediction is nearly randomly distributed in spatial term. While our model has a slightly higher mean absolute percentage error for neighborhoods with high income, the relatively small gap (3%) between the MAPE for high-income and low-income neighborhoods suggest that the model is generalizable. Thus, we believe that our model with good accuracy and generalizability will be useful for Zillow to predict home price in Boulder county, CO.

2 Data

2.1 Data Wrangling and Feature Engineering

We first download the county boundary data from the open data portal of Boulder county. We then use the county boundary to download key amenities–including parks, playgrounds, waters, restaurants, fast food, companies, and bus stations–spatial information from OpenStreetMap. We download spatial information of census tracts within Boulder and urban areas that the Census defines as densely developed areas with a population at least greater than 2,500 from the Census Bureau. We also manually collect k-12 school information from the website of Boulder Valley School District and clean it into a csv file.

We then convert raw variables into useful predictive features. We first eliminate outlinears that are significantly more expensive than the general housing stock in Boulder. To develop indicators, we calculate the age of homes and the distance to the nearest 1 ~ 3 amenities. We also create buffers to calculate how many amenities exist within certain miles of a house. We categorize numeric variables such as the number of bedrooms into categorical variables that divide the data into several categories such as house with more than 3 bedrooms. We spatial join the census tracts and urban areas to the housing characteristics so that each house is assigned to its census tract and urban status as either urban or non-urban. At last, we create lag price variable that represents the mean of the nearest 5 houses to account for the clustering effects of houses.

2.2 Variable Descriptions

For variables in our final model, we select the following variables:

Internal Characteristics
- age: age of the house calculated by 2021 minus the effective year of the house.
- designCodeDscr: description of building’s design type
- qualityCodeDscr: description of the quality of the house as determined by the government staff
- TotalFinishedSF: total number of finished square feet without the basement squre feed
- : whether the house has more than 3 bedrooms
- HeatingDscr: description of the type of heating system in the house
- Roof_CoverDscr: the material used on the roof

Public Service/Amenities
- park: distance from the house to the nearest park
- school: distance from the house to the nearest school
- restaurant_nn1: distance from the house to the nearest restaurant
- bus_stop_nn1: distance from the house to the nearest bus station
- company: distance from the house to the nearest company
- water_nn1: distance from the house to the nearest water

Spatial Process
- GEOID: census tract that the house belongs to
- urban_status: whether the house is located in an urban area defined by the Census
- lagPrice: the average sale price of the house’s 5 nearest neighbors

The summary statistics for all numerical varibles are presented below.

2.3 Data Selection: Correlation Matrix and Correlation Plots

Below if the correlation matrix of all variables that we develop. We select and delete variables based on the correlation matrix. If two variables have a high correlation, we only choose one to include in our model.

For categorical variables that cannot be incorporated into the correlation matrix, we plot price as a function of categorical variables to see if there is a correlation. Below are the plots for variables of interest that show a correlation to the home price.

2.4 Price Correlation Scatter Plots

The scatter plots below show the home price correlated with 4 different variables of interest. The age of house is negatively correlated with the home price, meaning that the older a house is, the lower the sale price will be; yet the correlation line is rather gradual, potentially skewed by a few old houses with high value. The distance to the nearest bus stop is negatively correlate with home price, suggesting that the home price is likely to decrease with the decrease in accessibility to public transit. The distance to the nearest company site is negatively correlate with home price, suggesting that more far away a house is from company sites, the lower the sale price will be; this correlation indicates the relationship between home price and potential job opportunity. The distance to the nearest school site is also negatively correlate with home price, meaning that more distant a house is to school, the lower the sale price will be.

2.5 Map Home Sale Price

The map below visualizes the distribution of sale prices in Boulder county. The higher price tends to concentrate in the Boulder city, and the lower price tends to concentrate in Northeast Boulder and to scatter over the less populated parts in the county.

2.6 Maps of Interesting Independent Variables

The following maps show the spatial distribution of three independent variables: urban status, distance to nearest bus station, and the design style of house. Each variable has its own distribution pattern, all with certain degree of clustering that does not resonate with a single spatial process, say, urban status. The different patterns of clustering for independent variables suggest the importance to include all of them into the model.

3 Methods

3.1 Split the training and testing datasets

We first split our data into training for 75% and testing sets for 25%. According to the correlation matrix above (Figure 1), we should eliminate the multicollinearity as possible to ensure that all these variables should be independent.

3.3 Accounting for neighbourhoods effects

4 Results

4.1 Results of training dataset

Table 2 shows the summary results of our training dataset. According to the P-value and t-value, the majority of factors are significant. Variables categorized as internal characteristics (total finished size, quality code, age, etc.) and spatial structure (most neighbourhoods and price lag) play most significant roles in the OLS regression model of our training dataset. According to the coefficients, houses with younger built ages, high quality, larger size, more bedrooms are associated with higher salesprice, which accords with our expectations. Except for the distance to the restaurant, public services are not statistically significant in our designed model. The is adjusted R square is 0.7267 , which means that 72.67% of variance in this model can be explained by all these predictors we chose.

4.2 Results of test dataset

Table 3 shows the mean absolute error (AbsError) and the mean absolute percentage error (MAPE) for a single test set. Here we compare the test set with and without neighbourhood effects. Notably model with neighbourhood effects becomes more accurate, since both MAPE and AbsError decreased as we introduced neighbourhoods to our model. However, a model with a MAPE of 18.9% still needs future improvement. <- lm(price ~ ., data = st_drop_geometry( %>% 
4.3 Cross validation

100-folds cross validation is then conducted to identify the optimal parameters in our training dataset. The MAE is 149189.9 and its standard deviation is 23366.62. The root mean square (RMSE) is 283864.8 and the R square is 0.7434937, which suggests that about 74.75% of variance can be explained in validation dataset. According to this histogram, MAE, RMSE and R square are not distributed normally. Further analysis should be conducted to test the generalizability of the training dataset.

4.4 Predicted sale price as function of observed price

Figure 6.1 plotted the predicted sale price as function of observed price. By comparing the gaps between the prediction (green line) and the perfect prediction (orange line), our predicted prices were close for majority. However, as prices increased, more errors spread out. By comparing the baseline regression and the neighbourhoods effects model in Figure 6.2, we noticed that the previous baseline regression was already close to the prefect prediction, but neighbourhoods effect even fits better slightly.

4.5 Residuals of test sets

Figure 7.1 shows the distribution of residuals for the test set. We observed some clusters in the east, southeastern area and city of Boulder, suggesting the positive spatial autocorrelation exists. Figure 7.2 indicates that the observed Moran’s I (represented by the orange line) is significantly higher than all 999 randomly permuted I, with a positive value of about 0.25, which again confirmed that positive spatial autocorrelation exists in these residuals. Then according to Figure 7.3, generally the prices increase as the spatial lag of errors increase. Seemingly more house prices, as well as outliers, are above the best-fit line, which suggest a clustering pattern of house prices.

Map of residuals

4.6 Mapping predicted values

With the regression model we predicted sales prices where to predict equals 0 in the original housing dataset. Then we plotted our results for all sales price where toPredict is both 0 and 1 as Figure 8 shows. To our prediction, the most expensive houses are concentrated in Boulder city and southeast corner of Boulder county, while the cheapest prices are clustered close to the northeast corner of Boulder city.

4.7 MAPE by neighborhood

Using the test set prediction, mean absolute percentage error (MAPE) by neighborhood is mapped as Figure 9. Most neighhourhoods have low value of MAPE, which is below 0.4. Notably, in the central Boulder county, one neighbourhood was significantly high, which exceeds 100%. Probably because this area has limited houses and more easily caused distortion.

4.8 Test generalizability under income context

To further test our model’s generalizability, we split our cities into high and low income groups. We collected income data from 15-19 ACS using tidycensus and calculated the median income of Boulder county (40453). Figure 11 shows the test of generalizability under income context, where most neighbourhoods are wealthier. MAPE is calculated across the Baseline regression and neighbourhoods effects under income contexts in Table 3, where we observed lower MAPE of models with neighbourhoods effects. Therefore, neighourhood effects make a more generalizable prediction model.

5 Discussion

Our model is an effective model with the p-value of the F-statistic small smaller than 0.001 (p-value < 2.2e-16). Our model predicts around 72% of the variation in prices.
Some of the more interesting variables include distance to nearest school, distance to nearest bus station, and distance to nearest company that are statistically significant in the baseline model but lose significance once we introduce the spatial process variables including the neighborhood variable and the urban status.
Important features for home price prediction include age, neighborhood, urban status, design style, quality, built-up area, number of bedrooms, heating system, roof material, distance to park, distance to restaurant, and the mean price of the nearest 5 houses. Specifically, the neighborhood, urban status, and lag price variables represent the spatial process of price clustering in which houses in neighborhoods with higher home price tend to also have higher price. The design style and roof material variables are distributed spatially correlated to the clustering process of homes and are corresponding with the urban/non-urban division of spatial context; specific design styles are concentrated in urban neighborhoods that tend to have higher home price. While the distance to amenities such as parks and restaurants stay significant after we introduce spatial process variables, the significance of them decrease because they too are clustered in relation to the neighborhood effects. Thus, the correlation of variables are complicated in spatial sense, reflecting a clustering distribution that is crucial for home price. The model has a mean of absolute error of 149189.9 and a mean absolute percentage error of 19%. While our model has a slightly higher mean absolute percentage error for neighborhoods with high income, the relatively small gap between the MAPE for high-income and low-income neighborhoods suggest that the model can account the spatial variation in prices. The residuals, although largely randomly distributed, tend to be higher in Boulder city; however, the residuals tend to be lower in Longmont, also an urban region. Thus, what causes the error can be spatial characteristics specific to Boulder city that our model fails to capture. For instance, the large young population of students of University of Colorado in Boulder city can skew the home price. In more remote regions where few observations exist, the results of residuals are mixed, indicating that the model might require more observation from the rural context to reduce the residual.

6 Conclusion

We would recommend our model to Zillow to include more local intelligence into their home price prediction model. Our model has a decent degree of accuracy and generalizability to predict home price across neighborhoods and to account for the variation of prices in a spatial sense. We can improve the model more by using more training data as the training data is rather limited fro Boulder county and we lack some key data such as crime data to train our model better. Also, the socioeconomic data including income, racial composition, and educational attainment that are accessible to us is census data that has a high collinearity with the neighborhood data that we use census tracts to approximate for; thus, we are unable to include socioeconomic data that is likely to be influential into our model. More feature engineering with using log rather than the raw data might also reshapes the distribution of independent variables and builds a fitter model. We believe that Zillow can use the rich resources and data that they already possess to train and improve our model with more available variables to get even better prediction results.