House Price Prediction in Natural Hazard Prone Areas

Introduction

Different Regression models i.e. Linear Regression, Decision Tree Regression, Gradient Boosted Regression, and Random Forest Regression were used. The performance of those models using R² were compared. Based on these performance score, better performing model were suggested to predict house price.

First, the data was divided into independent variable X and dependent variable y. Independent variable X was used to predict the target variable y. The price, id and date column were dropped from the new_df dataframe to create the variable X. The price column from the new_df dataframe were used to create the variable y. Different metrics were used to the performance of the Regression models such as Mean squared errors, Root mean squared errors, R-squared score, Mean absolute deviation, Mean absolute percent errors, etc. Root mean squared error and R-squared score were used to evaluate the performance of the regression model. In order to save the metrics of the model, a data frame was created and it was named metrics. Next, the data was splitted into training and testing set. 80% of the randomly selected data were kept as a training set and 20% of the randomly selected data as a testing set. The model was learned using the 80% of the data, and the rest 20% testing data were used as an unseen future dataset to predict the house price.

Linear Regression Model was built using the default parameters, and the model was fitted using the training dataset. X_test data was used to predict using the model. Then, Mean squared error (MSE), Root mean squared error (RMSE), R-squared score (r2_score), Mean absolute deviation (MAD), and Mean absolute percent error (MAPE) were calculated.

Image for post

Feature Selection

Backward elimination method of feature selection were used. Feature selection is the process of selecting a subset of relevant features that may improve the performance of the model. First, the worst attribute from the feature was removed. The date_sold_month were removed because it has a very weak correlation with the price of the house. Then, year_built_decade_mapped were removed from the feature set. Then, a univariate feature selection package called SelectKbest from the sklearn library was tried. Below are correlation coefficients for different features.

#gradient-boosting #machine-learning-models #random-forest-regressor #decision-tree-regressor #natural-hazards

Introduction

Feature Selection

towardsdatascience.com

House Price Prediction in Natural Hazard Prone Areas - Part 2