How to Predict Housing Prices with Linear Regression?

How-to-Predict-Housing-Prices-with-Linear-Regression

The final objective is to estimate the cost of a certain house in a Boston suburb. In 1970, the Boston Standard Metropolitan Statistical Area provided the information. To examine and modify the data, we will use several techniques such as data pre-processing and feature engineering. After that, we'll apply a statistical model like regression model to anticipate and monitor the real estate market.

Project Outline:

  • EDA
  • Feature Engineering
  • Pick and Train a Model
  • Interpret
  • Conclusion

EDA

Before using a statistical model, the EDA is a good step to go through in order to:

  • Recognize the data set
  • Check to see if any information is missing.
  • Find some outliers.
  • To get more out of the data, add, alter, or eliminate some features.

Importing the Libraries

  • Recognize the data set
  • Check to see if any information is missing.
  • Find some outliers.
  • To get more out of the data, add, alter, or eliminate some features.

# Import the libraries #Dataframe/Numerical libraries import pandas as pd import numpy as np #Data visualization import plotly.express as px import matplotlib import matplotlib.pyplot as plt import seaborn as sns #Machine learning model from sklearn.linear_model import LinearRegression

Reading the Dataset with Pandas

#Reading the data path='./housing.csv' housing_df=pd.read_csv(path,header=None,delim_whitespace=True)

 CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATMEDV
00.0063218.02.3100.5386.57565.24.09001296.015.3396.904.9824.0
10.027310.07.0700.4696.42178.94.96712242.017.8396.909.1421.6
20.027290.07.0700.4697.18561.14.96712242.017.8392.834.0334.7
30.032370.02.1800.4586.99845.86.06223222.018.7394.632.9433.4
40.069050.02.1800.4587.14754.26.06223222.018.7396.905.3336.2
.............................................
5010.062630.011.9300.5736.59369.12.47861273.021.0391.999.6722.4
5020.045270.011.9300.5736.12076.72.28751273.021.0396.909.0820.6
5030.060760.011.9300.5736.97691.02.16751273.021.0396.905.6423.9
5040.109590.011.9300.5736.79489.32.38891273.021.0393.456.4822.0
5050.047410.011.9300.5736.03080.82.50501273.021.0396.907.8811.9

Have a Look at the Columns

Crime: It refers to a town's per capita crime rate.

ZN: It is the percentage of residential land allocated for 25,000 square feet.

Indus: The amount of non-retail business lands per town is referred to as the indus.

CHAS: CHAS denotes whether or not the land is surrounded by a river.

NOX: The NOX stands for nitric oxide content (part per 10m)

RM: The average number of rooms per home is referred to as RM.

AGE: The percentage of owner-occupied housing built before 1940 is referred to as AGE.

DIS: Weighted distance to five Boston employment centers are referred to as dis.

RAD: Accessibility to radial highways index

TAX: The TAX columns denote the rate of full-value property taxes per $10,000 dollars.

B: B=1000(Bk — 0.63)2 is the outcome of the equation, where Bk is the proportion of blacks in each town.

PTRATIO: It refers to the student-to-teacher ratio in each community.

LSTAT: It refers to the population's lower socioeconomic status.

MEDV: It refers to the 1000-dollar median value of owner-occupied residences.

Data Preprocessing

# Check if there is any missing values. housing_df.isna().sum() CRIM       0 ZN         0 INDUS      0 CHAS       0 NOX        0 RM         0 AGE        0 DIS        0 RAD        0 TAX        0 PTRATIO    0 B          0 LSTAT      0 MEDV       0 dtype: int64

No missing values are found

We examine our data's mean, standard deviation, and percentiles.

housing_df.describe()

Graph Data

 CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATMEDV
count506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000
mean3.61352411.36363611.1367790.0691700.5546956.28463468.5749013.7950439.549407408.23715418.455534356.67403212.65306322.532806
std8.60154523.3224536.8603530.2539940.1158780.70261728.1488612.1057108.707259168.5371162.16494691.2948647.1410629.197104
min0.0063200.0000000.4600000.0000000.3850003.5610002.9000001.1296001.000000187.00000012.6000000.3200001.7300005.000000
25%0.0820450.0000005.1900000.0000000.4490005.88550045.0250002.1001754.000000279.00000017.400000375.3775006.95000017.025000
50%0.2565100.0000009.6900000.0000000.5380006.20850077.5000003.2074505.000000330.00000019.050000391.44000011.36000021.200000
75%3.67708312.50000018.1000000.0000000.6240006.62350094.0750005.18842524.000000666.00000020.200000396.22500016.95500025.000000
max88.976200100.00000027.7400001.0000000.8710008.780000100.00000012.12650024.000000711.00000022.000000396.90000037.97000050.000000

The crime, area, sector, nitric oxides, 'B' appear to have multiple outliers at first look because the minimum and maximum values are so far apart. In the Age columns, the mean and the Q2(50 percentile) do not match.

We might double-check it by examining the distribution of each column.

Inferences

  1. The rate of crime is rather low. The majority of values are in the range of 0 to 25. With a huge value and a value of zero.
  2. The majority of residential land is zoned for less than 25,000 square feet. Land zones larger than 25,000 square feet represent a small portion of the dataset.
  3. The percentage of non-retial commercial acres is mostly split between two ranges: 0-13 and 13-23.
  4. The majority of the properties are bordered by the river, although a tiny portion of the data is not.
  5. The content of nitrite dioxide has been trending lower from.3 to.7, with a little bump towards.8. It is permissible to leave a value in the range of 0.1–1.
  6. The number of rooms tends to cluster around the average.
  7. With time, the proportion of owner-occupied units rises.
  8. As the number of weights grows, the weight distance between 5 employment centers reduces. It could indicate that individuals choose to live in new high-employment areas.
  9. People choose to live in places with limited access to roadways (0-10). We have a 30th percentile outlier.
  10. The majority of dwelling taxes are in the range of $200-450, with large outliers around $700,000.
  11. The percentage of people with lower status tends to cluster around the median. The majority of persons are of lower social standing.

Because the model is overly generic, removing all outliers will underfit it. Keeping all outliers causes the model to overfit and become excessively accurate. The data's noise will be learned.

The approach is to establish a happy medium that prevents the model from becoming overly precise. When faced with a new set of data, however, they generalise well.

We'll keep numbers below 600 because there's a huge anomaly in the TAX column around 600.

new_df=housing_df[housing_df['TAX']<600]

Looking at the Distribution

Looking-at-the-Distribution

The overall distribution, particularly the TAX, PTRATIO, and RAD, has improved slightly.

Correlation

Correlation

Perfect correlation is denoted by the clear values. The medium correlation between the columns is represented by the reds, while the negative correlation is represented by the black.

With a value of 0.89, we can see that 'MEDV', which is the medium price we wish to anticipate, is substantially connected with the number of rooms 'RM'. The proportion of black people in area 'B' with a value of 0.19 is followed by the residential land 'ZN' with a value of 0.32 and the percentage of black people in area 'ZN' with a value of 0.32.

The metrics that are most connected with price will be plotted.

The-metrics-that-are-most-connected

Feature Engineering

Feature Scaling

Gradient descent is aided by feature scaling, which ensures that all features are on the same scale. It makes locating the local optimum much easier.

Mean standardization is one strategy to employ. It substitutes (target-mean) for the target to ensure that the feature has a mean of nearly zero.

def standard(X):    '''Standard makes the feature 'X' have a zero mean'''    mu=np.mean(X) #mean    std=np.std(X) #standard deviation    sta=(X-mu)/std # mean normalization    return mu,std,sta     mu,std,sta=standard(X) X=sta X

 CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTAT
0-0.6091290.092792-1.019125-0.2809760.2586700.2791350.162095-0.167660-2.105767-0.235130-1.1368630.401318-0.933659
1-0.575698-0.598153-0.225291-0.280976-0.4237950.0492520.6482660.250975-1.496334-1.032339-0.0041750.401318-0.219350
2-0.575730-0.598153-0.225291-0.280976-0.4237951.1897080.0165990.250975-1.496334-1.032339-0.0041750.298315-1.096782
3-0.567639-0.598153-1.040806-0.280976-0.5325940.910565-0.5263500.773661-0.886900-1.3276010.4035930.343869-1.283945
4-0.509220-0.598153-1.040806-0.280976-0.5325941.132984-0.2282610.773661-0.886900-1.3276010.4035930.401318-0.873561
..........................................
501-0.519445-0.5981530.585220-0.2809760.6048480.3060040.300494-0.936773-2.105767-0.5746821.4456660.277056-0.128344
502-0.547094-0.5981530.585220-0.2809760.604848-0.4000630.570195-1.027984-2.105767-0.5746821.4456660.401318-0.229652
503-0.522423-0.5981530.585220-0.2809760.6048480.8777251.077657-1.085260-2.105767-0.5746821.4456660.401318-0.820331
504-0.444652-0.5981530.585220-0.2809760.6048480.6060461.017329-0.979587-2.105767-0.5746821.4456660.314006-0.676095
505-0.543685-0.5981530.585220-0.2809760.604848-0.5344100.715691-0.924173-2.105767-0.5746821.4456660.401318-0.435703

Choose and Train the Model

For the sake of the project, we'll apply linear regression.

Typically, we run numerous models and select the best one based on a particular criterion.

Linear regression is a sort of supervised learning model in which the response is continuous, as it relates to machine learning.

Form of Linear Regression

y= θX+θ1 or y= θ1+X1θ2 +X2θ3 + X3θ4

y is the target you will be predicting

0 is the coefficient

x is the input

We will Sklearn to develop and train the model

#Import the libraries to train the model from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression

Allow us to utilise the train/test method to learn a part of the data on one set and predict using another set using the train/test approach.

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.4) #Create and Train the model model=LinearRegression().fit(X_train,y_train) #Generate prediction predictions_test=model.predict(X_test) #Compute loss to evaluate the model coefficient= model.coef_ intercept=model.intercept_ print(coefficient,intercept) [7.22218258] 24.66379606613584

In this example, you will learn the model using below hypothesis:

Price= 24.85 + 7.18* Room

It is interpreted as:

For a decided price of a house:

A 7.18-unit increase in the price is connected with a growth in the number of rooms.

As a side note, this is an association, not a cause!

Interpretation

You will need a metric to determine whether our hypothesis was right. The RMSE approach will be used.

Root Means Square Error (RMSE) is defined as the square root of the mean of square error. The difference between the true and anticipated numbers called the error. It's popular because it can be expressed in y-units, which is the median price of a home in our scenario.

def rmse(predict,actual):    return np.sqrt(np.mean(np.square(predict - actual))) # Split the Data into train and test set X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.4) #Create and Train the model model=LinearRegression().fit(X_train,y_train) #Generate prediction predictions_test=model.predict(X_test) #Compute loss to evaluate the model coefficient= model.coef_ intercept=model.intercept_ print(coefficient,intercept) loss=rmse(predictions_test,y_test) print('loss: ',loss) print(model.score(X_test,y_test)) #accuracy [7.43327725] 24.912055881970886 loss: 3.9673165450580714 0.7552661033654667 Loss will be 3.96

This means that y-units refer to the median value of occupied homes with 1000 dollars.

This will be less by 3960 dollars.

While learning the model you will have a high variance when you divide the data. Coefficient and intercept will vary. It's because when we utilized the train/test approach, we choose a set of data at random to place in either the train or test set. As a result, our theory will change each time the dataset is divided.

This problem can be solved using a technique called cross-validation.

Improvisation in the Model

With 'Forward Selection,' we'll iterate through each parameter to assist us choose the numbers characteristics to include in our model.

Forward Selection

  1. Choose the most appropriate variable (in our case based on high correlation)
  2. Add the next best variable to the model
  3. Some predetermined conditions must meet.

We'll use a random state of 1 so that each iteration yields the same outcome.

cols=[] los=[] los_train=[] scor=[] i=0 while i < len(high_corr_var):    cols.append(high_corr_var[i])        # Select inputs variables    X=new_df[cols]        #mean normalization    mu,std,sta=standard(X)    X=sta        # Split the data into training and testing    X_train,X_test,y_train,y_test= train_test_split(X,y,random_state=1)        #fit the model to the training    lnreg=LinearRegression().fit(X_train,y_train)        #make prediction on the training test    prediction_train=lnreg.predict(X_train)        #make prediction on the testing test    prediction=lnreg.predict(X_test)        #compute the loss on train test    loss=rmse(prediction,y_test)    loss_train=rmse(prediction_train,y_train)    los_train.append(loss_train)    los.append(loss)        #compute the score    score=lnreg.score(X_test,y_test)    scor.append(score)        i+=1

We have a big 'loss' with a smaller collection of variables, yet our system will overgeneralize in this scenario. Although we have a reduced 'loss,' we have a large number of variables. However, if the model grows too precise, it may not generalize well to new data.

In order for our model to generalize well with another set of data, we might use 6 or 7 features. The characteristic chosen is descending based on how strong the price correlation is.

high_corr_var ['RM', 'ZN', 'B', 'CHAS', 'RAD', 'DIS', 'CRIM', 'NOX', 'AGE', 'TAX', 'INDUS', 'PTRATIO', 'LSTAT']

With 'RM' having a high price correlation and LSTAT having a negative price correlation.

# Create a list of features names feature_cols=['RM','ZN','B','CHAS','RAD','CRIM','DIS','NOX'] #Select inputs variables X=new_df[feature_cols] # Split the data into training and testing sets X_train,X_test,y_train,y_test= train_test_split(X,y, random_state=1) # feature engineering mu,std,sta=standard(X) X=sta # fit the model to the trainning data lnreg=LinearRegression().fit(X_train,y_train) # make prediction on the testing test prediction=lnreg.predict(X_test) # compute the loss loss=rmse(prediction,y_test) print('loss: ',loss) lnreg.score(X_test,y_test) loss: 3.212659865936143 0.8582338376696363

The test set yielded a loss of 3.21 and an accuracy of 85%.

Other factors, such as alpha, the learning rate at which our model learns, could still be tweaked to improve our model. Alternatively, return to the preprocessing section and working to increase the parameter distribution.

For more details regarding scraping real estate data you can contact Scraping Intelligence today

https://www.websitescraper.com/how-to-predict-housing-prices-with-linear-regression.php

What is GEEK

Buddha Community

A Deep Dive into Linear Regression

Let’s begin our journey with the truth — machines never learn. What a typical machine learning algorithm does is find a mathematical equation that, when applied to a given set of training data, produces a prediction that is very close to the actual output.

Why is this not learning? Because if you change the training data or environment even slightly, the algorithm will go haywire! Not how learning works in humans. If you learned to play a video game by looking straight at the screen, you would still be a good player if the screen is slightly tilted by someone, which would not be the case in ML algorithms.

However, most of the algorithms are so complex and intimidating that it gives our mere human intelligence the feel of actual learning, effectively hiding the underlying math within. There goes a dictum that if you can implement the algorithm, you know the algorithm. This saying is lost in the dense jungle of libraries and inbuilt modules which programming languages provide, reducing us to regular programmers calling an API and strengthening further this notion of a black box. Our quest will be to unravel the mysteries of this so-called ‘black box’ which magically produces accurate predictions, detects objects, diagnoses diseases and claims to surpass human intelligence one day.

We will start with one of the not-so-complex and easy to visualize algorithm in the ML paradigm — Linear Regression. The article is divided into the following sections:

  1. Need for Linear Regression

  2. Visualizing Linear Regression

  3. Deriving the formula for weight matrix W

  4. Using the formula and performing linear regression on a real world data set

Note: Knowledge on Linear Algebra, a little bit of Calculus and Matrices are a prerequisite to understanding this article

Also, a basic understanding of python, NumPy, and Matplotlib are a must.


1) Need for Linear regression

Regression means predicting a real valued number from a given set of input variables. Eg. Predicting temperature based on month of the year, humidity, altitude above sea level, etc. Linear Regression would therefore mean predicting a real valued number that follows a linear trend. Linear regression is the first line of attack to discover correlations in our data.

Now, the first thing that comes to our mind when we hear the word linear is, a line.

Yes! In linear regression, we try to fit a line that best generalizes all the data points in the data set. By generalizing, we mean we try to fit a line that passes very close to all the data points.

But how do we ensure that this happens? To understand this, let’s visualize a 1-D Linear Regression. This is also called as Simple Linear Regression

#calculus #machine-learning #linear-regression-math #linear-regression #linear-regression-python #python

Rusty  Shanahan

Rusty Shanahan

1595563500

House Prices Prediction Using Deep Learning

In this tutorial, we’re going to create a model to predict House prices🏡 based on various factors across different markets.

Problem Statement

The goal of this statistical analysis is to help us understand the relationship between house features and how these variables are used to predict house price.

Objective

  • Predict the house price
  • Using two different models in terms of minimizing the difference between predicted and actual rating

Data used: Kaggle-kc_house Dataset

GitHub: you can find my source code here


Step 1: Exploratory Data Analysis (EDA)

First, Let’s import the data and have a look to see what kind of data we are dealing with:

#import required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
#import Data
Data = pd.read_csv('kc_house_data.csv')
Data.head(5).T
#get some information about our Data-Set
Data.info()
Data.describe().transpose()

Image for post

5 records of our dataset

Image for post

Information about the dataset, what kind of data types are your variables

Image for post

Statistical summary of your dataset

Our features are:

✔️**Date:**_ Date house was sold_

✔️**Price:**_ Price is prediction target_

✔️**_Bedrooms: _**Number of Bedrooms/House

✔️**Bathrooms:**_ Number of bathrooms/House_

✔️**Sqft_Living:**_ square footage of the home_

✔️**Sqft_Lot:**_ square footage of the lot_

✔️**Floors:**_ Total floors (levels) in house_

✔️**Waterfront:**_ House which has a view to a waterfront_

✔️**View:**_ Has been viewed_

✔️**Condition:**_ How good the condition is ( Overall )_

✔️**Grade:**_ grade given to the housing unit, based on King County grading system_

✔️**Sqft_Above:**_ square footage of house apart from basement_

✔️**Sqft_Basement:**_ square footage of the basement_

✔️**Yr_Built:**_ Built Year_

✔️**Yr_Renovated:**_ Year when house was renovated_

✔️**Zipcode:**_ Zip_

✔️**Lat:**_ Latitude coordinate_

✔️**_Long: _**Longitude coordinate

✔️**Sqft_Living15:**_ Living room area in 2015(implies — some renovations)_

✔️**Sqft_Lot15:**_ lotSize area in 2015(implies — some renovations)_


Let’s plot couple of features to get a better feel of the data

#visualizing house prices
fig = plt.figure(figsize=(10,7))
fig.add_subplot(2,1,1)
sns.distplot(Data['price'])
fig.add_subplot(2,1,2)
sns.boxplot(Data['price'])
plt.tight_layout()
#visualizing square footage of (home,lot,above and basement)
fig = plt.figure(figsize=(16,5))
fig.add_subplot(2,2,1)
sns.scatterplot(Data['sqft_above'], Data['price'])
fig.add_subplot(2,2,2)
sns.scatterplot(Data['sqft_lot'],Data['price'])
fig.add_subplot(2,2,3)
sns.scatterplot(Data['sqft_living'],Data['price'])
fig.add_subplot(2,2,4)
sns.scatterplot(Data['sqft_basement'],Data['price'])
#visualizing bedrooms,bathrooms,floors,grade
fig = plt.figure(figsize=(15,7))
fig.add_subplot(2,2,1)
sns.countplot(Data['bedrooms'])
fig.add_subplot(2,2,2)
sns.countplot(Data['floors'])
fig.add_subplot(2,2,3)
sns.countplot(Data['bathrooms'])
fig.add_subplot(2,2,4)
sns.countplot(Data['grade'])
plt.tight_layout()

With distribution plot of price, we can visualize that most of the prices are between 0 and around 1M with few outliers close to 8 million (fancy houses😉). It would make sense to drop those outliers in our analysis.

#linear-regression #machine-learning #python #house-price-prediction #deep-learning #deep learning

5 Regression algorithms: Explanation & Implementation in Python

Take your current understanding and skills on machine learning algorithms to the next level with this article. What is regression analysis in simple words? How is it applied in practice for real-world problems? And what is the possible snippet of codes in Python you can use for implementation regression algorithms for various objectives? Let’s forget about boring learning stuff and talk about science and the way it works.

#linear-regression-python #linear-regression #multivariate-regression #regression #python-programming

Angela  Dickens

Angela Dickens

1598352300

Regression: Linear Regression

Machine learning algorithms are not your regular algorithms that we may be used to because they are often described by a combination of some complex statistics and mathematics. Since it is very important to understand the background of any algorithm you want to implement, this could pose a challenge to people with a non-mathematical background as the maths can sap your motivation by slowing you down.

Image for post

In this article, we would be discussing linear and logistic regression and some regression techniques assuming we all have heard or even learnt about the Linear model in Mathematics class at high school. Hopefully, at the end of the article, the concept would be clearer.

**Regression Analysis **is a statistical process for estimating the relationships between the dependent variables (say Y) and one or more independent variables or predictors (X). It explains the changes in the dependent variables with respect to changes in select predictors. Some major uses for regression analysis are in determining the strength of predictors, forecasting an effect, and trend forecasting. It finds the significant relationship between variables and the impact of predictors on dependent variables. In regression, we fit a curve/line (regression/best fit line) to the data points, such that the differences between the distances of data points from the curve/line are minimized.

#regression #machine-learning #beginner #logistic-regression #linear-regression #deep learning

Singapore Housing Prices ML Prediction — Analyse Singapore’s Property Price

In this final part, I will share some popular machine learning algorithms to predict the housing prices and the live model that I have built. My objective is to find a model that can generate a high accuracy of the housing prices, based on the available dataset, such that given a new property and with the required information, we will know whether the property is over or under-valued.


Brief introduction of the machine learning algorithms used

I explore 5 machine learning algorithms that are used to predict the housing prices in Singapore, namely multi-linear regression, lasso, ridge, decision tree and neural network.

Multi-linear regression model, as its name suggest, is a widely used model that assumes linearity between the independent variables and dependent variable (price). This will be my baseline model for comparison.

Lasso and ridge are models to reduce model complexity and overfitting when there are too many parameters. For example, the lasso model will effectively shrink some of the variables, such that it only takes into account some of the important factors. While there are only 17 variables, in the dataset and the number of variables may not be considered extensive, it will still be a good exercise to analyse the effectiveness of these models.

Decision tree is an easily understandable model which uses a set of binary rules to achieve the target value. This is extremely useful for decision making as a tree diagram can be plotted to aid in understanding the importance of each variable (the higher the variable in the tree, the more important the variable).

Last, I have also explored a simple multi-layer perceptron neural network model. Simply put, the data inputs is put through a few layers of “filters” (feed forward hidden layers) and the model learns how to minimise the loss function by changing the values in the “filters” matrices.

#predictive-analytics #predictive-modeling #machine-learning #sklearn #housing-prices