Introduction

California is well known for Earthquakes and it is overdue for a major shaking. In 1989’s Loma Prieta earthquake, the world witnessed what liquefaction can do, as blocks of the Marina District were reduced to rubble. Some properties are on top of fault lines like the ones damaged in South Napa that had fault traces running through them. When the ground actually splits during an earthquake, it can damage buildings and utility lines far beyond what shaking can do.

Apart from earthquake hazard, 2018 wildfires have ravaged more than 1.2 million acres, destroyed more than 1200 homes in California. Housing market changed after that devastation in Northern California.

Properties that are prone to “ground failure” are required by law to disclose that information to potential buyers which could potentially change buyer’s decision scenario. And also, the information isn’t always easy to find until the transaction is already underway.

Problem Statement

Given a set of property details and natural hazard attributes, predict the median house price in San Jose city using Zillow property data combined with natural hazards data such as earthquake, landslide and fire hazards. The main goal is to find major attributes for the contribution of house prices in San Jose city.

Data Collection and Data Wrangling

Initial challenge was getting data which was not readily available like in Kaggle website or some other websites for data extraction. After extracting data from various sources with different data wrangling techniques, next challenge was merging different data frames with python merge & append methods and formed into single analyzable data frame. In the following section, data collection and data wrangling methods are discussed in detail.

Data Sources

Housing Data

Zillow property data for San Jose city were collected. Web scraping was used to get access the Zillow property data using python and web scraping packages such as selenium and BeautifulSoup.

Natural Hazards Data

Natural hazards data such as fault zone, landslide, liquefaction and fire hazard data were collected from below sources:

Data Collection

Housing Data from Zillow

Zillow website API has inbuilt limitation on extracting data from Zillow. Also, Zillow API does not have provision to download zip code-wise/city-wise data. Therefore, it was decided to do web scraping the following fields for single family properties and townhouse for sold and for sale (By agent, By owner, New construction, Foreclosures, Coming Soon) for different zip codes in San Jose. Latitude, longitude, address, zip, bedrooms, bathrooms, sqft, lot_size, year_built, price, sale_type, zestimate, date_sold, days_on_zillow, house_type, url were extracted.

Python code which imports Selenium and BeautifulSoup packages to scrape my Zillow features for all zip codes in San Jose was written. Below is the sample scraped data from Zillow.

Image for post

Sample scraped data from Zillow

Earthquake Hazard Zone Data

Sql code was written to extract data from California geological survey (CGS) web app application for seismic hazard data. But the problem was, CGS limited to retrieve only 1000 rows at a time with all attributes using sql query. It was decided to fetch object ids only using python program, then using those object IDs fetched the actual rows. Using python requests and json packages, all extracted data were stored in csv file. Total Features and attributes in dataframe: 261195 & 7.

Image for post

Seismic hazard zone map

Image for post

Sample seismic hazard data from CGS web application

Fault zone, liquefaction zone and landslide zone column string values were converted to 0, 1, NA based on following conditions:

If Liquefaction Zone:

LIES WITHIN a Liquefaction Zone = 1

NOT been EVALUATED by CGS for liquefaction hazards = NA

NOT within a liquefaction zone = 0

If Landslide Zone:

LIES WITHIN a Landslide Zone = 1

NOT been EVALUATED by CGS for seismic landslide hazard = NA

NOT within a landslide zone = 0

If Fault Zone:

LIES WITHIN an Earthquake Fault Zone = 1

NOT WITHIN an Earthquake Fault Zone = 0

Not been EVALUATED by CGS = NA

After converting features as explained above, sample seismic hazard data was stored in csv format and it is shown below.

Image for post

Sample formatted seismic hazard data

Fire Hazard Severity Zone Data

Fire hazard data is available in pdf. San Jose city has very limited area comes under fire hazard severity zone. Parcel numbers and address were extracted manually only for the area comes under fire hazard severity zone. Total features and attributes: 53 & 6. The fire hazard severity zone map and sample output data is shown in below.

Image for post

Very high fire hazard severity zones

Data Wrangling

Merging Dataframes

After collecting Zillow properties, seismic hazard and fire hazard data, the next challenge was merging all data frames. The following steps were used for merging:

  • Zillow properties and seismic hazard data based on ‘address’ column were merged. Then ‘fire hazard’ column in the merged data frame was added.
  • Fire hazard and seismic hazard data were merged based on ‘address’ column.
  • Above first merged data frame was appended to the second merged dataframe.

Image for post

Merged dataframe info

Handling Duplicates

The duplicates were dropped based on parcel number and address. Out of 13993 rows 13617 were obtained after removing duplicates.

Correcting Formats

  • It was noticed price column type was object instead of integer. The price column were reported as SOLD: $ — M. Strip method was used to remove text and converted unit M to dollar.
  • It was also observed that few houses named as APT, UNT, # in the address column. But listed as single family home. The house type was changed to townhouse for those mentioned as APT, UNT, # in address column.
  • Sold price were from 2016 to 2019. In order to normalize sold price, sold price was adjusted to current price based on Redfin median sale price change over the period of time.

Image for post

Medium sale price change over the period of time (San Jose)

Correcting Error in Data

  • Few properties were observed with very low and very high sold price values. It was also cross checked with other websites such as Redfin and Trulia and replaced those values. 36 properties were replaced with correct sold price and sold date.
  • Few properties were less than 100,000 and those properties were sold under non arm transactions type. These are not real property price. Hence, it was decided to remove those data by finding percentage difference between Zestimate and adjusted sold price. It was decided to remove properties with price diff percentage of more than 38% based on frequency distribution plot.

Correcting Data Type

Date sold format was converted from object type to date time.

Handling Missing Data

  • Year_built columns were replaced with median.
  • Bedrooms with zero count were removed. Bedrooms greater than or equal to 1 were considered. Studio rooms were not considered.
  • Properties without bedrooms, bathrooms and sqft details were removed.
  • Missing sqft properties were dropped.
  • For missing bedrooms and bathrooms, ffill method was used based sorted sqft values.
  • Missing lot size were filled with median values.
  • Date sold were split into month and year for further exploratory analysis.
  • All missing values were handled except Zestimate and days on Zillow.

Handling Outliers

  • Normalized price box plot was plotted to see outliers in data. It was found one house more than 30 M and checked in Zillow. It was wrong data. It was removed from data.
  • Similarly, few houses were more than 4 M as shown in the figure below. It was also cross checked with Zillow and found one house was not single family residential. It was removed from the data.
  • Following procedures were followed to remove outliers:
  • The interquartile range for the data was calculated.
  • The interquartile range (IQR) was multiplied by the number 1.5.
  • 1.5 x (IQR) to the third quartile was added. Any number greater than this was a suspected outlier and removed.
  • 1.5 x (IQR) was subtracted from the first quartile. Any number greater than this was a suspected outlier and removed.

Image for post

Box plot (adjusted price)

Exploratory Data Analysis

After collecting data, wrangling data then exploratory analyses were carried out. The following questions were got into my mind and exploratory analyses were done to find answers for all these.

  • Geospatial visualization of natural hazard homes in single family and townhouse
  • Most common features (bedrooms, bathrooms, year_built, liquefaction, fault zone, landslide, fire hazard) for single family and townhouse
  • Distribution of number of houses with price bin for single family and townhouse
  • Median number of bedrooms, bathrooms, sqft with price bin for single family and townhouse
  • Liquefaction, fault zone, landslide, fire hazard with price bin in single family and townhouse
  • Sold price distribution for various zip codes, hazards and non hazards for single family and townhouse
  • Best month to put sale for single family and townhouse
  • Most popular zip codes saleswise for single family and townhouse
  • Number of homes sold from 2016–2019 for single family and townhouse
  • Influence of natural hazard on sold price
  • Median price/sqft for hazard and non-hazard homes
  • Median price trend over construction year
  • Impact of liquefaction, landslides, fault zone, fire hazard on price of single family and townhouse
  • People’s comment on houses in hazard areas
  • Correlation plot between features and price
  • Influential features to predict house price

Where are hazards prone areas? Which hazard is most common and least common in city? How are they distributed in the city?

Before going deep into numerical part of data analysis, geospatial analyses were done to observe how hazards are distributed, most common, least common hazards in the city. Folium and basemap packages were used to plot theses maps. Here are the geospatial analyses results for single family and townhouse.

Fault zone distribution points are accumulated in between east San Jose land and mountain range.

Image for post

Fault zone distribution map (single family homes)

Image for post

Fault zone distribution map (town homes)

Landslide zone distribution points are accumulated along east side mountain range.

Image for post

Landslide distribution map (single family homes)

Image for post

Landslide distribution map (town homes)

Liquefaction are the most common hazard and widely distributed in the city.

Image for post

Liquefaction distribution map (single family homes)

Image for post

Liquefaction distribution map (town homes)

Liquefaction zones are most common hazard in San Jose and when it combines with landslide, it is even worse.

Image for post

Liquefaction and landslide distribution map (single family homes)

Among collected data points there were no combined hazard of landslide, liquefaction, landslide. These are points collected and not entire distribution of San Jose. After geospatial analyses, it was decided to do explore number of houses in each category of features. Bar plots were plotted with grouped data to see the distribution of the data.

How many number of bedrooms are most popular?

The distribution of “number of bedrooms with house count” bar plot shows most popular number of bedrooms is 3 for both single family and townhomes.

#regression #machine-learning #natural-hazards #deep learning

House Price Prediction in Natural Hazard Prone Areas
1.15 GEEK