California is well known for Earthquakes and it is overdue for a major shaking. In 1989’s Loma Prieta earthquake, the world witnessed what liquefaction can do, as blocks of the Marina District were reduced to rubble. Some properties are on top of fault lines like the ones damaged in South Napa that had fault traces running through them. When the ground actually splits during an earthquake, it can damage buildings and utility lines far beyond what shaking can do.
Apart from earthquake hazard, 2018 wildfires have ravaged more than 1.2 million acres, destroyed more than 1200 homes in California. Housing market changed after that devastation in Northern California.
Properties that are prone to “ground failure” are required by law to disclose that information to potential buyers which could potentially change buyer’s decision scenario. And also, the information isn’t always easy to find until the transaction is already underway.
Given a set of property details and natural hazard attributes, predict the median house price in San Jose city using Zillow property data combined with natural hazards data such as earthquake, landslide and fire hazards. The main goal is to find major attributes for the contribution of house prices in San Jose city.
Initial challenge was getting data which was not readily available like in Kaggle website or some other websites for data extraction. After extracting data from various sources with different data wrangling techniques, next challenge was merging different data frames with python merge & append methods and formed into single analyzable data frame. In the following section, data collection and data wrangling methods are discussed in detail.
Zillow property data for San Jose city were collected. Web scraping was used to get access the Zillow property data using python and web scraping packages such as selenium and BeautifulSoup.
Natural hazards data such as fault zone, landslide, liquefaction and fire hazard data were collected from below sources:
Zillow website API has inbuilt limitation on extracting data from Zillow. Also, Zillow API does not have provision to download zip code-wise/city-wise data. Therefore, it was decided to do web scraping the following fields for single family properties and townhouse for sold and for sale (By agent, By owner, New construction, Foreclosures, Coming Soon) for different zip codes in San Jose. Latitude, longitude, address, zip, bedrooms, bathrooms, sqft, lot_size, year_built, price, sale_type, zestimate, date_sold, days_on_zillow, house_type, url were extracted.
Python code which imports Selenium and BeautifulSoup packages to scrape my Zillow features for all zip codes in San Jose was written. Below is the sample scraped data from Zillow.
Sample scraped data from Zillow
Sql code was written to extract data from California geological survey (CGS) web app application for seismic hazard data. But the problem was, CGS limited to retrieve only 1000 rows at a time with all attributes using sql query. It was decided to fetch object ids only using python program, then using those object IDs fetched the actual rows. Using python requests and json packages, all extracted data were stored in csv file. Total Features and attributes in dataframe: 261195 & 7.
Seismic hazard zone map
Sample seismic hazard data from CGS web application
Fault zone, liquefaction zone and landslide zone column string values were converted to 0, 1, NA based on following conditions:
If Liquefaction Zone:
LIES WITHIN a Liquefaction Zone = 1
NOT been EVALUATED by CGS for liquefaction hazards = NA
NOT within a liquefaction zone = 0
If Landslide Zone:
LIES WITHIN a Landslide Zone = 1
NOT been EVALUATED by CGS for seismic landslide hazard = NA
NOT within a landslide zone = 0
If Fault Zone:
LIES WITHIN an Earthquake Fault Zone = 1
NOT WITHIN an Earthquake Fault Zone = 0
Not been EVALUATED by CGS = NA
After converting features as explained above, sample seismic hazard data was stored in csv format and it is shown below.
Sample formatted seismic hazard data
Fire hazard data is available in pdf. San Jose city has very limited area comes under fire hazard severity zone. Parcel numbers and address were extracted manually only for the area comes under fire hazard severity zone. Total features and attributes: 53 & 6. The fire hazard severity zone map and sample output data is shown in below.
Very high fire hazard severity zones
After collecting Zillow properties, seismic hazard and fire hazard data, the next challenge was merging all data frames. The following steps were used for merging:
Merged dataframe info
The duplicates were dropped based on parcel number and address. Out of 13993 rows 13617 were obtained after removing duplicates.
Medium sale price change over the period of time (San Jose)
Date sold format was converted from object type to date time.
Box plot (adjusted price)
After collecting data, wrangling data then exploratory analyses were carried out. The following questions were got into my mind and exploratory analyses were done to find answers for all these.
Before going deep into numerical part of data analysis, geospatial analyses were done to observe how hazards are distributed, most common, least common hazards in the city. Folium and basemap packages were used to plot theses maps. Here are the geospatial analyses results for single family and townhouse.
Fault zone distribution points are accumulated in between east San Jose land and mountain range.
Fault zone distribution map (single family homes)
Fault zone distribution map (town homes)
Landslide zone distribution points are accumulated along east side mountain range.
Landslide distribution map (single family homes)
Landslide distribution map (town homes)
Liquefaction are the most common hazard and widely distributed in the city.
Liquefaction distribution map (single family homes)
Liquefaction distribution map (town homes)
Liquefaction zones are most common hazard in San Jose and when it combines with landslide, it is even worse.
Liquefaction and landslide distribution map (single family homes)
Among collected data points there were no combined hazard of landslide, liquefaction, landslide. These are points collected and not entire distribution of San Jose. After geospatial analyses, it was decided to do explore number of houses in each category of features. Bar plots were plotted with grouped data to see the distribution of the data.
The distribution of “number of bedrooms with house count” bar plot shows most popular number of bedrooms is 3 for both single family and townhomes.
#regression #machine-learning #natural-hazards #deep learning