Analyzing Geospatial Data by Puppies & Python

So you’re interested in data that has a geospatial component? Cool! Me too!

When I started my first project rooted in geospatial data, I did not realize the level of challenge I was taking on — not quite biting off more than I could chew, but definitely a mouthful for where I was at the time.

This is image title

photo credit: Chris Martin on Flickr

My quick google search lead me, and probably you, straight to [geopandas](http://geopandas.org/).

This is image title

This plus another 600+ lines of package conflicts… including NumPy.

I had a lot of package conflicts with the geopandas installation originally, including numpy and pretty much everything I had in my project environment. I then followed their directions and installed geopandas first thing in a new environment on the conda-forge channel. Worked like a charm!

In today’s example, I am doing some research for my next move: where can I find the highest density of pets in Seattle? I want to ogle all the pets as I walk, skip, and yog around my future neighborhood. For this, I downloaded files from the City of Seattle’s Open Data on pet licenses and shape data on the zip codes, municipal boundaries, and shoreline (so we can see the outline of Seattle we know and love).

This is image title
Photo Credit: Joshua T. Beck [CC BY-SA 4.0] on [wikimedia commons]

Take away #1: A shapefile is not just one file

When I downloaded and unzipped my first shapefile, I only moved the .shp file into my directory, thinking, “maybe these other files work with different programs.” But no.

Even though whatever package you’re using to read shape files will likely only require a single .shp file path, they need the whole file set in that same directory in order to work:

  • .shp — actual shape file — this is the geometry of your shapes
  • .shx — shape index format — helps to quickly move through the shapes
  • .dbf — this has other information about the shapes (again, it’s required, even if you don’t need it for your purpose. It’s frequently helpful with reference information about each shape, like the name.)

Sometimes there are other files too. Check out this article for more info. Now, instead of worrying too much about which I want or need, I just unzip the set in my target directory. Then you know you’ll have whatever good stuff your data source provided.

Also note: all the files must have the same filename, with the only difference being the extension/suffix. So if you like to change file names to remove spaces, capital letters, etc (just a recovering perfectionist thing?), make sure that you change the names for ALL of the files that came in the package, and watch out for typos.

When you’re thinking about a shapefile, always add that trailing s — shapefiles.

In the search for max pet density, it was just a couple lines of code to get me started with the shapes in GeoDataFrames:

import geopandas as gpd
# load zip code shape data
zip_codes_gdf = gpd.read_file('data/shapes/Zip_Codes.shp')
# load seattle shape
cities = gpd.read_file('data/shapes/Municipal_Boundaries.shp')
seattle = cities.loc[cities['CITYNAME'] == 'Seattle']
seattle = seattle.reset_index()
seattle_shp = seattle.loc[0, 'geometry']
# load seattle waterfront shape
waterfront = gpd.read_file('data/shapes/
                            City_of_Seattle_Shoreline.shp')

This is image title

Note here that seattle_shp is a shapely Polygon and calling it in a Jupyter notebook displays an image of the shape. Alternatively, waterfront is a GeoDataFrame — so you can visualize it with waterfront.plot(). Keeping track of object types will save you headaches and time! Or at least is a good place to start when something isn’t working as you expected.

First off, I noticed that the zip code file on Seattle Open Data actually has the shape files for zip codes in several Washington counties. I needed to whittle down the zip codes to only those in Seattle. I quickly realized that the zip code boundaries in the Seattle area do not align with the city boundaries.

Here’s where geopandas’s wrapping of all those other packages, including shapely, comes in handy (even if it makes it a struggle to install).

Take away #2: Shapely is your friend

In order to combine data from different sources and get exactly the data you want, I have found the shapely functions to be extremely useful. Conveniently, they work great on GeoDataFrames, computing on each line and with a single line of code (just call geo_df.**shapely_function**()). Some of the [shapely](https://shapely.readthedocs.io/en/stable/manual.html) functions are described in the [geopandas](http://geopandas.org/) docs, but geopandas does not include documentation on all the shapely functions. I was pleasantly surprised when the additional functions described in the shapely docs, but not in geopandas worked as well, especially intersection!

Within:

object.**within**(_other)_ — Returns True if the object’s boundary and interior intersect only with the interior of the other (not its boundary or exterior). This applies to all types and is the inverse of [**contains()**](https://shapely.readthedocs.io/en/stable/manual.html#object.contains).*

So I tried it out. Here, zip_codes_gdf.within(seattle_shp) returns a series of True and False, one for each record in the GeoDataFrame. This series, used with .loc creates a new GeoDataFrame containing only the records (zip codes and their shapes) that are wholly within Seattle.

seattle_zips_within = zip_codes_gdf.loc[zip_codes_gdf.within(seattle_shp)]

Next, I plotted my result, along with the Seattle outline and waterfront.

fig, ax = plt.subplots(figsize=(10,10))
seattle.plot(ax=ax, color='white', edgecolor='black', linewidth=2)
waterfront.plot(ax=ax, color='blue')
seattle_zips_within.plot(ax=ax, cmap='Paired', alpha=.8)

This is image title

Zip codes that cross into other cities beyond Seattle were excluded from this selection process using the within() function. This function would be great if the zip codes lined up with the city boundaries. Unfortunately, this is not the case, so I continued my search for a better zip code selection function.

Intersects

object.**intersects**(_other_) — Returns True if the boundary or interior of the object intersect in any way with those of the other. In other words, geometric objects intersect if they have any boundary or interior point in common.*

seattle_zips_intersects = zip_codes_gdf.loc[zip_codes_gdf.intersects(seattle_shp)]
fig, ax = plt.subplots(figsize=(10,10))
waterfront.plot(ax=ax, color='blue')
seattle_zips_intersects.plot(ax=ax, cmap='Paired', alpha=.8)
seattle.plot(ax=ax, color='#00000000', edgecolor='black', 
linewidth=2)

This is image title

Now we’re on the right track, with all of Seattle represented. Now I want to take out all that area outside the city limits. If I calculate the pet density using the number of pets registered in Seattle city limits and the whole (Seattle and non-Seattle) land area for these trans-border zip codes, this will under-represent the pet density in these Seattle neighborhoods.

This is image title
Photo Credit: George Hodan [CC0 Public Domain] from PublicDomainPictures

So I looked for another shapely function.

Intersection

object.**intersection**(_other_) — Returns a representation of the intersection of this object with the other geometric object.*

In other words, it returns a shape of the area where the two objects overlap.

seattle_zips_intersection = zip_codes_gdf.intersection(seattle_shp)
fig, ax = plt.subplots(figsize=(10,10))
waterfront.plot(ax=ax, color='black')
seattle_zips_intersection.plot(ax=ax, cmap='Paired', alpha=.8, 
                               edgecolor='gray')
seattle.plot(ax=ax, color='#00000000', edgecolor='black', 
             linewidth=2)

This is image title

I’m not yet sure what’s going on with all the pink in this map, but this is starting to look good! With the zip codes trimmed to the area within Seattle, as far as mapping goes, I’m ready to get rid of the Seattle outline marking the center of Hood Canal, or wherever Seattle technically ends, in favor of the waterway and zip code outlines.

With the map shapes prepped, I’m ready to put that pet data to use!

Take away #3: Geopandas.plot() uses pretty much all the argument’s you know and love from matplotlib.pyplot

First, I added the new intersection geometries determined in the last step to my GeoDataFrame. Note: Geopandas always plots the “geometry” column by default to plot your shape or complete shape-based calculations like those used above. You can use geo_object.**set_geometry**(_‘column_name')_ to switch which column geopandas will use as the geometry of interest. In this case, I decided to start afresh with a new, paired down GeoDataFrame instead — just the Seattle zip codes with their intersection geometry and area.

# add intersection geometry to my dataframe
zip_codes_gdf['intersection'] = seattle_zips_intersection
# create dataframe with the seattle in-city geometries
seattle_zip_geom = zip_codes_gdf.loc[:, ['ZIP', 'intersection']]
# name the geometry column 'geometry' for easy geopandas plotting
seattle_zip_geom.columns = ['ZIP', 'geometry']
# calculate the in-seattle area for each zipcode and add as a column
seattle_zip_geom['area'] = seattle_zip_geom.area
# drop zip codes that have no area in seattle
seattle_zip_geom = seattle_zip_geom.loc[seattle_zip_geom['area'] > 
                                        0]

To this, I added the pet counts and pet density, using just the intersection area to calculate the density.

# create dataframe with Seattle zips and their respective pet counts
seattle_pets_zip = seattle_zip_geom.merge(pet_counts, on='ZIP')
# add pet_density column
seattle_pets_zip['pet_density'] = (seattle_pets_zip['pets'] / 
                                   seattle_pets_zip['area'])

To color the map based on the column 'pet_density', set the column to the column name in the call to plot. This kind of map, colored based on a data value, is called a choropleth (vocab I wish I knew before… googling for help works better with more precise key words…)

fig, ax = plt.subplots(figsize=(10,10))
waterfront.plot(ax=ax, color='black')
seattle_pets_zip.plot(column='pet_density', ax=ax, edgecolor='gray')

This is image title

Ugh! All the zip codes look the same!

Then, I noticed that two zip codes (98146, 98168) had a much larger pet density, even though they weren’t visible at my map’s scale.

To investigate these locations more closely, I mapped them on the city streets using folium:

import folium
import geopandas as gpd
def create_geo_map(geo_df, index, location, zoom=17, 
                   tiles='Stamen Terrain'):
  """creates a folium map with the geometry highlighted
     
  Args:
    geo_df: a geopandas dataframe (must have a column 'geometry'
    index: the index of the row you want mapped
    location: [lat, long] list with the center location for the map
    zoom: start zoom level. Default: 17
    tiles: Folium tiles / design. Default: Stamen Terrain
  Returns:
    folium map with geometry highlighted
  """
  # create geodataframe for zip code
  zip_code = gpd.GeoDataFrame(geo_df.iloc[index]).T
  # get the geojson for the zipcode
  gjson = zip_code['geometry'].to_json()
  # create folium map
  m = folium.Map(location=location, tiles=tiles, zoom_start=zoom)
  # add geojson to the map
  folium.GeoJson(gjson, name='geojson').add_to(m)
  
  return m

Looking at these two zip codes:

This is image title

The first one is a single block in Downtown Seattle, next to the library, that does contain a high rise condo complex. I’m not sure why this block has its own zip code, but oh well. I can believe that there are 35 pets registered here.

This is image title

The second one is more of a mystery. This tiny section of the much larger zip code is an industrial area near the Boeing Field Airport. There is a Supervalue, but no visible housing on Google Maps satellite view (yes, I attempted to stalk these 35 pets— feel free to check it out yourself. For now, my working assumption is these are emotional support animals for the Supervalue workers.)

This is image title
Example emotional support animal. Photo Credit: Kevin Honold, used with permission

Upon further investigation, zip code 98168 has some areas in unincorporated King County, where the addresses might be labeled “Seattle”, but are not technically within Seattle city limits. I wonder if some non-Seattle resident pet owners in this unincorporated area registered their pets with the city, not realizing that they are technically not city residents. You would think that the city wouldn’t take registrations for non-city addresses… it remains a mystery.

Anyways, I chose to drop these two zip codes so that the larger zip codes’ pet densities are easier to distinguish. I also added in the legend, with the assist of the geopandas docs.

# adding a legend using code from http://geopandas.org/mapping.html
from mpl_toolkits.axes_grid1 import make_axes_locatable
fig, ax = plt.subplots(figsize=(10,10))
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="5%", pad=0.1)
waterfront.plot(ax=ax, color='black')
seattle_pets_zip_dropped.plot(column='pet_density', ax=ax, 
                              legend=True, cax=cax);

This is image title

So now I have a good idea of the most and least desirable neighborhoods for pet watching! I do have some lingering questions: What if we normalize for population density? What type of pets are we talking about — would this map look different for cats/dogs/goats/pigs (yes goats and pigs are registered with the city!)? But these questions will have to wait for another day…coming soon to a data science blog near you.

This is image title
Are these your neighbors? Photo Credit: Kevin Honold, used with permission

Parting tip: use plt.savefig('fig_name.png') to save your image before displaying the figure to avoid saving a blank canvas. For example, all in one jupyter notebook cell:

fig, ax = plt.subplots(figsize=(10,10))
waterfront.plot(ax=ax, color='blue')
plt.savefig('images/seattle_zips_intersects.png');

You’ll still get the image displayed and saved.

You can see the full code on my GitHub repo here.

#python #tensorflow

Analyzing Geospatial Data by Puppies & Python
8.55 GEEK