In my last article, I talked about cleaning the data for geocoding. Let me talk about geocoders for a moment. There are many options out there. And some you might have heard of. Sure, Google has a very easy to use, well built geocoding service. And if you are processing thousands upon thousands of addresses, I would advise you to look into them. I will not be using Google though. The problem with well-known services like google is that they are limited. There is a free tier structure to them, and past that is a paid service. And once you learn the format, then they will charge you to do more with it. Not a bad thing, but I am concentrating on quick, low-cost solutions that allow you to quickly get something out without having to acquire additional licensing.

I use Nominatim which uses the OpenStreetMap geocoding service. The geocoding can be wonky at times, but it is certainly consistent. So setting up some code ahead of time that follows that consistency can set you up to quickly process requests and get XY coordinates for a list of addresses. Something that OpenStreetMap doesn’t like is apartment numbers, or anything trailing road names. So this code I will walk you through will help remove those from your column. I hope you like python dictionaries because this will be a good practice in them.

This first section defines a function called rem_apt that takes in a Series and splits it into a list of strings. Then a temporary dictionary is created from the list. This gives us a key to work with for each value. Now I can get a count of how many strings are in the service address road name. Lastly, I create an empty list to store the keys to be removed.

Line 6 loops through the dictionary items AND loops through a variable called road_types. This was a csv that I created from the USPS list of road types found here. I copied and pasted the information into a csv file, and kept only the “Postal Service Standard Suffix Abbreviation” column. That was imported into my code as a dataframe of strings. These will be the road names I will look for and remove all characters after I found it. Again, consistency is key so know your data. I know that the road names in my data end in abbreviations. While those that use road names as part of the road do not abbreviate. Examples are “Avenue J”, “Highway 9”, and “County Road 72".

Starting at line 8, I check to see if any of the road names(item) are in the dictionary of strings(value) created from the service address road name. If it does find it, then continue on to check which value matches the item it found. I then set the rem_key value as the key +1 and make sure it is an integer.

Now I loop through the rest of the values to add each key after the found value to the rep_all_after list. I add a plus to the key until I get to the end of the dictionary. This stores the keys outside of this vested loop to be used next.

Now I can use that list of keys to pop(remove) all of the values preceding the road name out of the dictionary of strings. Finally returning a string of the remaining values.

This last bit of code iterates through the original cust_add_db with addressing to replace the rd value with the cleaned rd value provided by the rem_apt function.

#data-analysis #data-science #data #data-visualization

Geocoding on the Fly: Cleaning Apartment Numbers
1.10 GEEK