Using Scikit-learn’s Binary Trees to Efficiently Find Latitude and Longitude Neighbors

Bridging together sets of GPS coordinates without breaking your Python interpreter

Image for post

Engineering features from latitude and longitude data can seem like a messy task that may tempt novices into creating their own apply function (or even worse: an enormous for loop). However, these types of brute force approaches are potential pitfalls that will unravel quickly when the size of the dataset increases.

For example: Imagine you have a single dataset of n items. The time it takes to explicitly compare these n items against n-1 other items essentially approaches n². Meaning that with each doubling of rows in your dataset, the time it takes to find all nearest neighbors will increase by a factor of 4!

Fortunately, you do not need to calculate the distance between every point. There are a few data structures to efficiently determine neighbors right in scikit-learn that leverage the power of priority queues.

They can be found within the neighbors module and this guide will show you how to use two of these incredible classes to tackle this problem with ease.

Getting started

To begin we load the libraries.

import numpy as np
from sklearn.neighbors import BallTree, KDTree

## This guide uses Pandas for increased clarity, but these processes
## can be done just as easily using only scikit-learn and NumPy.
import pandas as pd

Then we’ll make two sample DataFrames based on weather station locations that are publicly available from the National Oceanic and Atmospheric Administration.

#machine-learning #data-science #python #scikit-learn #knn

Bridging together sets of GPS coordinates without breaking your Python interpreter

Getting started

towardsdatascience.com

Using Scikit-learn’s Binary Trees to Efficiently Find Latitude and Longitude Neighbors