Fast Geospatial Indexing with H3

In this article, I am following up on the material presented in two previous articles, namely on geospatial indexing with H3, and on how to query geographical data using the triangular inequality. I will use both methods to illustrate the implementation of two of the most useful geospatial queries: radius and KNN. We use these queries to retrieve locations from geospatial databases using reference points. The radius query returns all the points within a preset distance from the reference point, while the KNN query returns the K nearest ones.

My previous article on the triangular inequality discussed the geometric underpinnings of the indexing process and implemented a simple version of the radius query. While I hinted that the KNN was simple to implement, I left that as an exercise to the reader. In this article, I present a fully revised and complete implementation of both queries based on the triangular inequality.

While working at it, I wondered whether I could perform the same calculations using the H3 geospatial indexing method. Once hashed to their H3 indexes, all points are immediately and quickly indexable, but there is more to it. Due to its inherent geometry, the H3 hexagonal tiles have relevant properties that allow us to perform both the radius and KNN queries with impressive performance.

Let’s handle the unfinished business first, and then have a look at H3.

KNN and the Triangular Inequality

Let us start by completing the unfinished work of explaining how to implement KNN queries using the triangular inequality. We go back to the previous article’s GitHub repository and build it from there.

There are a few changes to note since the code was last published. The first change relates to how the calculation of the two foci. In the first incarnation of the Python code, these were hardcoded. Now, their locations depend on the centroid of the input locations, placed in the opposite hemisphere at a ninety-degree latitude difference. The foci themselves have a ninety-degree longitude difference, centered on the input locations’ centroid.

def calculate_sorted_distances(latitudes, longitudes, lat, lon):
    dist = gm.vec_haversine(latitudes, longitudes, lat, lon)
    idx = np.argsort(dist)
    return idx, dist[idx]

def __init__(self, locations: np.ndarray):
    self.lats = locations[:, 0]
    self.lons = locations[:, 1]

    min_lat, max_lat = self.lats.min(), self.lats.max()
    min_lon, max_lon = self.lons.min(), self.lons.max()

    h = gm.num_haversine(min_lat, min_lon, max_lat, min_lon)
    w = gm.num_haversine(min_lat, min_lon, min_lat, max_lon)

    self.density = locations.shape[0] / (w * h)

    if max_lat > 0:
        self.lat0 = self.lat1 = min_lat - 90
    else:
        self.lat0 = self.lat1 = max_lat + 90
    self.lon0 = (max_lon - min_lon) / 2 - 45
    self.lon1 = self.lon0 + 90

    self.idx0, self.sorted0 = calculate_sorted_distances(self.lats, self.lons, 
                                                         self.lat0, self.lon0)
    self.idx1, self.sorted1 = calculate_sorted_distances(self.lats, self.lons, 
                                                         self.lat1, self.lon1)

The data for each “spoke” consists of the array of sorted distances and the sorting indices for the original location data.

#towards-data-science #geographicdatascience #geospatial #programming #h3 #data science

KNN and the Triangular Inequality

towardsdatascience.com

Fast Geospatial Indexing with H3