Geographic Clustering with HDBSCAN

Geographic Clustering with HDBSCAN

Your smartphone knows when you are at home or the office. At least, mine does, and can even tell me when to leave to get at one of my common destinations on time. We all accept that our smart devices collect information about our preferences and send them over to the cloud for processing.

Your smartphone knows when you are at home or the office. At least, mine does, and can even tell me when to leave to get at one of my common destinations on time. We all accept that our smart devices collect information about our preferences and send them over to the cloud for processing. These come back as recommendations for shopping, food, mating, and when to leave the office and head home.

What is the magic behind inferring a usual location? How do the cloud computers go about finding the “places” where we live our lives? The answer involves a collection of timestamped geographical areas and clustering.

In this article, I will illustrate the process of geographic clustering using the HDBSCAN [1] algorithm and the Vehicle Energy Dataset. Please refer to my previous post named “The Medium-Sized Dataset,” where I present the dataset and the details on how to handle it with an SQLite database. The present article shares the same GitHub repository and builds upon it to provide more features to the geographic data analysis.

The clustering approach draws from another article named “Mapping the UK’s Traffic Accident Hotspots,” where I used DBSCAN to help create geofences around the most frequently reported traffic accident areas around the UK. Here I use an improved version of the famous density-based clustering algorithm to handle the complex clusters that naturally show up in urban environments and a very simple-minded but effective way of naming them using publicly available data.

The Vehicle Energy Dataset

In 2019 G. S. Oh, David J. Leblanc, and Huei Peng published a paper on the Vehicle Energy Dataset (VED) containing one year’s worth of vehicle trajectory and energy consumption data. These data were collected from November 2017 to November 2018 in Ann Arbor and refer to a sample of 383 vehicles of diverse types and power sources. Here, we will only use the geographic data present in the dataset and defer the analysis of energy dynamics to a future article.

The paper discusses an interesting approach to personal data de-identification and the study of use-cases related to fuel economy and energy efficiency. Most importantly, the survey collected over 374,000 miles of GPS data that we can use for this article’s geographic clustering purposes.

The study’s data collection process ensured driver anonymization through a relatively simple three-step approach. This process became quite relevant to this article because it produced, as a by-product, a critical piece of information: individualized vehicle trajectories. To anonymize the driver information, the study authors applied Random FoggingGeo-fencing, and Major Intersection Bounding techniques. Random fogging removes observed locations near the start and the end of the trip, while geo-fencing clips observations outside a bounding box defined around the city’s boundaries. The authors also clipped trips around the first and last intersections. Besides driver anonymization, these procedures also have the benefit of producing individual trajectories that are readily usable.

I finished the previous article with an illustration on how to extract and display such trajectories. Here, I will use all the trajectories’ endpoints to identify places of interest for future analysis. Unfortunately, these will refer to the most used intersections in Ann Arbor, not the drivers’ final destinations themselves, as these are not present. But this is enough to illustrate the process of building the geographic clusters defined by all the trajectories’ endpoints.

Before we start, I invite you to clone the GitHub repository of this article and run the first four numbered notebooks. Each of the following sections details the next three Jupyter notebooks, leading you to the endpoint: cluster naming.

hdbscan clustering h3 towards-data-science openstreetmap data science

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

50 Data Science Jobs That Opened Just Last Week

Data Science and Analytics market evolves to adapt to the constantly changing economic and business environments. Our latest survey report suggests that as the overall Data Science and Analytics market evolves to adapt to the constantly changing economic and business environments, data scientists and AI practitioners should be aware of the skills and tools that the broader community is working on. A good grip in these skills will further help data science enthusiasts to get the best jobs that various industries in their data science functions are offering.

Data Science With Python Training | Python Data Science Course | Intellipaat

🔵 Intellipaat Data Science with Python course: https://intellipaat.com/python-for-data-science-training/In this Data Science With Python Training video, you...

Applications Of Data Science On 3D Imagery Data

The agenda of the talk included an introduction to 3D data, its applications and case studies, 3D data alignment and more.

Data Science Course in Dallas

Become a data analysis expert using the R programming language in this [data science](https://360digitmg.com/usa/data-science-using-python-and-r-programming-in-dallas "data science") certification training in Dallas, TX. You will master data...

Data Science Pull Requests — A Method for Data Science Review & Merging

Data Science Pull Requests — A Method for Data Science Review & Merging. A step forward for MLOps and unlocking Open Source Data Science