Geospatial big data performance tests with Cosmos DB...

Several reasons can let you move away from traditional spatial store and query engines like Microsoft SQL, PostgreSQL or Oracle. Here we present the geospatial query capabilities and performance of an alternative to these well-established relational databases (RDB), namely Azure Cosmos DB.

Limitations that you can encounter with RDBs occur under the following conditions:

- Scaling: when you deal with a large amount of data that outgrows a relational database.

- Query performance: when you need fast (short) response times on geospatial queries.

- Throughput: when you deal with fast moving or streaming data, like huge number of incoming messages from IoT sensors or Digital Twins.

- Global reach: when your applications, users or incoming data comes from different regions around the world.

In other words, when you deal with geospatial big data. One way to deal with the above given conditions is with Azure Cosmos DB. This service is designed for elastically scaling throughput and storage across any number of geographical regions. Here we focus on the performance of geospatial queries on big data within Azure Cosmos DB, that are demonstrated by query execution times on a large dataset, and show how to do data enrichment with Spark based Azure Synapse Analytics.

Azure Cosmos DB

Azure Cosmos DB is Microsoft’s globally distributed, horizontally partitioned, multi-model database service. It holds Service Level Agreements encompassing four dimensions: throughput, latency at the 99th percentile, availability, and consistency. Data resides in containers and depending on the used database model consists of collections, tables, graph etc. Each container is also associated with a flexible unit of scale for transactions and queries. Data in the containers are horizontally partitioned by a customer specified partition-key.

Geospatial capabilities of Azure Cosmos DB

Azure Cosmos DB’s database engine is schema agnostic and provides support for JSON. The write optimized engine understands spatial data represented in the GeoJSON standard, and supports 4 spatial types (points, polygons, multi-polygons and line-strings). Query performance is highly related to an efficient indexation. The implemented indexation works on a projected 2D plane and holds a two-step strategy, firstly it applies the quadtree approach by dividing the plane progressively into cells, and secondly these cells are mapped to a 1D index based on a Hilbert space filling curve. Resulting in a strong and efficient index combination for absolute position queries and relative nearest neighbor searches.

#big-data #azure-synapse-analytics

towardsdatascience.com

Geospatial big data performance tests with Cosmos DB...