Introduction

Storing and retrieving data from JSON fragments is a common need in many application scenarios, like IoT solutions or microservice-based architectures. You can persist these fragments can be in a variety of data stores, from blob or file shares, to relational and non-relational databases, and there’s a long standing debate in the industry on what’s the database technology that fits “better” for this task.

Azure SQL Database offers several options for parsing, transforming and querying JSON data, and this article doesn’t pretend to provide a definitive answer to that debate, but rather to explore these options for common scenarios like data loading and retrieving, and benchmarking results to provide a clear indication of how Azure SQL Database will perform manipulating JSON data.

Our test bench

To reproduce a realistic scenario, we started from a JSON fragment representing a real telemetry message from a machine in a factory floor containing ~60 attributes, for an average 1.6KB size:

We then identified two options for table structures where persist these JSON messages and execute various tests against.

First one is storing plain JSON fragments in a nvarchar(max) column, which is an approach that some customers that don’t want to deal with a relational model tend to adopt:

The second approach we wanted to test is instead “shredding” all attributes of the JSON document into specific columns in a relational table:

Loading data

First scenario we want to test is how efficiently we can load one or many documents generated by an application into these two table structures. In our tests, in order to eliminate all variability related to application implementation and potential latency in database interactions, we dynamically generate various row batches to insert (1, 1000, 10000, 100000, 3M rows) on the server-side, with some random attribute values generated while inserting into our target table through an INSERT…SELECT statement like this:

For the schematized table we’re taking a similar approach to generate our synthetic rowset, but we’re using the OPENJSON function instead to automatically shred all JSON attributes into table columns while again inserting all rows with an INSERT…SELECT statement:

For each table, we then measure loading these row batches with different indexing strategies:

  1. Heap (no clustered index)
  2. Clustered index
  3. Clustered Columnstore (trickle, or individual inserts we wrapped in a BEGIN/COMMIT )
  4. Clustered Columnstore (bulk inserts)

For #3, it’s important to mention the importance of batching multiple individual insert operations and wrap them into an explicit transaction to mitigate both latency between application and database layer, and latency introduced by individual transaction log writes typical of explicit transactions.

This topic has been extensively covered in another article comparing and contrasting all batching options. Having multiple writers can potentially create some concurrency issues that varies depending on the combination of indexing and rows already present in the table: this article is describing in details impact on concurrency and performance for various options.

For Columnstore indexes, for example, each thread loads data exclusively into each rowset by taking a X lock on the rowset, allowing parallel data load with concurrent data load sessions. It’s important to mention though, that in typical IoT scenarios like the one we mentioned, events and messages are typically sent to an event store like Azure IoT Hub or Kafka, and then processed by a relatively small number of event processors that will execute data loading, so concurrency can be less of an issue.

We then repeated our data loading tests with different Azure SQL Database instance sizes, namely Business Critical 2 vCores, 16 vCores or 40 vCores, to verify if increasing “Max log rate” limit for the instance is impacting data loading performance.

Loading times for all options are represented in the following matrix:

Cells with green background represents best loading times for different batch sizes in each scale tier. As we can see, bulk loading plain JSON into a Clustered Columnstore index is around 3x faster than the schematized approach we tested, across pretty much all batch sizes.

Comparing various indexing strategies on the two data models instead, differences in bulk loading on an heap, clustered index on row store or clustered columnstore is much more limited, in the range of 20-30% difference.

It’s also interesting to see the level of compression that can be achieved for both plain JSON and for the schematized approach by leveraging Clustered Columnstore (from ~5GB down to 204MB and 122MB respectively). In IoT scenarios where you have to store billions of messages per month, this can be a significant cost saving compared to many other database technologies.

Scaling from 2 vCores to 40 vCores is only providing ~20% performance improvement as, for this particular test, we’re executing bulk loading on a single connection and we’re not maxing out resource consumption in any particular area (CPU, IO or memory). If we would execute multiple bulk loading operation in parallel across many connections, we would clearly see a different picture here.

#azure sql #developers #json #azure #sql

JSON in your Azure SQL Database? Let’s benchmark some options!
1.50 GEEK