SQL Percentile Aggregates and Rollups With PostgreSQL and t-digest

When it comes to data, let’s start with the obvious. Averages suck. As developers, we all know that percentiles are much more useful. Metrics like P90, P95, P99 give us a much better indication of how our software is performing. The challenge, historically, is how to track the underlying data and calculate the percentiles.

Today I will show you how amazingly easy it is to aggregate and create SQL based percentile rollups with PostgreSQL and t-digest histograms!

The Problem: Percentiles Require Big Data

At Stackify, we deal with big data. We help developers track the performance of thousands of applications and billions of data points daily with Retrace. We track how long every transaction in your application takes to load. Plus, every SQL query, web service call, browser load timing, etc, etc.

The problem is, to properly calculate percentiles, you need all the raw data. So if your app runs on 100 servers and has 100 transactions a second, over the course of the month you need roughly 30,000,000,000 data points to then calculate a monthly percentile for a single performance metric.

What you can’t do is take percentiles every minute or hour from individual servers and then average them together. Averages of percentiles is a terrible idea and don’t work.

Histograms: How to Track Data for Calculating Percentiles

Let’s start with the basics. How do we track 30,000,000,000 data points just to calculate one simple P99 number for your custom dashboard?

The way to do this is with a data science technique called by many names and strategies. But most usually described as histograms, data sketches, buckets, or binning. People also use other types of algorithms. At the end of the day, they are all based on sampling and approximating your data.

A histogram is the most common term. Here is how wikipedia describes it:

“To construct a histogram, the first step is to “bin” (or “bucket”) the range of values-that is, divide the entire range of values into a series of intervals-and then count how many values fall into each interval.”

So basically you can create 100 “buckets” and sort those 30 billion data points into those buckets. Then you can track how many are in each bucket and the sum of the values in each bucket. Based on that information, you can approximate things like a median, P90, P95 with a pretty high degree of accuracy.

Introduction to t-digest for Calculating SQL Percentiles

At Stackify, we are always looking at different ways to ingest and calculate performance data. We recently ran across a very cool PostgreSQL extension called tdigest that implements t-digest histograms. Ted Dunning originally wrote a white paper on T-Digest back in 2013. It has slowly grown in popularity since then.

T-Digest is a high-performance algorithm for calculating percentiles. There are implementations of it in many languages, including node.js, python, java, and, most importantly, PostgreSQL.

I will walk you through some high-level basics of how it works to give you some basic understanding.

T-Digest works by dynamically calculating “centroids.” Think of these like buckets, but they are basically key data points spread across your data. As you add the first data points, it dynamically evaluates what the centroids should be and adapts as you continue to add more data. It’s a little magical.

Here is an example of what a t-digest looks like:

Java

flags 0 count 37362 compression 100 centroids 51 (0.164000, 1) (0.165000, 1) (0.166000, 1) (0.166000, 1) (0.167000, 1) (0.504000, 3) (0.843000, 5) (1.185000, 7) (2.061000, 12) (1.915000, 11) (3.437000, 19) (7.813000, 40) (11.765000, 57) (15.448000, 72) (24.421000, 109) (49.816000, 211) (88.728000, 346) (147.814000, 538) (260.275000, 907) (420.212000, 1394) (679.826000, 2153) (854.042000, 2577) (1495.861000, 3815) (3435.648000, 5290) (3555.114000, 4491) (3366.077000, 4198) (3474.402000, 3748) (2631.066000, 2593) (1809.314000, 1773) (980.488000, 956) (1692.846000, 781) (106168.275000, 473) (166453.499000, 233) (168294.000000, 211) (87554.000000, 109) (59128.000000, 73) (42188.000000, 49) (28435.000000, 29) (20688.000000, 21) (14902.000000, 15) (11462.000000, 11) (9249.000000, 8) (5832.000000, 5) (4673.000000, 4) (3511.000000, 3) (2345.000000, 2) (1174.000000, 1) (1174.000000, 1) (1174.000000, 1) (1174.000000, 1) (1176.000000, 1)

In this example, I have 37,362 data points spread across 51 centroids, with a max of 100 centroids. Each of the data points is the sum of the values in the bucket and how many items are in the bucket. So something like (3435.648000, 5290) means there are 5290 data points, and they add up to 3435 and would be 0.649 on average.

Based on these buckets, the t-digest library can quickly calculate any percentiles across these 37,362 data points in a few nanoseconds.

#database #tutorial #sql #postgresql #sql percentile aggregates #t-digest

The Problem: Percentiles Require Big Data

Histograms: How to Track Data for Calculating Percentiles

Introduction to t-digest for Calculating SQL Percentiles

dzone.com

SQL Percentile Aggregates and Rollups With PostgreSQL and t-digest