1597388940

# SQL Percentile Aggregates and Rollups With PostgreSQL and t-digest

When it comes to data, let’s start with the obvious. Averages suck. As developers, we all know that percentiles are much more useful. Metrics like P90, P95, P99 give us a much better indication of how our software is performing. The challenge, historically, is how to track the underlying data and calculate the percentiles.

Today I will show you how amazingly easy it is to aggregate and create SQL based percentile rollups with PostgreSQL and t-digest histograms!

### The Problem: Percentiles Require Big Data

At Stackify, we deal with big data. We help developers track the performance of thousands of applications and billions of data points daily with Retrace. We track how long every transaction in your application takes to load. Plus, every SQL query, web service call, browser load timing, etc, etc.

The problem is, to properly calculate percentiles, you need all the raw data. So if your app runs on 100 servers and has 100 transactions a second, over the course of the month you need roughly 30,000,000,000 data points to then calculate a monthly percentile for a single performance metric.

What you can’t do is take percentiles every minute or hour from individual servers and then average them together. Averages of percentiles is a terrible idea and don’t work.

### Histograms: How to Track Data for Calculating Percentiles

Let’s start with the basics. How do we track 30,000,000,000 data points just to calculate one simple P99 number for your custom dashboard?

The way to do this is with a data science technique called by many names and strategies. But most usually described as histograms, data sketches, buckets, or binning. People also use other types of algorithms. At the end of the day, they are all based on sampling and approximating your data.

A histogram is the most common term. Here is how wikipedia describes it:

“To construct a histogram, the first step is to “bin” (or “bucket”) the range of values-that is, divide the entire range of values into a series of intervals-and then count how many values fall into each interval.”

So basically you can create 100 “buckets” and sort those 30 billion data points into those buckets. Then you can track how many are in each bucket and the sum of the values in each bucket. Based on that information, you can approximate things like a median, P90, P95 with a pretty high degree of accuracy.

### Introduction to t-digest for Calculating SQL Percentiles

At Stackify, we are always looking at different ways to ingest and calculate performance data. We recently ran across a very cool PostgreSQL extension called tdigest that implements t-digest histograms. Ted Dunning originally wrote a white paper on T-Digest back in 2013. It has slowly grown in popularity since then.

T-Digest is a high-performance algorithm for calculating percentiles. There are implementations of it in many languages, including node.js, python, java, and, most importantly, PostgreSQL.

I will walk you through some high-level basics of how it works to give you some basic understanding.

T-Digest works by dynamically calculating “centroids.” Think of these like buckets, but they are basically key data points spread across your data. As you add the first data points, it dynamically evaluates what the centroids should be and adapts as you continue to add more data. It’s a little magical.

Here is an example of what a t-digest looks like:

Java

1

``````flags 0 count 37362 compression 100 centroids 51 (0.164000, 1) (0.165000, 1) (0.166000, 1) (0.166000, 1) (0.167000, 1) (0.504000, 3) (0.843000, 5) (1.185000, 7) (2.061000, 12) (1.915000, 11) (3.437000, 19) (7.813000, 40) (11.765000, 57) (15.448000, 72) (24.421000, 109) (49.816000, 211) (88.728000, 346) (147.814000, 538) (260.275000, 907) (420.212000, 1394) (679.826000, 2153) (854.042000, 2577) (1495.861000, 3815) (3435.648000, 5290) (3555.114000, 4491) (3366.077000, 4198) (3474.402000, 3748) (2631.066000, 2593) (1809.314000, 1773) (980.488000, 956) (1692.846000, 781) (106168.275000, 473) (166453.499000, 233) (168294.000000, 211) (87554.000000, 109) (59128.000000, 73) (42188.000000, 49) (28435.000000, 29) (20688.000000, 21) (14902.000000, 15) (11462.000000, 11) (9249.000000, 8) (5832.000000, 5) (4673.000000, 4) (3511.000000, 3) (2345.000000, 2) (1174.000000, 1) (1174.000000, 1) (1174.000000, 1) (1174.000000, 1) (1176.000000, 1)
``````

In this example, I have 37,362 data points spread across 51 centroids, with a max of 100 centroids. Each of the data points is the sum of the values in the bucket and how many items are in the bucket. So something like (3435.648000, 5290) means there are 5290 data points, and they add up to 3435 and would be 0.649 on average.

Based on these buckets, the t-digest library can quickly calculate any percentiles across these 37,362 data points in a few nanoseconds.

#database #tutorial #sql #postgresql #sql percentile aggregates #t-digest

1594369800

## Introduction to Structured Query Language SQL pdf

SQL stands for Structured Query Language. SQL is a scripting language expected to store, control, and inquiry information put away in social databases. The main manifestation of SQL showed up in 1974, when a gathering in IBM built up the principal model of a social database. The primary business social database was discharged by Relational Software later turning out to be Oracle.

Models for SQL exist. In any case, the SQL that can be utilized on every last one of the major RDBMS today is in various flavors. This is because of two reasons:

1. The SQL order standard is genuinely intricate, and it isn’t handy to actualize the whole standard.

2. Every database seller needs an approach to separate its item from others.

Right now, contrasts are noted where fitting.

#programming books #beginning sql pdf #commands sql #download free sql full book pdf #introduction to sql pdf #introduction to sql ppt #introduction to sql #practical sql pdf #sql commands pdf with examples free download #sql commands #sql free bool download #sql guide #sql language #sql pdf #sql ppt #sql programming language #sql tutorial for beginners #sql tutorial pdf #sql #structured query language pdf #structured query language ppt #structured query language

1597388940

## SQL Percentile Aggregates and Rollups With PostgreSQL and t-digest

When it comes to data, let’s start with the obvious. Averages suck. As developers, we all know that percentiles are much more useful. Metrics like P90, P95, P99 give us a much better indication of how our software is performing. The challenge, historically, is how to track the underlying data and calculate the percentiles.

Today I will show you how amazingly easy it is to aggregate and create SQL based percentile rollups with PostgreSQL and t-digest histograms!

### The Problem: Percentiles Require Big Data

At Stackify, we deal with big data. We help developers track the performance of thousands of applications and billions of data points daily with Retrace. We track how long every transaction in your application takes to load. Plus, every SQL query, web service call, browser load timing, etc, etc.

The problem is, to properly calculate percentiles, you need all the raw data. So if your app runs on 100 servers and has 100 transactions a second, over the course of the month you need roughly 30,000,000,000 data points to then calculate a monthly percentile for a single performance metric.

What you can’t do is take percentiles every minute or hour from individual servers and then average them together. Averages of percentiles is a terrible idea and don’t work.

### Histograms: How to Track Data for Calculating Percentiles

Let’s start with the basics. How do we track 30,000,000,000 data points just to calculate one simple P99 number for your custom dashboard?

The way to do this is with a data science technique called by many names and strategies. But most usually described as histograms, data sketches, buckets, or binning. People also use other types of algorithms. At the end of the day, they are all based on sampling and approximating your data.

A histogram is the most common term. Here is how wikipedia describes it:

“To construct a histogram, the first step is to “bin” (or “bucket”) the range of values-that is, divide the entire range of values into a series of intervals-and then count how many values fall into each interval.”

So basically you can create 100 “buckets” and sort those 30 billion data points into those buckets. Then you can track how many are in each bucket and the sum of the values in each bucket. Based on that information, you can approximate things like a median, P90, P95 with a pretty high degree of accuracy.

### Introduction to t-digest for Calculating SQL Percentiles

At Stackify, we are always looking at different ways to ingest and calculate performance data. We recently ran across a very cool PostgreSQL extension called tdigest that implements t-digest histograms. Ted Dunning originally wrote a white paper on T-Digest back in 2013. It has slowly grown in popularity since then.

T-Digest is a high-performance algorithm for calculating percentiles. There are implementations of it in many languages, including node.js, python, java, and, most importantly, PostgreSQL.

I will walk you through some high-level basics of how it works to give you some basic understanding.

T-Digest works by dynamically calculating “centroids.” Think of these like buckets, but they are basically key data points spread across your data. As you add the first data points, it dynamically evaluates what the centroids should be and adapts as you continue to add more data. It’s a little magical.

Here is an example of what a t-digest looks like:

Java

1

``````flags 0 count 37362 compression 100 centroids 51 (0.164000, 1) (0.165000, 1) (0.166000, 1) (0.166000, 1) (0.167000, 1) (0.504000, 3) (0.843000, 5) (1.185000, 7) (2.061000, 12) (1.915000, 11) (3.437000, 19) (7.813000, 40) (11.765000, 57) (15.448000, 72) (24.421000, 109) (49.816000, 211) (88.728000, 346) (147.814000, 538) (260.275000, 907) (420.212000, 1394) (679.826000, 2153) (854.042000, 2577) (1495.861000, 3815) (3435.648000, 5290) (3555.114000, 4491) (3366.077000, 4198) (3474.402000, 3748) (2631.066000, 2593) (1809.314000, 1773) (980.488000, 956) (1692.846000, 781) (106168.275000, 473) (166453.499000, 233) (168294.000000, 211) (87554.000000, 109) (59128.000000, 73) (42188.000000, 49) (28435.000000, 29) (20688.000000, 21) (14902.000000, 15) (11462.000000, 11) (9249.000000, 8) (5832.000000, 5) (4673.000000, 4) (3511.000000, 3) (2345.000000, 2) (1174.000000, 1) (1174.000000, 1) (1174.000000, 1) (1174.000000, 1) (1176.000000, 1)
``````

In this example, I have 37,362 data points spread across 51 centroids, with a max of 100 centroids. Each of the data points is the sum of the values in the bucket and how many items are in the bucket. So something like (3435.648000, 5290) means there are 5290 data points, and they add up to 3435 and would be 0.649 on average.

Based on these buckets, the t-digest library can quickly calculate any percentiles across these 37,362 data points in a few nanoseconds.

#database #tutorial #sql #postgresql #sql percentile aggregates #t-digest

1596441660

## Welcome Back the T-SQL Debugger with SQL Complete – SQL Debugger

When you develop large chunks of T-SQL code with the help of the SQL Server Management Studio tool, it is essential to test the “Live” behavior of your code by making sure that each small piece of code works fine and being able to allocate any error message that may cause a failure within that code.

The easiest way to perform that would be to use the T-SQL debugger feature, which used to be built-in over the SQL Server Management Studio tool. But since the T-SQL debugger feature was removed completely from SQL Server Management Studio 18 and later editions, we need a replacement for that feature. This is because we cannot keep using the old versions of SSMS just to support the T-SQL Debugger feature without “enjoying” the new features and bug fixes that are released in the new SSMS versions.

If you plan to wait for SSMS to bring back the T-SQL Debugger feature, vote in the Put Debugger back into SSMS 18 to ask Microsoft to reintroduce it.

As for me, I searched for an alternative tool for a T-SQL Debugger SSMS built-in feature and found that Devart company rolled out a new T-SQL Debugger feature to version 6.4 of SQL – Complete tool. SQL Complete is an add-in for Visual Studio and SSMS that offers scripts autocompletion capabilities, which help develop and debug your SQL database project.

The SQL Debugger feature of SQL Complete allows you to check the execution of your scripts, procedures, functions, and triggers step by step by adding breakpoints to the lines where you plan to start, suspend, evaluate, step through, and then to continue the execution of your script.

You can download SQL Complete from the dbForge Download page and install it on your machine using a straight-forward installation wizard. The wizard will ask you to specify the installation path for the SQL Complete tool and the versions of SSMS and Visual Studio that you plan to install the SQL Complete on, as an add-in, from the versions that are installed on your machine, as shown below:

Once SQL Complete is fully installed on your machine, the dbForge SQL Complete installation wizard will notify you of whether the installation was completed successfully or the wizard faced any specific issue that you can troubleshoot and fix easily. If there are no issues, the wizard will provide you with an option to open the SSMS tool and start using the SQL Complete tool, as displayed below:

When you open SSMS, you will see a new “Debug” tools menu, under which you can navigate the SQL Debugger feature options. Besides, you will see a list of icons that will be used to control the debug mode of the T-SQL query at the leftmost side of the SSMS tool. If you cannot see the list, you can go to View -> Toolbars -> Debugger to make these icons visible.

During the debugging session, the SQL Debugger icons will be as follows:

The functionality of these icons within the SQL Debugger can be summarized as:

• Adding Breakpoints to control the execution pause of the T-SQL script at a specific statement allows you to check the debugging information of the T-SQL statements such as the values for the parameters and the variables.
• Step Into is “navigate” through the script statements one by one, allowing you to check how each statement behaves.
• Step Over is “execute” a specific stored procedure if you are sure that it contains no error.
• Step Out is “return” from the stored procedure, function, or trigger to the main debugging window.
• Continue executing the script until reaching the next breakpoint.
• Stop Debugging is “terminate” the debugging session.
• Restart “stop and start” the current debugging session.

#sql server #sql #sql debugger #sql server #sql server stored procedure #ssms #t-sql queries

1600470000

## Backup Database using T-SQL Statements

##### Introduction

In this article, We will discuss how to backup our database in MS-SQL Server using T-SQL Statements.

We need to use **_BACKUP DATABASE _**statement to create full database backup, along with the name of Database and device to store the backup file.

#sql server #backup database #backup database using t-sql #sql query #sql tips #t-sql #tips

1596448980

## The Easy Guide on How to Use Subqueries in SQL Server

Let’s say the chief credit and collections officer asks you to list down the names of people, their unpaid balances per month, and the current running balance and wants you to import this data array into Excel. The purpose is to analyze the data and come up with an offer making payments lighter to mitigate the effects of the COVID19 pandemic.

Do you opt to use a query and a nested subquery or a join? What decision will you make?

## SQL Subqueries – What Are They?

Before we do a deep dive into syntax, performance impact, and caveats, why not define a subquery first?

In the simplest terms, a subquery is a query within a query. While a query that embodies a subquery is the outer query, we refer to a subquery as the inner query or inner select. And parentheses enclose a subquery similar to the structure below:

``````SELECT
col1
,col2
,(subquery) as col3
FROM table1
[JOIN table2 ON table1.col1 = table2.col2]
WHERE col1 <operator> (subquery)
``````

We are going to look upon the following points in this post:

• SQL subquery syntax depending on different subquery types and operators.
• When and in what sort of statements one can use a subquery.
• Performance implications vs. JOINs.
• Common caveats when using SQL subqueries.

As is customary, we provide examples and illustrations to enhance understanding. But bear in mind that the main focus of this post is on subqueries in SQL Server.

Now, let’s get started.

## Make SQL Subqueries That Are Self-Contained or Correlated

For one thing, subqueries are categorized based on their dependency on the outer query.

Let me describe what a self-contained subquery is.

Self-contained subqueries (or sometimes referred to as non-correlated or simple subqueries) are independent of the tables in the outer query. Let me illustrate this:

``````-- Get sales orders of customers from Southwest United States
-- (TerritoryID = 4)

GO
SELECT CustomerID, SalesOrderID
WHERE CustomerID IN (SELECT [CustomerID]
WHERE TerritoryID = 4)
``````

As demonstrated in the above code, the subquery (enclosed in parentheses below) has no references to any column in the outer query. Additionally, you can highlight the subquery in SQL Server Management Studio and execute it without getting any runtime errors.

Which, in turn, leads to easier debugging of self-contained subqueries.

The next thing to consider is correlated subqueries. Compared to its self-contained counterpart, this one has at least one column being referenced from the outer query. To clarify, I will provide an example:

``````USE [AdventureWorks]
GO
FROM Person.Person AS p
WHERE 1262000.00 IN
(SELECT [SalesQuota]
FROM Sales.SalesPersonQuotaHistory spq
``````

Were you attentive enough to notice the reference to BusinessEntityID from the Person table? Well done!

Once a column from the outer query is referenced in the subquery, it becomes a correlated subquery. One more point to consider: if you highlight a subquery and execute it, an error will occur.

And yes, you are absolutely right: this makes correlated subqueries pretty harder to debug.

To make debugging possible, follow these steps:

• isolate the subquery.
• replace the reference to the outer query with a constant value.

Isolating the subquery for debugging will make it look like this:

``````SELECT [SalesQuota]
FROM Sales.SalesPersonQuotaHistory spq
``````

Now, let’s dig a little deeper into the output of subqueries.

## Make SQL Subqueries With 3 Possible Returned Values

Well, first, let’s think of what returned values can we expect from SQL subqueries.

In fact, there are 3 possible outcomes:

• A single value
• Multiple values
• Whole tables

### Single Value

Let’s start with single-valued output. This type of subquery can appear anywhere in the outer query where an expression is expected, like the WHERE clause.

``````-- Output a single value which is the maximum or last TransactionID
GO
SELECT TransactionID, ProductID, TransactionDate, Quantity
FROM Production.TransactionHistory
WHERE TransactionID = (SELECT MAX(t.TransactionID)
FROM Production.TransactionHistory t)
``````

When you use a MAX() function, you retrieve a single value. That’s exactly what happened to our subquery above. Using the equal (=) operator tells SQL Server that you expect a single value. Another thing: if the subquery returns multiple values using the equals (=) operator, you get an error, similar to the one below:

``````Msg 512, Level 16, State 1, Line 20
Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression.
``````

### Multiple Values

Next, we examine the multi-valued output. This kind of subquery returns a list of values with a single column. Additionally, operators like IN and NOT IN will expect one or more values.

``````-- Output multiple values which is a list of customers with lastnames that --- start with 'I'

GO
SELECT [SalesOrderID], [OrderDate], [ShipDate], [CustomerID]
WHERE [CustomerID] IN (SELECT c.[CustomerID] FROM Sales.Customer c
INNER JOIN Person.Person p ON c.PersonID = p.BusinessEntityID
WHERE p.lastname LIKE N'I%' AND p.PersonType='SC')
``````

### Whole Table Values

And last but not least, why not delve into whole table outputs.

``````-- Output a table of values based on sales orders
GO
SELECT [ShipYear],
COUNT(DISTINCT [CustomerID]) AS CustomerCount
FROM (SELECT YEAR([ShipDate]) AS [ShipYear], [CustomerID]
GROUP BY [ShipYear]
ORDER BY [ShipYear]
``````

Have you noticed the FROM clause?

Instead of using a table, it used a subquery. This is called a derived table or a table subquery.

And now, let me present you some ground rules when using this sort of query:

• All columns in the subquery should have unique names. Much like a physical table, a derived table should have unique column names.
• ORDER BY is not allowed unless TOP is also specified. That’s because the derived table represents a relational table where rows have no defined order.

In this case, a derived table has the benefits of a physical table. That’s why in our example, we can use COUNT() in one of the columns of the derived table.

That’s about all regarding subquery outputs. But before we get any further, you may have noticed that the logic behind the example for multiple values and others as well can also be done using a JOIN.

``````-- Output multiple values which is a list of customers with lastnames that start with 'I'
GO
SELECT o.[SalesOrderID], o.[OrderDate], o.[ShipDate], o.[CustomerID]