Recently, I got a request from a business user within my company to create a report which shows the number of consecutive days where money turnover was greater than X value. When I started to prepare the data from our SQL Server database, there were a few tricky parts to solve.

Sample Data

In order to demonstrate how I solved this, I will use_ the _**Stack Overflow**sample database. Let’s translate the business requests into the following: find a number of consecutive days where there were more than 50 posts and each of these posts had more than 50.000 views.

Table dbo.Posts contains around 17.2 million rows. First, let’s check how many posts with more than 50.000 views were per every single day:

SELECT convert(date,CreationDate) as CreationDate
    ,count(Id) as TotalPosts
FROM Posts 
WHERE CreationDate >= '20080731'
    AND CreationDate < '20140101'
        AND ViewCount > 50000
GROUP BY convert(date,CreationDate)
ORDER BY CreationDate

When I run this query, I’m getting the following results:

In order to better cope with running queries, I’ve created a non-clustered index on the CreationDate column, with ViewCount column included. Now, when I run the query to check dates where there were more than 50 posts with 50.000+ views, I am getting the following results:

;WITH CTE_POSTS as(
select convert(date,CreationDate) CreationDate
    ,count(Id) totalPosts
from Posts 
where CreationDate >= '20080731'
    and CreationDate < '20140101'
    and ViewCount > 50000
group by convert(date,CreationDate)
)

SELECT * 
FROM CTE_POSTS
WHERE totalPosts > 50
order by CreationDate

As you can notice, between September 15th 2008 and September 19th, all dates are included. Then, 20th and 21st are missing…This problem is well known as the “Gaps and Islands” problem. Here, the first bucket of dates (15th — 19th) represents “island”, while 20th and 21st represent “gap”.

Welcome to the island…

Let’s first identify our “islands” (buckets of consecutive dates which satisfies the criteria: in our case, dates with more than 50 posts which have 50.000+ views):

; WITH CTE_POSTS as(
SELECT CONVERT(DATE,CreationDate) CreationDate
    ,COUNT(Id) totalPosts
from Posts 
WHERE CreationDate >= '20080731'
    AND CreationDate < '20140101'
    AND ViewCount > 50000
GROUP BY convert(date,CreationDate)
HAVING COUNT(Id) > 50
)

,starting as (
SELECT CreationDate
    ,ROW_NUMBER() OVER(ORDER BY CreationDate) rnm
FROM CTE_POSTS as CTE1
WHERE NOT EXISTS (
    SELECT * 
    FROM CTE_POSTS as CTE2
    WHERE DATEDIFF(dd,CTE2.CreationDate,CTE1.CreationDate) = 1)
    )

,ending as (
SELECT CreationDate
    ,ROW_NUMBER() OVER(ORDER BY CreationDate) rnm
FROM CTE_POSTS as CTE1
WHERE NOT EXISTS (
    SELECT * 
    FROM CTE_POSTS as CTE2
    WHERE DATEDIFF(dd,CTE2.CreationDate,CTE1.CreationDate) = -1)
    )

SELECT s.CreationDate as startingDate
    ,e.CreationDate as endingDate 
        ,DATEDIFF(dd,s.CreationDate,e.CreationDate) as ConsecutiveDays
FROM starting s
INNER JOIN ending e on s.rnm = e.rnm

#data #database #sql #towards-data-science #sql-server #data analysis

Island adventures with T-SQL Window Functions
1.40 GEEK