In this article we will address:

  1. How to generate a table with a range of dates using Presto SQL
  2. Rules of thumb for joining tables that are supposed to complete missing data from one another.

Data integrity is one of the most important things we need to address before performing calculations over our data. Even with all the right intentions, we can sometimes miss that there is an error in our data. This can become very difficult when the mistake is not in the data we have but** in the data we don’t have**.

When performing a calculation that takes into account the number of samples in our data (calculating an average or median value) we need to address rows where the value is NULL or zero.

Let’s assume we run an online shop and want to see the average amount of daily purchases a customer makes over a month. In dates in which a customer doesn’t make any purchases we won’t have an indication for that in our data. If we ignore this issue and calculate the average amount of purchases for each customer, we will get an overestimate.

   customer_id    | order_date | purchase_amount |
10000100005411274 | 2020-04-11 |        1        |
10000100005411274 | 2020-04-16 |        1        |
10000100005411274 | 2020-04-18 |        2        |
10000100005411274 | 2020-04-21 |        2        |
10000100005411274 | 2020-04-24 |        1        |

If we calculate the customer’s daily average purchase amount without looking at our raw data, we would think his average purchase amount is 1.4 (what a customer!).

To overcome this issue, we must generate and match between all dates to all customers. Via Presto SQL we can do this in a simple query:

SELECT
     CAST(date_column AS DATE) date_column
 FROM
     (VALUES
         (SEQUENCE(date('2020-04-01'),
                   date('2020-04-30'),
                   INTERVAL '1' DAY)
         )
     ) AS t1(date_array)
 CROSS JOIN
     UNNEST(date_array) AS t2(date_column)

Using SEQUENCE, we will create an array with the dates in our range and preform a cross join between each element in the array to the array itself. The result is a column with a row for each of the different dates.

A quick alternative could be to extract all the different dates regardless of customer from our initial data and store it as a WITH AS statement as well.

Next, we will preform another cross join in order to match between our customers and the different dates in order to fill in the missing ones:

with all_dates as (
SELECT
     CAST(date_column AS DATE) date_column
 FROM
     (VALUES
         (SEQUENCE(date('2020-04-01'),
                   date('2020-04-30'),
                   INTERVAL '1' DAY)
         )
     ) AS t1(date_array)
 CROSS JOIN
     UNNEST(date_array) AS t2(date_column)
)
select distinct customer_id
               ,date_column as order_date
from customer_purchases
cross join all_dates

Last, we will join the table with the new matches between customer and dates to the initial table with our data.

#presto #data #sql #data-engineering #data-science

Save The Date
1.20 GEEK