SQL and Machine Learning have a few things in common. It’s easy to start with one as it doesn’t require a lot of coding. Also, code rarely crashes.

I would argue that the fact that the SQL queries don’t crash makes the Data Analysis even harder. How many datasets I’ve extracted from the database, that turned out to have wrong or missing data? Many!

If the code would simply crash, I’d know I screw it up. Data Scientists need to spend a considerable amount of time on data validation because an SQL query always returns something.

These are the 5 mistakes you should avoid when writing SQL queries.

  1. Not knowing in what order queries execute
  2. What do Window Functions actually do?
  3. Calculating average with CASE WHEN
  4. JOINs ON columns with missing values
  5. Not using temporary tables for complex queries

1. Not knowing in what order queries execute

Image for post

SQL query execution order

SQL has a low barrier to entry. You start writing queries — use a JOIN here and there, do some grouping and you’re already an expert (at least some people think so).

But does the so-called expert even know in what order do SQL queries execute?

SQL queries don’t start with SELECT — they do in the editor when we write them, but the database doesn’t start with SELECT.

The database starts executing queries with FROM and JOIN. That’s why we can use fields from JOINed tables in WHERE.

Why can’t we filter the result of GROUP BY in WHERE? Because GROUP BY executes after WHERE. Hence, the reason for HAVING.

At last, we come to SELECT. It selects which columns to include and defines which aggregations to calculate. Also, Window Functions execute here.

This explains why we get an error when we try to filter with the output of a Window Function in WHERE.

Note, databases use a query optimizer to optimize the execution of a query. The optimizer might change the order of some operations so that the query runs faster. This diagram is a high-level overview of what is happening behind the scenes.

2. What do Window Functions actually do?

Image for post

Example of a transformation with a SUM Window Function

Window Functions seemed cryptic to me when I first encountered them. Why use Window Functions as GROUP BY can aggregate the data?

Well, a Window Function (WF) simplifies many operations when designing queries:

  • WF allows access to the records right before and after the current record. See Lead and Lag functions.
  • WF can perform an additional aggregation on already aggregated data with GROUP BY. See the example in the image above, where I calculate sales all with a WF.
  • ROW_NUMBER WF enumerates the rows. We can also use it to remove duplicate records with it. Or to take a random sample.
  • As the name suggests WF can calculate statistics on a given window:
sum(sales) OVER (PARTITION BY CustomerID BY ts ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as cumulative_sum

The WF above would calculate the cumulative sum from the first record to the current record.

Where did I do a mistake with Window Functions?

I didn’t take the time for a tutorial that would explain the basics and the power of Window Functions. Consequently, I avoided them and the queries became overcomplicated. Then bugs creep in.

Run the example above

Image for post

Running the example in SQL Fiddle

I got many requests from readers that would like to play with the example above. Now, you can run the example online in SQL Fiddle.

Here is the code, if you’d like to try it in your local database (it should work with PostgreSQL 9.3):

DROP TABLE IF EXISTS sales_table;
CREATE TEMPORARY TABLE sales_table
(
    key       varchar(6),
    customerID  varchar(10),
    productID   varchar(10),
    price     float
);
INSERT INTO sales_table
VALUES ('1', 'Customer1', 'Product1', 100),
       ('2', 'Customer1', 'Product1', 200),
       ('3', 'Customer1', 'Product2', 100),
       ('4', 'Customer2', 'Product2', 200),
       ('5', 'Customer2', 'Product3', 100);
SELECT customerID,
       productID,
       SUM(price) AS sales,
       SUM(SUM(price)) OVER (PARTITION BY customerID) AS sales_all
FROM sales_table
GROUP BY customerID, productID
ORDER BY customerID, productID

#data-science #data-analysis #sql #machine-learning #database

Don’t Repeat These 5 Mistakes with SQL
3.55 GEEK