SQL and Machine Learning have a few things in common. It’s easy to start with one as it doesn’t require a lot of coding. Also, code rarely crashes.
I would argue that the fact that the SQL queries don’t crash makes the Data Analysis even harder. How many datasets I’ve extracted from the database, that turned out to have wrong or missing data? Many!
If the code would simply crash, I’d know I screw it up. Data Scientists need to spend a considerable amount of time on data validation because an SQL query always returns something.
These are the 5 mistakes you should avoid when writing SQL queries.
SQL query execution order
SQL has a low barrier to entry. You start writing queries — use a JOIN here and there, do some grouping and you’re already an expert (at least some people think so).
But does the so-called expert even know in what order do SQL queries execute?
SQL queries don’t start with SELECT — they do in the editor when we write them, but the database doesn’t start with SELECT.
The database starts executing queries with FROM and JOIN. That’s why we can use fields from JOINed tables in WHERE.
Why can’t we filter the result of GROUP BY in WHERE? Because GROUP BY executes after WHERE. Hence, the reason for HAVING.
At last, we come to SELECT. It selects which columns to include and defines which aggregations to calculate. Also, Window Functions execute here.
This explains why we get an error when we try to filter with the output of a Window Function in WHERE.
Note, databases use a query optimizer to optimize the execution of a query. The optimizer might change the order of some operations so that the query runs faster. This diagram is a high-level overview of what is happening behind the scenes.
Example of a transformation with a SUM Window Function
Window Functions seemed cryptic to me when I first encountered them. Why use Window Functions as GROUP BY can aggregate the data?
Well, a Window Function (WF) simplifies many operations when designing queries:
sum(sales) OVER (PARTITION BY CustomerID BY ts ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as cumulative_sum
The WF above would calculate the cumulative sum from the first record to the current record.
Where did I do a mistake with Window Functions?
I didn’t take the time for a tutorial that would explain the basics and the power of Window Functions. Consequently, I avoided them and the queries became overcomplicated. Then bugs creep in.
Run the example above
Running the example in SQL Fiddle
I got many requests from readers that would like to play with the example above. Now, you can run the example online in SQL Fiddle.
Here is the code, if you’d like to try it in your local database (it should work with PostgreSQL 9.3):
DROP TABLE IF EXISTS sales_table;
CREATE TEMPORARY TABLE sales_table
(
key varchar(6),
customerID varchar(10),
productID varchar(10),
price float
);
INSERT INTO sales_table
VALUES ('1', 'Customer1', 'Product1', 100),
('2', 'Customer1', 'Product1', 200),
('3', 'Customer1', 'Product2', 100),
('4', 'Customer2', 'Product2', 200),
('5', 'Customer2', 'Product3', 100);
SELECT customerID,
productID,
SUM(price) AS sales,
SUM(SUM(price)) OVER (PARTITION BY customerID) AS sales_all
FROM sales_table
GROUP BY customerID, productID
ORDER BY customerID, productID
#data-science #data-analysis #sql #machine-learning #database