Doing Data Analysis with SQL is pretty straightforward… until it is NOT! Don’t Repeat These 5 Mistakes with SQL: Not knowing in what order queries execute, What do Window Functions actually do? Calculating average with CASE WHEN, JOINs ON columns with missing values, Not using temporary tables for complex queries
SQL and Machine Learning have a few things in common. It’s easy to start with one as it doesn’t require a lot of coding. Also, code rarely crashes.
I would argue that the fact that the SQL queries don’t crash makes the Data Analysis even harder. How many datasets I’ve extracted from the database, that turned out to have wrong or missing data? Many!
If the code would simply crash, I’d know I screw it up. Data Scientists need to spend a considerable amount of time on data validation because an SQL query always returns something.
These are the 5 mistakes you should avoid when writing SQL queries.
SQL query execution order
SQL has a low barrier to entry. You start writing queries — use a JOIN here and there, do some grouping and you’re already an expert (at least some people think so).
But does the so-called expert even know in what order do SQL queries execute?
SQL queries don’t start with SELECT — they do in the editor when we write them, but the database doesn’t start with SELECT.
The database starts executing queries with FROM and JOIN. That’s why we can use fields from JOINed tables in WHERE.
Why can’t we filter the result of GROUP BY in WHERE? Because GROUP BY executes after WHERE. Hence, the reason for HAVING.
At last, we come to SELECT. It selects which columns to include and defines which aggregations to calculate. Also, Window Functions execute here.
This explains why we get an error when we try to filter with the output of a Window Function in WHERE.
Note, databases use a query optimizer to optimize the execution of a query. The optimizer might change the order of some operations so that the query runs faster. This diagram is a high-level overview of what is happening behind the scenes.
Example of a transformation with a SUM Window Function
Window Functions seemed cryptic to me when I first encountered them. Why use Window Functions as GROUP BY can aggregate the data?
Well, a Window Function (WF) simplifies many operations when designing queries:
sum(sales) OVER (PARTITION BY CustomerID BY ts ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as cumulative_sum
The WF above would calculate the cumulative sum from the first record to the current record.
Where did I do a mistake with Window Functions?
I didn’t take the time for a tutorial that would explain the basics and the power of Window Functions. Consequently, I avoided them and the queries became overcomplicated. Then bugs creep in.
Run the example above
Running the example in SQL Fiddle
I got many requests from readers that would like to play with the example above. Now, you can run the example online in SQL Fiddle.
Here is the code, if you’d like to try it in your local database (it should work with PostgreSQL 9.3):
DROP TABLE IF EXISTS sales_table; CREATE TEMPORARY TABLE sales_table ( key varchar(6), customerID varchar(10), productID varchar(10), price float ); INSERT INTO sales_table VALUES ('1', 'Customer1', 'Product1', 100), ('2', 'Customer1', 'Product1', 200), ('3', 'Customer1', 'Product2', 100), ('4', 'Customer2', 'Product2', 200), ('5', 'Customer2', 'Product3', 100); SELECT customerID, productID, SUM(price) AS sales, SUM(SUM(price)) OVER (PARTITION BY customerID) AS sales_all FROM sales_table GROUP BY customerID, productID ORDER BY customerID, productID
SQL stands for Structured Query Language. SQL is a scripting language expected to store, control, and inquiry information put away in social databases. The main manifestation of SQL showed up in 1974, when a gathering in IBM built up the principal model of a social database. The primary business social database was discharged by Relational Software later turning out to be Oracle.
Learning is a new fun in the field of Machine Learning and Data Science. In this article, we’ll be discussing 15 machine learning and data science projects.
Most popular Data Science and Machine Learning courses — August 2020. This list was last updated in August 2020 — and will be updated regularly so as to keep it relevant
You will discover Exploratory Data Analysis (EDA), the techniques and tactics that you can use, and why you should be performing EDA on your next problem.
Why should you learn R programming when you're aiming to learn data science? Here are six reasons why R is the right language for you.