Don’t Repeat These 5 Mistakes with SQL

Don’t Repeat These 5 Mistakes with SQL

Doing Data Analysis with SQL is pretty straightforward… until it is NOT! Don’t Repeat These 5 Mistakes with SQL: Not knowing in what order queries execute, What do Window Functions actually do? Calculating average with CASE WHEN, JOINs ON columns with missing values, Not using temporary tables for complex queries

SQL and Machine Learning have a few things in common. It’s easy to start with one as it doesn’t require a lot of coding. Also, code rarely crashes.

I would argue that the fact that the SQL queries don’t crash makes the Data Analysis even harder. How many datasets I’ve extracted from the database, that turned out to have wrong or missing data? Many!

If the code would simply crash, I’d know I screw it up. Data Scientists need to spend a considerable amount of time on data validation because an SQL query always returns something.

These are the 5 mistakes you should avoid when writing SQL queries.

  1. Not knowing in what order queries execute
  2. What do Window Functions actually do?
  3. Calculating average with CASE WHEN
  4. JOINs ON columns with missing values
  5. Not using temporary tables for complex queries

1. Not knowing in what order queries execute

Image for post

SQL query execution order

SQL has a low barrier to entry. You start writing queries — use a JOIN here and there, do some grouping and you’re already an expert (at least some people think so).

But does the so-called expert even know in what order do SQL queries execute?

SQL queries don’t start with SELECT — they do in the editor when we write them, but the database doesn’t start with SELECT.

The database starts executing queries with FROM and JOIN. That’s why we can use fields from JOINed tables in WHERE.

Why can’t we filter the result of GROUP BY in WHERE? Because GROUP BY executes after WHERE. Hence, the reason for HAVING.

At last, we come to SELECT. It selects which columns to include and defines which aggregations to calculate. Also, Window Functions execute here.

This explains why we get an error when we try to filter with the output of a Window Function in WHERE.

Note, databases use a query optimizer to optimize the execution of a query. The optimizer might change the order of some operations so that the query runs faster. This diagram is a high-level overview of what is happening behind the scenes.

2. What do Window Functions actually do?

Image for post

Example of a transformation with a SUM Window Function

Window Functions seemed cryptic to me when I first encountered them. Why use Window Functions as GROUP BY can aggregate the data?

Well, a Window Function (WF) simplifies many operations when designing queries:

  • WF allows access to the records right before and after the current record. See Lead and Lag functions.
  • WF can perform an additional aggregation on already aggregated data with GROUP BY. See the example in the image above, where I calculate sales all with a WF.
  • ROW_NUMBER WF enumerates the rows. We can also use it to remove duplicate records with it. Or to take a random sample.
  • As the name suggests WF can calculate statistics on a given window:
sum(sales) OVER (PARTITION BY CustomerID BY ts ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as cumulative_sum

The WF above would calculate the cumulative sum from the first record to the current record.

Where did I do a mistake with Window Functions?

I didn’t take the time for a tutorial that would explain the basics and the power of Window Functions. Consequently, I avoided them and the queries became overcomplicated. Then bugs creep in.

Run the example above

Image for post

Running the example in SQL Fiddle

I got many requests from readers that would like to play with the example above. Now, you can run the example online in SQL Fiddle.

Here is the code, if you’d like to try it in your local database (it should work with PostgreSQL 9.3):

DROP TABLE IF EXISTS sales_table;
CREATE TEMPORARY TABLE sales_table
(
    key       varchar(6),
    customerID  varchar(10),
    productID   varchar(10),
    price     float
);
INSERT INTO sales_table
VALUES ('1', 'Customer1', 'Product1', 100),
       ('2', 'Customer1', 'Product1', 200),
       ('3', 'Customer1', 'Product2', 100),
       ('4', 'Customer2', 'Product2', 200),
       ('5', 'Customer2', 'Product3', 100);
SELECT customerID,
       productID,
       SUM(price) AS sales,
       SUM(SUM(price)) OVER (PARTITION BY customerID) AS sales_all
FROM sales_table
GROUP BY customerID, productID
ORDER BY customerID, productID

data-science data-analysis sql machine-learning database

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Introduction to Structured Query Language SQL pdf

SQL stands for Structured Query Language. SQL is a scripting language expected to store, control, and inquiry information put away in social databases. The main manifestation of SQL showed up in 1974, when a gathering in IBM built up the principal model of a social database. The primary business social database was discharged by Relational Software later turning out to be Oracle.

15 Machine Learning and Data Science Project Ideas with Datasets

Learning is a new fun in the field of Machine Learning and Data Science. In this article, we’ll be discussing 15 machine learning and data science projects.

Most popular Data Science and Machine Learning courses — July 2020

Most popular Data Science and Machine Learning courses — August 2020. This list was last updated in August 2020 — and will be updated regularly so as to keep it relevant

Exploratory Data Analysis is a significant part of Data Science

You will discover Exploratory Data Analysis (EDA), the techniques and tactics that you can use, and why you should be performing EDA on your next problem.

Why You Should Learn R — Learn Data Science with Dataquest

Why should you learn R programming when you're aiming to learn data science? Here are six reasons why R is the right language for you.