Comparing Python and SQL for Building Data Pipelines🚀🚀🚀

Comparing Python and SQL for Building Data Pipelines🚀🚀🚀

Comparing Python and SQL for Building Data Pipelines - Breaking into the workforce as a web developer, my first interaction with databases and SQL was using an Object Relational Model (ORM). I was using the Django query sets API and had an excellent experience using the interface...

Breaking into the workforce as a web developer, my first interaction with databases and SQL was using an Object Relational Model (ORM). I was using the Django query sets API and had an excellent experience using the interface...

Thereon-after, I changed to a data engineering role and became much more involved in leveraging datasets to build AI. It became my responsibility to take the data from the user application and turn it into something usable by Data Scientists, a process commonly known as ETL.

The data on our production system was very messy and required a lot of transformations before anyone was going to be able to build AI on top of it. There were JSON columns that had different schemas per row, columns contained mixed data types and some rows had erroneous values (people saying that they were born before 1850 or in the future). As I set out on cleaning, aggregating and engineering features for the data, I tried to decide which language would be best for the task. Having used python all day every day before this, I knew that it could do the job. However, what I learned through this experience was that just because python could do the job doesn’t mean it should.

The first time I misjudged SQL is when I assumed that SQL couldn’t do complicated transformations

We are working with a time-series dataset where we wanted to track particular users over time. Privacy laws prevent us from knowing the specific dates of the user visits, so we decided that we would normalize the date of the record to the users first visit (ie 5 days after their first visit etc.). For our analysis, it was important to know the time since the last visit as well as the time since their first visit. A had two sample datasets, one with approximately 7.5 million rows measuring 6.5 GBs, and the other with 550 000 rows measuring 900 MB.

Using the python and SQL code seen below, I used the smaller dataset to first test the transformations. Python and SQL completed the task in 591 and 40.9 seconds respectively. This means that SQL was able to provide a speed-up of roughly 14.5X!

# connect to db using wrapper around psycopg2
db = DatabaseConnection(db='db', user='username', password='password')
# grab data from db and load into memory
df = db.run_query("SELECT * FROM cleaned_table;")
df = pd.DataFrame(df, columns=['user_id', 'series_id', 'timestamp'])
# calculate time since first visit
df = df.sort_values(['user_id', 'timestamp'], ascending=True).assign(time_since_first=df.groupby('user_id').timestamp.apply(lambda x: x - x.min()))
# calculate time since last visit
df = df.assign(time_since_last=df.sort_values(['timestamp'], ascending=True).groupby('user_id')['timestamp'].transform(pd.Series.diff))
# save df to compressed csv
df.to_csv('transform_time_test.gz', compression='gzip')

-- SQL equivalent
-- increase the working memory (be careful with this)
set work_mem='600MB';
-- create a dual index on the partition
CREATE INDEX IF NOT EXISTS user_time_index ON table(user_id, timestamp);
-- calculate time since last visit and time since first visit in one pass 
SELECT *, AGE(timestamp, LAG(timestamp, 1, timestamp) OVER w) AS time_since_last, AGE(timestamp, FIRST_VALUE(timestamp) OVER w) AS time_since_first FROM table WINDOW w AS (PARTITION BY user_id ORDER BY timestamp);

This SQL transformation was not only faster but the code is also more readable and thus easier to maintain. Here, I used the lag and first_value functions to find specific records in the users history (called a partition). I then used the age function to determine the time difference between visits.

What’s even more interesting is that when these transformation scripts were applied to the 6.5 GB dataset, python completely failed. Out of 3 attempts, python crashed 2 times and my computer completely froze the 3rd time… while SQL took 226 seconds.

More info:

The second time I misjudged SQL is when I thought that it couldn’t flatten irregular json

Another game changer for me was realizing that Postgres worked with JSON quite well. I initially thought that it would be impossible to flatten or parse json in postgres…I can’t believe that I was so dumb. If you want to relationize json and its schema is consistent between rows, then your best bet is probably to use Postgres built in ability to parse json.

-- SQL (the -> syntax is how you parse json)
SELECT user_json->'info'->>'name' as user_name FROM user_table;

On the other hand, half the json in my sample dataset isn’t valid json and thus is stored as text. In which case I was left with a choice, I could either recode the data to make it valid OR I could just drop the rows that didn’t follow the rules. To do this, I created a new SQL function called is_json that I could then use to qualify valid json in a WHERE clause.

-- SQL
create or replace function is_json(text)
returns boolean language plpgsql immutable as $
    perform $1::json;
    return true;
    when invalid_text_representation then 
        return false;
end $;
SELECT user_json->'info'->>'name' as user_name FROM user_table WHERE is_json(user_json);

Unfortunately, I found that the user_json had a different schema depending on what app version the user was on. Although this makes sense from an application development point of view, it makes it really expensive to conditionally parse every possibility per row. Was I destined to enter python again… not a chance! I found another function on stack-overflow written by a postgres god named klin.

-- SQL
create or replace function create_jsonb_flat_view
    (table_name text, regular_columns text, json_column text)
    returns text language plpgsql as $
    cols text;
    execute format ($ex$
        select string_agg(format('%2$s->>%%1$L "%%1$s"', key), ', ')
        from (
            select distinct key
            from %1$s, jsonb_each(%2$s)
            order by 1
            ) s;
        $ex$, table_name, json_column)
    into cols;
    execute format($ex$
        drop view if exists %1$s_view;
        create view %1$s_view as 
        select %2$s, %3$s from %1$s
        $ex$, table_name, regular_columns, cols);
    return cols;
end $;

This function was able to successfully flatten my json and solve my worst nightmare quite easily.

Final Comments

There is an idiom that declares Python as the second best language to do almost anything. I believe this to be true and in some instances have found the performance difference between Python and the ‘best’ language to be negligible. In this case however, python was unable to compete with SQL. These realizations along with readings I’ve been doing has completely changed my approach to ETL. I now work under the paradigm of “Do not move data to code, move code to your data”. Python moves your data to the code while SQL acts on it in place. What’s more is that I know that I’ve only scratched the surface of sql and postgres abilities. I’m looking forward to more awesome functionality, and the possibility of getting speed ups from using an analytical warehouse.

python sql database

What's new in Bootstrap 5 and when Bootstrap 5 release date?

How to Build Progressive Web Apps (PWA) using Angular 9

What is new features in Javascript ES2020 ECMAScript 2020

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Random Password Generator Online

HTML Color Picker online | HEX Color Picker | RGB Color Picker

Introduction to Structured Query Language SQL pdf

SQL stands for Structured Query Language. SQL is a scripting language expected to store, control, and inquiry information put away in social databases. The main manifestation of SQL showed up in 1974, when a gathering in IBM built up the principal model of a social database. The primary business social database was discharged by Relational Software later turning out to be Oracle.

Python Database Connection - How to Connect Python with MySQL Database

This video on 'Python Database Connection', you will learn how to establish a connection between Python and MySQL DB and perform CRUD operations on it.

Python with SQL Server Database

In this article you will learn: How to connect through SQL Server Database using Python? How can we store some information into the SQL Server database using Python? How can we perform an update and delete operations? How we will retrieve stored information from the SQL Server database?

Connect to Microsoft SQL Server database on MacOS using Python

Connect to your MS SQL using python. The first thing you need is to install Homebrew. You need the copy the content in the square brackets which in my case is “ODBC Driver 13 for SQL Server”. Replace “ODBC Driver 13 for SQL Server” with the content you copied in the square brackets.

Basic Data Types in Python | Python Web Development For Beginners

In the programming world, Data types play an important role. Each Variable is stored in different data types and responsible for various functions. Python had two different objects, and They are mutable and immutable objects.