Python is very popular these days. Since Python is a general-purpose programming language, it can also be used to perform the Extract, Transform, Load (ETL) process. Different ETL modules are available, but today we’ll stick with the combination of Python and MySQL.
We’ll use Python to invoke stored procedures and prepare and execute SQL statements.
We’ll use two similar-but-different approaches. First, we’ll invoke stored procedures that will do the whole job, and after that we’ll analyze how we could do the same process without stored procedures by using MySQL code in Python.
Ready? Before we dig in, let’s look at the data model – or data models, as there’s two of them in this article.
We’ll need two data models, one to store our operational data and the other to store our reporting data.
The first model is shown in the picture above. This model is used to store operational (live) data for a subscription-based business. For more insight into this model, please take a look at our previous article, CREATING A DWH, PART ONE: A SUBSCRIPTION BUSINESS DATA MODEL.
https://my.vertabelo.com/model/IvcYd9W06dorp9zZxp9OfUGFyMlD7Z7s
Separating operational and reporting data is usually a very wise decision. To achieve that separation, we’ll need to create a data warehouse (DWH). We already did that; you can see the model in the picture above. This model is also described in detail in the post CREATING A DWH, PART TWO: A SUBSCRIPTION BUSINESS DATA MODEL.
Finally, we need to extract data from the live database, transform it, and load it into our DWH. We’ve already done this using SQL stored procedures. You can find a description of what we want to achieve along with some code examples in CREATING A DATA WAREHOUSE, PART 3: A SUBSCRIPTION BUSINESS DATA MODEL.
If you need additional information regarding DWHs, we recommend reading these articles:
Our task today is to replace the SQL stored procedures with Python code. We’re ready to make some Python magic. Let’s start with using only stored procedures in Python.
Before we start describing the process, it’s important to mention that we have two databases on our server.
The subscription_live
database is used to store transactional/live data, while the subscription_dwh
is our reporting database (DWH).
We’ve already described the stored procedures used to update dimension and fact tables. They will read data from the subscription_live
database, combine it with data in the subscription_dwh
database, and insert new data into the subscription_dwh
database. These two procedures are:
If you want to see the complete code for these procedures, read CREATING A DATA WAREHOUSE, PART 3: A SUBSCRIPTION BUSINESS DATA MODEL.
Now we’re ready to write a simple Python script that will connect to the server and perform the ETL process. Let’s first take a look at the whole script (etl_procedures.py). Then we’ll explain the most important parts.
# import MySQL connector
import mysql.connector
# connect to server
connection = mysql.connector.connect(user='', password='', host='127.0.0.1')
print('Connected to database.')
cursor = connection.cursor()
# I update dimensions
cursor.callproc('subscription_dwh.p_update_dimensions')
print('Dimension tables updated.')
# II update facts
cursor.callproc('subscription_dwh.p_update_facts')
print('Fact tables updated.')
# commit & close connection
cursor.close()
connection.commit()
connection.close()
print('Disconnected from database.')
etl_procedures.py
Python uses modules to store definitions and statements. You could use an existing module or write your own. Using existing modules will simplify your life because you’re using pre-written code, but writing your own module is also very useful. When you quit the Python interpreter and run it again, you’ll lose functions and variables you’ve previously defined. Of course, you don’t want to type the same code over and over again. To avoid that, you could store your definitions in a module and import it into Python.
Back to etl_procedures.py. In our program, we start with importing MySQL Connector:
# import MySQL connector
import mysql.connector
MySQL Connector for Python is used as a standardized driver that connects to a MySQL server/database. You’ll need to download it and install it if you haven’t previously done that. Besides connecting to the database, it offers a number of methods and properties for working with a database. We’ll use some of them, but you can check the complete documentation HERE.
Next, we’ll need to connect to our database:
# connect to server
connection = mysql.connector.connect(user='', password='', host='127.0.0.1')
print('Connected to database.')
cursor = connection.cursor()
The first line will connect to a server (in this case, I’m connecting to my local machine) using your credentials (replace and with actual values). While establishing a connection, you could also specify the database you want to connect to, as shown below:
connection = mysql.connector.connect(user='', password='', host='127.0.0.1', database='')
I’ve intentionally connected only to a server and not to a specific database because I’ll be using two databases located on the same server.
The next command – print
– is here just a notification that we were successfully connected. While it has no programming significance, it could be used to debug the code if something went wrong in the script.
The last line in this part is:
cursor = connection.cursor()
Cursors are the handler structure used to work with the data. We’ll use them for retrieving data from the database (SELECT), but also to modify the data (INSERT, UPDATE, DELETE). Before using a cursor, we need to create it. And that is what this line does.
The previous part was general and could be used for other database-related tasks. The following part of the code is specifically for ETL: calling our stored procedures with the cursor.callproc
command. It looks like this:
# 1. update dimensions
cursor.callproc('subscription_dwh.p_update_dimensions')
print('Dimension tables updated.')
# 2. update facts
cursor.callproc('subscription_dwh.p_update_facts')
print('Fact tables updated.')
Calling procedures is pretty much self-explanatory. After each call, a print command was added. Again, this just gives us a notification that everything went okay.
The final part of the script commits the database changes and closes all used objects:
# commit & close connection
cursor.close()
connection.commit()
connection.close()
print('Disconnected from database.')
Calling procedures is pretty much self-explanatory. After each call, a print command was added. Again, this just gives us a notification that everything went okay.
Committing is essential here; without it, there will be no changes to the database, even if you called a procedure or executed an SQL statement.
The last thing we need to do is to run our script. We’ll use the following commands in the Python Shell to achieve that:
import os file_path = ‘D://python_scripts’ os.chdir(file_path) exec(open(“etl_procedures.py”).read())
The script is executed and all changes are made in the database accordingly. The result can be seen in the picture below.
The approach presented above doesn’t differ a lot from the approach of calling stored procedures directly in MySQL. The only difference is that now we have a script that will do the whole job for us.
We could use another approach: putting everything inside the Python script. We’ll include Python statements, but we’ll also prepare SQL queries and execute them on the database. The source database (live) and the destination database (DWH) are the same as in the example with stored procedures.
Before we delve into this, let’s take a look at the complete script (etl_queries.py):
from datetime import date
# import MySQL connector
import mysql.connector
# connect to server
connection = mysql.connector.connect(user='', password='', host='127.0.0.1')
print('Connected to database.')
# 1. update dimensions
# 1.1 update dim_time
# date - yesterday
yesterday = date.fromordinal(date.today().toordinal()-1)
yesterday_str = '"' + str(yesterday) + '"'
# test if date is already in the table
cursor = connection.cursor()
query = (
"SELECT COUNT(*) "
"FROM subscription_dwh.dim_time "
"WHERE time_date = " + yesterday_str)
cursor.execute(query)
result = cursor.fetchall()
yesterday_subscription_count = int(result[0][0])
if yesterday_subscription_count == 0:
yesterday_year = 'YEAR("' + str(yesterday) + '")'
yesterday_month = 'MONTH("' + str(yesterday) + '")'
yesterday_week = 'WEEK("' + str(yesterday) + '")'
yesterday_weekday = 'WEEKDAY("' + str(yesterday) + '")'
query = (
"INSERT INTO subscription_dwh.`dim_time`(`time_date`, `time_year`, `time_month`, `time_week`, `time_weekday`, `ts`) "
" VALUES (" + yesterday_str + ", " + yesterday_year + ", " + yesterday_month + ", " + yesterday_week + ", " + yesterday_weekday + ", Now())")
cursor.execute(query)
# 1.2 update dim_city
query = (
"INSERT INTO subscription_dwh.`dim_city`(`city_name`, `postal_code`, `country_name`, `ts`) "
"SELECT city_live.city_name, city_live.postal_code, country_live.country_name, Now() "
"FROM subscription_live.city city_live "
"INNER JOIN subscription_live.country country_live ON city_live.country_id = country_live.id "
"LEFT JOIN subscription_dwh.dim_city city_dwh ON city_live.city_name = city_dwh.city_name AND city_live.postal_code = city_dwh.postal_code AND country_live.country_name = city_dwh.country_name "
"WHERE city_dwh.id IS NULL")
cursor.execute(query)
print('Dimension tables updated.')
# 2. update facts
# 2.1 update customers subscribed
# delete old data for the same date
query = (
"DELETE subscription_dwh.`fact_customer_subscribed`.* "
"FROM subscription_dwh.`fact_customer_subscribed` "
"INNER JOIN subscription_dwh.`dim_time` ON subscription_dwh.`fact_customer_subscribed`.`dim_time_id` = subscription_dwh.`dim_time`.`id` "
"WHERE subscription_dwh.`dim_time`.`time_date` = " + yesterday_str)
cursor.execute(query)
# insert new data
query = (
"INSERT INTO subscription_dwh.`fact_customer_subscribed`(`dim_city_id`, `dim_time_id`, `total_active`, `total_inactive`, `daily_new`, `daily_canceled`, `ts`) "
" SELECT city_dwh.id AS dim_ctiy_id, time_dwh.id AS dim_time_id, SUM(CASE WHEN customer_live.active = 1 THEN 1 ELSE 0 END) AS total_active, SUM(CASE WHEN customer_live.active = 0 THEN 1 ELSE 0 END) AS total_inactive, SUM(CASE WHEN customer_live.active = 1 AND DATE(customer_live.time_updated) = @time_date THEN 1 ELSE 0 END) AS daily_new, SUM(CASE WHEN customer_live.active = 0 AND DATE(customer_live.time_updated) = @time_date THEN 1 ELSE 0 END) AS daily_canceled, MIN(NOW()) AS ts "
"FROM subscription_live.`customer` customer_live "
"INNER JOIN subscription_live.`city` city_live ON customer_live.city_id = city_live.id "
"INNER JOIN subscription_live.`country` country_live ON city_live.country_id = country_live.id "
"INNER JOIN subscription_dwh.dim_city city_dwh ON city_live.city_name = city_dwh.city_name AND city_live.postal_code = city_dwh.postal_code AND country_live.country_name = city_dwh.country_name "
"INNER JOIN subscription_dwh.dim_time time_dwh ON time_dwh.time_date = " + yesterday_str + " "
"GROUP BY city_dwh.id, time_dwh.id")
cursor.execute(query)
# 2.2 update subscription statuses
# delete old data for the same date
query = (
"DELETE subscription_dwh.`fact_subscription_status`.* "
"FROM subscription_dwh.`fact_subscription_status` "
"INNER JOIN subscription_dwh.`dim_time` ON subscription_dwh.`fact_subscription_status`.`dim_time_id` = subscription_dwh.`dim_time`.`id` "
"WHERE subscription_dwh.`dim_time`.`time_date` = " + yesterday_str)
cursor.execute(query)
# insert new data
query = (
"INSERT INTO subscription_dwh.`fact_subscription_status`(`dim_city_id`, `dim_time_id`, `total_active`, `total_inactive`, `daily_new`, `daily_canceled`, `ts`) "
"SELECT city_dwh.id AS dim_ctiy_id, time_dwh.id AS dim_time_id, SUM(CASE WHEN subscription_live.active = 1 THEN 1 ELSE 0 END) AS total_active, SUM(CASE WHEN subscription_live.active = 0 THEN 1 ELSE 0 END) AS total_inactive, SUM(CASE WHEN subscription_live.active = 1 AND DATE(subscription_live.time_updated) = @time_date THEN 1 ELSE 0 END) AS daily_new, SUM(CASE WHEN subscription_live.active = 0 AND DATE(subscription_live.time_updated) = @time_date THEN 1 ELSE 0 END) AS daily_canceled, MIN(NOW()) AS ts "
"FROM subscription_live.`customer` customer_live "
"INNER JOIN subscription_live.`subscription` subscription_live ON subscription_live.customer_id = customer_live.id "
"INNER JOIN subscription_live.`city` city_live ON customer_live.city_id = city_live.id "
"INNER JOIN subscription_live.`country` country_live ON city_live.country_id = country_live.id "
"INNER JOIN subscription_dwh.dim_city city_dwh ON city_live.city_name = city_dwh.city_name AND city_live.postal_code = city_dwh.postal_code AND country_live.country_name = city_dwh.country_name "
"INNER JOIN subscription_dwh.dim_time time_dwh ON time_dwh.time_date = " + yesterday_str + " "
"GROUP BY city_dwh.id, time_dwh.id")
cursor.execute(query)
print('Fact tables updated.')
# commit & close connection
cursor.close()
connection.commit()
connection.close()
print('Disconnected from database.')
etl_queries.py
Once more, we’ll need to import MySQL using the following code:
import mysql.connector
We’ll also import the datetime module, as shown below. We need this for date-related operations in Python:
from datetime import date
The process for connecting to the database is the same as in the previous example.
To update the dim_time
table, we’ll need to check if the value (for yesterday) is already in the table. We’ll have to use Python’s date functions (instead of SQL’s) to do this:
# date - yesterday
yesterday = date.fromordinal(date.today().toordinal()-1)
yesterday_str = '"' + str(yesterday) + '"'
The first line of code will return yesterday’s date in the date variable, while the second line will store this value as a string. We’ll need this as a string because we will concatenate it with another string when we build the SQL query.
Next, we’ll need to test if this date is already in the dim_time
table. After declaring a cursor, we’ll prepare the SQL query. To execute the query, we’ll use the cursor.execute
command:
# test if date is already in the table
cursor = connection.cursor()
query = (
"SELECT COUNT(*) "
"FROM subscription_dwh.dim_time "
"WHERE time_date = " + yesterday_str)
cursor.execute(query)
'"'
We’ll store the query result in the result variable. The result will have either 0 or 1 rows, so we can test the first column of the first row. It will contain either a 0 or a 1. (Remember, we can have the same date only once in a dimension table.)
If the date is not already in the table, we’ll prepare the strings that will be part of the SQL query:
result = cursor.fetchall()
yesterday_subscription_count = int(result[0][0])
if yesterday_subscription_count == 0:
yesterday_year = 'YEAR("' + str(yesterday) + '")'
yesterday_month = 'MONTH("' + str(yesterday) + '")'
yesterday_week = 'WEEK("' + str(yesterday) + '")'
yesterday_weekday = 'WEEKDAY("' + str(yesterday) + '")'
Finally, we’ll build a query and execute it. This will update the dim_time
table after it is committed. Please notice that I’ve used the complete path to the table, including the database name (subscription_dwh
).
query = (
"INSERT INTO subscription_dwh.`dim_time`(`time_date`, `time_year`, `time_month`, `time_week`, `time_weekday`, `ts`) "
" VALUES (" + yesterday_str + ", " + yesterday_year + ", " + yesterday_month + ", " + yesterday_week + ", " + yesterday_weekday + ", Now())")
cursor.execute(query)
Updating the dim_city
table is even simpler because we don’t need to test anything before the insert. We’ll actually include that test in the SQL query.
# 1.2 update dim_city
query = (
"INSERT INTO subscription_dwh.`dim_city`(`city_name`, `postal_code`, `country_name`, `ts`) "
"SELECT city_live.city_name, city_live.postal_code, country_live.country_name, Now() "
"FROM subscription_live.city city_live "
"INNER JOIN subscription_live.country country_live ON city_live.country_id = country_live.id "
"LEFT JOIN subscription_dwh.dim_city city_dwh ON city_live.city_name = city_dwh.city_name AND city_live.postal_code = city_dwh.postal_code AND country_live.country_name = city_dwh.country_name "
"WHERE city_dwh.id IS NULL")
cursor.execute(query)
Here we prepare an execute the SQL query. Notice that I’ve again used the full paths to tables, including the names of both databases (subscription_live
and subscription_dwh
).
The last thing we need to do is to update our fact tables. The process is almost the same as updating dimension tables: we prepare queries and execute them. These queries are much more complex, but they are the same as the ones used in the stored procedures.
We’ve added one improvement compared to the stored procedures: deleting the existing data for the same date in the fact table. This will allow us to run a script multiple times for the same date. At the end, we’ll need to commit the transaction and close all objects and the connection.
We have a minor change in this part, which is calling a different script:
- import os
- file_path = 'D://python_scripts'
- os.chdir(file_path)
- exec(open("etl_queries.py").read())
Because we’ve used the same messages and the script completed successfully, the result is the same:
Today we saw one example of performing the ETL process with a Python script. There are other ways to do this, e.g. a number of open-source solutions that utilize Python libraries to work with databases and perform the ETL process. In the next article, we’ll play with one of them. In the meantime, feel free to share your experience with Python and ETL.
Recommended Reading
☞ Post Multipart Form Data in Python with Requests: Flask File Upload Example
☞ Get Modular with Python Functions
☞ Six Python Tips for Beginners
☞ 5 Python Frameworks You Should Learn 2019
☞ Data Preprocessing in Python
☞ Deep Learning from Scratch and Using Tensorflow in Python
☞ Deploying a python-django application using docker
#python #mysql