Feature Engineering in SQL and Python: A Hybrid Approach

Feature Engineering in SQL and Python: A Hybrid Approach

<strong>Set up your workstation, reduce workplace clutter, maintain a clean namespace, and effortlessly keep your dataset up-to-date</strong>

Set up your workstation, reduce workplace clutter, maintain a clean namespace, and effortlessly keep your dataset up-to-date

I knew SQL long before learning about Pandas, and I was intrigued by the way Pandas faithfully emulates SQL. Stereotypically, SQL is for analysts, who crunch data into informative reports, whereas Python is for data scientists, who use data to build (and overfit) models. Although they are almost functionally equivalent, I’d argue both tools are essential for a data scientist to work efficiently. From my experience with Pandas, I’ve noticed the following:

  • I end up with many CSV files when exploring different features.
  • When I aggregate over a big dataframe, the Jupyter kernel simply dies.
  • I have multiple dataframes with confusing (and long) names in my kernel.
  • My feature engineering codes look ugly and are scattered over many cells.

Those problems are naturally solved when I began feature engineering directly in SQL. So in this post, I’ll share some of my favorite tricks by working through a take-home challenge dataset. If you know a little bit of SQL, it’s time to put it into good use.

Installing MySQL

To start with, you need a SQL server. I’m using MySQL in this post. You can get MySQL server by installing one of the local desktop servers such as MAMP, WAMP or XAMPP. There are many tutorials online, and it’s worth going through the trouble.

After setting up your server, make sure you have three items ready: username, password, port number. Login through Terminal by entering the following command (here we have username “root”, password 1234567).

mysql -uroot -p`1234567`

Then create a database called “Shutterfly” in the MySQL console (you can name it whatever you want). The two tables will be loaded into this database.

create database Shutterfly;

Install sqlalchemy

You’ll need Pandas and sqlalchemy to work with SQL in Python. I bet you already have Pandas. Then install sqlalchemy by activating your desired environment to launch Jupyter notebook, and enter:

pip install sqlalchemy

The sqlalchemy module also requires MySQLdb module. Depending on your OS, this can be installed using different commands.

Load Dataset into MySQL Server

In this example, we’ll load data from two CSV files, and engineer features directly in MySQL. To load datasets, we need to instantiate an engine object using username, password, port number, and database name. Two tables will be created: Online and Order. A natural index will be created on each table.

from sqlalchemy import create_engine
import pandas as pd

username = "root"
password = "1234567"
port = 7777
database = "Shutterfly"

engine = create_engine('mysql+mysqldb://%s:%[email protected]:%i/%s'
%(username, password, port, database))

df_online = pd.read_csv("data/online.csv")
df_online.to_sql('Online', engine, if_exists='replace')

df_order = pd.read_csv("data/Order.csv")
df_order.to_sql('Purchase', engine, if_exists='replace')

In MySQL console, you can verify that the tables have been created.

mysql> use shutterfly;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed

mysql> show tables;
+----------------------+
| Tables_in_shutterfly |
+----------------------+
| Online |
| Purchase |
+----------------------+
2 rows in set (0.00 sec)

Split Dataset

This may seem counter-intuitive since we haven’t built any feature yet. But it’s actually very neat because all we need to do is to split dataset by index. By design, I also included the label (event 2) which we try to predict. When loading features, we will simply join the index with feature tables.

sql = "SELECT index, event2 FROM Online;"
df = pd.read_sql_query(sql, engine).set_index('index')

shuffle dataset, preserving index

df = df.sample(frac=1)

train_frac = 0.9
test_frac = 1 - train_frac

trn_cutoff = int(len(df) * train_frac)

df_trn = df[:trn_cutoff]
df_tst = df[trn_cutoff:]

df_trn.to_sql('trn_set', engine, if_exists='replace')
df_tst.to_sql('tst_set', engine, if_exists='replace')

In MySQL console, you can verify that the training and test set are created.

mysql> select count() from trn_set;
+----------+
| count(
) |
+----------+
| 859296 |
+----------+
1 row in set (0.61 sec)

mysql> select count() from tst_set;
+----------+
| count(
) |
+----------+
| 95478 |
+----------+
1 row in set (0.07 sec)

mysql> select * from trn_set limit 5;
+--------+--------+
| index | event2 |
+--------+--------+
| 464836 | 7 |
| 623193 | 7 |
| 240268 | 3 |
| 512374 | 7 |
| 381816 | 4 |
+--------+--------+
5 rows in set (0.00 sec)

Feature Engineering

This is the heavy lifting part. I write SQL code directly in Sublime Text, and debug my code by pasting them into MySQL console. Because this dataset is an event log, we must avoid leaking future information into each data point. As you can imagine, every feature needs to be aggregated over the history!

Joining table is the slowest operation, and so we want to get as many features as possible from each join. In this dataset, I implemented four types of join, resulting in four groups of features. The details are not important, but you can find all my SQL snippets here. Each snippet creates a table. The index is preserved and must match correctly to the response variable in the training set and test set. Each snippet is structured like this:

USE Shutterfly;

DROP TABLE IF EXISTS features_group_1;

CREATE TABLE IF NOT EXISTS features_group_1
SELECT o.index
,LEFT(o.dt, 10) AS day
,COUNT() AS order_count
,SUM(p.revenue) AS revenue_sum
,MAX(p.revenue) AS revenue_max
,MIN(p.revenue) AS revenue_min
,SUM(p.revenue) / COUNT(
) AS rev_p_order
,COUNT(p.prodcat1) AS prodcat1_count
,COUNT(p.prodcat2) AS prodcat2_count
,DATEDIFF(o.dt, MAX(p.orderdate)) AS days_last_order
,DATEDIFF(o.dt, MAX(CASE WHEN p.prodcat1 IS NOT NULL THEN p.orderdate ELSE NULL END)) AS days_last_prodcat1
,DATEDIFF(o.dt, MAX(CASE WHEN p.prodcat2 IS NOT NULL THEN p.orderdate ELSE NULL END)) AS days_last_prodcat2
,SUM(p.prodcat1 = 1) AS prodcat1_1_count
,SUM(p.prodcat1 = 2) AS prodcat1_2_count
,SUM(p.prodcat1 = 3) AS prodcat1_3_count
,SUM(p.prodcat1 = 4) AS prodcat1_4_count
,SUM(p.prodcat1 = 5) AS prodcat1_5_count
,SUM(p.prodcat1 = 6) AS prodcat1_6_count
,SUM(p.prodcat1 = 7) AS prodcat1_7_count
FROM Online AS o
JOIN Purchase AS p
ON o.custno = p.custno
AND p.orderdate <= o.dt
GROUP BY o.index;

ALTER TABLE features_group_1
ADD KEY ix_features_group_1_index (index);

To generate the feature tables, open a new Terminal, navigate to the folder containing the sql files, and enter the following commands and passwords. The first snippet creates some necessary indices that speed up the join operation. The next four snippets create four feature tables. Without the indices, the joining takes forever. With the indices, it takes about 20 minutes (not bad on a local machine).

mysql < add_index.sql -uroot -p1234567
mysql &lt; feature_group_1.sql -uroot -p1234567 mysql &lt; feature_group_2.sql -uroot -p1234567 mysql &lt; feature_group_3.sql -uroot -p1234567 mysql &lt; feature_group_4.sql -uroot -p1234567

Now you should have the following tables in the database. Note that the derived features are stored separately from the original event logs, which help prevent confusion and disaster.

mysql> show tables;
+----------------------+
| Tables_in_shutterfly |
+----------------------+
| Online |
| Purchase |
| features_group_1 |
| features_group_2 |
| features_group_3 |
| features_group_4 |
| trn_set |
| tst_set |
+----------------------+
8 rows in set (0.01 sec)

Load Features

Here I wrote a utility function that pulls data from the MySQL server.

  • The function takes table name “trn_set” (training set) or “tst_set” (test set) as input, and an optional limit clause, if you only want a subset of the data.
  • Unique columns, and columns with mostly missing values, are dropped.
  • Date column is mapped to month, to help capture seasonality effect.
  • Notice how the feature tables are joined in succession. This is actually efficient because we are always joining index on one-to-one mapping.
def load_dataset(split="trn_set", limit=None, ignore_categorical=False):
sql = """
SELECT o., f1., f2., f3., f4.*,
EXTRACT(MONTH FROM o.dt) AS month
FROM %s AS t
JOIN Online AS o
ON t.index = o.index
JOIN features_group_1 AS f1
ON t.index = f1.index
JOIN features_group_2 AS f2
ON t.index = f2.index
JOIN features_group_3 AS f3
ON t.index = f3.index
JOIN features_group_4 AS f4
ON t.index = f4.index
"""%split
if limit:
sql += " LIMIT %i"%limit

df = pd.read_sql_query(sql.replace('\n', " ").replace("\t", " "), engine)
df.event1 = df.event1.fillna(0)
X = df.drop(["index", "event2", "dt", "day", "session", "visitor", "custno"], axis=1)
Y = df.event2
return X, Y

Finally, let’s take a look at 5 training examples, and their features.

X_trn, Y_trn = load_dataset("trn_set", limit=5)

print(X_trn.head().T)

0 1 2 3 4 category 1 1 1 1 1 event1 0 0 0 0 0 order_count 1 2 2 1 1 revenue_sum 30.49 191.33 191.33 76.96 66.77 revenue_max 30.49 98.38 98.38 76.96 66.77 revenue_min 30.49 92.95 92.95 76.96 66.77 rev_p_order 30.49 95.665 95.665 76.96 66.77 prodcat1_count 1 2 2 1 1 prodcat2_count 1 2 2 1 1 days_last_order 270 311 311 137 202 days_last_prodcat1 270 311 311 137 202 days_last_prodcat2 270 311 311 137 202 prodcat1_1_count 0 2 2 0 0 prodcat1_2_count 1 0 0 1 0 prodcat1_3_count 0 0 0 0 1 prodcat1_4_count 0 0 0 0 0 prodcat1_5_count 0 0 0 0 0 prodcat1_6_count 0 0 0 0 0 prodcat1_7_count 0 0 0 0 0 category_1_count 4 29 29 2 2 category_2_count 0 12 12 2 0 category_3_count 0 4 4 8 11 event1_1_count 0 6 6 0 0 event1_2_count 0 1 1 1 0 event1_4_count 0 0 0 1 0 event1_5_count 0 0 0 0 0 event1_6_count 0 0 0 0 0 event1_7_count 0 0 0 0 0 event1_8_count 0 1 1 0 0 event1_9_count 0 0 0 0 0 event1_10_count 0 0 0 0 0 event1_11_count 0 0 0 1 0 event2_null_count 4 37 37 9 13 event2_1_count 1 11 11 1 3 event2_2_count 0 1 1 0 0 event2_3_count 1 4 4 0 2 event2_4_count 0 4 4 2 1 event2_5_count 0 2 2 0 1 event2_6_count 0 2 2 0 0 event2_7_count 1 15 15 5 5 event2_8_count 1 6 6 3 1 event2_9_count 0 0 0 0 0 event2_10_count 0 0 0 1 0 last_category 1 1 1 1 1 last_event1 None None None None None last_event2 8 7 7 8 3 last_revenue 30.49 92.95 98.38 76.96 66.77 last_prodcat1 2 1 1 2 3 last_prodcat2 3 11 89 3 9 month 11 8 8 1 11

Now you have a well-defined dataset and feature set. You can tweak the scale of each feature and missing values to suit your model’s requirement.

For tree-based methods, which are invariant to feature scaling, we can directly apply the model, and simply focus on tuning parameters! See an example of a plain-vanilla gradient boosting machine here.

It is nice to see that the useful features are all engineered, except for the category feature. Our efforts paid off! Also, the most predictive feature of event2 is how many nulls value were observed in event2. This is an illustrative case where we cannot replace null values by median or average, because the fact that they are missing is correlated with the response variable!

Summary

As you can see, we have no intermediate CSV files, a very clean namespace in our notebook, and our feature engineering codes are reduced to a few straightforward SQL statements. There are two situations in which the SQL approach is even more efficient:

  • If your dataset is deployed on the cloud, you may be able to run distributed query. Most SQL server supports distirbuted query today. In Pandas, you need some extension called Dask DataFrame.
  • If you can afford to pull data real-time, you can create SQL views instead of tables. In this way, every time you pull data in Python, your data will always be up-to-date.

One fundamental restriction of this approach is that you must be able to directly connect to your SQL server in Python. If this is not possible, you may have to download the query result as a CSV file and load it in Python.

I hope you find this post helpful. Though I’m not advocating method over another, it is necessary to understand the advantage and limitation of each method, and have both methods ready in our toolkit. So we can apply whichever method that works best under the constraints.

Originally published by Shaw Lu at https://towardsdatascience.com/feature-engineering-in-sql-and-python-a-hybrid-approach-b52347cd2de4

Follow great articles on Twitter

Learn More

☞ Complete Python Bootcamp: Go from zero to hero in Python 3

☞ Complete Python Masterclass

☞ Learn Python by Building a Blockchain & Cryptocurrency

☞ Python and Django Full Stack Web Developer Bootcamp

☞ The Python Bible™ | Everything You Need to Program in Python

☞ Learning Python for Data Analysis and Visualization

☞ Python for Financial Analysis and Algorithmic Trading

☞ The Modern Python 3 Bootcamp

Using Python and MySQL in the ETL Process

Using Python and MySQL in the ETL Process

Python is very popular these days. Since Python is a general-purpose programming language, it can also be used to perform the Extract, Transform, Load (ETL) process. Different ETL modules are available, but today we’ll stick with the combination of Python and MySQL.

Python is very popular these days. Since Python is a general-purpose programming language, it can also be used to perform the Extract, Transform, Load (ETL) process. Different ETL modules are available, but today we’ll stick with the combination of Python and MySQL.

We’ll use Python to invoke stored procedures and prepare and execute SQL statements.

We’ll use two similar-but-different approaches. First, we’ll invoke stored procedures that will do the whole job, and after that we’ll analyze how we could do the same process without stored procedures by using MySQL code in Python.

Ready? Before we dig in, let’s look at the data model – or data models, as there’s two of them in this article.

The Data Models

We’ll need two data models, one to store our operational data and the other to store our reporting data.

The Data Models

The first model is shown in the picture above. This model is used to store operational (live) data for a subscription-based business. For more insight into this model, please take a look at our previous article, CREATING A DWH, PART ONE: A SUBSCRIPTION BUSINESS DATA MODEL.

https://my.vertabelo.com/model/IvcYd9W06dorp9zZxp9OfUGFyMlD7Z7s

Separating operational and reporting data is usually a very wise decision. To achieve that separation, we’ll need to create a data warehouse (DWH). We already did that; you can see the model in the picture above. This model is also described in detail in the post CREATING A DWH, PART TWO: A SUBSCRIPTION BUSINESS DATA MODEL.

Finally, we need to extract data from the live database, transform it, and load it into our DWH. We’ve already done this using SQL stored procedures. You can find a description of what we want to achieve along with some code examples in CREATING A DATA WAREHOUSE, PART 3: A SUBSCRIPTION BUSINESS DATA MODEL.

If you need additional information regarding DWHs, we recommend reading these articles:

Our task today is to replace the SQL stored procedures with Python code. We’re ready to make some Python magic. Let’s start with using only stored procedures in Python.

Method 1: ETL Using Stored Procedures

Before we start describing the process, it’s important to mention that we have two databases on our server.

The subscription_live database is used to store transactional/live data, while the subscription_dwh is our reporting database (DWH).

We’ve already described the stored procedures used to update dimension and fact tables. They will read data from the subscription_live database, combine it with data in the subscription_dwh database, and insert new data into the subscription_dwhdatabase. These two procedures are:

If you want to see the complete code for these procedures, read CREATING A DATA WAREHOUSE, PART 3: A SUBSCRIPTION BUSINESS DATA MODEL.

Now we’re ready to write a simple Python script that will connect to the server and perform the ETL process. Let’s first take a look at the whole script (etl_procedures.py). Then we’ll explain the most important parts.

# import MySQL connector
import mysql.connector

# connect to server
connection = mysql.connector.connect(user='', password='', host='127.0.0.1')
print('Connected to database.')
cursor = connection.cursor()

# I update dimensions
cursor.callproc('subscription_dwh.p_update_dimensions')
print('Dimension tables updated.')

# II update facts
cursor.callproc('subscription_dwh.p_update_facts')
print('Fact tables updated.')

# commit & close connection
cursor.close()
connection.commit()
connection.close()
print('Disconnected from database.')


etl_procedures.py

Importing Modules and Connecting to the Database

Python uses modules to store definitions and statements. You could use an existing module or write your own. Using existing modules will simplify your life because you’re using pre-written code, but writing your own module is also very useful. When you quit the Python interpreter and run it again, you’ll lose functions and variables you’ve previously defined. Of course, you don’t want to type the same code over and over again. To avoid that, you could store your definitions in a module and import it into Python.

Back to etl_procedures.py. In our program, we start with importing MySQL Connector:

# import MySQL connector
import mysql.connector

MySQL Connector for Python is used as a standardized driver that connects to a MySQL server/database. You’ll need to download it and install it if you haven’t previously done that. Besides connecting to the database, it offers a number of methods and properties for working with a database. We’ll use some of them, but you can check the complete documentation HERE.

Next, we’ll need to connect to our database:

# connect to server
connection = mysql.connector.connect(user='', password='', host='127.0.0.1')
print('Connected to database.')
cursor = connection.cursor()

The first line will connect to a server (in this case, I’m connecting to my local machine) using your credentials (replace and with actual values). While establishing a connection, you could also specify the database you want to connect to, as shown below:

connection = mysql.connector.connect(user='', password='', host='127.0.0.1', database='')

I’ve intentionally connected only to a server and not to a specific database because I’ll be using two databases located on the same server.

The next command – print – is here just a notification that we were successfully connected. While it has no programming significance, it could be used to debug the code if something went wrong in the script.

The last line in this part is:

cursor = connection.cursor()

Cursors are the handler structure used to work with the data. We’ll use them for retrieving data from the database (SELECT), but also to modify the data (INSERT, UPDATE, DELETE). Before using a cursor, we need to create it. And that is what this line does.

Calling Procedures

The previous part was general and could be used for other database-related tasks. The following part of the code is specifically for ETL: calling our stored procedures with the cursor.callproc command. It looks like this:

# 1. update dimensions
cursor.callproc('subscription_dwh.p_update_dimensions')
print('Dimension tables updated.')

# 2. update facts
cursor.callproc('subscription_dwh.p_update_facts')
print('Fact tables updated.')

Calling procedures is pretty much self-explanatory. After each call, a print command was added. Again, this just gives us a notification that everything went okay.

Commit and Close

The final part of the script commits the database changes and closes all used objects:

# commit & close connection
cursor.close()
connection.commit()
connection.close()
print('Disconnected from database.')

Calling procedures is pretty much self-explanatory. After each call, a print command was added. Again, this just gives us a notification that everything went okay.

Committing is essential here; without it, there will be no changes to the database, even if you called a procedure or executed an SQL statement.

Running the Script

The last thing we need to do is to run our script. We’ll use the following commands in the Python Shell to achieve that:

import os file_path = 'D://python_scripts' os.chdir(file_path) exec(open("etl_procedures.py").read())

The script is executed and all changes are made in the database accordingly. The result can be seen in the picture below.

Method 2: ETL Using Python and MySQL

The approach presented above doesn’t differ a lot from the approach of calling stored procedures directly in MySQL. The only difference is that now we have a script that will do the whole job for us.

We could use another approach: putting everything inside the Python script. We’ll include Python statements, but we’ll also prepare SQL queries and execute them on the database. The source database (live) and the destination database (DWH) are the same as in the example with stored procedures.

Before we delve into this, let’s take a look at the complete script (etl_queries.py):

from datetime import date

# import MySQL connector
import mysql.connector

# connect to server
connection = mysql.connector.connect(user='', password='', host='127.0.0.1')
print('Connected to database.')

# 1. update dimensions

# 1.1 update dim_time
# date - yesterday
yesterday = date.fromordinal(date.today().toordinal()-1)
yesterday_str = '"' + str(yesterday) + '"'
# test if date is already in the table
cursor = connection.cursor()
query = (
  "SELECT COUNT(*) "
  "FROM subscription_dwh.dim_time " 
  "WHERE time_date = " + yesterday_str)
cursor.execute(query)
result = cursor.fetchall()
yesterday_subscription_count = int(result[0][0])
if yesterday_subscription_count == 0:
  yesterday_year = 'YEAR("' + str(yesterday) + '")'
  yesterday_month = 'MONTH("' + str(yesterday) + '")'
  yesterday_week = 'WEEK("' + str(yesterday) + '")'
  yesterday_weekday = 'WEEKDAY("' + str(yesterday) + '")'
  query = (
  "INSERT INTO subscription_dwh.`dim_time`(`time_date`, `time_year`, `time_month`, `time_week`, `time_weekday`, `ts`) " 
" VALUES (" + yesterday_str + ", " + yesterday_year + ", " + yesterday_month + ", " + yesterday_week + ", " + yesterday_weekday + ", Now())")
  cursor.execute(query)

# 1.2 update dim_city
query = (
  "INSERT INTO subscription_dwh.`dim_city`(`city_name`, `postal_code`, `country_name`, `ts`) "
  "SELECT city_live.city_name, city_live.postal_code, country_live.country_name, Now() "
  "FROM subscription_live.city city_live "
  "INNER JOIN subscription_live.country country_live ON city_live.country_id = country_live.id "
  "LEFT JOIN subscription_dwh.dim_city city_dwh ON city_live.city_name = city_dwh.city_name AND city_live.postal_code = city_dwh.postal_code AND country_live.country_name = city_dwh.country_name "
  "WHERE city_dwh.id IS NULL")
cursor.execute(query)

print('Dimension tables updated.')


# 2. update facts

# 2.1 update customers subscribed
# delete old data for the same date
query = (
  "DELETE subscription_dwh.`fact_customer_subscribed`.* "
  "FROM subscription_dwh.`fact_customer_subscribed` "
  "INNER JOIN subscription_dwh.`dim_time` ON subscription_dwh.`fact_customer_subscribed`.`dim_time_id` = subscription_dwh.`dim_time`.`id` "
  "WHERE subscription_dwh.`dim_time`.`time_date` = " + yesterday_str)
cursor.execute(query)
# insert new data
query = (
  "INSERT INTO subscription_dwh.`fact_customer_subscribed`(`dim_city_id`, `dim_time_id`, `total_active`, `total_inactive`, `daily_new`, `daily_canceled`, `ts`) "
  " SELECT city_dwh.id AS dim_ctiy_id, time_dwh.id AS dim_time_id, SUM(CASE WHEN customer_live.active = 1 THEN 1 ELSE 0 END) AS total_active, SUM(CASE WHEN customer_live.active = 0 THEN 1 ELSE 0 END) AS total_inactive, SUM(CASE WHEN customer_live.active = 1 AND DATE(customer_live.time_updated) = @time_date THEN 1 ELSE 0 END) AS daily_new, SUM(CASE WHEN customer_live.active = 0 AND DATE(customer_live.time_updated) = @time_date THEN 1 ELSE 0 END) AS daily_canceled, MIN(NOW()) AS ts "
  "FROM subscription_live.`customer` customer_live "
  "INNER JOIN subscription_live.`city` city_live ON customer_live.city_id = city_live.id "
  "INNER JOIN subscription_live.`country` country_live ON city_live.country_id = country_live.id "
  "INNER JOIN subscription_dwh.dim_city city_dwh ON city_live.city_name = city_dwh.city_name AND city_live.postal_code = city_dwh.postal_code AND country_live.country_name = city_dwh.country_name "
  "INNER JOIN subscription_dwh.dim_time time_dwh ON time_dwh.time_date = " + yesterday_str + " " 
  "GROUP BY city_dwh.id, time_dwh.id")
cursor.execute(query)

# 2.2 update subscription statuses
# delete old data for the same date
query = (
  "DELETE subscription_dwh.`fact_subscription_status`.* "
  "FROM subscription_dwh.`fact_subscription_status` "
  "INNER JOIN subscription_dwh.`dim_time` ON subscription_dwh.`fact_subscription_status`.`dim_time_id` = subscription_dwh.`dim_time`.`id` "
  "WHERE subscription_dwh.`dim_time`.`time_date` = " + yesterday_str)
cursor.execute(query)
# insert new data
query = (
  "INSERT INTO subscription_dwh.`fact_subscription_status`(`dim_city_id`, `dim_time_id`, `total_active`, `total_inactive`, `daily_new`, `daily_canceled`, `ts`) "
  "SELECT city_dwh.id AS dim_ctiy_id, time_dwh.id AS dim_time_id, SUM(CASE WHEN subscription_live.active = 1 THEN 1 ELSE 0 END) AS total_active, SUM(CASE WHEN subscription_live.active = 0 THEN 1 ELSE 0 END) AS total_inactive, SUM(CASE WHEN subscription_live.active = 1 AND DATE(subscription_live.time_updated) = @time_date THEN 1 ELSE 0 END) AS daily_new, SUM(CASE WHEN subscription_live.active = 0 AND DATE(subscription_live.time_updated) = @time_date THEN 1 ELSE 0 END) AS daily_canceled, MIN(NOW()) AS ts "
  "FROM subscription_live.`customer` customer_live "
  "INNER JOIN subscription_live.`subscription` subscription_live ON subscription_live.customer_id = customer_live.id "
  "INNER JOIN subscription_live.`city` city_live ON customer_live.city_id = city_live.id "
  "INNER JOIN subscription_live.`country` country_live ON city_live.country_id = country_live.id "
  "INNER JOIN subscription_dwh.dim_city city_dwh ON city_live.city_name = city_dwh.city_name AND city_live.postal_code = city_dwh.postal_code AND country_live.country_name = city_dwh.country_name "
  "INNER JOIN subscription_dwh.dim_time time_dwh ON time_dwh.time_date = " + yesterday_str + " "
  "GROUP BY city_dwh.id, time_dwh.id")
cursor.execute(query)

print('Fact tables updated.')

# commit & close connection
cursor.close()
connection.commit()
connection.close()
print('Disconnected from database.')

etl_queries.py

Importing Modules and Connecting to the Database

Once more, we’ll need to import MySQL using the following code:

import mysql.connector

We’ll also import the datetime module, as shown below. We need this for date-related operations in Python:

from datetime import date

The process for connecting to the database is the same as in the previous example.

Updating the dim_time Dimension

To update the dim_time table, we’ll need to check if the value (for yesterday) is already in the table. We’ll have to use Python’s date functions (instead of SQL’s) to do this:

# date - yesterday
yesterday = date.fromordinal(date.today().toordinal()-1)
yesterday_str = '"' + str(yesterday) + '"'


The first line of code will return yesterday’s date in the date variable, while the second line will store this value as a string. We’ll need this as a string because we will concatenate it with another string when we build the SQL query.

Next, we’ll need to test if this date is already in the dim_time table. After declaring a cursor, we’ll prepare the SQL query. To execute the query, we’ll use the cursor.executecommand:

# test if date is already in the table
cursor = connection.cursor()
query = (
  "SELECT COUNT(*) "
  "FROM subscription_dwh.dim_time " 
  "WHERE time_date = " + yesterday_str)
cursor.execute(query)
'"'

We’ll store the query result in the result variable. The result will have either 0 or 1 rows, so we can test the first column of the first row. It will contain either a 0 or a 1. (Remember, we can have the same date only once in a dimension table.)

If the date is not already in the table, we’ll prepare the strings that will be part of the SQL query:

result = cursor.fetchall()
yesterday_subscription_count = int(result[0][0])
if yesterday_subscription_count == 0:
  yesterday_year = 'YEAR("' + str(yesterday) + '")'
  yesterday_month = 'MONTH("' + str(yesterday) + '")'
  yesterday_week = 'WEEK("' + str(yesterday) + '")'
  yesterday_weekday = 'WEEKDAY("' + str(yesterday) + '")'

Finally, we’ll build a query and execute it. This will update the dim_time table after it is committed. Please notice that I’ve used the complete path to the table, including the database name (subscription_dwh).

  query = (
  "INSERT INTO subscription_dwh.`dim_time`(`time_date`, `time_year`, `time_month`, `time_week`, `time_weekday`, `ts`) " 
" VALUES (" + yesterday_str + ", " + yesterday_year + ", " + yesterday_month + ", " + yesterday_week + ", " + yesterday_weekday + ", Now())")
  cursor.execute(query)

Update the dim_city Dimension

Updating the dim_city table is even simpler because we don’t need to test anything before the insert. We’ll actually include that test in the SQL query.

# 1.2 update dim_city
query = (
  "INSERT INTO subscription_dwh.`dim_city`(`city_name`, `postal_code`, `country_name`, `ts`) "
  "SELECT city_live.city_name, city_live.postal_code, country_live.country_name, Now() "
  "FROM subscription_live.city city_live "
  "INNER JOIN subscription_live.country country_live ON city_live.country_id = country_live.id "
  "LEFT JOIN subscription_dwh.dim_city city_dwh ON city_live.city_name = city_dwh.city_name AND city_live.postal_code = city_dwh.postal_code AND country_live.country_name = city_dwh.country_name "
  "WHERE city_dwh.id IS NULL")
cursor.execute(query)

Here we prepare an execute the SQL query. Notice that I’ve again used the full paths to tables, including the names of both databases (subscription_live and subscription_dwh).

Updating the Fact Tables

The last thing we need to do is to update our fact tables. The process is almost the same as updating dimension tables: we prepare queries and execute them. These queries are much more complex, but they are the same as the ones used in the stored procedures.

We’ve added one improvement compared to the stored procedures: deleting the existing data for the same date in the fact table. This will allow us to run a script multiple times for the same date. At the end, we’ll need to commit the transaction and close all objects and the connection.

Running the Script

We have a minor change in this part, which is calling a different script:

-	import os
-	file_path = 'D://python_scripts'
-	os.chdir(file_path)
-	exec(open("etl_queries.py").read())

Because we’ve used the same messages and the script completed successfully, the result is the same:

How Would You Use Python in ETL?

Today we saw one example of performing the ETL process with a Python script. There are other ways to do this, e.g. a number of open-source solutions that utilize Python libraries to work with databases and perform the ETL process. In the next article, we’ll play with one of them. In the meantime, feel free to share your experience with Python and ETL.

Recommended Reading

Post Multipart Form Data in Python with Requests: Flask File Upload Example

Get Modular with Python Functions

Six Python Tips for Beginners

Web Scraping 101 in Python

5 Python Frameworks You Should Learn 2019

Data Preprocessing in Python

Deep Learning from Scratch and Using Tensorflow in Python

Deploying a python-django application using docker

Using Python and MySQL in the ETL Process

Using Python and MySQL in the ETL Process

In this article, you'll learn how to use Python and MySQL to extract, transform, and load data.

Python is very popular these days. Since Python is a general-purpose programming language, it can also be used to perform the Extract, Transform, Load (ETL) process. Different ETL modules are available, but today we’ll stick with the combination of Python and MySQL. We’ll use Python to invoke stored procedures and prepare and execute SQL statements.

We’ll use two similar-but-different approaches. First, we’ll invoke stored procedures that will do the whole job, and after that we’ll analyze how we could do the same process without stored procedures by using MySQL code in Python.

Ready? Before we dig in, let’s look at the data model – or data models, as there’s two of them in this article.

The Data Models

We’ll need two data models, one to store our operational data and the other to store our reporting data.

Edit Model in you browser

The first model is shown in the picture above. This model is used to store operational (live) data for a subscription-based business

Edit Model in you browser

Separating operational and reporting data is usually a very wise decision. To achieve that separation, we’ll need to create a data warehouse (DWH). We already did that; you can see the model in the picture above. 

Finally, we need to extract data from the live database, transform it, and load it into our DWH. We’ve already done this using SQL stored procedures.

Our task today is to replace the SQL stored procedures with Python code. We’re ready to make some Python magic. Let’s start with using only stored procedures in Python.

Method 1: ETL Using Stored Procedures

Before we start describing the process, it’s important to mention that we have two databases on our server.

The subscription_live database is used to store transactional/live data, while the subscription_dwh is our reporting database (DWH).

We’ve already described the stored procedures used to update dimension and fact tables. They will read data from the subscription_live database, combine it with data in the subscription_dwh database, and insert new data into the subscription_dwh database. These two procedures are:

  • p_update_dimensions – Updates the dimension tables dim_time and dim_city.
  • p_update_facts – Updates two fact tables, fact_customer_subscribed and fact_subscription_status.

Now we’re ready to write a simple Python script that will connect to the server and perform the ETL process. Let’s first take a look at the whole script (etl_procedures.py). Then we’ll explain the most important parts.

# import MySQL connector
import mysql.connector
 
# connect to server
connection = mysql.connector.connect(user='<username>', password='<password>', host='127.0.0.1')
print('Connected to database.')
cursor = connection.cursor()
 
# I update dimensions
cursor.callproc('subscription_dwh.p_update_dimensions')
print('Dimension tables updated.')
 
# II update facts
cursor.callproc('subscription_dwh.p_update_facts')
print('Fact tables updated.')
 
# commit & close connection
cursor.close()
connection.commit()
connection.close()
print('Disconnected from database.')

Importing Modules and Connecting to the Database

Python uses modules to store definitions and statements. You could use an existing module or write your own. Using existing modules will simplify your life because you’re using pre-written code, but writing your own module is also very useful. When you quit the Python interpreter and run it again, you’ll lose functions and variables you’ve previously defined. Of course, you don’t want to type the same code over and over again. To avoid that, you could store your definitions in a module and import it into Python.

Back to etl_procedures.py. In our program, we start with importing MySQL Connector:

# import MySQL connector
import mysql.connector

MySQL Connector for Python is used as a standardized driver that connects to a MySQL server/database. You’ll need to download it and install it if you haven’t previously done that. Besides connecting to the database, it offers a number of methods and properties for working with a database. We’ll use some of them, but you can check the complete documentation here.

Next, we’ll need to connect to our database:

# connect to server
connection = mysql.connector.connect(user='<username>', password='<password>', host='127.0.0.1')
print('Connected to database.')
cursor = connection.cursor()

The first line will connect to a server (in this case, I’m connecting to my local machine) using your credentials (replace <username> and <password> with actual values). While establishing a connection, you could also specify the database you want to connect to, as shown below: 

connection = mysql.connector.connect(user='<username>', password='<password>', host='127.0.0.1', database='<database_name>')

 I’ve intentionally connected only to a server and not to a specific database because I’ll be using two databases located on the same server.

The next command – print – is here just a notification that we were successfully connected. While it has no programming significance, it could be used to debug the code if something went wrong in the script.

The last line in this part is:

cursor = connection.cursor()

Cursors are the handler structure used to work with the data. We’ll use them for retrieving data from the database (SELECT), but also to modify the data (INSERT, UPDATE, DELETE). Before using a cursor, we need to create it. And that is what this line does.
Calling Procedures

The previous part was general and could be used for other database-related tasks. The following part of the code is specifically for ETL: calling our stored procedures with the cursor.callproc command. It looks like this:

# 1. update dimensions
cursor.callproc('subscription_dwh.p_update_dimensions')
print('Dimension tables updated.')
 
# 2. update facts
cursor.callproc('subscription_dwh.p_update_facts')
print('Fact tables updated.')

 Calling procedures is pretty much self-explanatory. After each call, a print command was added. Again, this just gives us a notification that everything went okay.

Commit and Close

The final part of the script commits the database changes and closes all used objects:

# commit & close connection
cursor.close()
connection.commit()
connection.close()
print('Disconnected from database.')

 Calling procedures is pretty much self-explanatory. After each call, a print command was added. Again, this just gives us a notification that everything went okay.

Committing is essential here; without it, there will be no changes to the database, even if you called a procedure or executed an SQL statement.

Running the Script

The last thing we need to do is to run our script. We’ll use the following commands in the Python Shell to achieve that:

import os file_path = 'D://python_scripts' os.chdir(file_path) exec(open("etl_procedures.py").read())

The script is executed and all changes are made in the database accordingly. The result can be seen in the picture below.

Method 2: ETL Using Python and MySQL

The approach presented above doesn’t differ a lot from the approach of calling stored procedures directly in MySQL. The only difference is that now we have a script that will do the whole job for us.

We could use another approach: putting everything inside the Python script. We’ll include Python statements, but we’ll also prepare SQL queries and execute them on the database. The source database (live) and the destination database (DWH) are the same as in the example with stored procedures.

Before we delve into this, let’s take a look at the complete script (etl_queries.py):

from datetime import date
 
# import MySQL connector
import mysql.connector
 
# connect to server
connection = mysql.connector.connect(user='<username>', password='<password>', host='127.0.0.1')
print('Connected to database.')
 
# 1. update dimensions
 
# 1.1 update dim_time
# date - yesterday
yesterday = date.fromordinal(date.today().toordinal()-1)
yesterday_str = '"' + str(yesterday) + '"'
# test if date is already in the table
cursor = connection.cursor()
query = (
  "SELECT COUNT(*) "
  "FROM subscription_dwh.dim_time "
  "WHERE time_date = " + yesterday_str)
cursor.execute(query)
result = cursor.fetchall()
yesterday_subscription_count = int(result[0][0])
if yesterday_subscription_count == 0:
  yesterday_year = 'YEAR("' + str(yesterday) + '")'
  yesterday_month = 'MONTH("' + str(yesterday) + '")'
  yesterday_week = 'WEEK("' + str(yesterday) + '")'
  yesterday_weekday = 'WEEKDAY("' + str(yesterday) + '")'
  query = (
  "INSERT INTO subscription_dwh.`dim_time`(`time_date`, `time_year`, `time_month`, `time_week`, `time_weekday`, `ts`) "
" VALUES (" + yesterday_str + ", " + yesterday_year + ", " + yesterday_month + ", " + yesterday_week + ", " + yesterday_weekday + ", Now())")
  cursor.execute(query)
 
# 1.2 update dim_city
query = (
  "INSERT INTO subscription_dwh.`dim_city`(`city_name`, `postal_code`, `country_name`, `ts`) "
  "SELECT city_live.city_name, city_live.postal_code, country_live.country_name, Now() "
  "FROM subscription_live.city city_live "
  "INNER JOIN subscription_live.country country_live ON city_live.country_id = country_live.id "
  "LEFT JOIN subscription_dwh.dim_city city_dwh ON city_live.city_name = city_dwh.city_name AND city_live.postal_code = city_dwh.postal_code AND country_live.country_name = city_dwh.country_name "
  "WHERE city_dwh.id IS NULL")
cursor.execute(query)
 
print('Dimension tables updated.')
 
 
# 2. update facts
 
# 2.1 update customers subscribed
# delete old data for the same date
query = (
  "DELETE subscription_dwh.`fact_customer_subscribed`.* "
  "FROM subscription_dwh.`fact_customer_subscribed` "
  "INNER JOIN subscription_dwh.`dim_time` ON subscription_dwh.`fact_customer_subscribed`.`dim_time_id` = subscription_dwh.`dim_time`.`id` "
  "WHERE subscription_dwh.`dim_time`.`time_date` = " + yesterday_str)
cursor.execute(query)
# insert new data
query = (
  "INSERT INTO subscription_dwh.`fact_customer_subscribed`(`dim_city_id`, `dim_time_id`, `total_active`, `total_inactive`, `daily_new`, `daily_canceled`, `ts`) "
  " SELECT city_dwh.id AS dim_ctiy_id, time_dwh.id AS dim_time_id, SUM(CASE WHEN customer_live.active = 1 THEN 1 ELSE 0 END) AS total_active, SUM(CASE WHEN customer_live.active = 0 THEN 1 ELSE 0 END) AS total_inactive, SUM(CASE WHEN customer_live.active = 1 AND DATE(customer_live.time_updated) = @time_date THEN 1 ELSE 0 END) AS daily_new, SUM(CASE WHEN customer_live.active = 0 AND DATE(customer_live.time_updated) = @time_date THEN 1 ELSE 0 END) AS daily_canceled, MIN(NOW()) AS ts "
  "FROM subscription_live.`customer` customer_live "
  "INNER JOIN subscription_live.`city` city_live ON customer_live.city_id = city_live.id "
  "INNER JOIN subscription_live.`country` country_live ON city_live.country_id = country_live.id "
  "INNER JOIN subscription_dwh.dim_city city_dwh ON city_live.city_name = city_dwh.city_name AND city_live.postal_code = city_dwh.postal_code AND country_live.country_name = city_dwh.country_name "
  "INNER JOIN subscription_dwh.dim_time time_dwh ON time_dwh.time_date = " + yesterday_str + " "
  "GROUP BY city_dwh.id, time_dwh.id")
cursor.execute(query)
 
# 2.2 update subscription statuses
# delete old data for the same date
query = (
  "DELETE subscription_dwh.`fact_subscription_status`.* "
  "FROM subscription_dwh.`fact_subscription_status` "
  "INNER JOIN subscription_dwh.`dim_time` ON subscription_dwh.`fact_subscription_status`.`dim_time_id` = subscription_dwh.`dim_time`.`id` "
  "WHERE subscription_dwh.`dim_time`.`time_date` = " + yesterday_str)
cursor.execute(query)
# insert new data
query = (
  "INSERT INTO subscription_dwh.`fact_subscription_status`(`dim_city_id`, `dim_time_id`, `total_active`, `total_inactive`, `daily_new`, `daily_canceled`, `ts`) "
  "SELECT city_dwh.id AS dim_ctiy_id, time_dwh.id AS dim_time_id, SUM(CASE WHEN subscription_live.active = 1 THEN 1 ELSE 0 END) AS total_active, SUM(CASE WHEN subscription_live.active = 0 THEN 1 ELSE 0 END) AS total_inactive, SUM(CASE WHEN subscription_live.active = 1 AND DATE(subscription_live.time_updated) = @time_date THEN 1 ELSE 0 END) AS daily_new, SUM(CASE WHEN subscription_live.active = 0 AND DATE(subscription_live.time_updated) = @time_date THEN 1 ELSE 0 END) AS daily_canceled, MIN(NOW()) AS ts "
  "FROM subscription_live.`customer` customer_live "
  "INNER JOIN subscription_live.`subscription` subscription_live ON subscription_live.customer_id = customer_live.id "
  "INNER JOIN subscription_live.`city` city_live ON customer_live.city_id = city_live.id "
  "INNER JOIN subscription_live.`country` country_live ON city_live.country_id = country_live.id "
  "INNER JOIN subscription_dwh.dim_city city_dwh ON city_live.city_name = city_dwh.city_name AND city_live.postal_code = city_dwh.postal_code AND country_live.country_name = city_dwh.country_name "
  "INNER JOIN subscription_dwh.dim_time time_dwh ON time_dwh.time_date = " + yesterday_str + " "
  "GROUP BY city_dwh.id, time_dwh.id")
cursor.execute(query)
 
print('Fact tables updated.')
 
# commit & close connection
cursor.close()
connection.commit()
connection.close()
print('Disconnected from database.')
Importing Modules and Connecting to the Database

Once more, we’ll need to import MySQL using the following code:

import mysql.connector

We’ll also import the datetime module, as shown below. We need this for date-related operations in Python: 

from datetime import date

 The process for connecting to the database is the same as in the previous example.

Updating the dim_time Dimension

To update the dim_time table, we’ll need to check if the value (for yesterday) is already in the table. We’ll have to use Python’s date functions (instead of SQL’s) to do this:

# date - yesterday
yesterday = date.fromordinal(date.today().toordinal()-1)
yesterday_str = '"' + str(yesterday) + '"'

 The first line of code will return yesterday’s date in the date variable, while the second line will store this value as a string. We’ll need this as a string because we will concatenate it with another string when we build the SQL query.

Next, we’ll need to test if this date is already in the dim_time table. After declaring a cursor, we’ll prepare the SQL query. To execute the query, we’ll use the cursor.execute command:

# test if date is already in the table
cursor = connection.cursor()
query = (
  "SELECT COUNT(*) "
  "FROM subscription_dwh.dim_time "
  "WHERE time_date = " + yesterday_str)
cursor.execute(query)
'"'

 We’ll store the query result in the result variable. The result will have either 0 or 1 rows, so we can test the first column of the first row. It will contain either a 0 or a 1. (Remember, we can have the same date only once in a dimension table.)

If the date is not already in the table, we’ll prepare the strings that will be part of the SQL query:

result = cursor.fetchall()
yesterday_subscription_count = int(result[0][0])
if yesterday_subscription_count == 0:
  yesterday_year = 'YEAR("' + str(yesterday) + '")'
  yesterday_month = 'MONTH("' + str(yesterday) + '")'
  yesterday_week = 'WEEK("' + str(yesterday) + '")'
  yesterday_weekday = 'WEEKDAY("' + str(yesterday) + '")'

Finally, we’ll build a query and execute it. This will update the dim_time table after it is committed. Please notice that I’ve used the complete path to the table, including the database name (subscriptiondwh).  

  query = (
  "INSERT INTO subscription_dwh.`dim_time`(`time_date`, `time_year`, `time_month`, `time_week`, `time_weekday`, `ts`) "
" VALUES (" + yesterday_str + ", " + yesterday_year + ", " + yesterday_month + ", " + yesterday_week + ", " + yesterday_weekday + ", Now())")
  cursor.execute(query)
Update the dim_city Dimension

Updating the dim_city table is even simpler because we don’t need to test anything before the insert. We’ll actually include that test in the SQL query.

# 1.2 update dim_city
query = (
  "INSERT INTO subscription_dwh.`dim_city`(`city_name`, `postal_code`, `country_name`, `ts`) "
  "SELECT city_live.city_name, city_live.postal_code, country_live.country_name, Now() "
  "FROM subscription_live.city city_live "
  "INNER JOIN subscription_live.country country_live ON city_live.country_id = country_live.id "
  "LEFT JOIN subscription_dwh.dim_city city_dwh ON city_live.city_name = city_dwh.city_name AND city_live.postal_code = city_dwh.postal_code AND country_live.country_name = city_dwh.country_name "
  "WHERE city_dwh.id IS NULL")
cursor.execute(query)

 Here we prepare an execute the SQL query. Notice that I’ve again used the full paths to tables, including the names of both databases (subscription_live and subscription_dwh).

Updating the Fact Tables

The last thing we need to do is to update our fact tables. The process is almost the same as updating dimension tables: we prepare queries and execute them. These queries are much more complex, but they are the same as the ones used in the stored procedures.

We’ve added one improvement compared to the stored procedures: deleting the existing data for the same date in the fact table. This will allow us to run a script multiple times for the same date. At the end, we’ll need to commit the transaction and close all objects and the connection.

Running the Script

We have a minor change in this part, which is calling a different script:

-   import os
-   file_path = 'D://python_scripts'
-   os.chdir(file_path)
-   exec(open("etl_queries.py").read())

Because we’ve used the same messages and the script completed successfully, the result is the same:

How Would You Use Python in ETL?

Today we saw one example of performing the ETL process with a Python script. There are other ways to do this, e.g. a number of open-source solutions that utilize Python libraries to work with databases and perform the ETL process. In the next article, we’ll play with one of them. In the meantime, feel free to share your experience with Python and ETL.

Thanks for reading

If you liked this post, share it with all of your programming buddies!

Follow us on Facebook | Twitter

Further reading

Complete Python Bootcamp: Go from zero to hero in Python 3

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python and Django Full Stack Web Developer Bootcamp

Complete Python Masterclass

Python Programming Tutorial | Full Python Course for Beginners 2019 👍

Top 10 Python Frameworks for Web Development In 2019

Python for Financial Analysis and Algorithmic Trading

Building A Concurrent Web Scraper With Python and Selenium

Getting Started with MySQL in Python

Getting Started with MySQL in Python

This tutorial explains how to download MySQL and create a connection.

MySQL Python: About MySQL
  • MySQL is a fast, easy-to-use relational database. It is currently the most popular open-source database
  • MySQL is used for many small and big businesses. It is developed, marketed, and supported by MySQL AB, a Swedish company. It is written in C and C++.
  • MySQL is an open-source database, so you don’t have to pay a single penny to use it.
Download MySQL

Follow these steps:

  1. Go to the official MySQL website
  2. Choose the version number for the MySQL community server that you want.
MySQL Python Connector

MySQL Python Connector is used to access the MySQL database from Python. You need a database driver.

MySQL Connector/Python is a standardized database driver provided by MySQL.

To check whether the mysql.connector is available or not, we type the following command:

>>> import mysql.connector

After typing this, we clearly say that No Module Named MySQL is present.

Then, we have to install MySQL. Python needs a MySQL driver to access the MySQL database.

So, next, we download the mysql-connector with the use of pip.

C:\Users\Nitin Arvind Shelke>pip install mysql-connector

After installation, we test whether it works or not. Lets check with the following command:

>>> import mysql.connector

The above line imports the MySQL Connector Python module in your program, so you can use this module’s API to connect MySQL.

If the above code was executed with no errors, we can say that “MySQL Connector” is installed properly and get ready to use it.

>>>from mysql.connector import Error

The MySQL connector error object is used to show us an error when we failed to connect databases or if any other database error occurred while working with the database.

Creating a Connection to the Database

After installing the MySQL Python connector, we need to test it to make sure that it is working correctly, and you can connect to the MySQL database server without any problems. To verify the installation, use the following steps:

Type the following line of code:

>>> import mysql.connector
To establish a connection to the database we should know the following parameters,
Host= localhost (In general it is same for all)
Database=mysql (You can set as per your wish)
User=root (It is a username)
Password= [email protected] (password set by me while installation of MyQL)
>>> mysql.connector.connect( host = 'localhost', database = 'mysql', user = 'root', password = '[email protected]')

Show the Available Database

You can check if a database exists on your system by listing all the databases in your system by using the “SHOW DATABASES” statement:

>>> my_database = mysql.connector.connect( host = 'localhost', database = 'mysql', user = 'root', password = '[email protected]')
>>> cursor = my_database.cursor()
>>> cursor.execute( " show databases " )
>>> for db in cursor:
...  print(db)
...

Output

('bank',)
('information_schema',)
('mysql',)
('performance_schema',)
('sakila',)
('sys',)
('world',)
>>>

Creating a Database

To create a database in MySQL, we use the “CREATE DATABASE” statement to create the database named “college”:

>>> my_database = mysql.connector.connect( host = 'localhost', user = 'root', password = '[email protected]' )
>>> cursor = my_database.cursor()
>>> cursor.execute( " CREATE DATABASE college "  )
>>> for db in cursor:
...  print(db)
...
>>> cursor.execute( " show databases " )
>>> for db in cursor:
...  print(db)
...

Creating the Tables

Next, we create the tables for the ‘college’ database.

It is compulsory to define the name of the database while creating the tables for it.

Syntax to create the table is

create table_name(
column 1 datatype,
column 2 datatype,
column 3 datatype,
…………………………………………,
column n datatype
)

Let’s create the table students, department, and faculty for the database college.

>>> my_database = mysql.connector.connect ( host = 'localhost', database = 'college', user = 'root', password = '[email protected]' )
>>> cursor = my_database.cursor()
>>>cursor. execute( " CREATE TABLE students ( stud_id varchar(200), stud_name VARCHAR(215), address VARCHAR(215), city char(100)) " )
>>> cursor. execute( " CREATE TABLE department ( dept_id varchar(200), dept_name VARCHAR(215)) " )
>>> cursor.execute( "CREATE TABLE faculty (  faculty_id varchar(200),faculty_name VARCHAR(215) )"  )
Show the Tables

To display the tables, we will have to use the “SHOW TABLES”

The following code displays all the tables present in the database “college”

>>> cursor. execute ( " SHOW TABLES " )
>>> for x in cursor:
...      print(x)
...
('department',)
('faculty',)
('students',)

Assign Primary Key in Table

Primary key: It is a minimal set of attributes (columns) in a table or relation that can uniquely identify tuples (rows) in that table.

For example, Student (Stud_Roll_No, Stud_Name, Addr)

In the student relation, attribute StudRollNo alone is a primary key, as each student has a unique id that can identify the student record in the table.

>>> my_database = mysql.connector.connect ( host = 'localhost', database = 'college', user = 'root', password = '[email protected]' )
>>> cursor = my_database.cursor()
>>>cursor. execute( " CREATE TABLE students2 ( stud_id varchar(200) PRIMARY KEY, stud_name VARCHAR(215), address VARCHAR(215), city char(100)) " )

If the table already exists, use the ALTER TABLE keyword:

>>> my_database = mysql.connector.connect ( host = 'localhost', database = 'college', user = 'root', password = '[email protected]' )
>>> cursor = my_database.cursor()
>>>cursor.execute( " ALTER TABLE student ADD COLUMN id INT AUTO_INCREMENT PRIMARY KEY " )
Describe the Created Tables

Desc keyword is used to describe the table in MySQL.

The following code describes the students table from the college database:

>>> cursor.execute("desc students")
>>> for x in cursor:
...     print(x)
...
('stud_id', 'varchar(200)', 'YES', '', None, '')
('stud_name', 'varchar(215)', 'YES', '', None, '')
('address', 'varchar(215)', 'YES', '', None, '')
('city', 'char(100)', 'YES', '', None, '')
>>>

Example 2

The following code describes the students2 (where stud_id is mentioned as primary key) table from the college database:

>>> cursor.execute("desc students2")
>>> for x in cursor:
...     print(x)
...
('stud_id', 'varchar(200)', 'NO', 'PRI', None, '')
('stud_name', 'varchar(215)', 'YES', '', None, '')
('address', 'varchar(215)', 'YES', '', None, '')
('city', 'char(100)', 'YES', '', None, '')
>>>

Insert Data Into the Table

To insert the data into the table, the “insert into” statement is used.

Let’s insert the data into the students table of the college database,

>>> my_database = mysql.connector.connect ( host = 'localhost', database = 'college', user = 'root', password = '[email protected]' )
>>> stm = " INSERT INTO students ( stud_id, stud_name, address, city ) VALUES ('101','Nitin Shelke', 'Congress Nagar', 'Amravati' ) "
>>> cursor = my_database.cursor()
>>> cursor.execute(stm)
Display or Select the Inserted Data From the Table
>>> cursor.execute(" select * from students")
>>> for x in cursor:
...     print(x)
...
('101', 'Nitin Shelke', 'Congress Nagar', 'Amravati')

An alternate way is to use the fetchall() method.

>>> cursor.fetchall()
[(‘101’, ‘Nitin Shelke’, ‘Congress Nagar’, ‘Amravati’)]

Thanks for reading! Let me know your thoughts in the comments.

Thanks for reading

If you liked this post, share it with all of your programming buddies!

Follow us on Facebook | Twitter

Further reading about Python

Complete Python Bootcamp: Go from zero to hero in Python 3

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python and Django Full Stack Web Developer Bootcamp

Complete Python Masterclass

Python Tutorial - Python GUI Programming - Python GUI Examples (Tkinter Tutorial)

Computer Vision Using OpenCV

OpenCV Python Tutorial - Computer Vision With OpenCV In Python

Python Tutorial: Image processing with Python (Using OpenCV)

A guide to Face Detection in Python

Machine Learning Tutorial - Image Processing using Python, OpenCV, Keras and TensorFlow

PyTorch Tutorial for Beginners