Set up your workstation, reduce workplace clutter, maintain a clean namespace, and effortlessly keep your dataset up-to-date
I knew SQL long before learning about Pandas, and I was intrigued by the way Pandas faithfully emulates SQL. Stereotypically, SQL is for analysts, who crunch data into informative reports, whereas Python is for data scientists, who use data to build (and overfit) models. Although they are almost functionally equivalent, I’d argue both tools are essential for a data scientist to work efficiently. From my experience with Pandas, I’ve noticed the following:
Those problems are naturally solved when I began feature engineering directly in SQL. So in this post, I’ll share some of my favorite tricks by working through a take-home challenge dataset. If you know a little bit of SQL, it’s time to put it into good use.
To start with, you need a SQL server. I’m using MySQL in this post. You can get MySQL server by installing one of the local desktop servers such as MAMP, WAMP or XAMPP. There are many tutorials online, and it’s worth going through the trouble.
After setting up your server, make sure you have three items ready: username, password, port number. Login through Terminal by entering the following command (here we have username “root”, password 1234567).
mysql -uroot -p`1234567`
Then create a database called “Shutterfly” in the MySQL console (you can name it whatever you want). The two tables will be loaded into this database.
create database Shutterfly;
You’ll need Pandas and sqlalchemy to work with SQL in Python. I bet you already have Pandas. Then install sqlalchemy by activating your desired environment to launch Jupyter notebook, and enter:
pip install sqlalchemy
The sqlalchemy module also requires MySQLdb module. Depending on your OS, this can be installed using different commands.
In this example, we’ll load data from two CSV files, and engineer features directly in MySQL. To load datasets, we need to instantiate an engine object using username, password, port number, and database name. Two tables will be created: Online and Order. A natural index will be created on each table.
from sqlalchemy import create_engine import pandas as pdusername = “root”
password = “1234567”
port = 7777
database = “Shutterfly”engine = create_engine(‘mysql+mysqldb://%s:%s@localhost:%i/%s’
%(username, password, port, database))df_online = pd.read_csv(“data/online.csv”)
df_online.to_sql(‘Online’, engine, if_exists=‘replace’)df_order = pd.read_csv(“data/Order.csv”)
df_order.to_sql(‘Purchase’, engine, if_exists=‘replace’)
In MySQL console, you can verify that the tables have been created.
mysql> use shutterfly;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -ADatabase changed
mysql> show tables;
±---------------------+
| Tables_in_shutterfly |
±---------------------+
| Online |
| Purchase |
±---------------------+
2 rows in set (0.00 sec)
This may seem counter-intuitive since we haven’t built any feature yet. But it’s actually very neat because all we need to do is to split dataset by index. By design, I also included the label (event 2) which we try to predict. When loading features, we will simply join the index with feature tables.
sql = “SELECTindex
, event2 FROM Online;”
df = pd.read_sql_query(sql, engine).set_index(‘index’)shuffle dataset, preserving index
df = df.sample(frac=1)
train_frac = 0.9
test_frac = 1 - train_fractrn_cutoff = int(len(df) * train_frac)
df_trn = df[:trn_cutoff]
df_tst = df[trn_cutoff:]df_trn.to_sql(‘trn_set’, engine, if_exists=‘replace’)
df_tst.to_sql(‘tst_set’, engine, if_exists=‘replace’)
In MySQL console, you can verify that the training and test set are created.
mysql> select count() from trn_set;
±---------+
| count() |
±---------+
| 859296 |
±---------+
1 row in set (0.61 sec)mysql> select count() from tst_set;
±---------+
| count() |
±---------+
| 95478 |
±---------+
1 row in set (0.07 sec)mysql> select * from trn_set limit 5;
±-------±-------+
| index | event2 |
±-------±-------+
| 464836 | 7 |
| 623193 | 7 |
| 240268 | 3 |
| 512374 | 7 |
| 381816 | 4 |
±-------±-------+
5 rows in set (0.00 sec)
This is the heavy lifting part. I write SQL code directly in Sublime Text, and debug my code by pasting them into MySQL console. Because this dataset is an event log, we must avoid leaking future information into each data point. As you can imagine, every feature needs to be aggregated over the history!
Joining table is the slowest operation, and so we want to get as many features as possible from each join. In this dataset, I implemented four types of join, resulting in four groups of features. The details are not important, but you can find all my SQL snippets here. Each snippet creates a table. The index is preserved and must match correctly to the response variable in the training set and test set. Each snippet is structured like this:
USE Shutterfly;DROP TABLE IF EXISTS features_group_1;
CREATE TABLE IF NOT EXISTS features_group_1
SELECT o.index
,LEFT(o.dt, 10) AS day
,COUNT() AS order_count
,SUM(p.revenue) AS revenue_sum
,MAX(p.revenue) AS revenue_max
,MIN(p.revenue) AS revenue_min
,SUM(p.revenue) / COUNT() AS rev_p_order
,COUNT(p.prodcat1) AS prodcat1_count
,COUNT(p.prodcat2) AS prodcat2_count
,DATEDIFF(o.dt, MAX(p.orderdate)) AS days_last_order
,DATEDIFF(o.dt, MAX(CASE WHEN p.prodcat1 IS NOT NULL THEN p.orderdate ELSE NULL END)) AS days_last_prodcat1
,DATEDIFF(o.dt, MAX(CASE WHEN p.prodcat2 IS NOT NULL THEN p.orderdate ELSE NULL END)) AS days_last_prodcat2
,SUM(p.prodcat1 = 1) AS prodcat1_1_count
,SUM(p.prodcat1 = 2) AS prodcat1_2_count
,SUM(p.prodcat1 = 3) AS prodcat1_3_count
,SUM(p.prodcat1 = 4) AS prodcat1_4_count
,SUM(p.prodcat1 = 5) AS prodcat1_5_count
,SUM(p.prodcat1 = 6) AS prodcat1_6_count
,SUM(p.prodcat1 = 7) AS prodcat1_7_count
FROM Online AS o
JOIN Purchase AS p
ON o.custno = p.custno
AND p.orderdate <= o.dt
GROUP BY o.index;ALTER TABLE
features_group_1
ADD KEYix_features_group_1_index
(index
);
To generate the feature tables, open a new Terminal, navigate to the folder containing the sql files, and enter the following commands and passwords. The first snippet creates some necessary indices that speed up the join operation. The next four snippets create four feature tables. Without the indices, the joining takes forever. With the indices, it takes about 20 minutes (not bad on a local machine).
mysql < add_index.sql-uroot -p1234567
mysql < feature_group_1.sql -uroot -p1234567 mysql < feature_group_2.sql -uroot -p1234567 mysql < feature_group_3.sql -uroot -p1234567 mysql < feature_group_4.sql -uroot -p1234567
Now you should have the following tables in the database. Note that the derived features are stored separately from the original event logs, which help prevent confusion and disaster.
mysql> show tables;
±---------------------+
| Tables_in_shutterfly |
±---------------------+
| Online |
| Purchase |
| features_group_1 |
| features_group_2 |
| features_group_3 |
| features_group_4 |
| trn_set |
| tst_set |
±---------------------+
8 rows in set (0.01 sec)
Here I wrote a utility function that pulls data from the MySQL server.
def load_dataset(split=“trn_set”, limit=None, ignore_categorical=False):
sql = “”"
SELECT o., f1., f2., f3., f4.*,
EXTRACT(MONTH FROM o.dt) AS month
FROM %s AS t
JOIN Online AS o
ON t.index = o.index
JOIN features_group_1 AS f1
ON t.index = f1.index
JOIN features_group_2 AS f2
ON t.index = f2.index
JOIN features_group_3 AS f3
ON t.index = f3.index
JOIN features_group_4 AS f4
ON t.index = f4.index
“”“%split
if limit:
sql += " LIMIT %i”%limitdf = pd.read_sql_query(sql.replace('\n', " ").replace("\t", " "), engine) df.event1 = df.event1.fillna(0) X = df.drop(["index", "event2", "dt", "day", "session", "visitor", "custno"], axis=1) Y = df.event2 return X, Y
Finally, let’s take a look at 5 training examples, and their features.
X_trn, Y_trn = load_dataset(“trn_set”, limit=5)print(X_trn.head().T)
0 1 2 3 4
category 1 1 1 1 1
event1 0 0 0 0 0
order_count 1 2 2 1 1
revenue_sum 30.49 191.33 191.33 76.96 66.77
revenue_max 30.49 98.38 98.38 76.96 66.77
revenue_min 30.49 92.95 92.95 76.96 66.77
rev_p_order 30.49 95.665 95.665 76.96 66.77
prodcat1_count 1 2 2 1 1
prodcat2_count 1 2 2 1 1
days_last_order 270 311 311 137 202
days_last_prodcat1 270 311 311 137 202
days_last_prodcat2 270 311 311 137 202
prodcat1_1_count 0 2 2 0 0
prodcat1_2_count 1 0 0 1 0
prodcat1_3_count 0 0 0 0 1
prodcat1_4_count 0 0 0 0 0
prodcat1_5_count 0 0 0 0 0
prodcat1_6_count 0 0 0 0 0
prodcat1_7_count 0 0 0 0 0
category_1_count 4 29 29 2 2
category_2_count 0 12 12 2 0
category_3_count 0 4 4 8 11
event1_1_count 0 6 6 0 0
event1_2_count 0 1 1 1 0
event1_4_count 0 0 0 1 0
event1_5_count 0 0 0 0 0
event1_6_count 0 0 0 0 0
event1_7_count 0 0 0 0 0
event1_8_count 0 1 1 0 0
event1_9_count 0 0 0 0 0
event1_10_count 0 0 0 0 0
event1_11_count 0 0 0 1 0
event2_null_count 4 37 37 9 13
event2_1_count 1 11 11 1 3
event2_2_count 0 1 1 0 0
event2_3_count 1 4 4 0 2
event2_4_count 0 4 4 2 1
event2_5_count 0 2 2 0 1
event2_6_count 0 2 2 0 0
event2_7_count 1 15 15 5 5
event2_8_count 1 6 6 3 1
event2_9_count 0 0 0 0 0
event2_10_count 0 0 0 1 0
last_category 1 1 1 1 1
last_event1 None None None None None
last_event2 8 7 7 8 3
last_revenue 30.49 92.95 98.38 76.96 66.77
last_prodcat1 2 1 1 2 3
last_prodcat2 3 11 89 3 9
month 11 8 8 1 11
Now you have a well-defined dataset and feature set. You can tweak the scale of each feature and missing values to suit your model’s requirement.
For tree-based methods, which are invariant to feature scaling, we can directly apply the model, and simply focus on tuning parameters! See an example of a plain-vanilla gradient boosting machine here.
It is nice to see that the useful features are all engineered, except for the category feature. Our efforts paid off! Also, the most predictive feature of event2 is how many nulls value were observed in event2. This is an illustrative case where we cannot replace null values by median or average, because the fact that they are missing is correlated with the response variable!
As you can see, we have no intermediate CSV files, a very clean namespace in our notebook, and our feature engineering codes are reduced to a few straightforward SQL statements. There are two situations in which the SQL approach is even more efficient:
One fundamental restriction of this approach is that you must be able to directly connect to your SQL server in Python. If this is not possible, you may have to download the query result as a CSV file and load it in Python.
I hope you find this post helpful. Though I’m not advocating method over another, it is necessary to understand the advantage and limitation of each method, and have both methods ready in our toolkit. So we can apply whichever method that works best under the constraints.
Originally published by Shaw Lu at https://towardsdatascience.com/feature-engineering-in-sql-and-python-a-hybrid-approach-b52347cd2de4
Follow great articles on Twitter
Learn More
☞ Complete Python Bootcamp: Go from zero to hero in Python 3
☞ Learn Python by Building a Blockchain & Cryptocurrency
☞ Python and Django Full Stack Web Developer Bootcamp
☞ The Python Bible™ | Everything You Need to Program in Python
☞ Learning Python for Data Analysis and Visualization
☞ Python for Financial Analysis and Algorithmic Trading
☞ The Modern Python 3 Bootcamp
#python #mysql