Introduction

There is nothing more powerful than learning something new or taking a skill to the next level by simply doing. In this article, I’ll use SQL (Postgres) to conduct my exploratory analysis as well as creating a transformed feature training set for my machine learning models.

Although Python or R is the current de facto language for data science and machine learning, having a solid foundation in SQL, at times, will help you iterate faster and allow you to quickly explore your data. At the end of the day, most of the data we work with lives in a relational database. By having the ability to write SQL queries fluidly, it will allow you to tackle data with ease and quickly get you started on your analysis.

Goal

Although Python would be my preferred approach, I wanted to see if I could do all my exploratory analysis and feature engineering in SQL. My goal is to show how powerful and complementary SQL can be to one’s workflow. Although one obvious limitation of SQL is the ability to visualize data natively, you’re still able to get big gains by quickly writing queries and outputting results. Let’s get started!

My Setup

macOS Mojave Version 10.14.6
Postgres (PostgreSQL) 12.2
pgAdmin 4.21 (SQL web querying tool)
psql (Terminal access to databases and tables)

In my follow-up article, I complete my supervised classification model with Python and share my highest score achieved on Kaggle’s Public Leadership Board. Check it out here!

Kaggle Titanic Competition: Model Building & Tuning in Python

Best Fitting Model, Feature & Permutation Importance, and Hyperparameter Tuning

towardsdatascience.com

Exploration

When examining the event that led to the sinking of the Titanic, it’s a tragedy with so many lives lost. In the context of this Kaggle competition, some historical knowledge provides an important piece of information that will help create new features in predicting who lived and died. And that important piece is the notion that women and children needed saving first. This notion will play a big role in how I group and analyze the Kaggle dataset.

To get started, I downloaded the train.csv and test.csv files from Kaggle and imported the files to two tables I created in the Postgres database. Next, I combined the two tables to create my first working table (titanic_train_test_raw).

#machine-learning #in-depth-analysis #sql #titanic-dataset #data-science #data analysis