Missing values are a huge problem in machine learning. In a day and age when machine learning can be done directly in the database, one wonders how to perform adequate data preparation with SQL, without other programming languages, such as Python and R. Today we’ll see just how easy it is.

We’ll use Oracle Cloud for the purpose of this article, as it’s free and can be used without any downloads and installations on your machine — through the SQL Developer Web. If you decide to follow along, create a free OLTP database, and go to Service Console — Development — SQL Developer Web.

With regards to the dataset, we’ll use the well-known Titanic dataset for two reasons:

  • It’s simple and easy to understand
  • It contains enough missing values for us to play with

Once you have the dataset downloaded, you can use the _Upload Data _functionality of SQL Developer Web to create the table and upload data:

Image for post

Change data types using your best judgment and you’re ready to roll!


Preparation and exploration

I don’t want to mess anything up with the source table, called titanic, so let’s make a copy of it:

CREATE TABLE cp_titanic AS 
SELECT * FROM titanic;

Let’s just make a quick Select to verify everything is as it should be:

SELECT * FROM cp_titanic;

#towards-data-science #machine-learning #sql #data-science #programming

Let’s Impute Missing Values with SQL
1.45 GEEK