Jupyter Notebook: Forget CSV, Fetch Data With Python

If you read a book, article, or blog about machine learning, chances are it will use training data from a CSV file. There’s nothing wrong with CSV, but let’s think about if it is really practical. Wouldn’t it be better to read data directly from the DB? Often, you can’t feed business data directly into ML training because it needs pre-processing — changing categorial data, calculating new data features, etc. Data preparation/transformation steps can be done quite easily with SQL while fetching original business data. Another advantage of reading data directly from DB is when data changes, it is easier to automate the ML model re-train process.

In this post, I describe how to call the Oracle DB from Jupyter notebook with Python code.

Step 1

Install cx_Oracle Python module:

python -m pip install cx_Oracle

This module helps to connect to the Oracle DB from Python.

Step 2

cx_Oracle enables us to execute SQL call from Python code, but to be able to call remote DB from Python script, we need to install and configure Oracle Instant Client on the machine where Python runs.

If you are using Ubuntu, install alien:

sudo apt-get update 
sudo apt-get install alien

Download RPM files for Oracle Instant Client and install with alien:

alien -i oracle-instantclient18.3-basiclite-18.3.0.0.0-1.x86_64.rpm 
alien -i oracle-instantclient18.3-sqlplus-18.3.0.0.0-1.x86_64.rpm 
alien -i oracle-instantclient18.3-devel-18.3.0.0.0-1.x86_64.rpm

Add environment variables:

export ORACLE_HOME=/usr/lib/oracle/18.3/client64 
export PATH=$PATH:$ORACLE_HOME/bin

Step 3

Install Magic SQL Python modules:

pip install jupyter-sql 
pip install ipython-sql

Installation and configuration complete.

For today’s sample, I’m using the Pima Indians Diabetes Database. CSV data can be downloaded from here. I uploaded CSV data into the database table and will be fetching it through SQL directly in Jupyter notebook.

First of all, the connection is established to the DB and then SQL query is executed. The query result set is stored in a variable called result. Do you see %%sql — this magic SQL:

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import cx_Oracle\n",
    "\n",
    "%load_ext sql"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Connected: hr@'"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "with open('credentials.json') as f:\n",
    "    data = json.load(f)\n",
    "    username = data['username']\n",
    "    password = data['password']\n",
    "\n",
    "%sql oracle+cx_oracle://$username:$password@dbhost:1521/?service_name=ORCLPDB1.localdomain"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " * oracle+cx_oracle://hr:***@dbhost:1521/?service_name=ORCLPDB1.localdomain\n",
      "0 rows affected.\n",
      "Returning data to local variable result\n"
     ]
    }
   ],
   "source": [
    "%%sql result <<\n",
    "select TIMES_PREGNANT \"TIMES_PREGNANT\", GLUCOSE \"GLUCOSE\", BLOOD_PRESSURE \"BLOOD_PRESSURE\",\n",
    "       SKIN_FOLD_THICK \"SKIN_FOLD_THICK\", SERUM_INSULIN \"SERUM_INSULING\",\n",
    "       MASS_INDEX \"MASS_INDEX\", DIABETES_PEDIGREE \"DIABETES_PEDIGREE\", AGE \"AGE\",\n",
    "       CLASS_VAR \"CLASS_VAR\" from PIMA_INDIANS_DIABETES"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

Username and password must be specified while establishing a connection. To avoid sharing a password, make sure to read password value from the external source (it could be simple JSON file as in this example or a more advanced encoded token from keyring).

The beauty of this approach is that data fetched through SQL query is out-of-the-box available in Data Frame. Machine Learning engineers can work with the data in the same way as it would be loaded through CSV:

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>times_pregnant</th>\n",
       "      <th>glucose</th>\n",
       "      <th>blood_pressure</th>\n",
       "      <th>skin_fold_thick</th>\n",
       "      <th>serum_insuling</th>\n",
       "      <th>mass_index</th>\n",
       "      <th>diabetes_pedigree</th>\n",
       "      <th>age</th>\n",
       "      <th>class_var</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>106</td>\n",
       "      <td>70</td>\n",
       "      <td>28</td>\n",
       "      <td>135</td>\n",
       "      <td>34.2</td>\n",
       "      <td>0.142</td>\n",
       "      <td>22</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>155</td>\n",
       "      <td>52</td>\n",
       "      <td>27</td>\n",
       "      <td>540</td>\n",
       "      <td>38.7</td>\n",
       "      <td>0.24</td>\n",
       "      <td>25</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>101</td>\n",
       "      <td>58</td>\n",
       "      <td>35</td>\n",
       "      <td>90</td>\n",
       "      <td>21.8</td>\n",
       "      <td>0.155</td>\n",
       "      <td>22</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>120</td>\n",
       "      <td>80</td>\n",
       "      <td>48</td>\n",
       "      <td>200</td>\n",
       "      <td>38.9</td>\n",
       "      <td>1.162</td>\n",
       "      <td>41</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>11</td>\n",
       "      <td>127</td>\n",
       "      <td>106</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>39</td>\n",
       "      <td>0.19</td>\n",
       "      <td>51</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   times_pregnant  glucose  blood_pressure  skin_fold_thick  serum_insuling  \\\n",
       "0               1      106              70               28             135   \n",
       "1               2      155              52               27             540   \n",
       "2               2      101              58               35              90   \n",
       "3               1      120              80               48             200   \n",
       "4              11      127             106                0               0   \n",
       "\n",
       "  mass_index diabetes_pedigree  age  class_var  \n",
       "0       34.2             0.142   22          0  \n",
       "1       38.7              0.24   25          1  \n",
       "2       21.8             0.155   22          0  \n",
       "3       38.9             1.162   41          0  \n",
       "4         39              0.19   51          0  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = result.DataFrame()\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of rows in dataset: {df.shape[0]}\n",
      "0    500\n",
      "1    268\n",
      "Name: class_var, dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print('Number of rows in dataset: {df.shape[0]}')\n",
    "print(df[df.columns[8]].value_counts())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "times_pregnant     0.221898\n",
       "glucose            0.466581\n",
       "blood_pressure     0.065068\n",
       "skin_fold_thick    0.074752\n",
       "serum_insuling     0.130548\n",
       "age                0.238356\n",
       "class_var          1.000000\n",
       "Name: class_var, dtype: float64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "corrs = df.corr()['class_var'].abs()\n",
    "columns = corrs[corrs > .01].index\n",
    "corrs = corrs.filter(columns)\n",
    "corrs"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

Sample Jupyter notebook available on GitHub. Sample credentials JSON file.

#python #database #oracle