Jupyter Notebook: Forget CSV, Fetch Data With Python

Jupyter Notebook: Forget CSV, Fetch Data With Python

In this post, see how to call the Oracle DB from Jupyter notebook with Python code.

If you read a book, article, or blog about machine learning, chances are it will use training data from a CSV file. There’s nothing wrong with CSV, but let’s think about if it is really practical. Wouldn’t it be better to read data directly from the DB? Often, you can’t feed business data directly into ML training because it needs pre-processing — changing categorial data, calculating new data features, etc. Data preparation/transformation steps can be done quite easily with SQL while fetching original business data. Another advantage of reading data directly from DB is when data changes, it is easier to automate the ML model re-train process.

In this post, I describe how to call the Oracle DB from Jupyter notebook with Python code.

Step 1

Install cx_Oracle Python module:

python -m pip install cx_Oracle

This module helps to connect to the Oracle DB from Python.

Step 2

cx_Oracle enables us to execute SQL call from Python code, but to be able to call remote DB from Python script, we need to install and configure Oracle Instant Client on the machine where Python runs.

If you are using Ubuntu, install alien:

sudo apt-get update 
sudo apt-get install alien

Download RPM files for Oracle Instant Client and install with alien:

alien -i oracle-instantclient18.3-basiclite-18.3.0.0.0-1.x86_64.rpm 
alien -i oracle-instantclient18.3-sqlplus-18.3.0.0.0-1.x86_64.rpm 
alien -i oracle-instantclient18.3-devel-18.3.0.0.0-1.x86_64.rpm

Add environment variables:

export ORACLE_HOME=/usr/lib/oracle/18.3/client64 
export PATH=$PATH:$ORACLE_HOME/bin

Read more here.

Step 3

Install Magic SQL Python modules:

pip install jupyter-sql 
pip install ipython-sql

Installation and configuration complete.

For today’s sample, I’m using the Pima Indians Diabetes Database. CSV data can be downloaded from here. I uploaded CSV data into the database table and will be fetching it through SQL directly in Jupyter notebook.

First of all, the connection is established to the DB and then SQL query is executed. The query result set is stored in a variable called result. Do you see %%sql — this magic SQL:

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import cx_Oracle\n",
    "\n",
    "%load_ext sql"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Connected: [email protected]'"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "with open('credentials.json') as f:\n",
    "    data = json.load(f)\n",
    "    username = data['username']\n",
    "    password = data['password']\n",
    "\n",
    "%sql oracle+cx_oracle://$username:[email protected]:1521/?service_name=ORCLPDB1.localdomain"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " * oracle+cx_oracle://hr:***@dbhost:1521/?service_name=ORCLPDB1.localdomain\n",
      "0 rows affected.\n",
      "Returning data to local variable result\n"
     ]
    }
   ],
   "source": [
    "%%sql result <<\n",
    "select TIMES_PREGNANT \"TIMES_PREGNANT\", GLUCOSE \"GLUCOSE\", BLOOD_PRESSURE \"BLOOD_PRESSURE\",\n",
    "       SKIN_FOLD_THICK \"SKIN_FOLD_THICK\", SERUM_INSULIN \"SERUM_INSULING\",\n",
    "       MASS_INDEX \"MASS_INDEX\", DIABETES_PEDIGREE \"DIABETES_PEDIGREE\", AGE \"AGE\",\n",
    "       CLASS_VAR \"CLASS_VAR\" from PIMA_INDIANS_DIABETES"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

Username and password must be specified while establishing a connection. To avoid sharing a password, make sure to read password value from the external source (it could be simple JSON file as in this example or a more advanced encoded token from keyring).

The beauty of this approach is that data fetched through SQL query is out-of-the-box available in Data Frame. Machine Learning engineers can work with the data in the same way as it would be loaded through CSV:

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>times_pregnant</th>\n",
       "      <th>glucose</th>\n",
       "      <th>blood_pressure</th>\n",
       "      <th>skin_fold_thick</th>\n",
       "      <th>serum_insuling</th>\n",
       "      <th>mass_index</th>\n",
       "      <th>diabetes_pedigree</th>\n",
       "      <th>age</th>\n",
       "      <th>class_var</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>106</td>\n",
       "      <td>70</td>\n",
       "      <td>28</td>\n",
       "      <td>135</td>\n",
       "      <td>34.2</td>\n",
       "      <td>0.142</td>\n",
       "      <td>22</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>155</td>\n",
       "      <td>52</td>\n",
       "      <td>27</td>\n",
       "      <td>540</td>\n",
       "      <td>38.7</td>\n",
       "      <td>0.24</td>\n",
       "      <td>25</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>101</td>\n",
       "      <td>58</td>\n",
       "      <td>35</td>\n",
       "      <td>90</td>\n",
       "      <td>21.8</td>\n",
       "      <td>0.155</td>\n",
       "      <td>22</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>120</td>\n",
       "      <td>80</td>\n",
       "      <td>48</td>\n",
       "      <td>200</td>\n",
       "      <td>38.9</td>\n",
       "      <td>1.162</td>\n",
       "      <td>41</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>11</td>\n",
       "      <td>127</td>\n",
       "      <td>106</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>39</td>\n",
       "      <td>0.19</td>\n",
       "      <td>51</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   times_pregnant  glucose  blood_pressure  skin_fold_thick  serum_insuling  \\\n",
       "0               1      106              70               28             135   \n",
       "1               2      155              52               27             540   \n",
       "2               2      101              58               35              90   \n",
       "3               1      120              80               48             200   \n",
       "4              11      127             106                0               0   \n",
       "\n",
       "  mass_index diabetes_pedigree  age  class_var  \n",
       "0       34.2             0.142   22          0  \n",
       "1       38.7              0.24   25          1  \n",
       "2       21.8             0.155   22          0  \n",
       "3       38.9             1.162   41          0  \n",
       "4         39              0.19   51          0  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = result.DataFrame()\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of rows in dataset: {df.shape[0]}\n",
      "0    500\n",
      "1    268\n",
      "Name: class_var, dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print('Number of rows in dataset: {df.shape[0]}')\n",
    "print(df[df.columns[8]].value_counts())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "times_pregnant     0.221898\n",
       "glucose            0.466581\n",
       "blood_pressure     0.065068\n",
       "skin_fold_thick    0.074752\n",
       "serum_insuling     0.130548\n",
       "age                0.238356\n",
       "class_var          1.000000\n",
       "Name: class_var, dtype: float64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "corrs = df.corr()['class_var'].abs()\n",
    "columns = corrs[corrs > .01].index\n",
    "corrs = corrs.filter(columns)\n",
    "corrs"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

Sample Jupyter notebook available on GitHub. Sample credentials JSON file.

How to write a simple toy database in Python

How to write a simple toy database in Python

How to write a simple toy database in Python: MySQL, PostgreSQL, Oracle, Redis, and many more, you just name it — databases are a really important piece of technology in the progress of human civilization. Today we can see how valuable data are, and so keeping them safe and stable is where the database comes in!

How to write a simple toy database in Python: MySQL, PostgreSQL, Oracle, Redis, and many more, you just name it — databases are a really important piece of technology in the progress of human civilization. Today we can see how valuable data are, and so keeping them safe and stable is where the database comes in!

So we can see how important databases are as well. For a quite some time I was thinking of creating My Own Toy Database just to understand, play around, and experiment with it. As Richard Feynman said:

“What I cannot create, I do not understand.”
So without any further talking let’s jump into the fun part: coding.

Let’s Start Coding…

For this Toy Database, we’ll use Python (my favorite ❤️). I named this database FooBarDB (I couldn’t find any other name 😉), but you can call it whatever you want!

So first let’s import some necessary Python libraries which are already available in Python Standard Library:

import json
import os

Yes, we only need these two libraries! We need json as our database will be based on JSON, and os for some path related stuff.

Now let’s define the main class FoobarDB with some pretty basic functions, which I’ll explain below.

class FoobarDB(object):
    def __init__(self , location):
        self.location = os.path.expanduser(location)
        self.load(self.location)

    def load(self , location):
        if os.path.exists(location):
            self._load()
        else:
            self.db = {}
        return True

    def _load(self):
        self.db = json.load(open(self.location , "r"))

    def dumpdb(self):
        try:
            json.dump(self.db , open(self.location, "w+"))
            return True
        except:
            return False

Here we defined our main class with an __init__ function. Whenever creating a Foobar Database we only need to pass the location of the database. In the first __init__ function we take the location parameter and replace ~ or ~user with user’s home directory to make it work intended way. And finally, put it in self.location variable to access it later on the same class functions. In the end, we are calling the load function passing self.location as an argument.

. . . .
    def load(self , location):
        if os.path.exists(location):
            self._load()
        else:
            self.db = {}
        return True
. . . .

In the next load function we take the location of the database as a param. Then check if the database exists or not. If it exists, we load it with the _load() function (explained below). Otherwise, we create an empty in-memory JSON object. And finally, return true on success.

. . . . 

    def _load(self):
        self.db = json.load(open(self.location , "r"))
. . . .

In the _load function, we just simply open the database file from the location stored in self.location. Then we transform it into a JSON object and load it into self.db variable.

 def dumpdb(self):
        try:
            json.dump(self.db , open(self.location, "w+"))
            return True
        except:
            return False

And finally, the dumpdb function: its name says what it does. It takes the in-memory database (actually a JSON object) from the self.db variable and saves it in the database file! It returns True if saved successfully, otherwise returns False.

Make It a Little More Usable… 😉

Wait a minute! 😐 A database is useless if it can’t store and retrieve data, isn’t it? Let’s go and add them also…😎

def set(self , key , value):
        try:
            self.db[str(key)] = value
            self.dumpdb()
            return True
        except Exception as e:
            print("[X] Error Saving Values to Database : " + str(e))
            return False

    def get(self , key):
        try:
            return self.db[key]
        except KeyError:
            print("No Value Can Be Found for " + str(key))  
            return False

    def delete(self , key):
        if not key in self.db:
            return False
        del self.db[key]
        self.dumpdb()
        return True

The set function is to add data to the database. As our database is a simple key-value based database, we’ll only take a key and value as an argument.

First, we’ll try to add the key and value to the database and then save the database. If everything goes right it will return True. Otherwise, it will print an error message and return False. (We don’t want it to crash and erase our data every time an error occurs 😎).

def get(self, key):
        try:
            return self.db[key]
        except KeyError:
            return False

get is a simple function, we take key as an argument and try to return the value linked to the key from the database. Otherwise False is returned with a message.

		def delete(self , key):
        if not key in self.db:
            return False
        del self.db[key]
        self.dumpdb()
        return True

delete function is to delete a key as well as its value from the database. First, we make sure the key is present in the database. If not we return False. Otherwise, we delete the key with the built-in del which automatically deletes the value of the key. Next, we save the database and it returns false.

Now you might think, what if I’ve created a large database and want to reset it? In theory, we can use delete — but it’s not practical, and it’s also very time-consuming! ⏳ So we can create a function to do this task…

    def resetdb(self):
        self.db={}
        self.dumpdb()
        return True

Here’s the function to reset the database, resetdb! It’s so simple: first, what we do is re-assign our in-memory database with an empty JSON object and it just saves it! And that’s it! Our Database is now again clean shaven.

Finally… 🎉

That’s it friends! We have created our own Toy Database ! 🎉🎉 Actually, FoobarDB is just a simple demo of a database. It’s like a cheap DIY toy: you can improve it any way you want. You can also add many other functions according to your needs.

Full Source is Here 👉 bauripalash/foobardb

I hope, you enjoyed it! Let me know your suggestions, ideas or mistakes I’ve made in the comments below! 👇

Thank you! See you soon!

Python Connect MySQL Database

Python Connect MySQL Database

Databases are critical for storing and processing data even if you consider a powerful programming language like Python. Ever wondered where does this whole large set of data is stored in or fetched from?

Databases are critical for storing and processing data even if you consider a powerful programming language like Python. Ever wondered where does this whole large set of data is stored in or fetched from?

In this article, I’ll talk about the same and take you through the following aspects in detail.

  •        What is a [database](https://morioh.com/topic/database "database")?
    
  •        What is [MySQLdb](https://morioh.com/topic/mysql "MySQLdb")?
    
  •        How does [Python](https://morioh.com/topic/python "Python") connect to a database?
    
  •        Creating a Database
    
  •        Database Operations-[CRUD](https://morioh.com/p/60b941830c01 "CRUD")
    

Let’s get started :)

What is a database?

A database is basically a collection of structured data in such a way that it can easily be retrieved, managed and accessed in various ways. One of the simplest forms of databases is a text database. Relational databases are the most popular database system which includes the following:

Among all these databases, MySQL is one of the easiest databases to work with. Let me walk you through about this in detail.

What is MySQLdb?

MySQLdb is an open-source freely available relational database management system that uses Structured Query Language. Now one of the most important question here is “What is SQL?”

SQL (Structured Query Language) is a standard language for relational databases that allow users to do various operations on data like, Manipulating, Creating, Dropping, etc. In a nutshell, SQL allows you to do anything with the data.

Let’s move ahead and dive deep into Python database connection wherein you will learn how to connect with the database.

**How does Python connect to a database? **

It is very simple to connect Python with the database. Refer the below image which illustrates a Python connection with the database where how a connection request is sent to MySQL connector Python, gets accepted from the database and cursor is executed with result data.

Before connecting to the MySQL database, make sure you have MySQL installer installed on your computer. It provides a comprehensive set of tools which helps in installing MySQL with the following components:
MySQL server All available connectorsMySQL WorkbenchMySQL NotifierTools for Excel and Microsoft Visual StudioMySQL Sample DatabasesMySQL Documentation
To download the MySQL installer please go through the following video which talks about the various steps that you need to follow while installing MySQL.

Before proceeding you should make sure you have MySQL db installed on your computer. Refer the below commands for installing MySQL in command prompt and pycharm:

Using Pip:

Command:

pip install mysql-connector

**Using Pycharm **

Command:

import mysql.connector

Output:

C:UsersHarshit_KantPycharmProjectstest1venvScriptspython.exe C:/Users/Harshit_Kant/PycharmProjects/test1/venv/python-db-conn.py

Process finished with exit code 0

Moving on in this article with Python Database Connection let us see the parameters required to connect to the database:

  • **Username- It is simply the username you give to work MySQL server with, the Default username is root.
  • **Password- **Password is given by the user when you have installed the MySQL database. I am giving password here as ‘password123’
  • Host Name- This basically is the server name or IP address on which your MySQL is running, If it is a ‘localhost’, then your IP address is 127.0.0.0

I will show you from a coding perspective to connect python with MySQL database.

Example:

import mysql.connector
 
mydb=mysql.connector.connect(host="localhost",user="root",passwd="password123") // I have used 'host','username','password'
 
print(mydb)

Output:

C:UsersHarshit_KantPycharmProjectstest1venvScriptspython.exe C:/Users/Harshit_Kant/PycharmProjects/test1/venv/python-db-conn.py

<mysql.connector.connection_cext.CMySQLConnection object at 0x000001606D7BD6A0>

Process finished with exit code 0

**Explanation: **Here ‘mydb’ is just an instance. From the output, you can clearly see that it has connected to the database.

Next up in Python Database Connection, you will learn how to create a database.

Creating a Database:

Once the database connection is established, you are ready to create your own database which will be acting as a bridge between your python and MySQL server.

Let’s see the implementation part of it.

Example:

import mysql.connector
 
mydb=mysql.connector.connect(host="localhost",user="root",passwd="password123")
mycursor=mydb.cursor()
mycursor.execute("create database harshdb")

Output:

C:/Users/Harshit_Kant/PycharmProjects/test1/venv/python-db-conn.py

Process finished with exit code 0

Explanation:

  • In the above program, I have made use of cursor which is basically an object that is used to communicate to your entire MySQL server through which I am able to create my own database.
  • You can see from the output that my database with the name”harshdb” is created which is custom, as you can give any name to your database.

If you want to see the databases in your MySQL server, you can implement the following piece of code in pycharm:

Example :

import mysql.connector
 
mydb=mysql.connector.connect(host="localhost",user="root",passwd="password123")
mycursor=mydb.cursor()
mycursor.execute("show databases")
 
for db in mycursor:
print(db)

Output:

C:UsersHarshit_KantPycharmProjectstest1venvScriptspython.exe C:/Users/Harshit_Kant/PycharmProjects/test1/venv/python-db-conn.py

(‘harshdb’,)

(‘information_schema’,)

(‘mysql’,)

(‘performance_schema’,)

(‘sakila’,)

(‘sys’,)

(‘world’,)

Process finished with exit code 0

Explanation:

  • By implementing the above-written code I have tried showing all the databases which are existing in MySQL server.

Now that you have created your database, let’s dive deep into one of the most important aspects of Python Database Connection by doing few operations in it. Let us understand this in detail.

Database Operations[CRUD]:

There are numerous operations a programmer can perform using databases and SQL in order to have sound knowledge of database programming and MySQL.

I have demonstrated the CRUD operations below

  • Create– It is an SQL statement used to create a record in the table or can say it is used for creating a table.
  • **Read- **It is used for fetching useful information from the database.
  • **Update- **This particular SQL statement is used for updating the records in the table or updating the table.
  • **Delete- **As the name itself justifies this command is used for deleting the table.

Let us look at each aspect in detail from the coding perspective.

Create Operation:

import mysql.connector
 
mydb=mysql.connector.connect(host="localhost",user="root",passwd="password123",database=harshdb)
 
mycursor=mydb.cursor()
 
mycursor.execute("create table employee(name varchar(250),sal int(20))")

Output:

C:UsersHarshit_KantPycharmProjectstest1venvScriptspython.exe C:/Users/Harshit_Kant/PycharmProjects/test1/venv/python-db-conn.py

Process finished with exit code 0

Explanation:

  • In the above-given program, I have created a table ’employee’.
  • Table employee has two fields ‘name’ and ‘sal’.
  • Here, the User id is “root” and Password is “password123” used for accessing harshdb.

Below given Screenshot shows the table ’employee’ and returns the fields ‘name’ and ‘sal’.

In order to see the table which I have created, refer to the following code in python

import mysql.connector
 
mydb=mysql.connector.connect(host="localhost",user="root",passwd="password123",database="harshdb")
mycursor=mydb.cursor()
mycursor.execute("show tables")
 
for tb in mycursor:
    print(tb)

Output:

C:UsersHarshit_KantPycharmProjectstest1venvScriptspython.exe C:/Users/Harshit_Kant/PycharmProjects/test1/venv/python-db-conn.py

(’employee’,)

Process finished with exit code 0

Below given Screenshot shows the table ’employee’ which I have created.

Screenshot:

Now that you have seen how a table is created, let us look at how a user can fetch values from it.

Read Operation:

This particular operation happens in various stages. In order to do that first stage is to populate the table.

Code:

import mysql.connector
 
mydb=mysql.connector.connect(host="localhost",user="root",passwd="password123",database="harshdb")
mycursor=mydb.cursor()
 
sqlformula = "Insert into employee(name,sal) values(%s,%s)"//'values has placeholders
 
employees = [("harshit",200000),("rahul", 30000),("avinash", 40000),("amit", 50000),]//Created an array of emplpoyees
 
 
mycursor.executemany(sqlformula, employees)//Passing the data
 
mydb.commit()//SQL statement used for saving the changes

Output:

C:UsersHarshit_KantPycharmProjectstest1venvScriptspython.exe C:/Users/Harshit_Kant/PycharmProjects/test1/venv/python-db-conn.py

Process finished with exit code 0

In the above code, I have populated the data by using an array of employees by writing SQL statements in Python. Below a screenshot of the database will show the changes

Here,’harshit’ is used two times in the record while created the array.

**Stage 2: **In this stage, we will make use of the “select” SQL statement where the actual read operation will take place.

  • fetchall()– This particular function fetches all the data from the last executed statement.
  • **fetchone()- **This particular statement fetches one data from the last executed statement.

Code:

import mysql.connector
 
mydb=mysql.connector.connect(host="localhost",user="root",passwd="password123",database="harshdb")
mycursor=mydb.cursor()
 
mycursor.execute("select * from employee")
 
myresult = mycursor.fetchall()
 
for row in myresult:
    print(row)

Output:

(‘harshit’, 200000)

(‘harshit’, 200000)

(‘rahul’, 30000)

(‘avinash’, 40000)

(‘amit’, 50000)

Process finished with exit code 0

**Explanation: **In the above code we have made use of the function ‘fetchall()’. It fetches all the data from the last executed statement.

Given below is the screenshot of the database.

Code:

import mysql.connector
 
mydb=mysql.connector.connect(host="localhost",user="root",passwd="password123",database="harshdb")
mycursor=mydb.cursor()
 
mycursor.execute("select name from employee")//selecting the field i want data to be fetched from
 
myresult = mycursor.fetchone()
 
for row in myresult:
    print(row)

Output:

C:UsersHarshit_KantPycharmProjectstest1venvScriptspython.exe C:/Users/Harshit_Kant/PycharmProjects/test1/venv/python-db-conn.py

harshit

Process finished with exit code 0

**Explanation: **In the above code, I have made use of the function “fetchone()” which basically fetches a single data from the last executed statement.

That was all about ‘Read operation’, let’s dive deep into Update operation.

Update Operation:

This SQL statement is used for updating the records in the table. Let’s implement the code and see how the changes are taking place.

Code:

import mysql.connector
 
mydb=mysql.connector.connect(host="localhost",user="root",passwd="password123",database="harshdb")
mycursor=mydb.cursor()
 
sql = "Update employee SET sal = 70000 WHERE name = 'harshit'"
 
mycursor.execute(sql)
 
mydb.commit()

Output:

C:UsersHarshit_KantPycharmProjectstest1venvScriptspython.exe C:/Users/Harshit_Kant/PycharmProjects/test1/venv/python-db-conn.py

Process finished with exit code 0

**Explanation: **We have updated the row “sal” of record harshit in the above-given code. Below given Screenshot will give you a clear picture.

Screenshot:

As you can clearly see row ‘sal’ of record ‘harshit’ is updated to 70000.

This was all about Update operation, moving on with “Python Connect MySQL Database” article we will see the last operation which is ‘delete’.

Delete Operation:

As the name itself justifies, Delete operation is used for the deletion of records from the table. Let’s understand it from a coding perspective.

Code:

import mysql.connector
 
mydb=mysql.connector.connect(host="localhost",user="root",passwd="password123",database="harshdb")
mycursor=mydb.cursor()
 
sql = "DELETE FROM employee  WHERE name = 'harshit'"
 
mycursor.execute(sql)
 
mydb.commit()

Output:

C:UsersHarshit_KantPycharmProjectstest1venvScriptspython.exe C:/Users/Harshit_Kant/PycharmProjects/test1/venv/python-db-conn.py

Process finished with exit code 0

**Explanation: **In the above code I have deleted a record ‘harshit’ as it was repeated twice.

Below given screenshot will give you a better picture.

As you can clearly see from the screenshot record ‘harshit’ has been deleted. Well, you can do another set of manipulation from the delete operation itself like deleting salary. I had mentioned only two fields so the operations on the record which I could do is limited, but you can create more fields under the same table ’employee’ or any other table you create.

This brings us to the end of our article on “Python Connect MySQL Database”. I hope you are clear with all the concepts related to database, MYSQL db, database operations in python. Make sure you practice as much as possible and revert your experience.

Python, Oracle ADWC and Machine Learning

Python, Oracle ADWC and Machine Learning

Python, Oracle ADWC and Machine Learning - How to use Open Source tools to analyze data managed through Oracle Autonomous Data Warehouse Cloud (ADWC).

Python, Oracle ADWC and Machine Learning - How to use Open Source tools to analyze data managed through Oracle Autonomous Data Warehouse Cloud (ADWC).

Introduction

Oracle Autonomous Database is the latest, modern evolution of Oracle Database technology. A technology to help managing and analyzing large volumes of data in the Cloud easier, faster and more powerful.

ADWC is the specialization of this technology for Data Warehouse and Advanced Data Analysis. It is a technology that simplifies uploading, transforming data and making advanced analytical tools accessible to Business Users and non-DBAs. Those tools that are part of the baggage of Data Scientists, so to speak.

In this article, however, I do not want only to examine in depth the tools available in ADWC, but I want to take a slightly different point of view.

I will try to explain it with an example: imagine to be a Data Scientist who knows above all the tools for Machine Learning and Data Exploration coming from the Open Source world. To give concrete examples, tools such as:

  • Python
  • Jupyter Notebook
  • I moduli NumPy, Pandas, Scikit-learn
  • eventually Keras (if we want to use Deep Learning models)

But the data that needs to be analyzed is stored in an ADWC. And more, our “hero” has some knowledge of SQL (well, maybe the opposite is strange) and understands that some processing should be done in SQL.

How can our “explorer” combine the best of both worlds?

Top down.

I do not want to turn this article into a sequence of commands that will be executed without a clear understanding. So, even if I want to provide all the details to simplify the life of those who want to try to follow my tracks, I want to proceed Top-down. I will first try to explain things to a higher level of abstraction, without too much detail, but I promise to add the details of implementation later.

The Cloud.

Like it or not the Cloud is there to stay. ADWC, for now, is a Public Cloud Oracle service.

The idea that came to me, to avoid to transfer too much data between my MacBook and the Cloud DataCenter, is to place all the analysis tools on a VM placed in the Oracle Cloud Infrastructure (OCI). For my purposes it is more than enough a VM with Ubuntu Linux and eight OCPU (core), to be turned on only when it is needed.
On this VM I installed:

  • Anaconda Python distribution

  • Python modules: Pandas, cx_Oracle, scikit-learn

  • server for configuring Notebook Extension (nbextensions)

  • Oracle client libraries (instant client 18.3)

  • The libaio1 system librarylibaio1 system library

I have also created in Anaconda a dedicated virtual Python environment (to be activated with the command “source activate ”).

With a little experience and patience and some “googling” in two hours the environment is up & running.

ADWC is created from the Oracle Cloud Infrastructure console. Just you need to specify the database name, the disk space in TB and the number of OCPU. In about ten minutes it is up & running; Not even the time for a long coffee break.

How to connect to Oracle ADWC

Nothing really complex. In the end, ADWC is always an Oracle database and the connection from a client is done in a manner well known to those who have worked with Oracle. The only really relevant detail is that it is mandatory that the connection is protected and encrypted.

To connect any client we must therefore:

  • download from the console of our ADWC the wallet that contains the details to identify who you are connecting to and the certificates required to validate the server and for the encryption

  • to place the wallet in a directory on our VM and unpack it (it is provided as a zip file)

  • have valid credentials (user and password) of a DB user with privileges to read data

Well, maybe at this stage our Data Scientist will be helped by a colleague with a little more “Oracle-experience”, or with a little patience can read the documentation.

How to interact with the database

Real fun begins here!

The tool that will be used is Jupyter Notebook, in which I have enabled the “ExecuteTime” extension, to record the execution times of each instruction.

In the Notebook we can interact with the DB (execute queries, etc) in two ways:

  • using Python instructions

  • using SQL statements directly

and the beauty is that we can combine the two modes according to our preferences, conveniences and inspiration of the moment. Today the SQL bores me deathly, I prefer the “Pythonic way-of-life?”, All right! No, too slow or too complex in Python, I’ll do it in SQL.

Let’s try to be more concrete. Let’s start with the Python (today …)
import cx_Oracle
import pandas as pd
import config as cfg
connection = cx_Oracle.connect(cfg.USER, cfg.PWD, ‘db4ml_medium’)
data = pd.read_sql("select * from my_credit_scoring", con=connection)

Let’s leave the details aside for now. We should instead admire simplicity: with just two lines of code we load all the data contained in the MY_CREDIT_SCORING table into a Pandas DataFrame.

At this point our beloved Data Scientist does not hold back from happyness. He has already guessed how he can finish the story: if he wants he can proceed in the way that he has been taught by dozens of books with titles such as “Machine Learning and Python” and must only (Mmm … only?) do some pre-processing , data cleaning, scaling, model building, optimization of hyper-parameters , model training, fight with overfitting, etc, etc. But, it does not need any other know-how on Oracle.

A particular now we give it (otherwise what is the purpose of Time Extension?): To load 100,000 records (each of which has 86 columns) from the ADWC database it takes only a little over 6 sec. Also due to the high speed and low latency network that is part of Oracle Cloud Infrastructure.

The other way to interact with the DB uses the “magic extension sql” of the Notebook:

%load_ext sql
%sql $cfg.CONN_STRING
%%sql 
select count(*) from my_credit_scoring

Here another detail: the second instruction makes the connection to the DB and I have stored the connection string (CONN_STRING) in a config.py file, not wanting to show in the Notebook explicitly username and password.

At this point, however, I want to add another detail (of the series: “we want to combine the best of both worlds”): our Data Scientist has access to the original data “read-only”, through a table called CREDIT_SCORING_100K, in the Oracle ADMIN schema. But since he assumes (you never know) that he wants to do some processing of this data (cleaning, etc) in the DB, he creates a “private” copy of the table, in his own private scheme.

This operation, done in SQL, is carried out with a single instruction!

%%sql 
create table my_credit_scoring as select * from admin.credit_scoring_100k

It takes only 7 sec. Here we see the simplicity and power that comes from ADWC. (Maybe I would have done better to choose a table of 10 million records, it would have had more effect, but I hadn’t it at my fingertips, sorry). Oracle ADWC is based on Oracle Exadata technology and such an instruction fully exploits Exadata’s storage technology and SQL offloading.

Legitimate curiosity: but if I want to combine “% sql” and Python?

Ok, here’s an example:

%%sql 
result << select credit_score_bin, count(*) num from my_credit_scoring group by credit_score_bin
data_bin = result.DataFrame()

And the result is again stored in a Pandas DataFrame.

But what is the model for?

Right … beyond technology, we must not lose sight of the business purpose. In the example chosen to illustrate the approach, the data is used to create a “CREDIT SCORING” model. A classification model that can be used by a bank or a finance company to decide whether a person is a good debtor and therefore whether to accept his mortgage and/or loan application.

A model of this type, with such data, is implemented through a “Supervised Learning” approach and with a binary classifier (the CREDIT_SCORE_BIN column contains only two distinct values): we train the model, then we give the model a series of data relating to the credit applicant and the model says: “Good credit, GO ON” or “Other credit, No GO”.

The complexity.

Here our ADWC has nothing to do with it or can only help. The data in the real world are “dirty”, have “holes” and must be cleaned up and completed. The table in my example is realistic from this point of view.

There are thousands of missing data (out of 100,000 samples). And many of the scikit-learn ML algorithms do not work well under such conditions.

There are many features (85) and many are categorical and therefore must be coded (translated into numbers). For the numerical characteristics, it is needed to bring them all in the same scale.

Once the data is loaded into the Pandas DataFrame we can use the “Python Only” approach and processing times are acceptable.

Here a minimum of wisdom inspires you not to try immediately with a very complex model, with all the characteristics, but to start with a few tens to establish a benchmark with which to compare more complex and articulate models.

A first benchmark is obvious: the accuracy of the model must be higher than 50%, otherwise we should change job (well, an official who decides whether to grant a loan by throwing a coin is not really the best, but in the absence of other tools …).

Models are like children, they learn what we teach.

Oh yes, there is always the problem of “algorithmic bias”. We must be sure, as far as possible, not to condition the predictions by inserting “only” certain types of data.

For example, if only a negligible fraction of samples indicates “Good Credit” our “algorithm-to-instruct” will learn that it must always answer: “Other Credit”.

In SQL the check is easy:

%%sql 
select credit_score_bin, count(*) from my_credit_scoring group by credit_score_bin

  • Python
  • Jupyter Notebook
  • I moduli NumPy, Pandas, Scikit-learn
  • eventually Keras (if we want to use Deep Learning models)

It can be good, obviously those who grant loans tend to be “prudent”, too many “sufferings” in the past.

Machine Learning Models

Even if we leave the topic “how to interact with Oracle ADWC” here, some will have the curiosity to read how to make a binary classifier with scikit-learn.

I want to be clear: a few lines of Python code implementing “a model” are not the serious answer to the question. Models have “hyper-parameters” that need to be optimized and this requires work (grid optimization) which is the really heavy part from a computational point of view.

But we can satisfy the curiosity (of course I tried to quickly pull up a model) without dedicating too much time to the hyper-parameters optimization. But only with the warning that, without this activity, we have only taken a first step towards the solution.

Another question: which algorithm to use?

Answer: “There is no free lunch” (Wikipedia).

That said, I choose to use Support Vector Machine (SVM). It is a type of algorithm that generally gives good results in problems with many features.

What is SVM? More or less, in every classifier, we could say that we try to find a separation surface between the two sets. If we can find an hyper-plane, the problem could be said “linearly separable”.

The Support Vectors in SVM are the data points that lies closest to the decision surface.

SVM maximizes the margin. In addition, using a “non-linear” kernel (like rbf) we can project the data points in a higher dimensional space where the two sets (Good Credit, Other credit) are linearly separable (this is more or less the idea).

The code in Python, using scikit-learn, is:

from sklearn.svm import SVC
svm = SVC(kernel = 'rbf', C = 1.0, gamma=0.1, random_state = 0)
svm.fit(X_train_std, y_train)
print('Train accuracy:', svm.score(X_train_std, y_train))
Test accuracy: 0.88145
print('Test accuracy:', svm.score(X_test_std, y_test))
Train accuracy: 0.8978987

Some notes:

  • I chose to use a non-linear kernel (rbf), assuming that the problem is not linearly separable; Actually, the choice is confirmed by the fact that with a linear kernel I have an accuracy of about three points lower;

  • RBF stands for Radial Basis Function

  • The two values ​​of the hyper-parameters (C, gamma) are set as a starting point; Optimizations apply here;

  • The model has a slight overfitting (90 vs 88%).

The accuracy we get? A good 88%. Not bad considering that it is a first, quick step. Another benchmark with which to compare successive more complex models.

If we want to improve the model, the first serious operation to do is to proceed with the optimization of hyper-parameters. It can be done, using all the processing power of the VM (for this reason 8 cores, not even many), with the GridSearchCV class

An example of the code:

gsc = GridSearchCV(
        estimator=SVR(kernel='rbf'),
        param_grid={
            'C': [0.1, 1, 100, 1000],
            'epsilon': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10],
            'gamma': [0.0001, 0.001, 0.005, 0.1, 1, 3, 5]
        },
        cv=5, scoring='neg_mean_squared_error', verbose=0, n_jobs=-1)

Some reflections

Data is in ADWC, protected, and we have not downloaded any data on our VM. All right from a security point of view.

Given the premise (Data Scientist who knows well the Open Source), I achieved the purpose using mainly scikit-learn and pandas, but also some of the tools offered by ADWC where the advantages were obvious (eg: the copy of the Oracle table) .

What we have done can be fairly easily understood even by those who do not have much experience on Oracle. I hope.

Does the model development use ADWC? Well, up to a point. I’ve done a lot on the VM using Python and the Notebook, less on ADWC. But this was foreseen (again, see the premise).

Processing times?

Here the speech runs the risk of being long. The training of SVM requires, even having 80000 samples in the training set (split 80:20), only about 10 min. If we look at CPU usage, we see that in practice only one thread works. So even with so many cores available we would not be able to use them. But, as said before, the computationally complex part is the optimization of the hyper-parameters, an operation that requires the training of the model N times (for each choice of hyper-parameters). This part can easily be parallelized, using all available cores. It is here that having so many cores can make the difference (and the Cloud helps).

By choice, we did not use the algorithm implementation contained in ADWC, as part of the Oracle Advanced Analytics Option. Using this implementation we could exploit the parallel engine of ADWC.

Python? Well. The good news is that at Oracle Open World (October 2018) Oracle has announced the availability of a Python interface to use all Oracle Advanced Analytics algorithms. This interface, named OML4Py, will be soon available. In a future article I will examine it.

Some details

The best place where to find information on how to install cx_Oracle:

All the details.

As always, time is not enough for doing exactly what I want. And the article has become longer that I wanted.

Therefore I have decided to dedicate another article to all the details. See you soon.

=======================================

Thanks for reading :heart: If you liked this post, share it with all of your programming buddies! Follow me on Facebook | Twitter