Python, Oracle ADWC and Machine Learning

Python, Oracle ADWC and Machine Learning - How to use Open Source tools to analyze data managed through Oracle Autonomous Data Warehouse Cloud (ADWC).

Introduction

Oracle Autonomous Database is the latest, modern evolution of Oracle Database technology. A technology to help managing and analyzing large volumes of data in the Cloud easier, faster and more powerful.

ADWC is the specialization of this technology for Data Warehouse and Advanced Data Analysis. It is a technology that simplifies uploading, transforming data and making advanced analytical tools accessible to Business Users and non-DBAs. Those tools that are part of the baggage of Data Scientists, so to speak.

In this article, however, I do not want only to examine in depth the tools available in ADWC, but I want to take a slightly different point of view.

I will try to explain it with an example: imagine to be a Data Scientist who knows above all the tools for Machine Learning and Data Exploration coming from the Open Source world. To give concrete examples, tools such as:

Python
Jupyter Notebook
I moduli NumPy, Pandas, Scikit-learn
eventually Keras (if we want to use Deep Learning models)

But the data that needs to be analyzed is stored in an ADWC. And more, our “hero” has some knowledge of SQL (well, maybe the opposite is strange) and understands that some processing should be done in SQL.

How can our “explorer” combine the best of both worlds?

Top down.

I do not want to turn this article into a sequence of commands that will be executed without a clear understanding. So, even if I want to provide all the details to simplify the life of those who want to try to follow my tracks, I want to proceed Top-down. I will first try to explain things to a higher level of abstraction, without too much detail, but I promise to add the details of implementation later.

The Cloud.

Like it or not the Cloud is there to stay. ADWC, for now, is a Public Cloud Oracle service.

The idea that came to me, to avoid to transfer too much data between my MacBook and the Cloud DataCenter, is to place all the analysis tools on a VM placed in the Oracle Cloud Infrastructure (OCI). For my purposes it is more than enough a VM with Ubuntu Linux and eight OCPU (core), to be turned on only when it is needed.
On this VM I installed:

Anaconda Python distribution
Python modules: Pandas, cx_Oracle, scikit-learn
server for configuring Notebook Extension (nbextensions)
Oracle client libraries (instant client 18.3)
The libaio1 system librarylibaio1 system library

I have also created in Anaconda a dedicated virtual Python environment (to be activated with the command “source activate ”).

With a little experience and patience and some “googling” in two hours the environment is up & running.

ADWC is created from the Oracle Cloud Infrastructure console. Just you need to specify the database name, the disk space in TB and the number of OCPU. In about ten minutes it is up & running; Not even the time for a long coffee break.

How to connect to Oracle ADWC

Nothing really complex. In the end, ADWC is always an Oracle database and the connection from a client is done in a manner well known to those who have worked with Oracle. The only really relevant detail is that it is mandatory that the connection is protected and encrypted.

To connect any client we must therefore:

download from the console of our ADWC the wallet that contains the details to identify who you are connecting to and the certificates required to validate the server and for the encryption
to place the wallet in a directory on our VM and unpack it (it is provided as a zip file)
have valid credentials (user and password) of a DB user with privileges to read data

Well, maybe at this stage our Data Scientist will be helped by a colleague with a little more “Oracle-experience”, or with a little patience can read the documentation.

How to interact with the database

Real fun begins here!

The tool that will be used is Jupyter Notebook, in which I have enabled the “ExecuteTime” extension, to record the execution times of each instruction.

In the Notebook we can interact with the DB (execute queries, etc) in two ways:

using Python instructions
using SQL statements directly

and the beauty is that we can combine the two modes according to our preferences, conveniences and inspiration of the moment. Today the SQL bores me deathly, I prefer the “Pythonic way-of-life?”, All right! No, too slow or too complex in Python, I’ll do it in SQL.

Let’s try to be more concrete. Let’s start with the Python (today …)

import cx_Oracle
import pandas as pd
import config as cfg
connection = cx_Oracle.connect(cfg.USER, cfg.PWD, ‘db4ml_medium’)
data = pd.read_sql("select * from my_credit_scoring", con=connection)

Let’s leave the details aside for now. We should instead admire simplicity: with just two lines of code we load all the data contained in the MY_CREDIT_SCORING table into a Pandas DataFrame.

At this point our beloved Data Scientist does not hold back from happyness. He has already guessed how he can finish the story: if he wants he can proceed in the way that he has been taught by dozens of books with titles such as “Machine Learning and Python” and must only (Mmm … only?) do some pre-processing , data cleaning, scaling, model building, optimization of hyper-parameters , model training, fight with overfitting, etc, etc. But, it does not need any other know-how on Oracle.

A particular now we give it (otherwise what is the purpose of Time Extension?): To load 100,000 records (each of which has 86 columns) from the ADWC database it takes only a little over 6 sec. Also due to the high speed and low latency network that is part of Oracle Cloud Infrastructure.

The other way to interact with the DB uses the “magic extension sql” of the Notebook:

%load_ext sql
%sql $cfg.CONN_STRING
%%sql 
select count(*) from my_credit_scoring

Here another detail: the second instruction makes the connection to the DB and I have stored the connection string (CONN_STRING) in a config.py file, not wanting to show in the Notebook explicitly username and password.

At this point, however, I want to add another detail (of the series: “we want to combine the best of both worlds”): our Data Scientist has access to the original data “read-only”, through a table called CREDIT_SCORING_100K, in the Oracle ADMIN schema. But since he assumes (you never know) that he wants to do some processing of this data (cleaning, etc) in the DB, he creates a “private” copy of the table, in his own private scheme.

This operation, done in SQL, is carried out with a single instruction!

%%sql 
create table my_credit_scoring as select * from admin.credit_scoring_100k

It takes only 7 sec. Here we see the simplicity and power that comes from ADWC. (Maybe I would have done better to choose a table of 10 million records, it would have had more effect, but I hadn’t it at my fingertips, sorry). Oracle ADWC is based on Oracle Exadata technology and such an instruction fully exploits Exadata’s storage technology and SQL offloading.

Legitimate curiosity: but if I want to combine “% sql” and Python?

Ok, here’s an example:

%%sql 
result << select credit_score_bin, count(*) num from my_credit_scoring group by credit_score_bin
data_bin = result.DataFrame()

And the result is again stored in a Pandas DataFrame.

But what is the model for?

Right … beyond technology, we must not lose sight of the business purpose. In the example chosen to illustrate the approach, the data is used to create a “CREDIT SCORING” model. A classification model that can be used by a bank or a finance company to decide whether a person is a good debtor and therefore whether to accept his mortgage and/or loan application.

A model of this type, with such data, is implemented through a “Supervised Learning” approach and with a binary classifier (the CREDIT_SCORE_BIN column contains only two distinct values): we train the model, then we give the model a series of data relating to the credit applicant and the model says: “Good credit, GO ON” or “Other credit, No GO”.

The complexity.

Here our ADWC has nothing to do with it or can only help. The data in the real world are “dirty”, have “holes” and must be cleaned up and completed. The table in my example is realistic from this point of view.

There are thousands of missing data (out of 100,000 samples). And many of the scikit-learn ML algorithms do not work well under such conditions.

There are many features (85) and many are categorical and therefore must be coded (translated into numbers). For the numerical characteristics, it is needed to bring them all in the same scale.

Once the data is loaded into the Pandas DataFrame we can use the “Python Only” approach and processing times are acceptable.

Here a minimum of wisdom inspires you not to try immediately with a very complex model, with all the characteristics, but to start with a few tens to establish a benchmark with which to compare more complex and articulate models.

A first benchmark is obvious: the accuracy of the model must be higher than 50%, otherwise we should change job (well, an official who decides whether to grant a loan by throwing a coin is not really the best, but in the absence of other tools …).

Models are like children, they learn what we teach.

Oh yes, there is always the problem of “algorithmic bias”. We must be sure, as far as possible, not to condition the predictions by inserting “only” certain types of data.

For example, if only a negligible fraction of samples indicates “Good Credit” our “algorithm-to-instruct” will learn that it must always answer: “Other Credit”.

In SQL the check is easy:

%%sql 
select credit_score_bin, count(*) from my_credit_scoring group by credit_score_bin

Python
Jupyter Notebook
I moduli NumPy, Pandas, Scikit-learn
eventually Keras (if we want to use Deep Learning models)

It can be good, obviously those who grant loans tend to be “prudent”, too many “sufferings” in the past.

Machine Learning Models

Even if we leave the topic “how to interact with Oracle ADWC” here, some will have the curiosity to read how to make a binary classifier with scikit-learn.

I want to be clear: a few lines of Python code implementing “a model” are not the serious answer to the question. Models have “hyper-parameters” that need to be optimized and this requires work (grid optimization) which is the really heavy part from a computational point of view.

But we can satisfy the curiosity (of course I tried to quickly pull up a model) without dedicating too much time to the hyper-parameters optimization. But only with the warning that, without this activity, we have only taken a first step towards the solution.

Another question: which algorithm to use?

Answer: “There is no free lunch” (Wikipedia).

That said, I choose to use Support Vector Machine (SVM). It is a type of algorithm that generally gives good results in problems with many features.

What is SVM? More or less, in every classifier, we could say that we try to find a separation surface between the two sets. If we can find an hyper-plane, the problem could be said “linearly separable”.

The Support Vectors in SVM are the data points that lies closest to the decision surface.

SVM maximizes the margin. In addition, using a “non-linear” kernel (like rbf) we can project the data points in a higher dimensional space where the two sets (Good Credit, Other credit) are linearly separable (this is more or less the idea).

The code in Python, using scikit-learn, is:

from sklearn.svm import SVC
svm = SVC(kernel = 'rbf', C = 1.0, gamma=0.1, random_state = 0)
svm.fit(X_train_std, y_train)
print('Train accuracy:', svm.score(X_train_std, y_train))
Test accuracy: 0.88145
print('Test accuracy:', svm.score(X_test_std, y_test))
Train accuracy: 0.8978987

Some notes:

I chose to use a non-linear kernel (rbf), assuming that the problem is not linearly separable; Actually, the choice is confirmed by the fact that with a linear kernel I have an accuracy of about three points lower;
RBF stands for Radial Basis Function
The two values of the hyper-parameters (C, gamma) are set as a starting point; Optimizations apply here;
The model has a slight overfitting (90 vs 88%).

The accuracy we get? A good 88%. Not bad considering that it is a first, quick step. Another benchmark with which to compare successive more complex models.

If we want to improve the model, the first serious operation to do is to proceed with the optimization of hyper-parameters. It can be done, using all the processing power of the VM (for this reason 8 cores, not even many), with the GridSearchCV class

An example of the code:

gsc = GridSearchCV(
        estimator=SVR(kernel='rbf'),
        param_grid={
            'C': [0.1, 1, 100, 1000],
            'epsilon': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10],
            'gamma': [0.0001, 0.001, 0.005, 0.1, 1, 3, 5]
        },
        cv=5, scoring='neg_mean_squared_error', verbose=0, n_jobs=-1)

Some reflections

Data is in ADWC, protected, and we have not downloaded any data on our VM. All right from a security point of view.

Given the premise (Data Scientist who knows well the Open Source), I achieved the purpose using mainly scikit-learn and pandas, but also some of the tools offered by ADWC where the advantages were obvious (eg: the copy of the Oracle table) .

What we have done can be fairly easily understood even by those who do not have much experience on Oracle. I hope.

Does the model development use ADWC? Well, up to a point. I’ve done a lot on the VM using Python and the Notebook, less on ADWC. But this was foreseen (again, see the premise).

Processing times?

Here the speech runs the risk of being long. The training of SVM requires, even having 80000 samples in the training set (split 80:20), only about 10 min. If we look at CPU usage, we see that in practice only one thread works. So even with so many cores available we would not be able to use them. But, as said before, the computationally complex part is the optimization of the hyper-parameters, an operation that requires the training of the model N times (for each choice of hyper-parameters). This part can easily be parallelized, using all available cores. It is here that having so many cores can make the difference (and the Cloud helps).

By choice, we did not use the algorithm implementation contained in ADWC, as part of the Oracle Advanced Analytics Option. Using this implementation we could exploit the parallel engine of ADWC.

Python? Well. The good news is that at Oracle Open World (October 2018) Oracle has announced the availability of a Python interface to use all Oracle Advanced Analytics algorithms. This interface, named OML4Py, will be soon available. In a future article I will examine it.

Some details

The best place where to find information on how to install cx_Oracle:

All the details.

As always, time is not enough for doing exactly what I want. And the article has become longer that I wanted.

Therefore I have decided to dedicate another article to all the details. See you soon.

=======================================

Thanks for reading :heart: If you liked this post, share it with all of your programming buddies! Follow me on Facebook | Twitter

#python #machine-learning #oracle