what is the difference between dbms_output, output variable, output script and output result in oracle

I'm a bit confuse. What is difference between dbms_output, output variable, output script and output result in oracle.

I'm a bit confuse. What is difference between dbms_output, output variable, output script and output result in oracle.

Because I want to write a stored procedure in Oracle. Then, the stored procedure will be executed from my typescript codes. But. I'm a bit confuse where and how to return the result. is it from output variable or using dbms_output only? Please help

Spring Boot + JPA + Hibernate + Oracle

Spring Boot + JPA + Hibernate + Oracle

In this tutorial, we will learn how to create a Spring Boot application that communicates with an Oracle data source through Hibernate.

In this tutorial, we will learn how to create a Spring Boot application that communicates with an Oracle data source through Hibernate.

Prerequisites
  • Eclipse IDE (neon release)
  • Maven 4
  • Java 1.8
Create a Maven Project

Open Eclipse, then create a new Maven project and name it SpringBootHibernate.

At the end of this tutorial, we’ll get the following project structure:

pom.xml

Configure Spring Boot inside your pom.xml by adding the following parent dependency:

<parent>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-parent</artifactId>
    <version>1.5.10.RELEASE</version>
</parent>

Then add a spring-boot-starter dependency in order to run the application as a standalone JAR application:

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter</artifactId>
</dependency>

Now in order to make use of Spring Data JPA and Hibernate, we need to just add spring-boot-starter-data-jpa as a dependency:

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-data-jpa</artifactId>
</dependency>

As soon as we include Spring Boot Starter JPA in our project, we get the following features from a wide variety of dependencies:

  • Eclipse IDE (neon release)
  • Maven 4
  • Java 1.8

This is the whole pom.xml for reference:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.programmer.gate</groupId>
  <artifactId>SpringBootHibernate</artifactId>
  <packaging>jar</packaging>
  <version>0.0.1-SNAPSHOT</version>
  <name>SpringBootHibernate</name>

  <properties>
       <maven.compiler.source>1.8</maven.compiler.source>
       <maven.compiler.target>1.8</maven.compiler.target>
    </properties>

    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>1.5.10.RELEASE</version>
    </parent>

  <dependencies>

      <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter</artifactId>
    </dependency>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-data-jpa</artifactId>
    </dependency>

  </dependencies>

  <build>
      <plugins>
          <plugin>
              <groupId>org.springframework.boot</groupId>
              <artifactId>spring-boot-maven-plugin</artifactId>
          </plugin>
      </plugins>
  </build>

</project>

Add the Oracle Driver to the Classpath

In this tutorial, we’re going to override the default in-memory database provided by Spring Boot and use our own Oracle database.

For this purpose, we add “oracle-ojdbc6-11.2.0.3.jar” under WEB-INF/lib and define it in our classpath.

application.properties

Configure the Oracle data source and Hibernate in application.properties:

# create and drop tables and sequences, loads import.sql
spring.jpa.hibernate.ddl-auto=create-drop

# Oracle settings
spring.datasource.url=jdbc:oracle:thin:@localhost:1522:orcl
spring.datasource.username=HIBERNATE_TEST
spring.datasource.password=HIBERNATE_TEST
spring.datasource.driver.class=oracle.jdbc.driver.OracleDriver

# logging
logging.pattern.console=%d{yyyy-MM-dd HH:mm:ss} %-5level %logger{36} - %msg%n
logging.level.org.hibernate.SQL=debug

Entities

Our entities represent a player and a team with a one-to-many relationship. Each team could have many players, whereas a player could only play with a single team at a time.

So we create our entities under the com.programmer.gate.model package:

Player.java:

package com.programmer.gate.model;

import javax.persistence.Column;
import javax.persistence.Entity;
import javax.persistence.FetchType;
import javax.persistence.GeneratedValue;
import javax.persistence.GenerationType;
import javax.persistence.Id;
import javax.persistence.JoinColumn;
import javax.persistence.ManyToOne;
import javax.persistence.SequenceGenerator;

@Entity
public class Player {

    @Id
    @GeneratedValue(strategy=GenerationType.SEQUENCE, generator = "player_Sequence")
    @SequenceGenerator(name = "player_Sequence", sequenceName = "PLAYER_SEQ")
    private Long id;

    @Column(name = "name")
    private String name;

    @Column(name = "num")
    private int num;

    @Column(name = "position")
    private String position;

    @ManyToOne(fetch = FetchType.LAZY)
        @JoinColumn(name = "team_id", nullable = false)
        private Team team;

    public Player() {
    }
        // getters/setters
}

Team.java:

package com.programmer.gate.model;

import java.util.List;

import javax.persistence.CascadeType;
import javax.persistence.Column;
import javax.persistence.Entity;
import javax.persistence.FetchType;
import javax.persistence.GeneratedValue;
import javax.persistence.GenerationType;
import javax.persistence.Id;
import javax.persistence.OneToMany;
import javax.persistence.SequenceGenerator;

@Entity
public class Team {

    @Id
    @GeneratedValue(strategy=GenerationType.SEQUENCE, generator = "team_Sequence")
    @SequenceGenerator(name = "team_Sequence", sequenceName = "TEAM_SEQ")
    private Long id;

    @Column(name = "name")
    private String name;

    @OneToMany(cascade = CascadeType.ALL,
            fetch = FetchType.EAGER,
            mappedBy = "team")
    private List<Player> players;

    public Team() {
    }

        // getters/setters
}

Since we set spring.jpa.hibernate.ddl-auto=create-drop inside application.properties, our application will automatically create Player and Team entities in our database, along with their sequences and constraints.

Our application would also look for import.sql in the classpath and execute it, if found.

In our example, we define import.sql under src/main/resources in order to fill our tables with static data:

insert into Team (id,name) values(1,'Barcelona');

insert into Player (id, team_id, name, num, position) values(1,1,'Lionel Messi', 10, 'Forward');
insert into Player (id, team_id, name, num, position) values(2,1,'Andreas Inniesta', 8, 'Midfielder');
insert into Player (id, team_id, name, num, position) values(3,1,'Pique', 3, 'Defender');

Repositories

We define our repositories’ interfaces under com.programmer.gate.repository. Each repository extends Spring CrudRepository, which provides a default implementation for the basic find, save,and delete methods — so we don’t care about defining implementation classes for them.

PlayerRepository:

package com.programmer.gate.repository;

import java.util.List;

import org.springframework.data.repository.CrudRepository;
import org.springframework.stereotype.Repository;

import com.programmer.gate.model.Player;

@Repository
public interface PlayerRepository extends CrudRepository<Player, Long> {

    List<Player> findByTeamId(long teamId);
}

TeamRepository:

package com.programmer.gate.repository;

import org.springframework.data.repository.CrudRepository;
import org.springframework.stereotype.Repository;

import com.programmer.gate.model.Team;

@Repository
public interface TeamRepository extends CrudRepository<Team, Long> {

    Team findByPlayers(long playerId);
}

Service

Now we define our service class, which holds the business logic of our application. Our service exposes two methods: getAllTeamPlayers() and addBarcelonaPlayer() ( just rename it to your favorite club if you don’t like Barcelona!). Our service layer communicates directly with the repository layer.

SoccerService.java:

package com.programmer.gate.service;

import java.util.List;

public interface SoccerService {
    public List<String> getAllTeamPlayers(long teamId);
    public void addBarcelonaPlayer(String name, String position, int number);
}

SoccerServiceImpl:

package com.programmer.gate.service;

import java.util.ArrayList;
import java.util.List;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;

import com.programmer.gate.model.Player;
import com.programmer.gate.model.Team;
import com.programmer.gate.repository.PlayerRepository;
import com.programmer.gate.repository.TeamRepository;

@Service
public class SoccerServiceImpl implements SoccerService {

    @Autowired
    private PlayerRepository playerRepository;
    @Autowired
    private TeamRepository teamRepository;

    public List<String> getAllTeamPlayers(long teamId) {
        List<String> result = new ArrayList<String>();
        List<Player> players = playerRepository.findByTeamId(teamId);
        for (Player player : players) {
            result.add(player.getName());
        }

        return result;
    }

    public void addBarcelonaPlayer(String name, String position, int number) {

        Team barcelona = teamRepository.findOne(1l);

        Player newPlayer = new Player();
        newPlayer.setName(name);
        newPlayer.setPosition(position);
        newPlayer.setNum(number);
        newPlayer.setTeam(barcelona);
        playerRepository.save(newPlayer);
    }
}

Application.java

The final step is to create the Spring Boot initializer. This is the entry point of our application. We define Application.javaunder com.programmer.gate:

package com.programmer.gate;

import java.util.List;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.CommandLineRunner;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;

import com.programmer.gate.service.SoccerService;

@SpringBootApplication
public class Application implements CommandLineRunner{

    @Autowired
    SoccerService soccerService;

    public static void main(String[] args) {
        SpringApplication.run(Application.class, args);
    }

    @Override
    public void run(String... arg0) throws Exception {

        soccerService.addBarcelonaPlayer("Xavi Hernandez", "Midfielder", 6);

        List<String> players = soccerService.getAllTeamPlayers(1);
        for(String player : players)
        {
            System.out.println("Introducing Barca player => " + player);
        }
    }
}

P.S.: It’s worth mentioning that the Spring Boot application automatically reads and creates entities, repositories, and services defined in the same package or in a sub-package relative to where you have your initializer class. So if we define Application.java under a different package, then we need to explicitly specify the package of the model, repository, and service.

Output:

When running the application as a standard Java app, we get the following output in the console:

2018-04-13 14:54:47 DEBUG org.hibernate.SQL - create sequence player_seq start with 1 increment by 1
2018-04-13 14:54:47 DEBUG org.hibernate.SQL - create sequence team_seq start with 1 increment by 1
2018-04-13 14:54:47 DEBUG org.hibernate.SQL - create table player (id number(19,0) not null, name varchar2(255 char), num number(10,0), position varchar2(255 char), team_id number(19,0) not null, primary key (id))
2018-04-13 14:54:47 DEBUG org.hibernate.SQL - create table team (id number(19,0) not null, name varchar2(255 char), primary key (id))
2018-04-13 14:54:47 DEBUG org.hibernate.SQL - alter table player add constraint FKdvd6ljes11r44igawmpm1mc5s foreign key (team_id) references team
2018-04-13 14:54:47 INFO  o.h.tool.hbm2ddl.SchemaExport - HHH000476: Executing import script '/import.sql'
2018-04-13 14:54:47 INFO  o.h.tool.hbm2ddl.SchemaExport - HHH000230: Schema export complete
2018-04-13 14:54:47 INFO  o.s.o.j.LocalContainerEntityManagerFactoryBean - Initialized JPA EntityManagerFactory for persistence unit 'default'
2018-04-13 14:54:48 INFO  o.s.j.e.a.AnnotationMBeanExporter - Registering beans for JMX exposure on startup
2018-04-13 14:54:48 DEBUG org.hibernate.SQL - select team0_.id as id1_1_0_, team0_.name as name2_1_0_, players1_.team_id as team_id5_0_1_, players1_.id as id1_0_1_, players1_.id as id1_0_2_, players1_.name as name2_0_2_, players1_.num as num3_0_2_, players1_.position as position4_0_2_, players1_.team_id as team_id5_0_2_ from team team0_, player players1_ where team0_.id=players1_.team_id(+) and team0_.id=?
2018-04-13 14:54:48 DEBUG org.hibernate.SQL - select player_seq.nextval from dual
2018-04-13 14:54:48 DEBUG org.hibernate.SQL - insert into player (name, num, position, team_id, id) values (?, ?, ?, ?, ?)
2018-04-13 14:54:48 INFO  o.h.h.i.QueryTranslatorFactoryInitiator - HHH000397: Using ASTQueryTranslatorFactory
2018-04-13 14:54:48 DEBUG org.hibernate.SQL - select player0_.id as id1_0_, player0_.name as name2_0_, player0_.num as num3_0_, player0_.position as position4_0_, player0_.team_id as team_id5_0_ from player player0_, team team1_ where player0_.team_id=team1_.id(+) and team1_.id=?

Introducing Barca player => Lionel Messi
Introducing Barca player => Andreas Inniesta
Introducing Barca player => Pique
Introducing Barca player => Xavi Hernandez

2018-04-13 14:54:49 INFO  com.programmer.gate.Application - Started Application in 4.213 seconds (JVM running for 4.555)

Source code

You can download the source code from this repository: spring-boot-jpa-hibernate.

Python, Oracle ADWC and Machine Learning

Python, Oracle ADWC and Machine Learning

Python, Oracle ADWC and Machine Learning - How to use Open Source tools to analyze data managed through Oracle Autonomous Data Warehouse Cloud (ADWC).

Python, Oracle ADWC and Machine Learning - How to use Open Source tools to analyze data managed through Oracle Autonomous Data Warehouse Cloud (ADWC).

Introduction

Oracle Autonomous Database is the latest, modern evolution of Oracle Database technology. A technology to help managing and analyzing large volumes of data in the Cloud easier, faster and more powerful.

ADWC is the specialization of this technology for Data Warehouse and Advanced Data Analysis. It is a technology that simplifies uploading, transforming data and making advanced analytical tools accessible to Business Users and non-DBAs. Those tools that are part of the baggage of Data Scientists, so to speak.

In this article, however, I do not want only to examine in depth the tools available in ADWC, but I want to take a slightly different point of view.

I will try to explain it with an example: imagine to be a Data Scientist who knows above all the tools for Machine Learning and Data Exploration coming from the Open Source world. To give concrete examples, tools such as:

  • Python
  • Jupyter Notebook
  • I moduli NumPy, Pandas, Scikit-learn
  • eventually Keras (if we want to use Deep Learning models)

But the data that needs to be analyzed is stored in an ADWC. And more, our “hero” has some knowledge of SQL (well, maybe the opposite is strange) and understands that some processing should be done in SQL.

How can our “explorer” combine the best of both worlds?

Top down.

I do not want to turn this article into a sequence of commands that will be executed without a clear understanding. So, even if I want to provide all the details to simplify the life of those who want to try to follow my tracks, I want to proceed Top-down. I will first try to explain things to a higher level of abstraction, without too much detail, but I promise to add the details of implementation later.

The Cloud.

Like it or not the Cloud is there to stay. ADWC, for now, is a Public Cloud Oracle service.

The idea that came to me, to avoid to transfer too much data between my MacBook and the Cloud DataCenter, is to place all the analysis tools on a VM placed in the Oracle Cloud Infrastructure (OCI). For my purposes it is more than enough a VM with Ubuntu Linux and eight OCPU (core), to be turned on only when it is needed.
On this VM I installed:

  • Anaconda Python distribution

  • Python modules: Pandas, cx_Oracle, scikit-learn

  • server for configuring Notebook Extension (nbextensions)

  • Oracle client libraries (instant client 18.3)

  • The libaio1 system librarylibaio1 system library

I have also created in Anaconda a dedicated virtual Python environment (to be activated with the command “source activate ”).

With a little experience and patience and some “googling” in two hours the environment is up & running.

ADWC is created from the Oracle Cloud Infrastructure console. Just you need to specify the database name, the disk space in TB and the number of OCPU. In about ten minutes it is up & running; Not even the time for a long coffee break.

How to connect to Oracle ADWC

Nothing really complex. In the end, ADWC is always an Oracle database and the connection from a client is done in a manner well known to those who have worked with Oracle. The only really relevant detail is that it is mandatory that the connection is protected and encrypted.

To connect any client we must therefore:

  • download from the console of our ADWC the wallet that contains the details to identify who you are connecting to and the certificates required to validate the server and for the encryption

  • to place the wallet in a directory on our VM and unpack it (it is provided as a zip file)

  • have valid credentials (user and password) of a DB user with privileges to read data

Well, maybe at this stage our Data Scientist will be helped by a colleague with a little more “Oracle-experience”, or with a little patience can read the documentation.

How to interact with the database

Real fun begins here!

The tool that will be used is Jupyter Notebook, in which I have enabled the “ExecuteTime” extension, to record the execution times of each instruction.

In the Notebook we can interact with the DB (execute queries, etc) in two ways:

  • using Python instructions

  • using SQL statements directly

and the beauty is that we can combine the two modes according to our preferences, conveniences and inspiration of the moment. Today the SQL bores me deathly, I prefer the “Pythonic way-of-life?”, All right! No, too slow or too complex in Python, I’ll do it in SQL.

Let’s try to be more concrete. Let’s start with the Python (today …)
import cx_Oracle
import pandas as pd
import config as cfg
connection = cx_Oracle.connect(cfg.USER, cfg.PWD, ‘db4ml_medium’)
data = pd.read_sql("select * from my_credit_scoring", con=connection)

Let’s leave the details aside for now. We should instead admire simplicity: with just two lines of code we load all the data contained in the MY_CREDIT_SCORING table into a Pandas DataFrame.

At this point our beloved Data Scientist does not hold back from happyness. He has already guessed how he can finish the story: if he wants he can proceed in the way that he has been taught by dozens of books with titles such as “Machine Learning and Python” and must only (Mmm … only?) do some pre-processing , data cleaning, scaling, model building, optimization of hyper-parameters , model training, fight with overfitting, etc, etc. But, it does not need any other know-how on Oracle.

A particular now we give it (otherwise what is the purpose of Time Extension?): To load 100,000 records (each of which has 86 columns) from the ADWC database it takes only a little over 6 sec. Also due to the high speed and low latency network that is part of Oracle Cloud Infrastructure.

The other way to interact with the DB uses the “magic extension sql” of the Notebook:

%load_ext sql
%sql $cfg.CONN_STRING
%%sql 
select count(*) from my_credit_scoring

Here another detail: the second instruction makes the connection to the DB and I have stored the connection string (CONN_STRING) in a config.py file, not wanting to show in the Notebook explicitly username and password.

At this point, however, I want to add another detail (of the series: “we want to combine the best of both worlds”): our Data Scientist has access to the original data “read-only”, through a table called CREDIT_SCORING_100K, in the Oracle ADMIN schema. But since he assumes (you never know) that he wants to do some processing of this data (cleaning, etc) in the DB, he creates a “private” copy of the table, in his own private scheme.

This operation, done in SQL, is carried out with a single instruction!

%%sql 
create table my_credit_scoring as select * from admin.credit_scoring_100k

It takes only 7 sec. Here we see the simplicity and power that comes from ADWC. (Maybe I would have done better to choose a table of 10 million records, it would have had more effect, but I hadn’t it at my fingertips, sorry). Oracle ADWC is based on Oracle Exadata technology and such an instruction fully exploits Exadata’s storage technology and SQL offloading.

Legitimate curiosity: but if I want to combine “% sql” and Python?

Ok, here’s an example:

%%sql 
result << select credit_score_bin, count(*) num from my_credit_scoring group by credit_score_bin
data_bin = result.DataFrame()

And the result is again stored in a Pandas DataFrame.

But what is the model for?

Right … beyond technology, we must not lose sight of the business purpose. In the example chosen to illustrate the approach, the data is used to create a “CREDIT SCORING” model. A classification model that can be used by a bank or a finance company to decide whether a person is a good debtor and therefore whether to accept his mortgage and/or loan application.

A model of this type, with such data, is implemented through a “Supervised Learning” approach and with a binary classifier (the CREDIT_SCORE_BIN column contains only two distinct values): we train the model, then we give the model a series of data relating to the credit applicant and the model says: “Good credit, GO ON” or “Other credit, No GO”.

The complexity.

Here our ADWC has nothing to do with it or can only help. The data in the real world are “dirty”, have “holes” and must be cleaned up and completed. The table in my example is realistic from this point of view.

There are thousands of missing data (out of 100,000 samples). And many of the scikit-learn ML algorithms do not work well under such conditions.

There are many features (85) and many are categorical and therefore must be coded (translated into numbers). For the numerical characteristics, it is needed to bring them all in the same scale.

Once the data is loaded into the Pandas DataFrame we can use the “Python Only” approach and processing times are acceptable.

Here a minimum of wisdom inspires you not to try immediately with a very complex model, with all the characteristics, but to start with a few tens to establish a benchmark with which to compare more complex and articulate models.

A first benchmark is obvious: the accuracy of the model must be higher than 50%, otherwise we should change job (well, an official who decides whether to grant a loan by throwing a coin is not really the best, but in the absence of other tools …).

Models are like children, they learn what we teach.

Oh yes, there is always the problem of “algorithmic bias”. We must be sure, as far as possible, not to condition the predictions by inserting “only” certain types of data.

For example, if only a negligible fraction of samples indicates “Good Credit” our “algorithm-to-instruct” will learn that it must always answer: “Other Credit”.

In SQL the check is easy:

%%sql 
select credit_score_bin, count(*) from my_credit_scoring group by credit_score_bin

  • Python
  • Jupyter Notebook
  • I moduli NumPy, Pandas, Scikit-learn
  • eventually Keras (if we want to use Deep Learning models)

It can be good, obviously those who grant loans tend to be “prudent”, too many “sufferings” in the past.

Machine Learning Models

Even if we leave the topic “how to interact with Oracle ADWC” here, some will have the curiosity to read how to make a binary classifier with scikit-learn.

I want to be clear: a few lines of Python code implementing “a model” are not the serious answer to the question. Models have “hyper-parameters” that need to be optimized and this requires work (grid optimization) which is the really heavy part from a computational point of view.

But we can satisfy the curiosity (of course I tried to quickly pull up a model) without dedicating too much time to the hyper-parameters optimization. But only with the warning that, without this activity, we have only taken a first step towards the solution.

Another question: which algorithm to use?

Answer: “There is no free lunch” (Wikipedia).

That said, I choose to use Support Vector Machine (SVM). It is a type of algorithm that generally gives good results in problems with many features.

What is SVM? More or less, in every classifier, we could say that we try to find a separation surface between the two sets. If we can find an hyper-plane, the problem could be said “linearly separable”.

The Support Vectors in SVM are the data points that lies closest to the decision surface.

SVM maximizes the margin. In addition, using a “non-linear” kernel (like rbf) we can project the data points in a higher dimensional space where the two sets (Good Credit, Other credit) are linearly separable (this is more or less the idea).

The code in Python, using scikit-learn, is:

from sklearn.svm import SVC
svm = SVC(kernel = 'rbf', C = 1.0, gamma=0.1, random_state = 0)
svm.fit(X_train_std, y_train)
print('Train accuracy:', svm.score(X_train_std, y_train))
Test accuracy: 0.88145
print('Test accuracy:', svm.score(X_test_std, y_test))
Train accuracy: 0.8978987

Some notes:

  • I chose to use a non-linear kernel (rbf), assuming that the problem is not linearly separable; Actually, the choice is confirmed by the fact that with a linear kernel I have an accuracy of about three points lower;

  • RBF stands for Radial Basis Function

  • The two values ​​of the hyper-parameters (C, gamma) are set as a starting point; Optimizations apply here;

  • The model has a slight overfitting (90 vs 88%).

The accuracy we get? A good 88%. Not bad considering that it is a first, quick step. Another benchmark with which to compare successive more complex models.

If we want to improve the model, the first serious operation to do is to proceed with the optimization of hyper-parameters. It can be done, using all the processing power of the VM (for this reason 8 cores, not even many), with the GridSearchCV class

An example of the code:

gsc = GridSearchCV(
        estimator=SVR(kernel='rbf'),
        param_grid={
            'C': [0.1, 1, 100, 1000],
            'epsilon': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10],
            'gamma': [0.0001, 0.001, 0.005, 0.1, 1, 3, 5]
        },
        cv=5, scoring='neg_mean_squared_error', verbose=0, n_jobs=-1)

Some reflections

Data is in ADWC, protected, and we have not downloaded any data on our VM. All right from a security point of view.

Given the premise (Data Scientist who knows well the Open Source), I achieved the purpose using mainly scikit-learn and pandas, but also some of the tools offered by ADWC where the advantages were obvious (eg: the copy of the Oracle table) .

What we have done can be fairly easily understood even by those who do not have much experience on Oracle. I hope.

Does the model development use ADWC? Well, up to a point. I’ve done a lot on the VM using Python and the Notebook, less on ADWC. But this was foreseen (again, see the premise).

Processing times?

Here the speech runs the risk of being long. The training of SVM requires, even having 80000 samples in the training set (split 80:20), only about 10 min. If we look at CPU usage, we see that in practice only one thread works. So even with so many cores available we would not be able to use them. But, as said before, the computationally complex part is the optimization of the hyper-parameters, an operation that requires the training of the model N times (for each choice of hyper-parameters). This part can easily be parallelized, using all available cores. It is here that having so many cores can make the difference (and the Cloud helps).

By choice, we did not use the algorithm implementation contained in ADWC, as part of the Oracle Advanced Analytics Option. Using this implementation we could exploit the parallel engine of ADWC.

Python? Well. The good news is that at Oracle Open World (October 2018) Oracle has announced the availability of a Python interface to use all Oracle Advanced Analytics algorithms. This interface, named OML4Py, will be soon available. In a future article I will examine it.

Some details

The best place where to find information on how to install cx_Oracle:

All the details.

As always, time is not enough for doing exactly what I want. And the article has become longer that I wanted.

Therefore I have decided to dedicate another article to all the details. See you soon.

=======================================

Thanks for reading :heart: If you liked this post, share it with all of your programming buddies! Follow me on Facebook | Twitter

Jupyter Notebook: Forget CSV, Fetch Data With Python

Jupyter Notebook: Forget CSV, Fetch Data With Python

In this post, see how to call the Oracle DB from Jupyter notebook with Python code.

If you read a book, article, or blog about machine learning, chances are it will use training data from a CSV file. There’s nothing wrong with CSV, but let’s think about if it is really practical. Wouldn’t it be better to read data directly from the DB? Often, you can’t feed business data directly into ML training because it needs pre-processing — changing categorial data, calculating new data features, etc. Data preparation/transformation steps can be done quite easily with SQL while fetching original business data. Another advantage of reading data directly from DB is when data changes, it is easier to automate the ML model re-train process.

In this post, I describe how to call the Oracle DB from Jupyter notebook with Python code.

Step 1

Install cx_Oracle Python module:

python -m pip install cx_Oracle

This module helps to connect to the Oracle DB from Python.

Step 2

cx_Oracle enables us to execute SQL call from Python code, but to be able to call remote DB from Python script, we need to install and configure Oracle Instant Client on the machine where Python runs.

If you are using Ubuntu, install alien:

sudo apt-get update 
sudo apt-get install alien

Download RPM files for Oracle Instant Client and install with alien:

alien -i oracle-instantclient18.3-basiclite-18.3.0.0.0-1.x86_64.rpm 
alien -i oracle-instantclient18.3-sqlplus-18.3.0.0.0-1.x86_64.rpm 
alien -i oracle-instantclient18.3-devel-18.3.0.0.0-1.x86_64.rpm

Add environment variables:

export ORACLE_HOME=/usr/lib/oracle/18.3/client64 
export PATH=$PATH:$ORACLE_HOME/bin

Read more here.

Step 3

Install Magic SQL Python modules:

pip install jupyter-sql 
pip install ipython-sql

Installation and configuration complete.

For today’s sample, I’m using the Pima Indians Diabetes Database. CSV data can be downloaded from here. I uploaded CSV data into the database table and will be fetching it through SQL directly in Jupyter notebook.

First of all, the connection is established to the DB and then SQL query is executed. The query result set is stored in a variable called result. Do you see %%sql — this magic SQL:

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import cx_Oracle\n",
    "\n",
    "%load_ext sql"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Connected: [email protected]'"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "with open('credentials.json') as f:\n",
    "    data = json.load(f)\n",
    "    username = data['username']\n",
    "    password = data['password']\n",
    "\n",
    "%sql oracle+cx_oracle://$username:[email protected]:1521/?service_name=ORCLPDB1.localdomain"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " * oracle+cx_oracle://hr:***@dbhost:1521/?service_name=ORCLPDB1.localdomain\n",
      "0 rows affected.\n",
      "Returning data to local variable result\n"
     ]
    }
   ],
   "source": [
    "%%sql result <<\n",
    "select TIMES_PREGNANT \"TIMES_PREGNANT\", GLUCOSE \"GLUCOSE\", BLOOD_PRESSURE \"BLOOD_PRESSURE\",\n",
    "       SKIN_FOLD_THICK \"SKIN_FOLD_THICK\", SERUM_INSULIN \"SERUM_INSULING\",\n",
    "       MASS_INDEX \"MASS_INDEX\", DIABETES_PEDIGREE \"DIABETES_PEDIGREE\", AGE \"AGE\",\n",
    "       CLASS_VAR \"CLASS_VAR\" from PIMA_INDIANS_DIABETES"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

Username and password must be specified while establishing a connection. To avoid sharing a password, make sure to read password value from the external source (it could be simple JSON file as in this example or a more advanced encoded token from keyring).

The beauty of this approach is that data fetched through SQL query is out-of-the-box available in Data Frame. Machine Learning engineers can work with the data in the same way as it would be loaded through CSV:

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>times_pregnant</th>\n",
       "      <th>glucose</th>\n",
       "      <th>blood_pressure</th>\n",
       "      <th>skin_fold_thick</th>\n",
       "      <th>serum_insuling</th>\n",
       "      <th>mass_index</th>\n",
       "      <th>diabetes_pedigree</th>\n",
       "      <th>age</th>\n",
       "      <th>class_var</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>106</td>\n",
       "      <td>70</td>\n",
       "      <td>28</td>\n",
       "      <td>135</td>\n",
       "      <td>34.2</td>\n",
       "      <td>0.142</td>\n",
       "      <td>22</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>155</td>\n",
       "      <td>52</td>\n",
       "      <td>27</td>\n",
       "      <td>540</td>\n",
       "      <td>38.7</td>\n",
       "      <td>0.24</td>\n",
       "      <td>25</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>101</td>\n",
       "      <td>58</td>\n",
       "      <td>35</td>\n",
       "      <td>90</td>\n",
       "      <td>21.8</td>\n",
       "      <td>0.155</td>\n",
       "      <td>22</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>120</td>\n",
       "      <td>80</td>\n",
       "      <td>48</td>\n",
       "      <td>200</td>\n",
       "      <td>38.9</td>\n",
       "      <td>1.162</td>\n",
       "      <td>41</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>11</td>\n",
       "      <td>127</td>\n",
       "      <td>106</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>39</td>\n",
       "      <td>0.19</td>\n",
       "      <td>51</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   times_pregnant  glucose  blood_pressure  skin_fold_thick  serum_insuling  \\\n",
       "0               1      106              70               28             135   \n",
       "1               2      155              52               27             540   \n",
       "2               2      101              58               35              90   \n",
       "3               1      120              80               48             200   \n",
       "4              11      127             106                0               0   \n",
       "\n",
       "  mass_index diabetes_pedigree  age  class_var  \n",
       "0       34.2             0.142   22          0  \n",
       "1       38.7              0.24   25          1  \n",
       "2       21.8             0.155   22          0  \n",
       "3       38.9             1.162   41          0  \n",
       "4         39              0.19   51          0  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = result.DataFrame()\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of rows in dataset: {df.shape[0]}\n",
      "0    500\n",
      "1    268\n",
      "Name: class_var, dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print('Number of rows in dataset: {df.shape[0]}')\n",
    "print(df[df.columns[8]].value_counts())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "times_pregnant     0.221898\n",
       "glucose            0.466581\n",
       "blood_pressure     0.065068\n",
       "skin_fold_thick    0.074752\n",
       "serum_insuling     0.130548\n",
       "age                0.238356\n",
       "class_var          1.000000\n",
       "Name: class_var, dtype: float64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "corrs = df.corr()['class_var'].abs()\n",
    "columns = corrs[corrs > .01].index\n",
    "corrs = corrs.filter(columns)\n",
    "corrs"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

Sample Jupyter notebook available on GitHub. Sample credentials JSON file.

ORA-31011: XML parsing failed while trying to get XML output

When trying to get a value from an CLOB XML column, I get the following error:

When trying to get a value from an CLOB XML column, I get the following error:

ORA-31011: XML parsing failed
ORA-19202: Error occurred in XML processing
LPX-00601: Invalid token in: '/l-analysis/vehicle/odometer-reading/reading value/text()'
31011. 00000 -  "XML parsing failed"
*Cause:    XML parser returned an error while trying to parse the document.
*Action:   Check if the document to be parsed is valid.

I have the following XML output:

<?xml version="1.0" encoding="UTF-8"?>
<l-analysis>
   <vehicle>
      <make>SAAB</make>
      <model>9000 CSI</model>
      <body-style code="SL">Saloon</body-style>
      <vehicle-type code="7">Passenger car/van</vehicle-type>
      <odometer-reading latest="true">
         <reading value="154267213">154267213</reading>
         <reading-date value="2006-07-17">17-Jul-2006</reading-date>
      </odometer-reading>
   </vehicle>
</l-analysis>

When trying to get the reading value value, I am running this query:

select XMLTYPE(xml_data).EXTRACT('/l-analysis/vehicle/odometer-reading/reading value/text()').getStringVal() AS XML FROM lm

I believe the error is due reading value having spaces? 'Cause for another XML column that doesn't have spaces it works.

How could I get this working?

An Introduction to Oracle SQL

An Introduction to Oracle SQL

An Introduction to Oracle SQL: A quick tour of the six principal clauses of the SELECT statement.

This course is a brief presentation of Oracle SQL.
We will install a version of Oracle Express Edition and Oracle SQL Developer for free.
We'll then look at the SELECT statement with its 6 main clauses (SELECT, FROM, WHERE, GROUP BY, HAVING and ORDER BY).
Following this, we'll look at what else you need to know, and resources (which range from FREE to very expensive) which will help you to learn it..

What you'll learn

  • Either Download and Install Oracle Express Edition and Oracle SQL Developer on your computer - for free
  • Write your own SELECT statements.
  • Learn the SELECT, FROM, WHERE, GROUP BY, HAVING and ORDER BY clauses.
  • Get an easy way to remember the order of these clauses.
  • Find out what more you need to know, and resources in how to learn it.