Creat chat app in the terminal using Python

Creat chat app in the terminal using Python

Creat a chat app that runs entirely in your terminal using Python.

Realtime chat is virtually any online communication that provides a realtime or live transmission of text messages from sender to receiver. This tutorial will show you how to build a realtime terminal chat using Python and Pusher Channels.

It’s lightweight to use the terminal for our chat, as there is no opening of the browser, loading of JS libraries or any frontend code. Also, it allows us to quickly test our ideas without worrying about what the user interface would look like.

Prerequisites

A basic understanding of Python is needed to follow this tutorial. You also need to have Python 3 and pip installed and configured on your machine.

Set up an app on Pusher

Pusher is a hosted service that makes it super-easy to add realtime data and functionality to web and mobile applications.

Pusher acts as a realtime layer between your servers and clients. Pusher maintains persistent connections to the clients - over Web-socket if possible and falling back to HTTP-based connectivity - so that as soon as your servers have new data they want to push to the clients they can do, via Pusher.

If you do not already have one, head over to Pusher and create a free account. We will register a new app on the dashboard. The only compulsory options are the app name and cluster. A cluster represents the physical location of the Pusher server that will handle your app’s requests. Also, copy out your App ID, Key, and Secret from the App Keys section, as we will need them later on.

Creating our application

Initial steps

First, we need to install a package called virtualenv. Virtualenv helps to manage environments in Python. This is so we do not end up with conflicting libraries due to install operations from project to project. To install Virtualenv, we run:

sudo pip install virtualenv

For Windows users, open Powershell as admin, and run:

pip install virtualenv

Once the install is completed, we can verify by running:

virtualenv --version

Next, let us create a new environment with Virtualenv:

virtualenv terminal-chat

Once the environment is done creating, we move into the new directory created and we activate the environment:

   # change directory
    cd terminal-chat
    # activate environment
    source bin/activate

For Windows users, you can activate by running:

  # change directory
    cd terminal-chat
    # activate environment
    Scripts\activate

We need to install libraries, which we will use during this project. To install them, run:

    pip install termcolor pusher git+https://github.com/nlsdfnbch/Pysher.git python-dotenv

What are these packages we have installed? And what do they do? I’ll explain.

  • termcolor: ANSII Color formatting for output in the terminal. This package will format the color of the output to the terminal. Note that the colors won't display in Powershell or Windows Command Prompt.
  • pusher: the official Python library for interacting with the Pusher HTTP API.
  • pysher: Python module for handling pusher WebSockets. This will handle event subscriptions using Pusher
  • python-dotenv: Python module that reads the key, value pair from .env file and adds them to the environment variable.

Creating the entry point

Let us create a new .env file which will hold our environment variables, which will be used in connecting to Pusher. Create a new file called .env and add your pusher app id, key, secret and cluster respectively:

 PUSHER_APP_ID=YOUR_APP_ID
 PUSHER_APP_KEY=YOUR_APP_KEY
 PUSHER_APP_SECRET=YOUR_APP_SECRET
 PUSHER_APP_CLUSTER=YOUR_APP_CLUSTER

Next, create a file called terminalChat.py and add:

 import getpass
    from termcolor import colored
    from dotenv import load_dotenv
    load_dotenv(dotenv_path='.env')
    class terminalChat():
        pusher = None
        channel = None
        chatroom = None
        clientPusher = None
        user = None
        users = {
            "samuel": "samuel'spassword",
            "daniel": "daniel'spassword",
            "tobi": "tobi'spassword",
            "sarah": "sarah'spassword"
        }
        chatrooms = ["sports", "general", "education", "health", "technology"]
    ''' The entry point of the application'''
    def main(self):
        self.login()
        self.selectChatroom()
        while True:
            self.getInput()

    ''' This function handles login to the system. In a real-world app, 
    you might need to connect to API's or a database to verify users '''

    def login(self):
        username = input("Please enter your username: ")
        password = getpass.getpass("Please enter %s's Password:" % username)
        if username in self.users:
            if self.users[username] == password:
                self.user = username
            else:
                print(colored("Your password is incorrect", "red"))
                self.login()
        else:
            print(colored("Your username is incorrect", "red"))
            self.login()

    ''' This function is used to select which chatroom you would like to connect to '''
    def selectChatroom(self):
        print(colored("Info! Available chatrooms are %s" % str(self.chatrooms), "blue"))
        chatroom = input(colored("Please select a chatroom: ", "green"))
        if chatroom in self.chatrooms:
            self.chatroom = chatroom
            self.initPusher()
        else:
            print(colored("No such chatroom in our list", "red"))
            self.selectChatroom()

    ''' This function is used to get the user's current message '''
    def getInput(self):
        message = input(colored("{}: ".format(self.user), "green"))

if __name__ == "__main__":
    terminalChat().main()

What is going on in the code above?

We import the colored module which will give colors to our console output and the load_env module to load environment variables from our .env file. We then called the load_env function.

The terminalChat class is then defined, with some properties:

  • pusher : this property will hold the Pusher server instance once it is available.
  • channel: this property will hold the Pusher instance of the channel subscribed to.
  • chatroom: this property will hold the name of the channel the user wants to chat in.
  • clientPusher: this property will hold the Pusher client instance once it is available.
  • user: this property will hold the details of the currently logged in user.
  • users: this property holds a static list of users who can log in, with their values as the password. In a real-world application, this would usually be gotten from some database
  • chatrooms: this property holds a list of all available chat-rooms one can join.

Understanding the defined functions

We have four functions defined, which I will explain how they work respectively:

main: this is the entry point into our application. Here, we call the function to log in, and the function to select a chat room. After this, we have a while loop that calls the getInput function. This while loop means the getInput function will always be running. This is to enable us always have an input to type in new messages to the terminal.

login: the login function is as simple as the name implies. It is used to manage login into the app. In the function, we ask for both the username and password of the user. Next, we check if the username exists in our user’s dictionary. Also, we check if the password correlates with the user’s password. If all is well, we assign the user variable to the value of the user input.

Note: for the sake of this tutorial, we have a pre-defined dictionary of users. In your application, you may need to verify that the user exists in your database.

selectChatroom: as the name implies, this function enables the user to select a chat-room. First, it informs the user of the available chat-rooms, before proceeding to ask us to select a chat-room. Once a valid chat-room has been selected, we assign the chat-room variable to the selected room, and we call a method called initPusher (which we will create soon), which initializes and sets up Pusher to send and receive messages.

getInput: this function is simple. It shows an input with the logged in user’s name in front, waiting for the user to enter a message and send. For now, it does nothing to the message, we will revisit this function once Pusher has been set up correctly.

Connecting the Pusher server and client to our app

If we remember, in the previous section above, we discussed the initPusher method which initializes and sets up Pusher to send and receive messages. Here is where we implement that function. First, we need to add the following imports to the top of our file:

    #terminalChat.py
from pusher import Pusher
import pysher
import os
import json

Next, let’s go ahead and defined initPusher and some other functions within our terminalChat class:

''' This function initializes both the Http server Pusher as well as the clientPusher'''
def initPusher(self):
self.pusher = Pusher(app_id=os.getenv('PUSHER_APP_ID', None), key=os.getenv('PUSHER_APP_KEY', None), secret=os.getenv('PUSHER_APP_SECRET', None), cluster=os.getenv('PUSHER_APP_CLUSTER', None))
self.clientPusher = pysher.Pusher(os.getenv('PUSHER_APP_KEY', None), os.getenv('PUSHER_APP_CLUSTER', None))
self.clientPusher.connection.bind('pusher:connection_established', self.connectHandler)
self.clientPusher.connect()

''' This function is called once pusher has successfully established a connection'''
def connectHandler(self, data):
self.channel = self.clientPusher.subscribe(self.chatroom)
self.channel.bind('newmessage', self.pusherCallback)

''' This function is called once pusher receives a new event ''' def pusherCallback(self, message): message = json.loads(message) if message['user'] != self.user: print(colored("{}: {}".format(message['user'], message['message']), "blue")) print(colored("{}: ".format(self.user), "green"))

In the init function, we initialize a new Pusher instance to the pusher variable, passing in our APPID, APPKEYAPP_SECRET and APP_CLUSTER respectively. Next, we initialize a new Pysher client for Pusher, passing in our APP_KEY. We then bind to the connection, the pusher:connection_established event, and pass the connectHandler function as it’s callback. The reason we do this is to ensure that the client has been connected before we try to subscribe to a channel. After this is done, we call connect on the clientPusher.

You might have been wondering why we are using Pysher as the client library for Pusher here. It is because the default Pusher library only allows for triggering of events and not subscribing to them. Pysher is a community library which allows us to subscribe for events using Python on the server.

In the connectHandler function, we receive an argument called data. This comprises connection data that comes from the established connection between the Pusher WebSockets. We subscribe to the channel, which has been chosen with Pusher, then bind to an event called newmessage, passing in the pusherCallback function as it’s callback.

In the pusherCallback method, we receive an argument called message, which returns the object of the new message received from Pusher. Here, we convert the message to a readable JSON format for Python, then check if the message isn't for the currently logged in user before printing the message to the screen alongside the sender’s name. We also print the logged in user’s name to the screen, with a colon in its front, so the user knows he can still type.

Updating the getInput function

Let’s update our getInput function, so we can trigger the message to Pusher once it is received:

 ''' This function is used to get the user's current message '''
def getInput(self):
message = input(colored("{}: ".format(self.user), "green"))
self.pusher.trigger(self.chatroom, u'newmessage', {"user": self.user, "message": message})

Here, after receiving the message, we trigger a newmesage event to the current chat-room, passing the current user and the message sent.

Bringing it all together as one piece

Here is what our terminalChat.py looks like:

import getpass
from termcolor import colored
from pusher import Pusher
import pysher
from dotenv import load_dotenv
import os
import json

load_dotenv(dotenv_path='.env')

class terminalChat():
pusher = None
channel = None
chatroom = None
clientPusher = None
user = None
users = {
"samuel": "samuel'spassword",
"daniel": "daniel'spassword",
"tobi": "tobi'spassword",
"sarah": "sarah'spassword"
}
chatrooms = ["sports", "general", "education", "health", "technology"]

    ''' The entry point of the application'''
    def main(self):
        self.login()
        self.selectChatroom()
        while True:
            self.getInput()

''' This function handles logon to the system. In a real world app,
you might need to connect to API's or a database to verify users '''

def login(self):
username = input("Please enter your username: ")
password = getpass.getpass("Please enter %s's Password:" % username)
if username in self.users:
if self.users[username] == password:
self.user = username
else:
print(colored("Your password is incorrect", "red"))
self.login()
else:
print(colored("Your username is incorrect", "red"))
self.login()

''' This function is used to select which chatroom you would like to connect to '''
def selectChatroom(self):
print(colored("Info! Available chatrooms are %s" % str(self.chatrooms), "blue"))
chatroom = input(colored("Please select a chatroom: ", "green"))
if chatroom in self.chatrooms:
self.chatroom = chatroom
self.initPusher()
else:
print(colored("No such chatroom in our list", "red"))
self.selectChatroom()

''' This function initializes both the Http server Pusher as well as the clientPusher'''
def initPusher(self):
self.pusher = Pusher(app_id=os.getenv('PUSHER_APP_ID', None), key=os.getenv('PUSHER_APP_KEY', None), secret=os.getenv('PUSHER_APP_SECRET', None), cluster=os.getenv('PUSHER_APP_CLUSTER', None))
self.clientPusher = pysher.Pusher(os.getenv('PUSHER_APP_KEY', None), os.getenv('PUSHER_APP_CLUSTER', None))
self.clientPusher.connection.bind('pusher:connection_established', self.connectHandler)
self.clientPusher.connect()

''' This function is called once pusher has successfully established a connection'''
def connectHandler(self, data):
self.channel = self.clientPusher.subscribe(self.chatroom)
self.channel.bind('newmessage', self.pusherCallback)

''' This function is called once pusher receives a new event '''
def pusherCallback(self, message):
message = json.loads(message)
if message['user'] != self.user:
print(colored("{}: {}".format(message['user'], message['message']), "blue"))
print(colored("{}: ".format(self.user), "green"))

''' This function is used to get the user's current message '''
def getInput(self):
message = input(colored("{}: ".format(self.user), "green"))
self.pusher.trigger(self.chatroom, u'newmessage', {"user": self.user, "message": message})

if name == "main":
terminalChat().main()

Here is what our chat looks like if we run python terminalChat.py:

Conclusion

We’ve seen how straightforward it is to add realtime chats to our terminal, thanks to Pusher Channels. Our demo app is a simple example. The same functionality could be used in many real world scenarios. 

Thank for reading !

Originally published on https://pusher.com


How to Set up an SMS Notification With Python

How to Set up an SMS Notification With Python

How to Set up an SMS Notification With Python. oday I am beginning a new series of posts specifically aimed at Python beginners.

Hi everyone :) Today I am beginning a new series of posts specifically aimed at Python beginners. The concept is rather simple: I'll do a fun project, in as few lines of code as possible, and will try out as many new tools as possible.

For example, today we will learn to use the Twilio API, the Twitch API, and we'll see how to deploy the project on Heroku. I'll show you how you can have your own "Twitch Live" SMS notifier, in 30 lines of codes, and for 12 cents a month.

Prerequisite: You only need to know how to run Python on your machine and some basic commands in git (commit & push). If you need help with these, I can recommend these 2 articles to you:

Python 3 Installation & Setup Guide

The Ultimate Git Command Tutorial for Beginners from Adrian Hajdin.

What you'll learn:

  • Twitch API
  • Twilio API
  • Deploying on Heroku
  • Setting up a scheduler on Heroku

What you will build:

The specifications are simple: we want to receive an SMS as soon as a specific Twitcher is live streaming. We want to know when this person is going live and when they leave streaming. We want this whole thing to run by itself, all day long.

We will split the project into 3 parts. First, we will see how to programmatically know if a particular Twitcher is online. Then we will see how to receive an SMS when this happens. We will finish by seeing how to make this piece of code run every X minutes, so we never miss another moment of our favorite streamer's life.

Is this Twitcher live?

To know if a Twitcher is live, we can do two things: we can go to the Twitcher URL and try to see if the badge "Live" is there.

Screenshot of a Twitcher live streaming.

This process involves scraping and is not easily doable in Python in less than 20 or so lines of code. Twitch runs a lot of JS code and a simple request.get() won't be enough.

For scraping to work, in this case, we would need to scrape this page inside Chrome to get the same content like what you see in the screenshot. This is doable, but it will take much more than 30 lines of code. If you'd like to learn more, don't hesitate to check my recent web scraping guide.

So instead of trying to scrape Twitch, we will use their API. For those unfamiliar with the term, an API is a programmatic interface that allows websites to expose their features and data to anyone, mainly developers. In Twitch's case, their API is exposed through HTTP, witch means that we can have lots of information and do lots of things by just making a simple HTTP request.

Get your API key

To do this, you have to first create a Twitch API key. Many services enforce authentication for their APIs to ensure that no one abuses them or to restrict access to certain features by certain people.

Please follow these steps to get your API key:

  • Create a Twitch account
  • Now create a Twitch dev account -> "Signing up with Twitch" top right
  • Go to your "dashboard" once logged in
  • "Register your application"
  • Name -> Whatever, Oauth redirection URL -> http://localhost, Category -> Whatever

You should now see, at the bottom of your screen, your client-id. Keep this for later.

Is that Twitcher streaming now?

With your API key in hand, we can now query the Twitch API to have the information we want, so let's begin to code. The following snippet just consumes the Twitch API with the correct parameters and prints the response.

# requests is the go to package in python to make http request
# https://2.python-requests.org/en/master/
import requests

# This is one of the route where Twich expose data, 
# They have many more: https://dev.twitch.tv/docs
endpoint = "https://api.twitch.tv/helix/streams?"

# In order to authenticate we need to pass our api key through header
headers = {"Client-ID": "<YOUR-CLIENT-ID>"}

# The previously set endpoint needs some parameter, here, the Twitcher we want to follow
# Disclaimer, I don't even know who this is, but he was the first one on Twich to have a live stream so I could have nice examples
params = {"user_login": "Solary"}

# It is now time to make the actual request
response = request.get(endpoint, params=params, headers=headers)
print(response.json())

The output should look like this:

{
   'data':[
      {
         'id':'35289543872',
         'user_id':'174955366',
         'user_name':'Solary',
         'game_id':'21779',
         'type':'live',
         'title':"Wakz duoQ w/ Tioo - GM 400LP - On récupère le chall après les -250LP d'inactivité !",
         'viewer_count':4073,
         'started_at':'2019-08-14T07:01:59Z',
         'language':'fr',
         'thumbnail_url':'https://static-cdn.jtvnw.net/previews-ttv/live_user_solary-{width}x{height}.jpg',
         'tag_ids':[
            '6f655045-9989-4ef7-8f85-1edcec42d648'
         ]
      }
   ],
   'pagination':{
      'cursor':'eyJiIjpudWxsLCJhIjp7Ik9mZnNldCI6MX19'
   }
}

This data format is called JSON and is easily readable. The data object is an array that contains all the currently active streams. The key type ensures that the stream is currently live. This key will be empty otherwise (in case of an error, for example).

So if we want to create a boolean variable in Python that stores whether the current user is streaming, all we have to append to our code is:

json_response = response.json()

# We get only streams
streams = json_response.get('data', [])

# We create a small function, (a lambda), that tests if a stream is live or not
is_active = lambda stream: stream.get('type') == 'live'
# We filter our array of streams with this function so we only keep streams that are active
streams_active = filter(is_active, streams)

# any returns True if streams_active has at least one element, else False
at_least_one_stream_active = any(streams_active)

print(at_least_one_stream_active)

At this point, at_least_one_stream_active is True when your favourite Twitcher is live.

Let's now see how to get notified by SMS.

Send me a text, NOW!

So to send a text to ourselves, we will use the Twilio API. Just go over there and create an account. When asked to confirm your phone number, please use the phone number you want to use in this project. This way you'll be able to use the $15 of free credit Twilio offers to new users. At around 1 cent a text, it should be enough for your bot to run for one year.

If you go on the console, you'll see your Account SID and your Auth Token , save them for later. Also click on the big red button "Get My Trial Number", follow the step, and save this one for later too.

Sending a text with the Twilio Python API is very easy, as they provide a package that does the annoying stuff for you. Install the package with pip install Twilio and just do:

from twilio.rest import Client
client = Client(<Your Account SID>, <Your Auth Token>)
client.messages.create(
	body='Test MSG',from_=<Your Trial Number>,to=<Your Real Number>)

And that is all you need to send yourself a text, amazing right?

Putting everything together

We will now put everything together, and shorten the code a bit so we manage to say under 30 lines of Python code.

import requests
from twilio.rest import Client
endpoint = "https://api.twitch.tv/helix/streams?"
headers = {"Client-ID": "<YOUR-CLIENT-ID>"}
params = {"user_login": "Solary"}
response = request.get(endpoint, params=params, headers=headers)
json_response = response.json()
streams = json_response.get('data', [])
is_active = lambda stream:stream.get('type') == 'live'
streams_active = filter(is_active, streams)
at_least_one_stream_active = any(streams_active)
if at_least_one_stream_active:
    client = Client(<Your Account SID>, <Your Auth Token>)
	client.messages.create(body='LIVE !!!',from_=<Your Trial Number>,to=<Your Real Number>)

Avoiding double notifications

This snippet works great, but should that snippet run every minute on a server, as soon as our favorite Twitcher goes live we will receive an SMS every minute.

We need a way to store the fact that we were already notified that our Twitcher is live and that we don't need to be notified anymore.

The good thing with the Twilio API is that it offers a way to retrieve our message history, so we just have to retrieve the last SMS we sent to see if we already sent a text notifying us that the twitcher is live.

Here what we are going do to in pseudocode:

if favorite_twitcher_live and last_sent_sms is not live_notification:
	send_live_notification()
if not favorite_twitcher_live and last_sent_sms is live_notification:
	send_live_is_over_notification()

This way we will receive a text as soon as the stream starts, as well as when it is over. This way we won't get spammed - perfect right? Let's code it:

# reusing our Twilio client
last_messages_sent = client.messages.list(limit=1)
last_message_id = last_messages_sent[0].sid
last_message_data = client.messages(last_message_id).fetch()
last_message_content = last_message_data.body

Let's now put everything together again:

import requests
from twilio.rest import Client
client = Client(<Your Account SID>, <Your Auth Token>)

endpoint = "https://api.twitch.tv/helix/streams?"
headers = {"Client-ID": "<YOUR-CLIENT-ID>"}
params = {"user_login": "Solary"}
response = request.get(endpoint, params=params, headers=headers)
json_response = response.json()
streams = json_response.get('data', [])
is_active = lambda stream:stream.get('type') == 'live'
streams_active = filter(is_active, streams)
at_least_one_stream_active = any(streams_active)

last_messages_sent = client.messages.list(limit=1)
if last_messages_sent:
	last_message_id = last_messages_sent[0].sid
	last_message_data = client.messages(last_message_id).fetch()
	last_message_content = last_message_data.body
    online_notified = "LIVE" in last_message_content
    offline_notified = not online_notified
else:
	online_notified, offline_notified = False, False

if at_least_one_stream_active and not online_notified:
	client.messages.create(body='LIVE !!!',from_=<Your Trial Number>,to=<Your Real Number>)
if not at_least_one_stream_active and not offline_notified:
	client.messages.create(body='OFFLINE !!!',from_=<Your Trial Number>,to=<Your Real Number>)

And voilà!

You now have a snippet of code, in less than 30 lines of Python, that will send you a text a soon as your favourite Twitcher goes Online / Offline and without spamming you.

We just now need a way to host and run this snippet every X minutes.

The quest for a host

To host and run this snippet we will use Heroku. Heroku is honestly one of the easiest ways to host an app on the web. The downside is that it is really expensive compared to other solutions out there. Fortunately for us, they have a generous free plan that will allow us to do what we want for almost nothing.

If you don't already, you need to create a Heroku account. You also need to download and install the Heroku client.

You now have to move your Python script to its own folder, don't forget to add a requirements.txt file in it. The content of the latter begins:

requests
twilio

This is to ensure that Heroku downloads the correct dependencies.

cd into this folder and just do a heroku create --app <app name>.

If you go on your app dashboard you'll see your new app.

We now need to initialize a git repo and push the code on Heroku:

git init
heroku git:remote -a <app name>
git add .
git commit -am 'Deploy breakthrough script'
git push heroku master

Your app is now on Heroku, but it is not doing anything. Since this little script can't accept HTTP requests, going to <app name>.herokuapp.com won't do anything. But that should not be a problem.

To have this script running 24/7 we need to use a simple Heroku add-on call "Heroku Scheduler". To install this add-on, click on the "Configure Add-ons" button on your app dashboard.

Then, on the search bar, look for Heroku Scheduler:

Click on the result, and click on "Provision"

If you go back to your App dashboard, you'll see the add-on:

Click on the "Heroku Scheduler" link to configure a job. Then click on "Create Job". Here select "10 minutes", and for run command select python <name_of_your_script>.py. Click on "Save job".

While everything we used so far on Heroku is free, the Heroku Scheduler will run the job on the $25/month instance, but prorated to the second. Since this script approximately takes 3 seconds to run, for this script to run every 10 minutes you should just have to spend 12 cents a month.

Ideas for improvements

I hope you liked this project and that you had fun putting it into place. In less than 30 lines of code, we did a lot, but this whole thing is far from perfect. Here are a few ideas to improve it:

  • Send yourself more information about the current streaming (game played, number of viewers ...)
  • Send yourself the duration of the last stream once the twitcher goes offline
  • Don't send you a text, but rather an email
  • Monitor multiple twitchers at the same time

Do not hesitate to tell me in the comments if you have more ideas.

Conclusion

I hope that you liked this post and that you learned things reading it. I truly believe that this kind of project is one of the best ways to learn new tools and concepts, I recently launched a web scraping API where I learned a lot while making it.

Please tell me in the comments if you liked this format and if you want to do more.

I have many other ideas, and I hope you will like them. Do not hesitate to share what other things you build with this snippet, possibilities are endless.

Happy Coding.

Pierre

Don't want to miss my next post:

You can subscribe here to my newsletter.

Python Tutorial: Image processing with Python (Using OpenCV)

Python Tutorial: Image processing with Python (Using OpenCV)

In this tutorial, you will learn how you can process images in Python using the OpenCV library.

In this tutorial, you will learn how you can process images in Python using the OpenCV library.

OpenCV is a free open source library used in real-time image processing. It’s used to process images, videos, and even live streams, but in this tutorial, we will process images only as a first step. Before getting started, let’s install OpenCV.

Table of Contents

Install OpenCV

To install OpenCV on your system, run the following pip command:

 pip install opencv-python

Now OpenCV is installed successfully and we are ready. Let’s have some fun with some images!

Rotate an Image

First of all, import the cv2 module.

 import cv2

Now to read the image, use the imread() method of the cv2 module, specify the path to the image in the arguments and store the image in a variable as below:

 img = cv2.imread("pyimg.jpg")

The image is now treated as a matrix with rows and columns values stored in img.

Actually, if you check the type of the img, it will give you the following result:

>>>print(type(img))
 
<class 'numpy.ndarray'>

It’s a NumPy array! That why image processing using OpenCV is so easy. All the time you are working with a NumPy array.

To display the image, you can use the imshow() method of cv2.

cv2.imshow('Original Image', img) 
 
cv2.waitKey(0)

The waitkey functions take time as an argument in milliseconds as a delay for the window to close. Here we set the time to zero to show the window forever until we close it manually.

To rotate this image, you need the width and the height of the image because you will use them in the rotation process as you will see later.

 height, width = img.shape[0:2]

The shape attribute returns the height and width of the image matrix. If you print img.shape[0:2] , you will have the following output:

Okay, now we have our image matrix and we want to get the rotation matrix. To get the rotation matrix, we use the getRotationMatrix2D() method of cv2. The syntax of getRotationMatrix2D() is:

 cv2.getRotationMatrix2D(center, angle, scale)

Here the center is the center point of rotation, the angle is the angle in degrees and scale is the scale property which makes the image fit on the screen.

To get the rotation matrix of our image, the code will be:

 rotationMatrix = cv2.getRotationMatrix2D((width/2, height/2), 90, .5)

The next step is to rotate our image with the help of the rotation matrix.

To rotate the image, we have a cv2 method named wrapAffine which takes the original image, the rotation matrix of the image and the width and height of the image as arguments.

 rotatedImage = cv2.warpAffine(img, rotationMatrix, (width, height))

The rotated image is stored in the rotatedImage matrix. To show the image, use imshow() as below:

cv2.imshow('Rotated Image', rotatedImage)
 
cv2.waitKey(0)

After running the above lines of code, you will have the following output:

Crop an Image

First, we need to import the cv2 module and read the image and extract the width and height of the image:

import cv2
 
img = cv2.imread("pyimg.jpg")
 
height, width = img.shape[0:2]

Now get the starting and ending index of the row and column. This will define the size of the newly created image. For example, start from row number 10 till row number 15 will give the height of the image.

Similarly, start from column number 10 until column number 15 will give the width of the image.

You can get the starting point by specifying the percentage value of the total height and the total width. Similarly, to get the ending point of the cropped image, specify the percentage values as below:

startRow = int(height*.15)
 
startCol = int(width*.15)
 
endRow = int(height*.85)
 
endCol = int(width*.85)

Now map these values to the original image. Note that you have to cast the starting and ending values to integers because when mapping, the indexes are always integers.

 croppedImage = img[startRow:endRow, startCol:endCol]

Here we specified the range from starting to ending of rows and columns.

Now display the original and cropped image in the output:

cv2.imshow('Original Image', img)
 
cv2.imshow('Cropped Image', croppedImage)
 
cv2.waitKey(0)

The result will be as follows:

Resize an Image

To resize an image, you can use the resize() method of openCV. In the resize method, you can either specify the values of x and y axis or the number of rows and columns which tells the size of the image.

Import and read the image:

import cv2
 
img = cv2.imread("pyimg.jpg")

Now using the resize method with axis values:

newImg = cv2.resize(img, (0,0), fx=0.75, fy=0.75)
 
cv2.imshow('Resized Image', newImg)
 
cv2.waitKey(0)

The result will be as follows:

Now using the row and column values to resize the image:

newImg = cv2.resize(img, (550, 350))
 
cv2.imshow('Resized Image', newImg)
 
cv2.waitKey(0)

We say we want 550 columns (the width) and 350 rows (the height).

The result will be:

Adjust Image Contrast

In Python OpenCV module, there is no particular function to adjust image contrast but the official documentation of OpenCV suggests an equation that can perform image brightness and image contrast both at the same time.

 new_img = a * original_img + b

Here a is alpha which defines contrast of the image. If a is greater than 1, there will be higher contrast.

If the value of a is between 0 and 1 (smaller than 1 but greater than 0), there would be lower contrast. If a is 1, there will be no contrast effect on the image.

b stands for beta. The values of b vary from -127 to +127.

To implement this equation in Python OpenCV, you can use the addWeighted() method. We use The addWeighted() method as it generates the output in the range of 0 and 255 for a 24-bit color image.

The syntax of addWeighted() method is as follows:

 cv2.addWeighted(source_img1, alpha1, source_img2, alpha2, beta)

This syntax will blend two images, the first source image (source_img1) with a weight of alpha1 and second source image (source_img2).

If you only want to apply contrast in one image, you can add a second image source as zeros using NumPy.

Let’s work on a simple example. Import the following modules:

import cv2
 
import numpy as np

Read the original image:

 img = cv2.imread("pyimg.jpg")

Now apply the contrast. Since there is no other image, we will use the np.zeros which will create an array of the same shape and data type as the original image but the array will be filled with zeros.

contrast_img = cv2.addWeighted(img, 2.5, np.zeros(img.shape, img.dtype), 0, 0)
 
cv2.imshow('Original Image', img)
 
cv2.imshow('Contrast Image', contrast_img)
 
cv2.waitKey(0)

In the above code, the brightness is set to 0 as we only want to apply contrast.

The comparison of the original and contrast image is as follows:

Make an image blurry

Gaussian Blur

To make an image blurry, you can use the GaussianBlur() method of OpenCV.

The GaussianBlur() uses the Gaussian kernel. The height and width of the kernel should be a positive and an odd number.

Then you have to specify the X and Y direction that is sigmaX and sigmaY respectively. If only one is specified, both are considered the same.

Consider the following example:

import cv2
 
img = cv2.imread("pyimg.jpg")
 
blur_image = cv2.GaussianBlur(img, (7,7), 0)
 
cv2.imshow('Original Image', img)
 
cv2.imshow('Blur Image', blur_image)
 
cv2.waitKey(0)

In the above snippet, the actual image is passed to GaussianBlur() along with height and width of the kernel and the X and Y directions.

The comparison of the original and blurry image is as follows:

Median Blur

In median blurring, the median of all the pixels of the image is calculated inside the kernel area. The central value is then replaced with the resultant median value. Median blurring is used when there are salt and pepper noise in the image.

To apply median blurring, you can use the medianBlur() method of OpenCV.

Consider the following example where we have a salt and pepper noise in the image:

import cv2
 
img = cv2.imread("pynoise.png")
 
blur_image = cv2.medianBlur(img,5)

This will apply 50% noise in the image along with median blur. Now show the images:

cv2.imshow('Original Image', img)
 
cv2.imshow('Blur Image', blur_image)
 
cv2.waitKey(0)

The result will be like the following:

Another comparison of the original image and after blurring:

Detect Edges

To detect the edges in an image, you can use the Canny() method of cv2 which implements the Canny edge detector. The Canny edge detector is also known as the optimal detector.

The syntax to Canny() is as follows:

 cv2.Canny(image, minVal, maxVal)

Here minVal and maxVal are the minimum and maximum intensity gradient values respectively.

Consider the following code:

import cv2
 
img = cv2.imread("pyimg.jpg")
 
edge_img = cv2.Canny(img,100,200)
 
cv2.imshow("Detected Edges", edge_img)
 
cv2.waitKey(0)

The output will be the following:

Here is the result of the above code on another image:

Convert image to grayscale (Black & White)

The easy way to convert an image in grayscale is to load it like this:

 img = cv2.imread("pyimg.jpg", 0)

There is another method using BGR2GRAY.

To convert a color image into a grayscale image, use the BGR2GRAY attribute of the cv2 module. This is demonstrated in the example below:

Import the cv2 module:

 import cv2

Read the image:

 img = cv2.imread("pyimg.jpg")

Use the cvtColor() method of the cv2 module which takes the original image and the COLOR_BGR2GRAY attribute as an argument. Store the resultant image in a variable:

 gray_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

Display the original and grayscale images:

cv2.imshow("Original Image", img)
 
cv2.imshow("Gray Scale Image", gray_img)
 
cv2.waitKey(0)

The output will be as follows:

Centroid (Center of blob) detection

To find the center of an image, the first step is to convert the original image into grayscale. We can use the cvtColor() method of cv2 as we did before.

This is demonstrated in the following code:

import cv2
 
img = cv2.imread("py.jpg")
 
gray_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

We read the image and convert it to a grayscale image. The new image is stored in gray_img.

Now we have to calculate the moments of the image. Use the moments() method of cv2. In the moments() method, the grayscale image will be passed as below:

 moment = cv2.moments(gray_img)

Finally, we have the center of the image. To highlight this center position, we can use the circle method which will create a circle in the given coordinates of the given radius.

The circle() method takes the img, the x and y coordinates where the circle will be created, the size, the color that we want the circle to be and the thickness.

 cv2.circle(img, (X, Y), 15, (205, 114, 101), 1)

The circle is created on the image.

cv2.imshow("Center of the Image", img)
 
cv2.waitKey(0)

The original image is:

After detecting the center, our image will be as follows:

Apply a mask for a colored image

Image masking means to apply some other image as a mask on the original image or to change the pixel values in the image.

To apply a mask on the image, we will use the HoughCircles() method of the OpenCV module. The HoughCircles() method detects the circles in an image. After detecting the circles, we can simply apply a mask on these circles.

The HoughCircles() method takes the original image, the Hough Gradient (which detects the gradient information in the edges of the circle), and the information from the following circle equation:

 (x - xcenter)2 + (y - ycenter)2 = r2

In this equation (xcenter , ycenter) is the center of the circle and r is the radius of the circle.

Our original image is:

After detecting circles in the image, the result will be:

Okay, so we have the circles in the image and we can apply the mask. Consider the following code:

import cv2
 
import numpy as np
 
img1 = cv2.imread('pyimg.jpg')
 
img1 = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

Detecting the circles in the image using the HoughCircles() code from OpenCV: Hough Circle Transform:

gray_img = cv2.medianBlur(cv2.cvtColor(img, cv2.COLOR_RGB2GRAY), 3)
 
circles = cv2.HoughCircles(gray_img, cv2.HOUGH_GRADIENT, 1, 20, param1=50, param2=50, minRadius=0, maxRadius=0)
 
circles = np.uint16(np.around(circles))

To create the mask, use np.full which will return a NumPy array of given shape:

masking=np.full((img1.shape[0], img1.shape[1]),0,dtype=np.uint8)
 
for j in circles[0, :]:
 
    cv2.circle(masking, (j[0], j[1]), j[2], (255, 255, 255), -1)

The next step is to combine the image and the masking array we created using the bitwise_or operator as follows:

 final_img = cv2.bitwise_or(img1, img1, masking=masking)

Display the resultant image:

Extracting text from Image (OCR)

To extract text from an image, you can use Google Tesseract-OCR. You can download it from this link

Then you should install the pytesseract module which is a Python wrapper for Tesseract-OCR.

The image from which we will extract the text from is as follows:

Now let’s convert the text in this image to a string of characters and display the text as a string on output:

Import the pytesseract module:

 import pytesseract

Set the path of the Tesseract-OCR executable file:

 pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files (x86)\Tesseract-OCR\tesseract'

Now use the image_to_string method to convert the image into a string:

 print(pytesseract.image_to_string('pytext.png'))

The output will be as follows:

Works like charm!

Detect and correct text skew

In this section, we will correct the text skew.

The original image is as follows:

Import the modules cv2, NumPy and read the image:

import cv2
 
import numpy as np
 
img = cv2.imread("pytext1.png")

Convert the image into a grayscale image:

 gray_img=cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

Invert the grayscale image using bitwise_not:

 gray_img=cv2.bitwise_not(gray_img)

Select the x and y coordinates of the pixels greater than zero by using the column_stack method of NumPy:

 coordinates = np.column_stack(np.where(gray_img > 0))

Now we have to calculate the skew angle. We will use the minAreaRect() method of cv2 which returns an angle range from -90 to 0 degrees (where 0 is not included).

 ang=cv2.minAreaRect(coordinates)[-1]

The rotated angle of the text region will be stored in the ang variable. Now we add a condition for the angle; if the text region’s angle is smaller than -45, we will add a 90 degrees else we will multiply the angle with a minus to make the angle positive.

if ang<-45:
 
    ang=-(90+ang)
 
else:
 
    ang=-ang

Calculate the center of the text region:

height, width = img.shape[:2]
 
center_img = (width / 2, height / 2)

Now we have the angle of text skew, we will apply the getRotationMatrix2D() to get the rotation matrix then we will use the wrapAffine() method to rotate the angle (explained earlier).

rotationMatrix = cv2.getRotationMatrix2D(center, angle, 1.0)
 
rotated_img = cv2.warpAffine(img, rotationMatrix, (width, height), borderMode = cv2.BORDER_REFLECT)

Display the rotated image:

cv2.imshow("Rotated Image", rotated_img)
 
cv2.waitKey(0)

Color Detection

Let’s detect the green color from an image:

Import the modules cv2 for images and NumPy for image arrays:

import cv2
 
import numpy as np

Read the image and convert it into HSV using cvtColor():

img = cv2.imread("pydetect.png")
 
hsv_img = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)

Display the image:

 cv2.imshow("HSV Image", hsv_img)

Now create a NumPy array for the lower green values and the upper green values:

lower_green = np.array([34, 177, 76])
 
upper_green = np.array([255, 255, 255])

Use the inRange() method of cv2 to check if the given image array elements lie between array values of upper and lower boundaries:

 masking = cv2.inRange(hsv_img, lower_green, upper_green)

This will detect the green color.

Finally, display the original and resultant images:

 cv2.imshow("Original Image", img)

cv2.imshow("Green Color detection", masking)
 
cv2.waitKey(0)

Reduce Noise

To reduce noise from an image, OpenCV provides the following methods:

  1. fastNlMeansDenoising(): Removes noise from a grayscale image
  2. fastNlMeansDenoisingColored(): Removes noise from a colored image
  3. fastNlMeansDenoisingMulti(): Removes noise from grayscale image frames (a grayscale video)
  4. fastNlMeansDenoisingColoredMulti(): Same as 3 but works with colored frames

Let’s use fastNlMeansDenoisingColored() in our example:

Import the cv2 module and read the image:

2
3
	
import cv2
 
img = cv2.imread("pyn1.png")

Apply the denoising function which takes respectively the original image (src), the destination (which we have kept none as we are storing the resultant), the filter strength, the image value to remove the colored noise (usually equal to filter strength or 10), the template patch size in pixel to compute weights which should always be odd (recommended size equals 7) and the window size in pixels to compute average of the given pixel.

 result = cv2.fastNlMeansDenoisingColored(img,None,20,10,7,21)

Display original and denoised image:

cv2.imshow("Original Image", img)
 
cv2.imshow("Denoised Image", result)
 
cv2.waitKey(0)

The output will be:

Get image contour

Contours are the curves in an image that are joint together. The curves join the continuous points in an image. The purpose of contours is used to detect the objects.

The original image of which we are getting the contours of is given below:

Consider the following code where we used the findContours() method to find the contours in the image:

Import cv2 module:

 import cv2

Read the image and convert it to a grayscale image:

img = cv2.imread('py1.jpg')
 
gray_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

Find the threshold:

 retval, thresh = cv2.threshold(gray_img, 127, 255, 0)

Use the findContours() which takes the image (we passed threshold here) and some attributes. See findContours() Official.

 img_contours, _ = cv2.findContours(thresh, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)

Draw the contours on the image using drawContours() method:

  cv2.drawContours(img, img_contours, -1, (0, 255, 0))

Display the image:

cv2.imshow('Image Contours', img)
 
cv2.waitKey(0)

The result will be:

Remove Background from an image

To remove the background from an image, we will find the contours to detect edges of the main object and create a mask with np.zeros for the background and then combine the mask and the image using the bitwise_and operator.

Consider the example below:

Import the modules (NumPy and cv2):

import cv2
 
import numpy as np

Read the image and convert the image into a grayscale image:

img = cv2.imread("py.jpg")
 
gray_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

Find the threshold:

 _, thresh = cv2.threshold(gray_img, 127, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)

In the threshold() method, the last argument defines the style of the threshold. See Official documentation of OpenCV threshold.

Find the image contours:

 img_contours = cv2.findContours(threshed, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)[-2]

Sort the contours:

img_contours = sorted(img_contours, key=cv2.contourArea)
 
for i in img_contours:
 
    if cv2.contourArea(i) > 100:
 
        break

Generate the mask using np.zeros:

 mask = np.zeros(img.shape[:2], np.uint8)

Draw contours:

 cv2.drawContours(mask, [i],-1, 255, -1)

Apply the bitwise_and operator:

 new_img = cv2.bitwise_and(img, img, mask=mask)

Display the original image:

 cv2.imshow("Original Image", img)

Display the resultant image:

cv2.imshow("Image with background removed", new_img)
 
cv2.waitKey(0)

Image processing is fun when using OpenCV as you saw. I hope you find the tutorial useful. Keep coming back.

Thank you.

Introduction to Python Hex() Function for Beginners

Introduction to Python Hex() Function for Beginners

Python hex() function is used to convert any integer number ( in base 10) to the corresponding hexadecimal number. Notably, the given input should be in base 10. Python hex function is one of the built-in functions in Python3, which is used to convert an integer number into its corresponding hexadecimal form.

Python hex() function is used to convert any integer number ( in base 10) to the corresponding hexadecimal number. Notably, the given input should be in base 10. Python hex function is one of the built-in functions in Python3, which is used to convert an integer number into its corresponding hexadecimal form.

##Python Hex Example
The hex() function converts the integer to the corresponding hexadecimal number in string form and returns it.

The input integer argument can be in any base such as binary, octal, etc. Python will take care of converting them to hexadecimal format.

Syntax

hex(number)

number: It is an integer that will be converted into hexadecimal value.
This function converts the number into the hexadecimal form, and then it returns that hexadecimal number in a string format.

Please note that the return value always starts with ‘0x’ (without quotes), which proves that the number is in hexadecimal format.

# app.py

print("Enter the number: ")

# taking input from user
num = int(input())

# converting the number into hexadecimal form
h1 = hex(num)

# Printing hexadecimal form
print("The ", num, " in hexadecimal is: ", h1)

# Converting float number to hexadecimal form
print("\nEnter a float number")
num2 = float(input())

# converting into hexadecimal form
# for float we have to use float.hex() here
h2 = float.hex(num2)

# printing result
print("The ", num2, " in hexadecimal is: ", h2)

In the above example, we used the Python input() function to take the input from the user.

See the output.

Enter the number:
541
The  541  in hexadecimal is:  0x21d
    
Enter a float number
123.54
The  123.54  in hexadecimal is:  0x1.ee28f5c28f5c3p+6

Python hex() without 0x

See the following program.

# app.py

print("Enter the number: ")

# taking input from user
num = int(input())

# converting the number into hexadecimal form
h1 = hex(num)

# Printing hexadecimal form
# we have used string slicing here
print("The ", num, " in hexadecimal is: ", h1[2:])

# Converting float number to hexadecimal form
print("\nEnter a float number")
num2 = float(input())

# converting into hexadecimal form
h2 = float.hex(num2)

# printing result
print("The ", num2, " in hexadecimal is: ", h2[2:])

See the output.

Enter the number:
541
The  541  in hexadecimal is:  21d

Enter a float number
123.65
The  123.65  in hexadecimal is:  1.ee9999999999ap+6

On the above program, we have used string slicing to print the result without ‘0x’.

We have started our index from position 2 to the last of the string, i.e., h1[2:]; this means the string will print characters from position 2 to the last of the string.

Hexadecimal representation of float in Python

See the following program.

# app.py

numberEL = 11.21
print(numberEL, 'in hex =', float.hex(numberEL))

numberK = 19.21
print(numberK, 'in hex =', float.hex(numberK))

See the output.

➜  pyt python3 app.py
11.21 in hex = 0x1.66b851eb851ecp+3
19.21 in hex = 0x1.335c28f5c28f6p+4
➜  pyt

Python hex() with object

See the following code.

# app.py

class AI:
    id = 0

    def __index__(self):
        print('__index__() function called')
        return self.rank


stockfish = AI()
stockfish.rank = 2900

print(hex(stockfish))

In the above example, we have used the index() method so that we can use it with hex() function.

See the output.

➜  pyt python3 app.py
__index__() function called
0xb54
➜  pyt

How to convert hex string to int in Python

Without the 0x prefix, you need to specify the base explicitly. Otherwise, it won’t work.

See the following code.

# app.py

data = int("0xa", 16)
print(data)

With the 0x prefix, Python can distinguish hex and decimal automatically.

You must specify 0 as the base to invoke this prefix-guessing behavior, omitting the second parameter means to assume base-10.)
If you want to convert the string to an int, pass the string to int along with a base you are converting from. Both strings will suffice for conversion in this way.

# app.py

hexStrA = "0xffff"
hexStrB = "ffff"

print(int(hexStrA, 16))
print(int(hexStrB, 16))

See the output.

➜  pyt python3 app.py
65535
65535
➜  pyt

In the all above examples, we have used Python int() method.

Thanks for reading !

Top 10 Books To Learn Python

Top 10 Books To Learn Python

This video on 'Top 10 Books To Learn Python' will suggest to you what we think are the best books for Python, even if you are an experienced programmer or a complete beginner.

This video on 'Top 10 Books To Learn Python' will suggest to you what we think are the best books for Python, even if you are an experienced programmer or a complete beginner. Below are the topics covered in this video:

Why Python?

  • Beginner-Level Python Books
  • Domain-Specific Python Books
  • Bonus Python Book

Links for the Python Books:

  1. Learning Python by Mark Lutz: http://bit.ly/2BR38aY
  2. Python Crash Course by Eric Matthews: http://bit.ly/2BLlJ8i
  3. Think Python by Allen Downey: http://bit.ly/2pjoXNC
  4. Python Programming by John M Zelle: http://bit.ly/31SkYon
  5. Python in a Nutshell by Alex Martelli: http://bit.ly/32UOyeh
  6. Programming Python Mark Lutz: https://amzn.to/31Slhj1
  7. Effective Computation in Physics by Anthony Scopatz, Kathryn D. Huff: http://bit.ly/2BPD00c
  8. Python for Data Analysis by Wes McKinney: http://bit.ly/2pWCaMo
  9. Python Machine Learning by Sebastian Raschka and Vahid Mirjalili: https://amzn.to/36amOV3
  10. Django for Beginners by William S. Vincent: https://amzn.to/36lQtuG

A Complete Machine Learning Project Walk-Through in Python

A Complete Machine Learning Project Walk-Through in Python

A Complete Machine Learning Project Walk-Through in Python: Putting the machine learning pieces together; Model Selection, Hyperparameter Tuning, and Evaluation; Interpreting a machine learning model and presenting results

Reading through a data science book or taking a course, it can feel like you have the individual pieces, but don’t quite know how to put them together. Taking the next step and solving a complete machine learning problem can be daunting, but preserving and completing a first project will give you the confidence to tackle any data science problem. This series of articles will walk through a complete machine learning solution with a real-world dataset to let you see how all the pieces come together.

We’ll follow the general machine learning workflow step-by-step:

  1. Data cleaning and formatting
  2. Exploratory data analysis
  3. Feature engineering and selection
  4. Compare several machine learning models on a performance metric
  5. Perform hyperparameter tuning on the best model
  6. Evaluate the best model on the testing set
  7. Interpret the model results
  8. Draw conclusions and document work

Along the way, we’ll see how each step flows into the next and how to specifically implement each part in Python. The complete project is available on GitHub, with the first notebook here.

(As a note, this problem was originally given to me as an “assignment” for a job screen at a start-up. After completing the work, I was offered the job, but then the CTO of the company quit and they weren’t able to bring on any new employees. I guess that’s how things go on the start-up scene!)

Problem Definition

The first step before we get coding is to understand the problem we are trying to solve and the available data. In this project, we will work with publicly available building energy data from New York City.

The objective is to use the energy data to build a model that can predict the Energy Star Score of a building and interpret the results to find the factors which influence the score.

The data includes the Energy Star Score, which makes this a supervised regression machine learning task:

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

We want to develop a model that is both **accurate **— it can predict the Energy Star Score close to the true value — and interpretable — we can understand the model predictions. Once we know the goal, we can use it to guide our decisions as we dig into the data and build models.

Data Cleaning

Contrary to what most data science courses would have you believe, not every dataset is a perfectly curated group of observations with no missing values or anomalies (looking at you mtcars and iris datasets). Real-world data is messy which means we need to clean and wrangle it into an acceptable format before we can even start the analysis. Data cleaning is an un-glamorous, but necessary part of most actual data science problems.

First, we can load in the data as a Pandas DataFrame and take a look:

import pandas as pd
import numpy as np

# Read in data into a dataframe 
data = pd.read_csv('data/Energy_and_Water_Data_Disclosure_for_Local_Law_84_2017__Data_for_Calendar_Year_2016_.csv')

# Display top of dataframe
data.head()

This is a subset of the full data which contains 60 columns. Already, we can see a couple issues: first, we know that we want to predict the ENERGY STAR Score but we don’t know what any of the columns mean. While this isn’t necessarily an issue — we can often make an accurate model without any knowledge of the variables — we want to focus on interpretability, and it might be important to understand at least some of the columns.

When I originally got the assignment from the start-up, I didn’t want to ask what all the column names meant, so I looked at the name of the file,

and decided to search for “Local Law 84”. That led me to this page which explains this is an NYC law requiring all buildings of a certain size to report their energy use. More searching brought me to all the definitions of the columns. Maybe looking at a file name is an obvious place to start, but for me this was a reminder to go slow so you don’t miss anything important!

We don’t need to study all of the columns, but we should at least understand the Energy Star Score, which is described as:

A 1-to-100 percentile ranking based on self-reported energy usage for the reporting year. The Energy Star score is a relative measure used for comparing the energy efficiency of buildings.

That clears up the first problem, but the second issue is that missing values are encoded as “Not Available”. This is a string in Python which means that even the columns with numbers will be stored as object datatypes because Pandas converts a column with any strings into a column of all strings. We can see the datatypes of the columns using the dataframe.info()method:

# See the column data types and non-missing values
data.info()

Sure enough, some of the columns that clearly contain numbers (such as ft²), are stored as objects. We can’t do numerical analysis on strings, so these will have to be converted to number (specifically float) data types!

Here’s a little Python code that replaces all the “Not Available” entries with not a number ( np.nan), which can be interpreted as numbers, and then converts the relevant columns to the float datatype:

# Replace all occurrences of Not Available with numpy not a number
data = data.replace({'Not Available': np.nan})

# Iterate through the columns
for col in list(data.columns):
    # Select columns that should be numeric
    if ('ft²' in col or 'kBtu' in col or 'Metric Tons CO2e' in col or 'kWh' in 
        col or 'therms' in col or 'gal' in col or 'Score' in col):
        # Convert the data type to float
data[col] = data[col].astype(float)

Once the correct columns are numbers, we can start to investigate the data.

Missing Data and Outliers

In addition to incorrect datatypes, another common problem when dealing with real-world data is missing values. These can arise for many reasons and have to be either filled in or removed before we train a machine learning model. First, let’s get a sense of how many missing values are in each column (see the notebook for code).

(To create this table, I used a function from this Stack Overflow Forum).

While we always want to be careful about removing information, if a column has a high percentage of missing values, then it probably will not be useful to our model. The threshold for removing columns should depend on the problem (here is a discussion), and for this project, we will remove any columns with more than 50% missing values.

At this point, we may also want to remove outliers. These can be due to typos in data entry, mistakes in units, or they could be legitimate but extreme values. For this project, we will remove anomalies based on the definition of extreme outliers:

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

(For the code to remove the columns and the anomalies, see the notebook). At the end of the data cleaning and anomaly removal process, we are left with over 11,000 buildings and 49 features.

Exploratory Data Analysis

Now that the tedious — but necessary — step of data cleaning is complete, we can move on to exploring our data! Exploratory Data Analysis (EDA) is an open-ended process where we calculate statistics and make figures to find trends, anomalies, patterns, or relationships within the data.

In short, the goal of EDA is to learn what our data can tell us. It generally starts out with a high level overview, then narrows in to specific areas as we find interesting parts of the data. The findings may be interesting in their own right, or they can be used to inform our modeling choices, such as by helping us decide which features to use.

Single Variable Plots

The goal is to predict the Energy Star Score (renamed to score in our data) so a reasonable place to start is examining the distribution of this variable. A histogram is a simple yet effective way to visualize the distribution of a single variable and is easy to make using matplotlib.

import matplotlib.pyplot as plt

# Histogram of the Energy Star Score
plt.style.use('fivethirtyeight')
plt.hist(data['score'].dropna(), bins = 100, edgecolor = 'k');
plt.xlabel('Score'); plt.ylabel('Number of Buildings'); 
plt.title('Energy Star Score Distribution');

This looks quite suspicious! The Energy Star score is a percentile rank, which means we would expect to see a uniform distribution, with each score assigned to the same number of buildings. However, a disproportionate number of buildings have either the highest, 100, or the lowest, 1, score (higher is better for the Energy Star score).

If we go back to the definition of the score, we see that it is based on “self-reported energy usage” which might explain the very high scores. Asking building owners to report their own energy usage is like asking students to report their own scores on a test! As a result, this probably is not the most objective measure of a building’s energy efficiency.

If we had an unlimited amount of time, we might want to investigate why so many buildings have very high and very low scores which we could by selecting these buildings and seeing what they have in common. However, our objective is only to predict the score and not to devise a better method of scoring buildings! We can make a note in our report that the scores have a suspect distribution, but our main focus in on predicting the score.

Looking for Relationships

A major part of EDA is searching for relationships between the features and the target. Variables that are correlated with the target are useful to a model because they can be used to predict the target. One way to examine the effect of a categorical variable (which takes on only a limited set of values) on the target is through a density plot using the seaborn library.

A density plot can be thought of as a smoothed histogram because it shows the distribution of a single variable. We can color a density plot by class to see how a categorical variable changes the distribution. The following code makes a density plot of the Energy Star Score colored by the the type of building (limited to building types with more than 100 data points):

# Create a list of buildings with more than 100 measurements
types = data.dropna(subset=['score'])
types = types['Largest Property Use Type'].value_counts()
types = list(types[types.values > 100].index)

# Plot of distribution of scores for building categories
figsize(12, 10)

# Plot each building
for b_type in types:
    # Select the building type
    subset = data[data['Largest Property Use Type'] == b_type]
    
    # Density plot of Energy Star scores
    sns.kdeplot(subset['score'].dropna(),
               label = b_type, shade = False, alpha = 0.8);
    
# label the plot
plt.xlabel('Energy Star Score', size = 20); plt.ylabel('Density', size = 20); 
plt.title('Density Plot of Energy Star Scores by Building Type', size = 28);

We can see that the building type has a significant impact on the Energy Star Score. Office buildings tend to have a higher score while Hotels have a lower score. This tells us that we should include the building type in our modeling because it does have an impact on the target. As a categorical variable, we will have to one-hot encode the building type.

A similar plot can be used to show the Energy Star Score by borough:

The borough does not seem to have as large of an impact on the score as the building type. Nonetheless, we might want to include it in our model because there are slight differences between the boroughs.

To quantify relationships between variables, we can use the Pearson Correlation Coefficient. This is a measure of the strength and direction of a linear relationship between two variables. A score of +1 is a perfectly linear positive relationship and a score of -1 is a perfectly negative linear relationship. Several values of the correlation coefficient are shown below:

While the correlation coefficient cannot capture non-linear relationships, it is a good way to start figuring out how variables are related. In Pandas, we can easily calculate the correlations between any columns in a dataframe:

# Find all correlations with the score and sort 
correlations_data = data.corr()['score'].sort_values()

The most negative (left) and positive (right) correlations with the target:

There are several strong negative correlations between the features and the target with the most negative the different categories of EUI (these measures vary slightly in how they are calculated). The EUI — Energy Use Intensity — is the amount of energy used by a building divided by the square footage of the buildings. It is meant to be a measure of the efficiency of a building with a lower score being better. Intuitively, these correlations make sense: as the EUI increases, the Energy Star Score tends to decrease.

Two-Variable Plots

To visualize relationships between two continuous variables, we use scatterplots. We can include additional information, such as a categorical variable, in the color of the points. For example, the following plot shows the Energy Star Score vs. Site EUI colored by the building type:

This plot lets us visualize what a correlation coefficient of -0.7 looks like. As the Site EUI decreases, the Energy Star Score increases, a relationship that holds steady across the building types.

The final exploratory plot we will make is known as the Pairs Plot. This is a great exploration tool because it lets us see relationships between multiple pairs of variables as well as distributions of single variables. Here we are using the seaborn visualization library and the PairGrid function to create a Pairs Plot with scatterplots on the upper triangle, histograms on the diagonal, and 2D kernel density plots and correlation coefficients on the lower triangle.

# Extract the columns to  plot
plot_data = features[['score', 'Site EUI (kBtu/ft²)', 
                      'Weather Normalized Source EUI (kBtu/ft²)', 
                      'log_Total GHG Emissions (Metric Tons CO2e)']]

# Replace the inf with nan
plot_data = plot_data.replace({np.inf: np.nan, -np.inf: np.nan})

# Rename columns 
plot_data = plot_data.rename(columns = {'Site EUI (kBtu/ft²)': 'Site EUI', 
                                        'Weather Normalized Source EUI (kBtu/ft²)': 'Weather Norm EUI',
                                        'log_Total GHG Emissions (Metric Tons CO2e)': 'log GHG Emissions'})

# Drop na values
plot_data = plot_data.dropna()

# Function to calculate correlation coefficient between two columns
def corr_func(x, y, **kwargs):
    r = np.corrcoef(x, y)[0][1]
    ax = plt.gca()
    ax.annotate("r = {:.2f}".format(r),
                xy=(.2, .8), xycoords=ax.transAxes,
                size = 20)

# Create the pairgrid object
grid = sns.PairGrid(data = plot_data, size = 3)

# Upper is a scatter plot
grid.map_upper(plt.scatter, color = 'red', alpha = 0.6)

# Diagonal is a histogram
grid.map_diag(plt.hist, color = 'red', edgecolor = 'black')

# Bottom is correlation and density plot
grid.map_lower(corr_func);
grid.map_lower(sns.kdeplot, cmap = plt.cm.Reds)

# Title for entire plot
plt.suptitle('Pairs Plot of Energy Data', size = 36, y = 1.02);

To see interactions between variables, we look for where a row intersects with a column. For example, to see the correlation of Weather Norm EUI with score, we look in the Weather Norm EUI row and the score column and see a correlation coefficient of -0.67. In addition to looking cool, plots such as these can help us decide which variables to include in modeling.

Feature Engineering and Selection

Feature engineering and selection often provide the greatest return on time invested in a machine learning problem. First of all, let’s define what these two tasks are:

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

A machine learning model can only learn from the data we provide it, so ensuring that data includes all the relevant information for our task is crucial. If we don’t feed a model the correct data, then we are setting it up to fail and we should not expect it to learn!

For this project, we will take the following feature engineering steps:

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

One-hot encoding is necessary to include categorical variables in a model. A machine learning algorithm cannot understand a building type of “office”, so we have to record it as a 1 if the building is an office and a 0 otherwise.

Adding transformed features can help our model learn non-linear relationships within the data. Taking the square root, natural log, or various powers of features is common practice in data science and can be based on domain knowledge or what works best in practice. Here we will include the natural log of all numerical features.

The following code selects the numeric features, takes log transformations of these features, selects the two categorical features, one-hot encodes these features, and joins the two sets together. This seems like a lot of work, but it is relatively straightforward in Pandas!

# Copy the original data
features = data.copy()

# Select the numeric columns
numeric_subset = data.select_dtypes('number')

# Create columns with log of numeric columns
for col in numeric_subset.columns:
    # Skip the Energy Star Score column
    if col == 'score':
        next
    else:
        numeric_subset['log_' + col] = np.log(numeric_subset[col])
        
# Select the categorical columns
categorical_subset = data[['Borough', 'Largest Property Use Type']]

# One hot encode
categorical_subset = pd.get_dummies(categorical_subset)

# Join the two dataframes using concat
# Make sure to use axis = 1 to perform a column bind
features = pd.concat([numeric_subset, categorical_subset], axis = 1)

After this process we have over 11,000 observations (buildings) with 110 columns (features). Not all of these features are likely to be useful for predicting the Energy Star Score, so now we will turn to feature selection to remove some of the variables.

Feature Selection

Many of the 110 features we have in our data are redundant because they are highly correlated with one another. For example, here is a plot of Site EUI vs Weather Normalized Site EUI which have a correlation coefficient of 0.997.

Features that are strongly correlated with each other are known as collinear and removing one of the variables in these pairs of features can often help a machine learning model generalize and be more interpretable. (I should point out we are talking about correlations of features with other features, not correlations with the target, which help our model!)

There are a number of methods to calculate collinearity between features, with one of the most common the variance inflation factor. In this project, we will use thebcorrelation coefficient to identify and remove collinear features. We will drop one of a pair of features if the correlation coefficient between them is greater than 0.6. For the implementation, take a look at the notebook (and this Stack Overflow answer)

While this value may seem arbitrary, I tried several different thresholds, and this choice yielded the best model. Machine learning is an empirical field and is often about experimenting and finding what performs best! After feature selection, we are left with 64 total features and 1 target.

# Remove any columns with all na values
features  = features.dropna(axis=1, how = 'all')
print(features.shape)

(11319, 65)

Establishing a Baseline

We have now completed data cleaning, exploratory data analysis, and feature engineering. The final step to take before getting started with modeling is establishing a naive baseline. This is essentially a guess against which we can compare our results. If the machine learning models do not beat this guess, then we might have to conclude that machine learning is not acceptable for the task or we might need to try a different approach.

For regression problems, a reasonable naive baseline is to guess the median value of the target on the training set for all the examples in the test set. This sets a relatively low bar for any model to surpass.

The metric we will use is mean absolute error (mae) which measures the average absolute error on the predictions. There are many metrics for regression, but I like Andrew Ng’s advice to pick a single metric and then stick to it when evaluating models. The mean absolute error is easy to calculate and is interpretable.

Before calculating the baseline, we need to split our data into a training and a testing set:

  1. Data cleaning and formatting
  2. Exploratory data analysis
  3. Feature engineering and selection
  4. Compare several machine learning models on a performance metric
  5. Perform hyperparameter tuning on the best model
  6. Evaluate the best model on the testing set
  7. Interpret the model results
  8. Draw conclusions and document work

We will use 70% of the data for training and 30% for testing:

# Split into 70% training and 30% testing set
X, X_test, y, y_test = train_test_split(features, targets, 
                                        test_size = 0.3, 
                                        random_state = 42)

Now we can calculate the naive baseline performance:

# Function to calculate mean absolute error
def mae(y_true, y_pred):
    return np.mean(abs(y_true - y_pred))

baseline_guess = np.median(y)

print('The baseline guess is a score of %0.2f' % baseline_guess)
print("Baseline Performance on the test set: MAE = %0.4f" % mae(y_test, baseline_guess))
The baseline guess is a score of 66.00
Baseline Performance on the test set: MAE = 24.5164

The naive estimate is off by about 25 points on the test set. The score ranges from 1–100, so this represents an error of 25%, quite a low bar to surpass!

Conclusions

In this article we walked through the first three steps of a machine learning problem. After defining the question, we:

  1. Data cleaning and formatting
  2. Exploratory data analysis
  3. Feature engineering and selection
  4. Compare several machine learning models on a performance metric
  5. Perform hyperparameter tuning on the best model
  6. Evaluate the best model on the testing set
  7. Interpret the model results
  8. Draw conclusions and document work

Finally, we also completed the crucial step of establishing a baseline against which we can judge our machine learning algorithms.

A Complete Machine Learning Walk-Through in Python (Part Two): Model Selection, Hyperparameter Tuning, and Evaluation

Model Evaluation and Selection

As a reminder, we are working on a supervised regression task: using New York City building energy data, we want to develop a model that can predict the Energy Star Score of a building. Our focus is on both accuracy of the predictions and interpretability of the model.

There are a ton of machine learning models to choose from and deciding where to start can be intimidating. While there are some charts that try to show you which algorithm to use, I prefer to just try out several and see which one works best! Machine learning is still a field driven primarily by empirical (experimental) rather than theoretical results, and it’s almost impossible to know ahead of time which model will do the best.

Generally, it’s a good idea to start out with simple, interpretable models such as linear regression, and if the performance is not adequate, move on to more complex, but usually more accurate methods. The following chart shows a (highly unscientific) version of the accuracy vs interpretability trade-off:

We will evaluate five different models covering the complexity spectrum:

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

In this post we will focus on implementing these methods rather than the theory behind them. For anyone interesting in learning the background, I highly recommend An Introduction to Statistical Learning (available free online) or Hands-On Machine Learning with Scikit-Learn and TensorFlow. Both of these textbooks do a great job of explaining the theory and showing how to effectively use the methods in R and Python respectively.

Imputing Missing Values

While we dropped the columns with more than 50% missing values when we cleaned the data, there are still quite a few missing observations. Machine learning models cannot deal with any absent values, so we have to fill them in, a process known as imputation.

First, we’ll read in all the data and remind ourselves what it looks like:

import pandas as pd
import numpy as np

# Read in data into dataframes 
train_features = pd.read_csv('data/training_features.csv')
test_features = pd.read_csv('data/testing_features.csv')
train_labels = pd.read_csv('data/training_labels.csv')
test_labels = pd.read_csv('data/testing_labels.csv')

Training Feature Size:  (6622, 64)
Testing Feature Size:   (2839, 64)
Training Labels Size:   (6622, 1)
Testing Labels Size:    (2839, 1)

Every value that is NaN represents a missing observation. While there are a number of ways to fill in missing data, we will use a relatively simple method, median imputation. This replaces all the missing values in a column with the median value of the column.

In the following code, we create a Scikit-Learn Imputer object with the strategy set to median. We then train this object on the training data (using imputer.fit) and use it to fill in the missing values in both the training and testing data (using imputer.transform). This means missing values in the test data are filled in with the corresponding median value from the training data.

(We have to do imputation this way rather than training on all the data to avoid the problem of test data leakage, where information from the testing dataset spills over into the training data.)

# Create an imputer object with a median filling strategy
imputer = Imputer(strategy='median')

# Train on the training features
imputer.fit(train_features)

# Transform both training data and testing data
X = imputer.transform(train_features)
X_test = imputer.transform(test_features)

Missing values in training features:  0
Missing values in testing features:   0

All of the features now have real, finite values with no missing examples.

Feature Scaling

Scaling refers to the general process of changing the range of a feature. This is necessary because features are measured in different units, and therefore cover different ranges. Methods such as support vector machines and K-nearest neighbors that take into account distance measures between observations are significantly affected by the range of the features and scaling allows them to learn. While methods such as Linear Regression and Random Forest do not actually require feature scaling, it is still best practice to take this step when we are comparing multiple algorithms.

We will scale the features by putting each one in a range between 0 and 1. This is done by taking each value of a feature, subtracting the minimum value of the feature, and dividing by the maximum minus the minimum (the range). This specific version of scaling is often called normalization and the other main version is known as standardization.

While this process would be easy to implement by hand, we can do it using a MinMaxScaler object in Scikit-Learn. The code for this method is identical to that for imputation except with a scaler instead of imputer! Again, we make sure to train only using training data and then transform all the data.

# Create the scaler object with a range of 0-1
scaler = MinMaxScaler(feature_range=(0, 1))

# Fit on the training data
scaler.fit(X)

# Transform both the training and testing data
X = scaler.transform(X)
X_test = scaler.transform(X_test)

Every feature now has a minimum value of 0 and a maximum value of 1. Missing value imputation and feature scaling are two steps required in nearly any machine learning pipeline so it’s a good idea to understand how they work!

Implementing Machine Learning Models in Scikit-Learn

After all the work we spent cleaning and formatting the data, actually creating, training, and predicting with the models is relatively simple. We will use the Scikit-Learn library in Python, which has great documentation and a consistent model building syntax. Once you know how to make one model in Scikit-Learn, you can quickly implement a diverse range of algorithms.

We can illustrate one example of model creation, training (using .fit ) and testing (using .predict ) with the Gradient Boosting Regressor:

from sklearn.ensemble import GradientBoostingRegressor

# Create the model
gradient_boosted = GradientBoostingRegressor()

# Fit the model on the training data
gradient_boosted.fit(X, y)

# Make predictions on the test data
predictions = gradient_boosted.predict(X_test)

# Evaluate the model
mae = np.mean(abs(predictions - y_test))

print('Gradient Boosted Performance on the test set: MAE = %0.4f' % mae)
Gradient Boosted Performance on the test set: MAE = 10.0132

Model creation, training, and testing are each one line! To build the other models, we use the same syntax, with the only change the name of the algorithm. The results are presented below:

To put these figures in perspective, the naive baseline calculated using the median value of the target was 24.5. Clearly, machine learning is applicable to our problem because of the significant improvement over the baseline!

The gradient boosted regressor (MAE = 10.013) slightly beats out the random forest (10.014 MAE). These results aren’t entirely fair because we are mostly using the default values for the hyperparameters. Especially in models such as the support vector machine, the performance is highly dependent on these settings. Nonetheless, from these results we will select the gradient boosted regressor for model optimization.

Hyperparameter Tuning for Model Optimization

In machine learning, after we have selected a model, we can optimize it for our problem by tuning the model hyperparameters.

First off, what are hyperparameters and how do they differ from parameters?

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

Controlling the hyperparameters affects the model performance by altering the balance between underfitting and overfitting in a model. Underfitting is when our model is not complex enough (it does not have enough degrees of freedom) to learn the mapping from features to target. An underfit model has high bias, which we can correct by making our model more complex.

Overfitting is when our model essentially memorizes the training data. An overfit model has high variance, which we can correct by limiting the complexity of the model through regularization. Both an underfit and an overfit model will not be able to generalize well to the testing data.

The problem with choosing the right hyperparameters is that the optimal set will be different for every machine learning problem! Therefore, the only way to find the best settings is to try out a number of them on each new dataset. Luckily, Scikit-Learn has a number of methods to allow us to efficiently evaluate hyperparameters. Moreover, projects such as TPOT by Epistasis Lab are trying to optimize the hyperparameter search using methods like genetic programming. In this project, we will stick to doing this with Scikit-Learn, but stayed tuned for more work on the auto-ML scene!

Random Search with Cross Validation

The particular hyperparameter tuning method we will implement is called random search with cross validation:

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

The idea of K-Fold cross validation with K = 5 is shown below:

The entire process of performing random search with cross validation is:

  1. Data cleaning and formatting
  2. Exploratory data analysis
  3. Feature engineering and selection
  4. Compare several machine learning models on a performance metric
  5. Perform hyperparameter tuning on the best model
  6. Evaluate the best model on the testing set
  7. Interpret the model results
  8. Draw conclusions and document work

Of course, we don’t do actually do this manually, but rather let Scikit-Learn’s RandomizedSearchCV handle all the work!

Slight Diversion: Gradient Boosted Methods

Since we will be using the Gradient Boosted Regression model, I should give at least a little background! This model is an ensemble method, meaning that it is built out of many weak learners, in this case individual decision trees. While a bagging algorithm such as random forest trains the weak learners in parallel and has them vote to make a prediction, a boosting method like Gradient Boosting, trains the learners in sequence, with each learner “concentrating” on the mistakes made by the previous ones.

Boosting methods have become popular in recent years and frequently win machine learning competitions. The Gradient Boosting Method is one particular implementation that uses Gradient Descent to minimize the cost function by sequentially training learners on the residuals of previous ones. The Scikit-Learn implementation of Gradient Boosting is generally regarded as less efficient than other libraries such as XGBoost , but it will work well enough for our small dataset and is quite accurate.

Back to Hyperparameter Tuning

There are many hyperparameters to tune in a Gradient Boosted Regressor and you can look at the Scikit-Learn documentation for the details. We will optimize the following hyperparameters:

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

I’m not sure if there is anyone who truly understands how all of these interact, and the only way to find the best combination is to try them out!

In the following code, we build a hyperparameter grid, create a RandomizedSearchCV object, and perform hyperparameter search using 4-fold cross validation over 25 different combinations of hyperparameters:

# Loss function to be optimized
loss = ['ls', 'lad', 'huber']

# Number of trees used in the boosting process
n_estimators = [100, 500, 900, 1100, 1500]

# Maximum depth of each tree
max_depth = [2, 3, 5, 10, 15]

# Minimum number of samples per leaf
min_samples_leaf = [1, 2, 4, 6, 8]

# Minimum number of samples to split a node
min_samples_split = [2, 4, 6, 10]

# Maximum number of features to consider for making splits
max_features = ['auto', 'sqrt', 'log2', None]

# Define the grid of hyperparameters to search
hyperparameter_grid = {'loss': loss,
                       'n_estimators': n_estimators,
                       'max_depth': max_depth,
                       'min_samples_leaf': min_samples_leaf,
                       'min_samples_split': min_samples_split,
                       'max_features': max_features}

# Create the model to use for hyperparameter tuning
model = GradientBoostingRegressor(random_state = 42)

# Set up the random search with 4-fold cross validation
random_cv = RandomizedSearchCV(estimator=model,
                               param_distributions=hyperparameter_grid,
                               cv=4, n_iter=25, 
                               scoring = 'neg_mean_absolute_error',
                               n_jobs = -1, verbose = 1, 
                               return_train_score = True,
                               random_state=42)

# Fit on the training data
random_cv.fit(X, y)

After performing the search, we can inspect the RandomizedSearchCV object to find the best model:

# Find the best combination of settings
random_cv.best_estimator_

GradientBoostingRegressor(loss='lad', max_depth=5,
                          max_features=None,
                          min_samples_leaf=6,
                          min_samples_split=6,
                          n_estimators=500)

We can then use these results to perform grid search by choosing parameters for our grid that are close to these optimal values. However, further tuning is unlikely to significant improve our model. As a general rule, proper feature engineering will have a much larger impact on model performance than even the most extensive hyperparameter tuning. It’s the law of diminishing returns applied to machine learning: feature engineering gets you most of the way there, and hyperparameter tuning generally only provides a small benefit.

One experiment we can try is to change the number of estimators (decision trees) while holding the rest of the hyperparameters steady. This directly lets us observe the effect of this particular setting. See the notebook for the implementation, but here are the results:

As the number of trees used by the model increases, both the training and the testing error decrease. However, the training error decreases much more rapidly than the testing error and we can see that our model is overfitting: it performs very well on the training data, but is not able to achieve that same performance on the testing set.

We always expect at least some decrease in performance on the testing set (after all, the model can see the true answers for the training set), but a significant gap indicates overfitting. We can address overfitting by getting more training data, or decreasing the complexity of our model through the hyerparameters. In this case, we will leave the hyperparameters where they are, but I encourage anyone to try and reduce the overfitting.

For the final model, we will use 800 estimators because that resulted in the lowest error in cross validation. Now, time to test out this model!

Evaluating on the Test Set

As responsible machine learning engineers, we made sure to not let our model see the test set at any point of training. Therefore, we can use the test set performance as an indicator of how well our model would perform when deployed in the real world.

Making predictions on the test set and calculating the performance is relatively straightforward. Here, we compare the performance of the default Gradient Boosted Regressor to the tuned model:

# Make predictions on the test set using default and final model
default_pred = default_model.predict(X_test)
final_pred = final_model.predict(X_test)

Default model performance on the test set: MAE = 10.0118.
Final model performance on the test set:   MAE = 9.0446.

Hyperparameter tuning improved the accuracy of the model by about 10%. Depending on the use case, 10% could be a massive improvement, but it came at a significant time investment!

We can also time how long it takes to train the two models using the %timeit magic command in Jupyter Notebooks. First is the default model:

%%timeit -n 1 -r 5
default_model.fit(X, y)

1.09 s ± 153 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

1 second to train seems very reasonable. The final tuned model is not so fast:

%%timeit -n 1 -r 5
final_model.fit(X, y)

12.1 s ± 1.33 s per loop (mean ± std. dev. of 5 runs, 1 loop each)

This demonstrates a fundamental aspect of machine learning: it is always a game of trade-offs. We constantly have to balance accuracy vs interpretability, bias vs variance, accuracy vs run time, and so on. The right blend will ultimately depend on the problem. In our case, a 12 times increase in run-time is large in relative terms, but in absolute terms it’s not that significant.

Once we have the final predictions, we can investigate them to see if they exhibit any noticeable skew. On the left is a density plot of the predicted and actual values, and on the right is a histogram of the residuals:

The model predictions seem to follow the distribution of the actual values although the peak in the density occurs closer to the median value (66) on the training set than to the true peak in density (which is near 100). The residuals are nearly normally distribution, although we see a few large negative values where the model predictions were far below the true values.

Conclusions

In this article we covered several steps in the machine learning workflow:

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

The results of this work showed us that machine learning is applicable to the task of predicting a building’s Energy Star Score using the available data. Using a gradient boosted regressor we were able to predict the scores on the test set to within 9.1 points of the true value. Moreover, we saw that hyperparameter tuning can increase the performance of a model at a significant cost in terms of time invested. This is one of many trade-offs we have to consider when developing a machine learning solution.

A Complete Machine Learning Walk-Through in Python (Part Three): Interpreting a machine learning model and presenting results

As a reminder, we are working through a supervised regression machine learning problem. Using New York City building energy data, we have developed a model which can predict the Energy Star Score of a building. The final model we built is a Gradient Boosted Regressor which is able to predict the Energy Star Score on the test data to within 9.1 points (on a 1–100 scale).

Model Interpretation

The gradient boosted regressor sits somewhere in the middle on the scale of model interpretability: the entire model is complex, but it is made up of hundreds of decision trees, which by themselves are quite understandable. We will look at three ways to understand how our model makes predictions:

  1. Data cleaning and formatting
  2. Exploratory data analysis
  3. Feature engineering and selection
  4. Compare several machine learning models on a performance metric
  5. Perform hyperparameter tuning on the best model
  6. Evaluate the best model on the testing set
  7. Interpret the model results
  8. Draw conclusions and document work

The first two methods are specific to ensembles of trees, while the third — as you might have guessed from the name — can be applied to any machine learning model. LIME is a relatively new package and represents an exciting step in the ongoing effort to explain machine learning predictions.

Feature Importances

Feature importances attempt to show the relevance of each feature to the task of predicting the target. The technical details of feature importances are complex (they measure the mean decrease impurity, or the reduction in error from including the feature), but we can use the relative values to compare which features are the most relevant. In Scikit-Learn, we can extract the feature importances from any ensemble of tree-based learners.

With model as our trained model, we can find the feature importances usingmodel.feature_importances_. Then we can put them into a pandas DataFrame and display or plot the top ten most important:

import pandas as pd

# model is the trained model
importances = model.feature_importances_

# train_features is the dataframe of training features
feature_list = list(train_features.columns)

# Extract the feature importances into a dataframe
feature_results = pd.DataFrame({'feature': feature_list, 
                                'importance': importances})

# Show the top 10 most important
feature_results = feature_results.sort_values('importance', 
                                              ascending = False).reset_index(drop=True)

feature_results.head(10)

The Site EUI(Energy Use Intensity) and the Weather Normalized Site Electricity Intensity are by far the most important features, accounting for over 66% of the total importance. After the top two features, the importance drops off significantly, which indicates we might not need to retain all 64 features in the data to achieve high performance. (In the Jupyter notebook, I take a look at using only the top 10 features and discover that the model is not quite as accurate.)

Based on these results, we can finally answer one of our initial questions: the most important indicators of a building’s Energy Star Score are the Site EUI and the Weather Normalized Site Electricity Intensity. While we do want to be careful about reading too much into the feature importances, they are a useful way to start to understand how the model makes its predictions.

Visualizing a Single Decision Tree

While the entire gradient boosting regressor may be difficult to understand, any one individual decision tree is quite intuitive. We can visualize any tree in the forest using the Scikit-Learn function [export_graphviz]([http://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html)](http://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html) "http://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html)"). We first extract a tree from the ensemble then save it as a dot file:

from sklearn import tree

# Extract a single tree (number 105)
single_tree = model.estimators_[105][0]

# Save the tree to a dot file
tree.export_graphviz(single_tree, out_file = 'images/tree.dot', 
feature_names = feature_list)

Using the Graphviz visualization software we can convert the dot file to a png from the command line:

dot -Tpng images/tree.dot -o images/tree.png

The result is a complete decision tree:

This is a little overwhelming! Even though this tree only has a depth of 6 (the number of layers), it’s difficult to follow. We can modify the call to export_graphviz and limit our tree to a more reasonable depth of 2:

Each node (box) in the tree has four pieces of information:

  1. Data cleaning and formatting
  2. Exploratory data analysis
  3. Feature engineering and selection
  4. Compare several machine learning models on a performance metric
  5. Perform hyperparameter tuning on the best model
  6. Evaluate the best model on the testing set
  7. Interpret the model results
  8. Draw conclusions and document work

(Leaf nodes only have 2.–4. because they represent the final estimate and do not have any children).

A decision tree makes a prediction for a data point by starting at the top node, called the root, and working its way down through the tree. At each node, a yes-or-no question is asked of the data point. For example, the question for the node above is: Does the building have a Site EUI less than or equal to 68.95? If the answer is yes, the building is placed in the right child node, and if the answer is no, the building goes to the left child node.

This process is repeated at each layer of the tree until the data point is placed in a leaf node, at the bottom of the tree (the leaf nodes are cropped from the small tree image). The prediction for all the data points in a leaf node is the value. If there are multiple data points ( samples ) in a leaf node, they all get the same prediction. As the depth of the tree is increased, the error on the training set will decrease because there are more leaf nodes and the examples can be more finely divided. However, a tree that is too deep will overfit to the training data and will not be able to generalize to new testing data.

In the second article, we tuned a number of the model hyperparameters, which control aspects of each tree such as the maximum depth of the tree and the minimum number of samples required in a leaf node. These both have a significant impact on the balance of under vs over-fitting, and visualizing a single decision tree allows us to see how these settings work.

Although we cannot examine every tree in the model, looking at one lets us understand how each individual learner makes a prediction. This flowchart-based method seems much like how a human makes decisions, answering one question about a single value at a time. Decision-tree-based ensembles combine the predictions of many individual decision trees in order to create a more accurate model with less variance. Ensembles of trees tend to be very accurate, and also are intuitive to explain.

Local Interpretable Model-Agnostic Explanations (LIME)

The final tool we will explore for trying to understand how our model “thinks” is a new entry into the field of model explanations. LIME aims to explain a single prediction from any machine learning model by creating a approximation of the model locally near the data point using a simple model such as linear regression (the full details can be found in the paper ).

Here we will use LIME to examine a prediction the model gets completely wrong to see what it might tell us about why the model makes mistakes.

First we need to find the observation our model gets most wrong. We do this by training and predicting with the model and extracting the example on which the model has the greatest error:

from sklearn.ensemble import GradientBoostingRegressor

# Create the model with the best hyperparamters
model = GradientBoostingRegressor(loss='lad', max_depth=5, max_features=None,
                                  min_samples_leaf=6, min_samples_split=6, 
                                  n_estimators=800, random_state=42)

# Fit and test on the features
model.fit(X, y)
model_pred = model.predict(X_test)

# Find the residuals
residuals = abs(model_pred - y_test)
    
# Extract the most wrong prediction
wrong = X_test[np.argmax(residuals), :]

print('Prediction: %0.4f' % np.argmax(residuals))
print('Actual Value: %0.4f' % y_test[np.argmax(residuals)])
Prediction: 12.8615
Actual Value: 100.0000

Next, we create the LIME explainer object passing it our training data, the mode, the training labels, and the names of the features in our data. Finally, we ask the explainer object to explain the wrong prediction, passing it the observation and the prediction function.

import lime 

# Create a lime explainer object
explainer = lime.lime_tabular.LimeTabularExplainer(training_data = X, 
                                                   mode = 'regression',
                                                   training_labels = y,
                                                   feature_names = feature_list)


# Explanation for wrong prediction
exp = explainer.explain_instance(data_row = wrong, 
                                 predict_fn = model.predict)

# Plot the prediction explaination
exp.as_pyplot_figure();

The plot explaining this prediction is below:

Here’s how to interpret the plot: Each entry on the y-axis indicates one value of a variable and the red and green bars show the effect this value has on the prediction. For example, the top entry says the Site EUI is greater than 95.90 which subtracts about 40 points from the prediction. The second entry says the Weather Normalized Site Electricity Intensity is less than 3.80 which adds about 10 points to the prediction. The final prediction is an intercept term plus the sum of each of these individual contributions.

We can get another look at the same information by calling the explainer .show_in_notebook() method:

# Show the explanation in the Jupyter Notebook
exp.show_in_notebook()

This shows the reasoning process of the model on the left by displaying the contributions of each variable to the prediction. The table on the right shows the actual values of the variables for the data point.

For this example, the model prediction was about 12 and the actual value was 100! While initially this prediction may be puzzling, looking at the explanation, we can see this was not an extreme guess, but a reasonable estimate given the values for the data point. The Site EUI was relatively high and we would expect the Energy Star Score to be low (because EUI is strongly negatively correlated with the score), a conclusion shared by our model. In this case, the logic was faulty because the building had a perfect score of 100.

It can be frustrating when a model is wrong, but explanations such as these help us to understand why the model is incorrect. Moreover, based on the explanation, we might want to investigate why the building has a perfect score despite such a high Site EUI. Perhaps we can learn something new about the problem that would have escaped us without investigating the model. Tools such as this are not perfect, but they go a long way towards helping us understand the model which in turn can allow us to make better decisions.

Documenting Work and Reporting Results

An often under-looked part of any technical project is documentation and reporting. We can do the best analysis in the world, but if we do not clearly communicate the results, then they will not have any impact!

When we document a data science project, we take all the versions of the data and code and package it so it our project can be reproduced or built on by other data scientists. It’s important to remember that code is read more often than it is written, and we want to make sure our work is understandable both for others and for ourselves if we come back to it a few months later. This means putting in helpful comments in the code and explaining your reasoning. I find Jupyter Notebooks to be a great tool for documentation because they allow for explanations and code one after the other.

Jupyter Notebooks can also be a good platform for communicating findings to others. Using notebook extensions, we can hide the code from our final report , because although it’s hard to believe, not everyone wants to see a bunch of Python code in a document!

Personally, I struggle with succinctly summarizing my work because I like to go through all the details. However, it’s important to understand your audience when you are presenting and tailor the message accordingly. With that in mind, here is my 30-second takeaway from the project:

  1. Data cleaning and formatting
  2. Exploratory data analysis
  3. Feature engineering and selection
  4. Compare several machine learning models on a performance metric
  5. Perform hyperparameter tuning on the best model
  6. Evaluate the best model on the testing set
  7. Interpret the model results
  8. Draw conclusions and document work

Originally, I was given this project as a job-screening “assignment” by a start-up. For the final report, they wanted to see both my work and my conclusions, so I developed a Jupyter Notebook to turn in. However, instead of converting directly to PDF in Jupyter, I converted it to a Latex .tex file that I then edited in texStudio before rendering to a PDF for the final version. The default PDF output from Jupyter has a decent appearance, but it can be significantly improved with a few minutes of editing. Moreover, Latex is a powerful document preparation system and it’s good to know the basics.

At the end of the day, our work is only as valuable as the decisions it enables, and being able to present results is a crucial skill. Furthermore, by properly documenting work, we allow others to reproduce our results, give us feedback so we can become better data scientists, and build on our work for the future.

Conclusions

Throughout this series of posts, we’ve walked through a complete end-to-end machine learning project. We started by cleaning the data, moved into model building, and finally looked at how to interpret a machine learning model. As a reminder, the general structure of a machine learning project is below:

  1. Data cleaning and formatting
  2. Exploratory data analysis
  3. Feature engineering and selection
  4. Compare several machine learning models on a performance metric
  5. Perform hyperparameter tuning on the best model
  6. Evaluate the best model on the testing set
  7. Interpret the model results
  8. Draw conclusions and document work

While the exact steps vary by project, and machine learning is often an iterative rather than linear process, this guide should serve you well as you tackle future machine learning projects. I hope this series has given you confidence to be able to implement your own machine learning solutions, but remember, none of us do this by ourselves! If you want any help, there are many incredibly supportive communities where you can look for advice.

A few resources I have found helpful throughout my learning process:

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

*Originally published at *https://towardsdatascience.com

Thanks for reading

If you liked this post, share it with all of your programming buddies!

Follow us on Facebook | Twitter

Further reading

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python for Data Science and Machine Learning Bootcamp

Machine Learning, Data Science and Deep Learning with Python

Deep Learning A-Z™: Hands-On Artificial Neural Networks

Artificial Intelligence A-Z™: Learn How To Build An AI

A Complete Machine Learning Project Walk-Through in Python

Machine Learning: how to go from Zero to Hero

Top 18 Machine Learning Platforms For Developers

10 Amazing Articles On Python Programming And Machine Learning

100+ Basic Machine Learning Interview Questions and Answers