Python vs Java: Understand Object Oriented Programming

Python vs Java: Understand Object Oriented Programming

This article compares and contrasts object-oriented programming support in Python vs Java. By the end, you’ll be able to apply your knowledge of object-oriented programming to Python, understand how to reinterpret your understanding of Java objects to Python, and use objects in a Pythonic way.

Java programmers making a move to Python often struggle with Python’s approach to object-oriented programming (OOP). The approach to working with objects, variable types, and other language capabilities taken by Python vs Java are quite different. It can make switching between both languages very confusing.

Over the course of this article, you’ll:

  • Build a basic class in both Java and Python
  • Explore how object attributes work in Python vs Java
  • Compare and contrast Java methods and Python functions
  • Discover inheritance and polymorphism mechanisms in both languages
  • Investigate reflection across Python vs Java
  • Apply everything in a complete class implementation in both languages

This article isn’t a primer on object-oriented programming. Rather, it compares object-oriented features and principles of Python vs Java. Readers should have good knowledge of Java, and also be familiar with coding Python.

Table of Contents

  • Sample Classes in Python vs Java
  • Object Attributes
    Declaration and InitializationPublic and PrivateAccess Controlself and this* Methods and Functions
  • Inheritance and Polymorphism
    InheritanceTypes and PolymorphismDefault MethodsOperator Overloading* Reflection
    Examining an Object’s TypeExamining an Object’s AttributesCalling Methods Through Reflection* Conclusion
Sample Classes in Python vs Java

To begin, you’ll implement the same small class in both Python and Java to illustrate the differences between them. You’ll make modifications to them as the article progresses.

First, assume you have the following Car class definition in Java:

public class Car {
    private String color;
    private String model;
    private int year;
    public Car(String color, String model, int year) {
        this.color = color;
        this.model = model;
        this.year = year;
    }
    public String getColor() {
        return color;
    }
    public String getModel() {
        return model;
    }
    public int getYear() {
        return year;
    }
}

Java classes are defined in files with the same name as the class. So, you have to save this class in a file named Car.java. Only one class can be defined in each file.

A similar small Car class is written in Python as follows:

class Car:
    def __init__(self, color, model, year):
        self.color = color
        self.model = model
        self.year = year

In Python you can declare a class anywhere, in any file, at any time. Save this class in the file car.py.

Using these classes as your base, you can explore the basic components of classes and objects.

Object Attributes

All object oriented languages have some way to store data about the object. In **Java **and Python, data is stored in attributes, which are variables associated with specific objects.

One of the most significant differences between Python vs Java is how they define and manage class and object attributes. Some of these differences come from constraints imposed by the languages, while others come from best practices.

Declaration and Initialization

In Java, you declare attributes in the class body, outside of any methods, with a definite type. You must define class attributes before they are used:

public class Car {
    private String color;
    private String model;
    private int year;

In Python, you both declare and define attributes inside the class <strong>init</strong>(), which is the equivalent of Java’s constructor:

def __init__(self, color, model, year):
    self.color = color
    self.model = model
    self.year = year

By prefixing the variable names with self, you tell Python these are attributes. Each instance of the class will have a copy. All variables in Python are loosely typed, and these attributes are no exception.

You can also create instance variables outside of .<strong>init</strong>(), but it’s not a best practice as their scope is often confusing. If not used properly, instance variables created outside of .__init__() can lead to subtle bugs that are hard to find. For example, you can add a new attribute .wheels to a Car object like this:

>>> import car
>>> my_car = car.Car("yellow", "beetle", 1967)
>>> print(f"My car is {my_car.color}")
My car is yellow
>>> my_car.wheels = 5
>>> print(f"Wheels: {my_car.wheels}")
Wheels: 5

However, if you forget the my_car.wheels = 5 on line 6, then Python displays an error:

>>> import car
>>> my_car = car.Car("yellow", "beetle", 1967)
>>> print(f"My car is {my_car.color}")
My car is yellow
>>> print(f"Wheels: {my_car.wheels}")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'Car' object has no attribute 'wheels'

In Python, when you declare a variable outside of a method, it’s treated as a class variable. Update the Car class as follows:

class Car:
    wheels = 0
    def __init__(self, color, model, year):
        self.color = color
        self.model = model
        self.year = year

This changes how you use the variable wheels. Instead of referring to it using an object, you refer to it using the class name:

>>> import car
>>> my_car = car.Car("yellow", "beetle", 1967)
>>> print(f"My car is {my_car.color}")
My car is yellow
>>> print(f"It has {car.Car.wheels} wheels")
It has 0 wheels
>>> print(f"It has {my_car.wheels} wheels")
It has 0 wheels

Note: In Python, you refer to a class variable using the following syntax:

  1. The name of the file containing the class, without the .py extension
  2. A dot
  3. The name of the class
  4. A dot
  5. The name of the variable

Since you saved the Car class in the file car.py, you refer to the class variable wheels on line 6 as car.Car.wheels.

You can refer to my<em>car.wheels or car.Car.wheels, but be careful. Changing the value of the instance variable my</em>car.wheels will not change the value of the class variable car.Car.wheels:

>>> from car import *
>>> my_car = car.Car("yellow", "Beetle", "1966")
>>> my_other_car = car.Car("red", "corvette", "1999")
>>> print(f"My car is {my_car.color}")
My car is yellow
>>> print(f"It has {my_car.wheels} wheels")
It has 0 wheels
>>> print(f"My other car is {my_other_car.color}")
My other car is red
>>> print(f"It has {my_other_car.wheels} wheels")
It has 0 wheels
>>> # Change the class variable value
... car.Car.wheels = 4
>>> print(f"My car has {my_car.wheels} wheels")
My car has 4 wheels
>>> print(f"My other car has {my_other_car.wheels} wheels")
My other car has 4 wheels
>>> # Change the instance variable value for my_car
... my_car.wheels = 5
>>> print(f"My car has {my_car.wheels} wheels")
My car has 5 wheels
>>> print(f"My other car has {my_other_car.wheels} wheels")
My other car has 4 wheels

You define two Car objects on lines 2 and 3:

  1. my_car
  2. my_other_car

At first, both of them have zero wheels. When you set the class variable using car.Car.wheels = 4 on line 16, both objects now have four wheels. However, when you set the instance variable using my_car.wheels = 5 on line 24, only that object is affected.

This means that there are now two different copies of the wheels attribute:

  1. A class variable that applies to all Car objects
  2. A specific instance variable applicable to the my_car object only

It isn’t difficult to accidentally refer to the wrong one and introduce subtle bugs.

Java’s equivalent to a class attribute is a static attribute:

public class Car {
    private String color;
    private String model;
    private int year;
    private static int wheels;

    public Car(String color, String model, int year) {
        this.color = color;
        this.model = model;
        this.year = year;
    }

    public static int getWheels() {
        return wheels;
    }

    public static void setWheels(int count) {
        wheels = count;
    }
}

Normally, you refer to static variables using the Java class name. You can refer to a static variable through a class instance like Python, but it’s not a best practice.

Your Java class is getting long. One of the reasons why Java is more verbose than Python is the notion of public and private methods and attributes.

Public and Private

Java controls access to methods and attributes by differentiating between public data and private data.

In Java, it is expected that attributes are declared as private, or protected if subclasses need direct access to them. This limits access to these attributes from code outside the class. To provide access to private attributes, you declare public methods which set and retrieve data in a controlled manner (more on that later).

Recall from your Java class above that the color variable was declared as private. Therefore, this Java code will show a compilation error at the highlighted line:

Car myCar = new Car("blue", "Ford", 1972);

// Paint the car
myCar.color = "red";

If you don’t specify an access level, then the attribute defaults to package protected, which limits access to classes in the same package. You have to mark the attribute as public if you want this code to work.

However, declaring public attributes is not considered a best practice in Java. You’re expected to declare attributes as private, and use public access methods, such as the .getColor() and .getModel() shown in the code.

Python doesn’t have the same notion of private or protected data that Java does. Everything in Python is public. This code works with your existing Python class just fine:

>>> my_car = car.Car("blue", "Ford", 1972)

>>> # Paint the car
... my_car.color = "red"

Instead of private, Python has a notion of a non-public instance variable. Any variable which starts with an underscore character is defined to be non-public. This naming convention makes it harder to access a variable, but it’s only a naming convention, and you can still access the variable directly.

Add the following line to your Python Car class:

class Car:

    wheels = 0

    def __init__(self, color, model, year):
        self.color = color
        self.model = model
        self.year = year
        self._cupholders = 6

You can access the ._cupholders variable directly:

>>> import car
>>> my_car = car.Car("yellow", "Beetle", "1969")
>>> print(f"It was built in {my_car.year}")
It was built in 1969
>>> my_car.year = 1966
>>> print(f"It was built in {my_car.year}")
It was built in 1966
>>> print(f"It has {my_car._cupholders} cupholders.")
It has 6 cupholders.

Python lets you access ._cupholders, but IDEs such as VS Code may issue a warning through linters that support PEP 8.

Here’s the code in VS Code, with a warning highlighted and displayed:

Python further recognizes using double underscore characters in front of a variable to conceal an attribute in Python. When Python sees a double underscore variable, it changes the variable name internally to make it difficult to access directly. This mechanism avoids accidents but still doesn’t make data impossible to access.

To show this mechanism in action, change the Python Car class again:

class Car:

    wheels = 0

    def __init__(self, color, model, year):
        self.color = color
        self.model = model
        self.year = year
        self.__cupholders = 6

Now, when you try to access the .__cupholders variable, you see the following error:

>>> import car
>>> my_car = car.Car("yellow", "Beetle", "1969")
>>> print(f"It was built in {my_car.year}")
It was built in 1969
>>> my_car.year = 1966
>>> print(f"It was built in {my_car.year}")
It was built in 1966
>>> print(f"It has {my_car.__cupholders} cupholders.")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'Car' object has no attribute '__cupholders'

So why doesn’t the .__cupholders attribute exist?

When Python sees an attribute with double underscores, it changes the attribute by prefixing the original name of the attribute with an underscore, followed by the class name. To use the attribute directly, you need to change the name you use as well:

>>> print(f"It has {my_car._Car__cupholders} cupholders")
It has 6 cupholders

When you use double underscores to conceal an attribute from the user, Python changes the name in a well-documented manner. This means that a determined developer can still access the attribute directly.

So if your Java attributes are declared private, and your Python attributes are prefaced with double underscores, then how do you provide and control access to the data they store?

Access Control

In Java, you access private attributes using setters and getters. To allow users to paint their cars, add the following code to your Java class:

public String getColor() {
    return color;
}

public void setColor(String color) {
    this.color = color;
}

Since .getColor() and .setColor() are public, anyone can call them to change or retrieve the car’s color. Java’s best practices of using private attributes accessed with public getters and setters is one of the reasons why Java code tends to be more verbose than Python.

As you saw above, you access attributes directly in Python. Since everything is public, you can access anything at any time from anywhere. You set and get attribute values directly by referring to their names. You can even delete attributes in Python, which isn’t possible in Java:

>>> my_car = Car("yellow", "beetle", 1969)
>>> print(f"My car was built in {my_car.year}")
My car was built in 1969
>>> my_car.year = 1966
>>> print(f"It was built in {my_car.year}")
It was built in 1966
>>> del my_car.year
>>> print(f"It was built in {my_car.year}")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'Car' object has no attribute 'year'

However, there are times you may want to control access to an attribute. In that case, you can use Python properties.

In Python, properties provide controllable access to class attributes using Python decorator syntax. Properties allow functions to be declared in Python classes that are analogous to Java getter and setter methods, with the added bonus of allowing you to delete attributes as well.

You can see how properties work by adding one to your Car class:

class Car:
    def __init__(self, color, model, year):
        self.color = color
        self.model = model
        self.year = year
        self._voltage = 12
    @property
    def voltage(self):
        return self._voltage

    @voltage.setter
    def voltage(self, volts):
        print("Warning: this can cause problems!")
        self._voltage = volts

    @voltage.deleter
    def voltage(self):
        print("Warning: the radio will stop working!")
        del self._voltage

Here, you expand the notion of the Car to include electric vehicles. You declare the ._voltage attribute to hold the battery voltage on line 6.

To provide controlled access, you define a function called voltage() to return the private value on lines 9 and 10. By using the @property decoration, you mark it as a getter that anyone can access directly.

Similarly, you define a setter function on lines 13 to 15, also called voltage(). However, you decorate this function with @voltage.setter. Lastly, you use @voltage.deleter to decorate a third voltage() on lines 18 to 20, which allows controlled deletion of the attribute.

The names of the decorated functions are all the same, indicating they control access to the same attribute. The function names also become the name of the attribute you use to access the value. Here’s how these properties work in practice:

>>> from car import *
>>> my_car = Car("yellow", "beetle", 1969)
>>> print(f"My car uses {my_car.voltage} volts")
My car uses 12 volts
>>> my_car.voltage = 6
Warning: this can cause problems!
>>> print(f"My car now uses {my_car.voltage} volts")
My car now uses 6 volts
>>> del my_car.voltage
Warning: the radio will stop working!

Note that you use .voltage in the highlighted lines above, not ._voltage. This tells Python to use the property functions you defined:

  • When you print the value of my_car.voltage on line 4, Python calls .voltage() decorated with @property.
  • When you assign a value to my_car.voltage on line 7, Python calls .voltage() decorated with @voltage.setter.
  • When you delete my_car.voltage on line 13, Python calls .voltage() decorated with @voltage.deleter.

The @property, @.setter, and @.deleter decorations make it possible to control access to attributes without requiring users to use different methods. You can even make attributes appear to be read-only properties by omitting the @.setter and @.deleter decorated functions.

self and this

In Java, a class refers to itself with the this reference:

public void setColor(String color) {
    this.color = color;
}

this is implicit in Java code: it doesn’t normally need to be written, unless there may be confusion between two variables with the same name.

You can write the same setter this way:

public void setColor(String newColor) {
    color = newColor;
}

Since Car has an attribute named .color, and there isn’t another variable in scope with the same name, a reference to that name works. We used this in the first example to differentiate between the attribute and parameter both named color.

In Python, the keyword self serves a similar purpose. It’s how you refer to member variables, but unlike Java’s this, it’s required if you want to create or refer to a member attribute:

class Car:
    def __init__(self, color, model, year):
        self.color = color
        self.model = model
        self.year = year
        self._voltage = 12

    @property
    def voltage(self):
        return self._voltage

Python requires each self in the code above. Each one either creates or refers to the attributes. If you omit them, then Python will create a local variable instead of an attribute.

The difference between how you use self and this in Python and Java is due to underlying differences between the two languages and how they name variables and attributes.

Methods and Functions

This difference between Python vs Java is, simply put, that Python has functions, and Java doesn’t.

In Python, the following code is perfectly fine (and very common):

>>> def say_hi():
...     print("Hi!")
... 
>>> say_hi()
Hi!

You can call say_hi() from anywhere it’s visible. This function has no reference to self, which indicates that it’s a global function, not a class function. It can’t alter or store any data in any classes but can use local and global variables.

In contrast, every line of Java code you write belongs to a class. Functions can’t exist outside of a class, and by definition, all Java functions are methods. In Java, the closest you can get to a pure function is by using a static method:

public class Utils {
    static void SayHi() {
        System.out.println("Hi!");
    }
}

Utils.SayHi() is callable from anywhere without first creating an instance of Utils. Since you can call SayHi() without creating an object first, the this reference doesn’t exist. However, this is still not a function in the sense that say_hi() is in Python.

Inheritance and Polymorphism

Inheritance and polymorphism are two fundamental concepts in object-oriented programming.

Inheritance allows objects to derive attributes and functionality from other objects, creating a hierarchy moving from more general objects to more specific. For example, a Car and a Boat are both specific types of Vehicles. Objects can inherit their behavior from a single parent object or multiple parent objects, and are referred to as child objects when they do.

Polymorphism allows two or more objects to behave like one another, which allows them to be used interchangeably. For example, if a method or function knows how to paint a Vehicle object, then it can also paint a Car or Boat object, since they inherit their data and behavior from the Vehicle.

These fundamental OOP concepts are implemented quite differently in Python vs Java.

Inheritance

Python supports multiple inheritance, or creating classes that inherit behavior from more than one parent class.

To see how this works, update the Car class by breaking it into two categories, one for vehicles, and one for devices that use electricity:

class Vehicle:
    def __init__(self, color, model):
        self.color = color
        self.model = model

class Device:
    def __init__(self):
        self._voltage = 12

class Car(Vehicle, Device):
    def __init__(self, color, model, year):
        Vehicle.__init__(self, color, model)
        Device.__init__(self)
        self.year = year

    @property
    def voltage(self):
        return self._voltage

    @voltage.setter
    def voltage(self, volts):
        print("Warning: this can cause problems!")
        self._voltage = volts

    @voltage.deleter
    def voltage(self):
        print("Warning: the radio will stop working!")
        del self._voltage

A Vehicle is defined as having .color and .model attributes. Then, a Device is defined to have a ._voltage attribute. Since the original Car object had these three attributes, it can be redefined to inherit both the Vehicle and Device classes. The color, model, and _voltage attributes will be part of the new Car class.

In the .__init__() for Car, you call the .__init__() methods for both of the parent classes to make sure everything is initialized properly. Once done, you can add any other functionality you want to your Car. In this case, you add a .year attribute specific to Car objects, and getter and setter methods for .voltage.

Functionally, the new Car class behaves as it always has. You create and use Car objects just as before:

>>> from car import *
>>> my_car = Car("yellow", "beetle", 1969)

>>> print(f"My car is {my_car.color}")
My car is yellow

>>> print(f"My car uses {my_car.voltage} volts")
My car uses 12 volts

>>> my_car.voltage = 6
Warning: this can cause problems!

>>> print(f"My car now uses {my_car.voltage} volts")
My car now uses 6 volts

Java, on the other hand, only supports single inheritance, which means classes in Java can inherit data and behavior from only a single parent class. However, Java objects can inherit behavior from many different interfaces. Interfaces provide a group of related methods an object must implement, and allow multiple child classes to behave similarly.

To see this in action, split the Java Car class into a parent class and an interface:

public class Vehicle {

    private String color;
    private String model;

    public Vehicle(String color, String model) {
        this.color = color;
        this.model = model;
    }

    public String getColor() {
        return color;
    }

    public String getModel() {
        return model;
    }
}

public interface Device {
    int getVoltage();
}

public class Car extends Vehicle implements Device {

    private int voltage;
    private int year;

    public Car(String color, String model, int year) {
        super(color, model);
        this.year = year;
        this.voltage = 12;
    }

    @Override
    public int getVoltage() {
        return voltage;
    }

    public int getYear() {
        return year;
    }
}

Remember that each class and interface needs to live in its own file.

As you did with Python, you create a new class called Vehicle to hold the more general vehicle related data and functionality. However, to add the Device functionality, you need to create an interface instead. This interface defines a single method to return the voltage of the Device.

Redefining the Car class requires you to inherit from Vehicle using extend, and implement the Device interface using implements. In the constructor, you call the parent class constructor using the built-in super(). Since there is only one parent class, it can only refer to the Vehicle constructor. To implement the interface, you write getVoltage() using the @Override annotation.

Rather than getting code reuse from Device as Python did, Java requires you to implement the same functionality in every class that implements the interface. Interfaces only define the methods—they cannot define instance data or implementation details.

So why is this the case for Java? It all comes down to types.

Types and Polymorphism

Java’s strict type checking is what drives its interface design.

Every class and interface in Java is a type. Therefore, if two Java objects implement the same interface, then they are considered to be the same type with respect to that interface. This mechanism allows different classes to be used interchangeably, which is the definition of polymorphism.

You can implement device charging for your Java objects by creating a .charge() that takes a Device to charge. Any object that implements the Device interface can be passed to .charge(). This also means that classes that do not implement Device will generate a compilation error.

Create the following class in a file called Rhino.java:

public class Rhino {
}

Now you can create a new Main.java to implement .charge() and explore how Car and Rhino objects differ:

public class Main{
    public static void charge(Device device) {
       device.getVoltage();
    }

    public static void main(String[] args) throws Exception {
        Car car = new Car("yellow", "beetle", 1969);
        Rhino rhino = new Rhino();
        charge(car);
        charge(rhino);
    }
}

Here is what you should see when you try to build this code:

Information:2019-02-02 15:20 - Compilation completed with 
    1 error and 0 warnings in 4 s 395 ms
Main.java
Error:(43, 11) java: incompatible types: Rhino cannot be converted to Device

Since the Rhino class doesn’t implement the Device interface, it can’t be passed into .charge().

In contrast to Java’s strict variable typing, Python uses a concept called duck typing, which in basic terms means that if a variable “walks like a duck and quacks like a duck, then it’s a duck.” Instead of identifying objects by type, Python examines their behavior.

You can explore duck typing by implementing similar device charging capabilities for your Python Device class:

>>> def charge(device):
...     if hasattr(device, '_voltage'):
...         print(f"Charging a {device._voltage} volt device")
...     else:
...         print(f"I can't charge a {device.__class__.__name__}")
... 
>>> class Phone(Device):
...     pass
... 
>>> class Rhino:
...     pass
... 
>>> my_car = Car("yellow", "Beetle", "1966")
>>> my_phone = Phone()
>>> my_rhino = Rhino()

>>> charge(my_car)
Charging a 12 volt device
>>> charge(my_phone)
Charging a 12 volt device
>>> charge(my_rhino)
I can't charge a Rhino

charge() must check for the existence of the ._voltage attribute in the object it’s passed. Since the Device class defines this attribute, any class that inherits from it (such as Car and Phone) will have this attribute, and will therefore show they are charging properly. Classes that do not inherit from Device (like Rhino) may not have this attribute, and will not be able to charge (which is good, since charging rhinos can be dangerous).

Default Methods

All Java classes descend from the Object class, which contains a set of methods every other class inherits. Subclasses can either override them or keep the defaults. The Object class defines the following methods:

class Object {
    boolean equals(Object obj) { ... }    
    int hashCode() { ... }    
    String toString() { ... }    
}

By default, <a href="https://docs.oracle.com/javase/8/docs/api/java/lang/Object.html#equals-java.lang.Object-" target="_blank">equals()</a> compares the addresses of the current Object with a second Object passed in, and <a href="https://docs.oracle.com/javase/8/docs/api/java/lang/Object.html#hashCode--" target="_blank">hashcode()</a> computes a unique identifier that also uses the address of the current Object. These methods are used in many different contexts in Java. For example, utility classes, such as collections that sort objects based on value, need both of them.

<a href="https://docs.oracle.com/javase/8/docs/api/java/lang/Object.html#toString--" target="_blank">toString()</a> returns a String representation of the Object. By default, this is the name of the class and the address. This method is called automatically when an Object is passed to a method that requires a String argument, such as System.out.println():

Car car = new Car("yellow", "Beetle", 1969);
System.out.println(car);

Running this code will use the default .toString() to show the car object:

[email protected]

Not very useful, right? You can improve this by overriding the default .toString(). Add this method to your Java Car class:

public String toString() {
    return "Car: " + getColor() + " : " + getModel() + " : " + getYear();
}

Now, when you run the same sample code, you’ll see the following:

Car: yellow : Beetle : 1969

Python provides similar functionality with a set of common dunder (short for “double underscore”) methods. Every Python class inherits these methods, and you can override them to modify their behavior.

For string representations of an object, Python provides __repr__() and __str__(), which you can learn about in Pythonic OOP String Conversion: __repr__ vs __str__. The unambiguous representation of an object is returned by __repr__(), while __str__() returns a human readable representation. These are roughly analogous to .hashcode() and .toString() in Java.

Like Java, Python provides default implementations of these dunder methods:

>>> my_car = Car("yellow", "Beetle", "1966")

>>> print(repr(my_car))
<car.Car object at 0x7fe4ca154f98>
>>> print(str(my_car))
<car.Car object at 0x7fe4ca154f98>

You can improve this output by overriding .__str__(), adding this to your Python Car class:

def __str__(self):
    return f'Car {self.color} : {self.model} : {self.year}'

This gives you a much nicer result:

>>> my_car = Car("yellow", "Beetle", "1966")

>>> print(repr(my_car))
<car.Car object at 0x7f09e9a7b630>
>>> print(str(my_car))
Car yellow : Beetle : 1966

Overriding the dunder method gave us a more readable representation of your Car. You may want to override the .__repr__() as well, as it is often useful for debugging.

Python offers a lot more dunder methods. Using dunder methods, you can define your object’s behavior during iteration, comparison, addition, or making an object callable directly, among other things.

Operator Overloading

Operator overloading refers to redefining how Python operators work when operating on user-defined objects. Python’s dunder methods allow you to implement operator overloading, something that Java doesn’t offer at all.

Modify your Python Car class with the following additional dunder methods:

class Car:
    def __init__(self, color, model, year):
        self.color = color
        self.model = model
        self.year = year

    def __str__(self):
        return f'Car {self.color} : {self.model} : {self.year}'

    def __eq__(self, other):
        return self.year == other.year

    def __lt__(self, other):
        return self.year < other.year

    def __add__(self, other):
        return Car(self.color + other.color, 
                   self.model + other.model, 
                   int(self.year) + int(other.year))

The table below shows the relationship between these dunder methods and the Python operators they represent:

When Python sees an expression containing objects, it calls any dunder methods defined that correspond to operators in the expression. The code below uses these new overloaded arithmetic operators on a couple of Car objects:

>>> my_car = Car("yellow", "Beetle", "1966")
>>> your_car = Car("red", "Corvette", "1967")

>>> print (my_car < your_car)
True
>>> print (my_car > your_car)
False
>>> print (my_car == your_car)
False
>>> print (my_car + your_car)
Car yellowred : BeetleCorvette : 3933 

There are many more operators you can overload using dunder methods. They offer a way to enrich your object’s behavior in a way that Java’s common base class default methods don’t.

Reflection

Reflection refers to examining an object or class from within the object or class. Both Java and Python offer ways to explore and examine the attributes and methods in a class.

Examining an Object’s Type

Both languages have ways to test or check an object’s type.

In Python, you use type() to display the type of a variable, and isinstance() to determine if a given variable is an instance or child of a specific class:

>>> my_car = Car("yellow", "Beetle", "1966")

>>> print(type(my_car))
<class 'car.Car'>
>>> print(isinstance(my_car, Car))
True
>>> print(isinstance(my_car, Device))
True

In Java, you query the object for its type using .getClass(), and use the instanceof operator to check for a specific class:

Car car = new Car("yellow", "beetle", 1969);

System.out.println(car.getClass());
System.out.println(car instanceof Car);

This code outputs the following:

class com.realpython.Car
true

Examining an Object’s Attributes

In Python, you can view every attribute and function contained in any object (including all the dunder methods) using dir(). To get the specific details of a given attribute or function, use getattr():

>>> print(dir(my_car))
['_Car__cupholders', '__add__', '__class__', '__delattr__', '__dict__', 
 '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__',
 '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__',
 '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__',
 '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__',
 '_voltage', 'color', 'model', 'voltage', 'wheels', 'year']

>>> print(getattr(my_car, "__format__"))
<built-in method __format__ of Car object at 0x7fb4c10f5438>

Java has similar capabilities, but the language’s access control and type safety make it more complicated to retrieve.

.getFields() retrieves a list of all publicly accessible attributes. However, since none of the attributes of Car are public, this code returns an empty array:

Field[] fields = car.getClass().getFields();

Java treats attributes and methods as separate entities, so public methods are retrieved using .getDeclaredMethods(). Since public attributes will have a corresponding .get method, one way to discover if a class contains a specific property might look like this:

  • Use .getFields() to generate an array of all the methods.
  • Loop through all the methods returned:
  • For each method discovered, return true if the method:
    Begins with the word get OR accepts zero argumentsAND doesn’t return voidAND includes the name of the propertyOtherwise, return false.
    Here’s a quick-and-dirty example:
public static boolean getProperty(String name, Object object) throws Exception {
    Method[] declaredMethods = object.getClass().getDeclaredMethods();
    for (Method method : declaredMethods) {
        if (isGetter(method) && 
            method.getName().toUpperCase().contains(name.toUpperCase())) {
              return true;
        }
    }
    return false;
}
// Helper function to get if the method is a getter method
public static boolean isGetter(Method method) {
    if ((method.getName().startsWith("get") || 
         method.getParameterCount() == 0 ) && 
        !method.getReturnType().equals(void.class)) {
          return true;
    }
    return false;
}

getProperty() is your entry point. Call this with the name of an attribute and an object. It returns true if the property is found, and false if not.

Calling Methods Through Reflection

Both Java and Python provide mechanisms to call methods through reflection.

In the Java example above, instead of simply returning true if the property was found, you could call the method directly. Recall that getDeclaredMethods() returns an array of Method objects. The Method object itself has a method called .invoke(), which will call the Method. Instead of returning true when the correct method is found on line 7 above, you can return method.invoke(object) instead.

This capability exists in Python as well. However, since Python doesn’t differentiate between functions and attributes, you have to look specifically for entries that are callable:

>>> for method_name in dir(my_car):
...     if callable(getattr(my_car, method_name)):
...         print(method_name)
... 
__add__
__class__
__delattr__
__dir__
__eq__
__format__
__ge__
__getattribute__
__gt__
__init__
__init_subclass__
__le__
__lt__
__ne__
__new__
__reduce__
__reduce_ex__
__repr__
__setattr__
__sizeof__
__str__
__subclasshook__

Python methods are simpler to manage and call than in Java. Adding the () operator (and any required arguments) is all you need to do.

The code below will find an object’s .__str__() and call it through reflection:

>>> for method_name in dir(my_car):
...     attr = getattr(my_car, method_name)
...     if callable(attr):
...         if method_name == '__str__':
...             print(attr())
... 
Car yellow : Beetle : 1966

Here, every attribute returned by dir() is checked. You get the actual attribute object using getattr(), and check if it’s a callable function using callable(). If so, you then check if its name is __str__(), and then call it.

Conclusion

Throughout the course of this article, you learned how object-oriented principles differ in Python vs Java. As you read, you:

  • Built a basic class in both Java and Python
  • Explored how object attributes work in Python vs Java
  • Compared and contrasted Java methods and Python functions
  • Discovered inheritance and polymorphism mechanisms in both languages
  • Investigated reflection across Python vs Java
  • Applied everything in a complete class implementation in both languages

Understanding the differences in Python vs Java when handling objects, and the syntax choices each language makes, will help you apply best practices and make your next project smoother.

How to Set up an SMS Notification With Python

How to Set up an SMS Notification With Python

How to Set up an SMS Notification With Python. oday I am beginning a new series of posts specifically aimed at Python beginners.

Hi everyone :) Today I am beginning a new series of posts specifically aimed at Python beginners. The concept is rather simple: I'll do a fun project, in as few lines of code as possible, and will try out as many new tools as possible.

For example, today we will learn to use the Twilio API, the Twitch API, and we'll see how to deploy the project on Heroku. I'll show you how you can have your own "Twitch Live" SMS notifier, in 30 lines of codes, and for 12 cents a month.

Prerequisite: You only need to know how to run Python on your machine and some basic commands in git (commit & push). If you need help with these, I can recommend these 2 articles to you:

Python 3 Installation & Setup Guide

The Ultimate Git Command Tutorial for Beginners from Adrian Hajdin.

What you'll learn:

  • Twitch API
  • Twilio API
  • Deploying on Heroku
  • Setting up a scheduler on Heroku

What you will build:

The specifications are simple: we want to receive an SMS as soon as a specific Twitcher is live streaming. We want to know when this person is going live and when they leave streaming. We want this whole thing to run by itself, all day long.

We will split the project into 3 parts. First, we will see how to programmatically know if a particular Twitcher is online. Then we will see how to receive an SMS when this happens. We will finish by seeing how to make this piece of code run every X minutes, so we never miss another moment of our favorite streamer's life.

Is this Twitcher live?

To know if a Twitcher is live, we can do two things: we can go to the Twitcher URL and try to see if the badge "Live" is there.

Screenshot of a Twitcher live streaming.

This process involves scraping and is not easily doable in Python in less than 20 or so lines of code. Twitch runs a lot of JS code and a simple request.get() won't be enough.

For scraping to work, in this case, we would need to scrape this page inside Chrome to get the same content like what you see in the screenshot. This is doable, but it will take much more than 30 lines of code. If you'd like to learn more, don't hesitate to check my recent web scraping guide.

So instead of trying to scrape Twitch, we will use their API. For those unfamiliar with the term, an API is a programmatic interface that allows websites to expose their features and data to anyone, mainly developers. In Twitch's case, their API is exposed through HTTP, witch means that we can have lots of information and do lots of things by just making a simple HTTP request.

Get your API key

To do this, you have to first create a Twitch API key. Many services enforce authentication for their APIs to ensure that no one abuses them or to restrict access to certain features by certain people.

Please follow these steps to get your API key:

  • Create a Twitch account
  • Now create a Twitch dev account -> "Signing up with Twitch" top right
  • Go to your "dashboard" once logged in
  • "Register your application"
  • Name -> Whatever, Oauth redirection URL -> http://localhost, Category -> Whatever

You should now see, at the bottom of your screen, your client-id. Keep this for later.

Is that Twitcher streaming now?

With your API key in hand, we can now query the Twitch API to have the information we want, so let's begin to code. The following snippet just consumes the Twitch API with the correct parameters and prints the response.

# requests is the go to package in python to make http request
# https://2.python-requests.org/en/master/
import requests

# This is one of the route where Twich expose data, 
# They have many more: https://dev.twitch.tv/docs
endpoint = "https://api.twitch.tv/helix/streams?"

# In order to authenticate we need to pass our api key through header
headers = {"Client-ID": "<YOUR-CLIENT-ID>"}

# The previously set endpoint needs some parameter, here, the Twitcher we want to follow
# Disclaimer, I don't even know who this is, but he was the first one on Twich to have a live stream so I could have nice examples
params = {"user_login": "Solary"}

# It is now time to make the actual request
response = request.get(endpoint, params=params, headers=headers)
print(response.json())

The output should look like this:

{
   'data':[
      {
         'id':'35289543872',
         'user_id':'174955366',
         'user_name':'Solary',
         'game_id':'21779',
         'type':'live',
         'title':"Wakz duoQ w/ Tioo - GM 400LP - On récupère le chall après les -250LP d'inactivité !",
         'viewer_count':4073,
         'started_at':'2019-08-14T07:01:59Z',
         'language':'fr',
         'thumbnail_url':'https://static-cdn.jtvnw.net/previews-ttv/live_user_solary-{width}x{height}.jpg',
         'tag_ids':[
            '6f655045-9989-4ef7-8f85-1edcec42d648'
         ]
      }
   ],
   'pagination':{
      'cursor':'eyJiIjpudWxsLCJhIjp7Ik9mZnNldCI6MX19'
   }
}

This data format is called JSON and is easily readable. The data object is an array that contains all the currently active streams. The key type ensures that the stream is currently live. This key will be empty otherwise (in case of an error, for example).

So if we want to create a boolean variable in Python that stores whether the current user is streaming, all we have to append to our code is:

json_response = response.json()

# We get only streams
streams = json_response.get('data', [])

# We create a small function, (a lambda), that tests if a stream is live or not
is_active = lambda stream: stream.get('type') == 'live'
# We filter our array of streams with this function so we only keep streams that are active
streams_active = filter(is_active, streams)

# any returns True if streams_active has at least one element, else False
at_least_one_stream_active = any(streams_active)

print(at_least_one_stream_active)

At this point, at_least_one_stream_active is True when your favourite Twitcher is live.

Let's now see how to get notified by SMS.

Send me a text, NOW!

So to send a text to ourselves, we will use the Twilio API. Just go over there and create an account. When asked to confirm your phone number, please use the phone number you want to use in this project. This way you'll be able to use the $15 of free credit Twilio offers to new users. At around 1 cent a text, it should be enough for your bot to run for one year.

If you go on the console, you'll see your Account SID and your Auth Token , save them for later. Also click on the big red button "Get My Trial Number", follow the step, and save this one for later too.

Sending a text with the Twilio Python API is very easy, as they provide a package that does the annoying stuff for you. Install the package with pip install Twilio and just do:

from twilio.rest import Client
client = Client(<Your Account SID>, <Your Auth Token>)
client.messages.create(
	body='Test MSG',from_=<Your Trial Number>,to=<Your Real Number>)

And that is all you need to send yourself a text, amazing right?

Putting everything together

We will now put everything together, and shorten the code a bit so we manage to say under 30 lines of Python code.

import requests
from twilio.rest import Client
endpoint = "https://api.twitch.tv/helix/streams?"
headers = {"Client-ID": "<YOUR-CLIENT-ID>"}
params = {"user_login": "Solary"}
response = request.get(endpoint, params=params, headers=headers)
json_response = response.json()
streams = json_response.get('data', [])
is_active = lambda stream:stream.get('type') == 'live'
streams_active = filter(is_active, streams)
at_least_one_stream_active = any(streams_active)
if at_least_one_stream_active:
    client = Client(<Your Account SID>, <Your Auth Token>)
	client.messages.create(body='LIVE !!!',from_=<Your Trial Number>,to=<Your Real Number>)

Avoiding double notifications

This snippet works great, but should that snippet run every minute on a server, as soon as our favorite Twitcher goes live we will receive an SMS every minute.

We need a way to store the fact that we were already notified that our Twitcher is live and that we don't need to be notified anymore.

The good thing with the Twilio API is that it offers a way to retrieve our message history, so we just have to retrieve the last SMS we sent to see if we already sent a text notifying us that the twitcher is live.

Here what we are going do to in pseudocode:

if favorite_twitcher_live and last_sent_sms is not live_notification:
	send_live_notification()
if not favorite_twitcher_live and last_sent_sms is live_notification:
	send_live_is_over_notification()

This way we will receive a text as soon as the stream starts, as well as when it is over. This way we won't get spammed - perfect right? Let's code it:

# reusing our Twilio client
last_messages_sent = client.messages.list(limit=1)
last_message_id = last_messages_sent[0].sid
last_message_data = client.messages(last_message_id).fetch()
last_message_content = last_message_data.body

Let's now put everything together again:

import requests
from twilio.rest import Client
client = Client(<Your Account SID>, <Your Auth Token>)

endpoint = "https://api.twitch.tv/helix/streams?"
headers = {"Client-ID": "<YOUR-CLIENT-ID>"}
params = {"user_login": "Solary"}
response = request.get(endpoint, params=params, headers=headers)
json_response = response.json()
streams = json_response.get('data', [])
is_active = lambda stream:stream.get('type') == 'live'
streams_active = filter(is_active, streams)
at_least_one_stream_active = any(streams_active)

last_messages_sent = client.messages.list(limit=1)
if last_messages_sent:
	last_message_id = last_messages_sent[0].sid
	last_message_data = client.messages(last_message_id).fetch()
	last_message_content = last_message_data.body
    online_notified = "LIVE" in last_message_content
    offline_notified = not online_notified
else:
	online_notified, offline_notified = False, False

if at_least_one_stream_active and not online_notified:
	client.messages.create(body='LIVE !!!',from_=<Your Trial Number>,to=<Your Real Number>)
if not at_least_one_stream_active and not offline_notified:
	client.messages.create(body='OFFLINE !!!',from_=<Your Trial Number>,to=<Your Real Number>)

And voilà!

You now have a snippet of code, in less than 30 lines of Python, that will send you a text a soon as your favourite Twitcher goes Online / Offline and without spamming you.

We just now need a way to host and run this snippet every X minutes.

The quest for a host

To host and run this snippet we will use Heroku. Heroku is honestly one of the easiest ways to host an app on the web. The downside is that it is really expensive compared to other solutions out there. Fortunately for us, they have a generous free plan that will allow us to do what we want for almost nothing.

If you don't already, you need to create a Heroku account. You also need to download and install the Heroku client.

You now have to move your Python script to its own folder, don't forget to add a requirements.txt file in it. The content of the latter begins:

requests
twilio

This is to ensure that Heroku downloads the correct dependencies.

cd into this folder and just do a heroku create --app <app name>.

If you go on your app dashboard you'll see your new app.

We now need to initialize a git repo and push the code on Heroku:

git init
heroku git:remote -a <app name>
git add .
git commit -am 'Deploy breakthrough script'
git push heroku master

Your app is now on Heroku, but it is not doing anything. Since this little script can't accept HTTP requests, going to <app name>.herokuapp.com won't do anything. But that should not be a problem.

To have this script running 24/7 we need to use a simple Heroku add-on call "Heroku Scheduler". To install this add-on, click on the "Configure Add-ons" button on your app dashboard.

Then, on the search bar, look for Heroku Scheduler:

Click on the result, and click on "Provision"

If you go back to your App dashboard, you'll see the add-on:

Click on the "Heroku Scheduler" link to configure a job. Then click on "Create Job". Here select "10 minutes", and for run command select python <name_of_your_script>.py. Click on "Save job".

While everything we used so far on Heroku is free, the Heroku Scheduler will run the job on the $25/month instance, but prorated to the second. Since this script approximately takes 3 seconds to run, for this script to run every 10 minutes you should just have to spend 12 cents a month.

Ideas for improvements

I hope you liked this project and that you had fun putting it into place. In less than 30 lines of code, we did a lot, but this whole thing is far from perfect. Here are a few ideas to improve it:

  • Send yourself more information about the current streaming (game played, number of viewers ...)
  • Send yourself the duration of the last stream once the twitcher goes offline
  • Don't send you a text, but rather an email
  • Monitor multiple twitchers at the same time

Do not hesitate to tell me in the comments if you have more ideas.

Conclusion

I hope that you liked this post and that you learned things reading it. I truly believe that this kind of project is one of the best ways to learn new tools and concepts, I recently launched a web scraping API where I learned a lot while making it.

Please tell me in the comments if you liked this format and if you want to do more.

I have many other ideas, and I hope you will like them. Do not hesitate to share what other things you build with this snippet, possibilities are endless.

Happy Coding.

Pierre

Don't want to miss my next post:

You can subscribe here to my newsletter.

Python Tutorial: Image processing with Python (Using OpenCV)

Python Tutorial: Image processing with Python (Using OpenCV)

In this tutorial, you will learn how you can process images in Python using the OpenCV library.

In this tutorial, you will learn how you can process images in Python using the OpenCV library.

OpenCV is a free open source library used in real-time image processing. It’s used to process images, videos, and even live streams, but in this tutorial, we will process images only as a first step. Before getting started, let’s install OpenCV.

Table of Contents

Install OpenCV

To install OpenCV on your system, run the following pip command:

 pip install opencv-python

Now OpenCV is installed successfully and we are ready. Let’s have some fun with some images!

Rotate an Image

First of all, import the cv2 module.

 import cv2

Now to read the image, use the imread() method of the cv2 module, specify the path to the image in the arguments and store the image in a variable as below:

 img = cv2.imread("pyimg.jpg")

The image is now treated as a matrix with rows and columns values stored in img.

Actually, if you check the type of the img, it will give you the following result:

>>>print(type(img))
 
<class 'numpy.ndarray'>

It’s a NumPy array! That why image processing using OpenCV is so easy. All the time you are working with a NumPy array.

To display the image, you can use the imshow() method of cv2.

cv2.imshow('Original Image', img) 
 
cv2.waitKey(0)

The waitkey functions take time as an argument in milliseconds as a delay for the window to close. Here we set the time to zero to show the window forever until we close it manually.

To rotate this image, you need the width and the height of the image because you will use them in the rotation process as you will see later.

 height, width = img.shape[0:2]

The shape attribute returns the height and width of the image matrix. If you print img.shape[0:2] , you will have the following output:

Okay, now we have our image matrix and we want to get the rotation matrix. To get the rotation matrix, we use the getRotationMatrix2D() method of cv2. The syntax of getRotationMatrix2D() is:

 cv2.getRotationMatrix2D(center, angle, scale)

Here the center is the center point of rotation, the angle is the angle in degrees and scale is the scale property which makes the image fit on the screen.

To get the rotation matrix of our image, the code will be:

 rotationMatrix = cv2.getRotationMatrix2D((width/2, height/2), 90, .5)

The next step is to rotate our image with the help of the rotation matrix.

To rotate the image, we have a cv2 method named wrapAffine which takes the original image, the rotation matrix of the image and the width and height of the image as arguments.

 rotatedImage = cv2.warpAffine(img, rotationMatrix, (width, height))

The rotated image is stored in the rotatedImage matrix. To show the image, use imshow() as below:

cv2.imshow('Rotated Image', rotatedImage)
 
cv2.waitKey(0)

After running the above lines of code, you will have the following output:

Crop an Image

First, we need to import the cv2 module and read the image and extract the width and height of the image:

import cv2
 
img = cv2.imread("pyimg.jpg")
 
height, width = img.shape[0:2]

Now get the starting and ending index of the row and column. This will define the size of the newly created image. For example, start from row number 10 till row number 15 will give the height of the image.

Similarly, start from column number 10 until column number 15 will give the width of the image.

You can get the starting point by specifying the percentage value of the total height and the total width. Similarly, to get the ending point of the cropped image, specify the percentage values as below:

startRow = int(height*.15)
 
startCol = int(width*.15)
 
endRow = int(height*.85)
 
endCol = int(width*.85)

Now map these values to the original image. Note that you have to cast the starting and ending values to integers because when mapping, the indexes are always integers.

 croppedImage = img[startRow:endRow, startCol:endCol]

Here we specified the range from starting to ending of rows and columns.

Now display the original and cropped image in the output:

cv2.imshow('Original Image', img)
 
cv2.imshow('Cropped Image', croppedImage)
 
cv2.waitKey(0)

The result will be as follows:

Resize an Image

To resize an image, you can use the resize() method of openCV. In the resize method, you can either specify the values of x and y axis or the number of rows and columns which tells the size of the image.

Import and read the image:

import cv2
 
img = cv2.imread("pyimg.jpg")

Now using the resize method with axis values:

newImg = cv2.resize(img, (0,0), fx=0.75, fy=0.75)
 
cv2.imshow('Resized Image', newImg)
 
cv2.waitKey(0)

The result will be as follows:

Now using the row and column values to resize the image:

newImg = cv2.resize(img, (550, 350))
 
cv2.imshow('Resized Image', newImg)
 
cv2.waitKey(0)

We say we want 550 columns (the width) and 350 rows (the height).

The result will be:

Adjust Image Contrast

In Python OpenCV module, there is no particular function to adjust image contrast but the official documentation of OpenCV suggests an equation that can perform image brightness and image contrast both at the same time.

 new_img = a * original_img + b

Here a is alpha which defines contrast of the image. If a is greater than 1, there will be higher contrast.

If the value of a is between 0 and 1 (smaller than 1 but greater than 0), there would be lower contrast. If a is 1, there will be no contrast effect on the image.

b stands for beta. The values of b vary from -127 to +127.

To implement this equation in Python OpenCV, you can use the addWeighted() method. We use The addWeighted() method as it generates the output in the range of 0 and 255 for a 24-bit color image.

The syntax of addWeighted() method is as follows:

 cv2.addWeighted(source_img1, alpha1, source_img2, alpha2, beta)

This syntax will blend two images, the first source image (source_img1) with a weight of alpha1 and second source image (source_img2).

If you only want to apply contrast in one image, you can add a second image source as zeros using NumPy.

Let’s work on a simple example. Import the following modules:

import cv2
 
import numpy as np

Read the original image:

 img = cv2.imread("pyimg.jpg")

Now apply the contrast. Since there is no other image, we will use the np.zeros which will create an array of the same shape and data type as the original image but the array will be filled with zeros.

contrast_img = cv2.addWeighted(img, 2.5, np.zeros(img.shape, img.dtype), 0, 0)
 
cv2.imshow('Original Image', img)
 
cv2.imshow('Contrast Image', contrast_img)
 
cv2.waitKey(0)

In the above code, the brightness is set to 0 as we only want to apply contrast.

The comparison of the original and contrast image is as follows:

Make an image blurry

Gaussian Blur

To make an image blurry, you can use the GaussianBlur() method of OpenCV.

The GaussianBlur() uses the Gaussian kernel. The height and width of the kernel should be a positive and an odd number.

Then you have to specify the X and Y direction that is sigmaX and sigmaY respectively. If only one is specified, both are considered the same.

Consider the following example:

import cv2
 
img = cv2.imread("pyimg.jpg")
 
blur_image = cv2.GaussianBlur(img, (7,7), 0)
 
cv2.imshow('Original Image', img)
 
cv2.imshow('Blur Image', blur_image)
 
cv2.waitKey(0)

In the above snippet, the actual image is passed to GaussianBlur() along with height and width of the kernel and the X and Y directions.

The comparison of the original and blurry image is as follows:

Median Blur

In median blurring, the median of all the pixels of the image is calculated inside the kernel area. The central value is then replaced with the resultant median value. Median blurring is used when there are salt and pepper noise in the image.

To apply median blurring, you can use the medianBlur() method of OpenCV.

Consider the following example where we have a salt and pepper noise in the image:

import cv2
 
img = cv2.imread("pynoise.png")
 
blur_image = cv2.medianBlur(img,5)

This will apply 50% noise in the image along with median blur. Now show the images:

cv2.imshow('Original Image', img)
 
cv2.imshow('Blur Image', blur_image)
 
cv2.waitKey(0)

The result will be like the following:

Another comparison of the original image and after blurring:

Detect Edges

To detect the edges in an image, you can use the Canny() method of cv2 which implements the Canny edge detector. The Canny edge detector is also known as the optimal detector.

The syntax to Canny() is as follows:

 cv2.Canny(image, minVal, maxVal)

Here minVal and maxVal are the minimum and maximum intensity gradient values respectively.

Consider the following code:

import cv2
 
img = cv2.imread("pyimg.jpg")
 
edge_img = cv2.Canny(img,100,200)
 
cv2.imshow("Detected Edges", edge_img)
 
cv2.waitKey(0)

The output will be the following:

Here is the result of the above code on another image:

Convert image to grayscale (Black & White)

The easy way to convert an image in grayscale is to load it like this:

 img = cv2.imread("pyimg.jpg", 0)

There is another method using BGR2GRAY.

To convert a color image into a grayscale image, use the BGR2GRAY attribute of the cv2 module. This is demonstrated in the example below:

Import the cv2 module:

 import cv2

Read the image:

 img = cv2.imread("pyimg.jpg")

Use the cvtColor() method of the cv2 module which takes the original image and the COLOR_BGR2GRAY attribute as an argument. Store the resultant image in a variable:

 gray_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

Display the original and grayscale images:

cv2.imshow("Original Image", img)
 
cv2.imshow("Gray Scale Image", gray_img)
 
cv2.waitKey(0)

The output will be as follows:

Centroid (Center of blob) detection

To find the center of an image, the first step is to convert the original image into grayscale. We can use the cvtColor() method of cv2 as we did before.

This is demonstrated in the following code:

import cv2
 
img = cv2.imread("py.jpg")
 
gray_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

We read the image and convert it to a grayscale image. The new image is stored in gray_img.

Now we have to calculate the moments of the image. Use the moments() method of cv2. In the moments() method, the grayscale image will be passed as below:

 moment = cv2.moments(gray_img)

Finally, we have the center of the image. To highlight this center position, we can use the circle method which will create a circle in the given coordinates of the given radius.

The circle() method takes the img, the x and y coordinates where the circle will be created, the size, the color that we want the circle to be and the thickness.

 cv2.circle(img, (X, Y), 15, (205, 114, 101), 1)

The circle is created on the image.

cv2.imshow("Center of the Image", img)
 
cv2.waitKey(0)

The original image is:

After detecting the center, our image will be as follows:

Apply a mask for a colored image

Image masking means to apply some other image as a mask on the original image or to change the pixel values in the image.

To apply a mask on the image, we will use the HoughCircles() method of the OpenCV module. The HoughCircles() method detects the circles in an image. After detecting the circles, we can simply apply a mask on these circles.

The HoughCircles() method takes the original image, the Hough Gradient (which detects the gradient information in the edges of the circle), and the information from the following circle equation:

 (x - xcenter)2 + (y - ycenter)2 = r2

In this equation (xcenter , ycenter) is the center of the circle and r is the radius of the circle.

Our original image is:

After detecting circles in the image, the result will be:

Okay, so we have the circles in the image and we can apply the mask. Consider the following code:

import cv2
 
import numpy as np
 
img1 = cv2.imread('pyimg.jpg')
 
img1 = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

Detecting the circles in the image using the HoughCircles() code from OpenCV: Hough Circle Transform:

gray_img = cv2.medianBlur(cv2.cvtColor(img, cv2.COLOR_RGB2GRAY), 3)
 
circles = cv2.HoughCircles(gray_img, cv2.HOUGH_GRADIENT, 1, 20, param1=50, param2=50, minRadius=0, maxRadius=0)
 
circles = np.uint16(np.around(circles))

To create the mask, use np.full which will return a NumPy array of given shape:

masking=np.full((img1.shape[0], img1.shape[1]),0,dtype=np.uint8)
 
for j in circles[0, :]:
 
    cv2.circle(masking, (j[0], j[1]), j[2], (255, 255, 255), -1)

The next step is to combine the image and the masking array we created using the bitwise_or operator as follows:

 final_img = cv2.bitwise_or(img1, img1, masking=masking)

Display the resultant image:

Extracting text from Image (OCR)

To extract text from an image, you can use Google Tesseract-OCR. You can download it from this link

Then you should install the pytesseract module which is a Python wrapper for Tesseract-OCR.

The image from which we will extract the text from is as follows:

Now let’s convert the text in this image to a string of characters and display the text as a string on output:

Import the pytesseract module:

 import pytesseract

Set the path of the Tesseract-OCR executable file:

 pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files (x86)\Tesseract-OCR\tesseract'

Now use the image_to_string method to convert the image into a string:

 print(pytesseract.image_to_string('pytext.png'))

The output will be as follows:

Works like charm!

Detect and correct text skew

In this section, we will correct the text skew.

The original image is as follows:

Import the modules cv2, NumPy and read the image:

import cv2
 
import numpy as np
 
img = cv2.imread("pytext1.png")

Convert the image into a grayscale image:

 gray_img=cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

Invert the grayscale image using bitwise_not:

 gray_img=cv2.bitwise_not(gray_img)

Select the x and y coordinates of the pixels greater than zero by using the column_stack method of NumPy:

 coordinates = np.column_stack(np.where(gray_img > 0))

Now we have to calculate the skew angle. We will use the minAreaRect() method of cv2 which returns an angle range from -90 to 0 degrees (where 0 is not included).

 ang=cv2.minAreaRect(coordinates)[-1]

The rotated angle of the text region will be stored in the ang variable. Now we add a condition for the angle; if the text region’s angle is smaller than -45, we will add a 90 degrees else we will multiply the angle with a minus to make the angle positive.

if ang<-45:
 
    ang=-(90+ang)
 
else:
 
    ang=-ang

Calculate the center of the text region:

height, width = img.shape[:2]
 
center_img = (width / 2, height / 2)

Now we have the angle of text skew, we will apply the getRotationMatrix2D() to get the rotation matrix then we will use the wrapAffine() method to rotate the angle (explained earlier).

rotationMatrix = cv2.getRotationMatrix2D(center, angle, 1.0)
 
rotated_img = cv2.warpAffine(img, rotationMatrix, (width, height), borderMode = cv2.BORDER_REFLECT)

Display the rotated image:

cv2.imshow("Rotated Image", rotated_img)
 
cv2.waitKey(0)

Color Detection

Let’s detect the green color from an image:

Import the modules cv2 for images and NumPy for image arrays:

import cv2
 
import numpy as np

Read the image and convert it into HSV using cvtColor():

img = cv2.imread("pydetect.png")
 
hsv_img = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)

Display the image:

 cv2.imshow("HSV Image", hsv_img)

Now create a NumPy array for the lower green values and the upper green values:

lower_green = np.array([34, 177, 76])
 
upper_green = np.array([255, 255, 255])

Use the inRange() method of cv2 to check if the given image array elements lie between array values of upper and lower boundaries:

 masking = cv2.inRange(hsv_img, lower_green, upper_green)

This will detect the green color.

Finally, display the original and resultant images:

 cv2.imshow("Original Image", img)

cv2.imshow("Green Color detection", masking)
 
cv2.waitKey(0)

Reduce Noise

To reduce noise from an image, OpenCV provides the following methods:

  1. fastNlMeansDenoising(): Removes noise from a grayscale image
  2. fastNlMeansDenoisingColored(): Removes noise from a colored image
  3. fastNlMeansDenoisingMulti(): Removes noise from grayscale image frames (a grayscale video)
  4. fastNlMeansDenoisingColoredMulti(): Same as 3 but works with colored frames

Let’s use fastNlMeansDenoisingColored() in our example:

Import the cv2 module and read the image:

2
3
	
import cv2
 
img = cv2.imread("pyn1.png")

Apply the denoising function which takes respectively the original image (src), the destination (which we have kept none as we are storing the resultant), the filter strength, the image value to remove the colored noise (usually equal to filter strength or 10), the template patch size in pixel to compute weights which should always be odd (recommended size equals 7) and the window size in pixels to compute average of the given pixel.

 result = cv2.fastNlMeansDenoisingColored(img,None,20,10,7,21)

Display original and denoised image:

cv2.imshow("Original Image", img)
 
cv2.imshow("Denoised Image", result)
 
cv2.waitKey(0)

The output will be:

Get image contour

Contours are the curves in an image that are joint together. The curves join the continuous points in an image. The purpose of contours is used to detect the objects.

The original image of which we are getting the contours of is given below:

Consider the following code where we used the findContours() method to find the contours in the image:

Import cv2 module:

 import cv2

Read the image and convert it to a grayscale image:

img = cv2.imread('py1.jpg')
 
gray_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

Find the threshold:

 retval, thresh = cv2.threshold(gray_img, 127, 255, 0)

Use the findContours() which takes the image (we passed threshold here) and some attributes. See findContours() Official.

 img_contours, _ = cv2.findContours(thresh, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)

Draw the contours on the image using drawContours() method:

  cv2.drawContours(img, img_contours, -1, (0, 255, 0))

Display the image:

cv2.imshow('Image Contours', img)
 
cv2.waitKey(0)

The result will be:

Remove Background from an image

To remove the background from an image, we will find the contours to detect edges of the main object and create a mask with np.zeros for the background and then combine the mask and the image using the bitwise_and operator.

Consider the example below:

Import the modules (NumPy and cv2):

import cv2
 
import numpy as np

Read the image and convert the image into a grayscale image:

img = cv2.imread("py.jpg")
 
gray_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

Find the threshold:

 _, thresh = cv2.threshold(gray_img, 127, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)

In the threshold() method, the last argument defines the style of the threshold. See Official documentation of OpenCV threshold.

Find the image contours:

 img_contours = cv2.findContours(threshed, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)[-2]

Sort the contours:

img_contours = sorted(img_contours, key=cv2.contourArea)
 
for i in img_contours:
 
    if cv2.contourArea(i) > 100:
 
        break

Generate the mask using np.zeros:

 mask = np.zeros(img.shape[:2], np.uint8)

Draw contours:

 cv2.drawContours(mask, [i],-1, 255, -1)

Apply the bitwise_and operator:

 new_img = cv2.bitwise_and(img, img, mask=mask)

Display the original image:

 cv2.imshow("Original Image", img)

Display the resultant image:

cv2.imshow("Image with background removed", new_img)
 
cv2.waitKey(0)

Image processing is fun when using OpenCV as you saw. I hope you find the tutorial useful. Keep coming back.

Thank you.

Introduction to Python Hex() Function for Beginners

Introduction to Python Hex() Function for Beginners

Python hex() function is used to convert any integer number ( in base 10) to the corresponding hexadecimal number. Notably, the given input should be in base 10. Python hex function is one of the built-in functions in Python3, which is used to convert an integer number into its corresponding hexadecimal form.

Python hex() function is used to convert any integer number ( in base 10) to the corresponding hexadecimal number. Notably, the given input should be in base 10. Python hex function is one of the built-in functions in Python3, which is used to convert an integer number into its corresponding hexadecimal form.

##Python Hex Example
The hex() function converts the integer to the corresponding hexadecimal number in string form and returns it.

The input integer argument can be in any base such as binary, octal, etc. Python will take care of converting them to hexadecimal format.

Syntax

hex(number)

number: It is an integer that will be converted into hexadecimal value.
This function converts the number into the hexadecimal form, and then it returns that hexadecimal number in a string format.

Please note that the return value always starts with ‘0x’ (without quotes), which proves that the number is in hexadecimal format.

# app.py

print("Enter the number: ")

# taking input from user
num = int(input())

# converting the number into hexadecimal form
h1 = hex(num)

# Printing hexadecimal form
print("The ", num, " in hexadecimal is: ", h1)

# Converting float number to hexadecimal form
print("\nEnter a float number")
num2 = float(input())

# converting into hexadecimal form
# for float we have to use float.hex() here
h2 = float.hex(num2)

# printing result
print("The ", num2, " in hexadecimal is: ", h2)

In the above example, we used the Python input() function to take the input from the user.

See the output.

Enter the number:
541
The  541  in hexadecimal is:  0x21d
    
Enter a float number
123.54
The  123.54  in hexadecimal is:  0x1.ee28f5c28f5c3p+6

Python hex() without 0x

See the following program.

# app.py

print("Enter the number: ")

# taking input from user
num = int(input())

# converting the number into hexadecimal form
h1 = hex(num)

# Printing hexadecimal form
# we have used string slicing here
print("The ", num, " in hexadecimal is: ", h1[2:])

# Converting float number to hexadecimal form
print("\nEnter a float number")
num2 = float(input())

# converting into hexadecimal form
h2 = float.hex(num2)

# printing result
print("The ", num2, " in hexadecimal is: ", h2[2:])

See the output.

Enter the number:
541
The  541  in hexadecimal is:  21d

Enter a float number
123.65
The  123.65  in hexadecimal is:  1.ee9999999999ap+6

On the above program, we have used string slicing to print the result without ‘0x’.

We have started our index from position 2 to the last of the string, i.e., h1[2:]; this means the string will print characters from position 2 to the last of the string.

Hexadecimal representation of float in Python

See the following program.

# app.py

numberEL = 11.21
print(numberEL, 'in hex =', float.hex(numberEL))

numberK = 19.21
print(numberK, 'in hex =', float.hex(numberK))

See the output.

➜  pyt python3 app.py
11.21 in hex = 0x1.66b851eb851ecp+3
19.21 in hex = 0x1.335c28f5c28f6p+4
➜  pyt

Python hex() with object

See the following code.

# app.py

class AI:
    id = 0

    def __index__(self):
        print('__index__() function called')
        return self.rank


stockfish = AI()
stockfish.rank = 2900

print(hex(stockfish))

In the above example, we have used the index() method so that we can use it with hex() function.

See the output.

➜  pyt python3 app.py
__index__() function called
0xb54
➜  pyt

How to convert hex string to int in Python

Without the 0x prefix, you need to specify the base explicitly. Otherwise, it won’t work.

See the following code.

# app.py

data = int("0xa", 16)
print(data)

With the 0x prefix, Python can distinguish hex and decimal automatically.

You must specify 0 as the base to invoke this prefix-guessing behavior, omitting the second parameter means to assume base-10.)
If you want to convert the string to an int, pass the string to int along with a base you are converting from. Both strings will suffice for conversion in this way.

# app.py

hexStrA = "0xffff"
hexStrB = "ffff"

print(int(hexStrA, 16))
print(int(hexStrB, 16))

See the output.

➜  pyt python3 app.py
65535
65535
➜  pyt

In the all above examples, we have used Python int() method.

Thanks for reading !

Top 10 Books To Learn Python

Top 10 Books To Learn Python

This video on 'Top 10 Books To Learn Python' will suggest to you what we think are the best books for Python, even if you are an experienced programmer or a complete beginner.

This video on 'Top 10 Books To Learn Python' will suggest to you what we think are the best books for Python, even if you are an experienced programmer or a complete beginner. Below are the topics covered in this video:

Why Python?

  • Beginner-Level Python Books
  • Domain-Specific Python Books
  • Bonus Python Book

Links for the Python Books:

  1. Learning Python by Mark Lutz: http://bit.ly/2BR38aY
  2. Python Crash Course by Eric Matthews: http://bit.ly/2BLlJ8i
  3. Think Python by Allen Downey: http://bit.ly/2pjoXNC
  4. Python Programming by John M Zelle: http://bit.ly/31SkYon
  5. Python in a Nutshell by Alex Martelli: http://bit.ly/32UOyeh
  6. Programming Python Mark Lutz: https://amzn.to/31Slhj1
  7. Effective Computation in Physics by Anthony Scopatz, Kathryn D. Huff: http://bit.ly/2BPD00c
  8. Python for Data Analysis by Wes McKinney: http://bit.ly/2pWCaMo
  9. Python Machine Learning by Sebastian Raschka and Vahid Mirjalili: https://amzn.to/36amOV3
  10. Django for Beginners by William S. Vincent: https://amzn.to/36lQtuG

A Complete Machine Learning Project Walk-Through in Python

A Complete Machine Learning Project Walk-Through in Python

A Complete Machine Learning Project Walk-Through in Python: Putting the machine learning pieces together; Model Selection, Hyperparameter Tuning, and Evaluation; Interpreting a machine learning model and presenting results

Reading through a data science book or taking a course, it can feel like you have the individual pieces, but don’t quite know how to put them together. Taking the next step and solving a complete machine learning problem can be daunting, but preserving and completing a first project will give you the confidence to tackle any data science problem. This series of articles will walk through a complete machine learning solution with a real-world dataset to let you see how all the pieces come together.

We’ll follow the general machine learning workflow step-by-step:

  1. Data cleaning and formatting
  2. Exploratory data analysis
  3. Feature engineering and selection
  4. Compare several machine learning models on a performance metric
  5. Perform hyperparameter tuning on the best model
  6. Evaluate the best model on the testing set
  7. Interpret the model results
  8. Draw conclusions and document work

Along the way, we’ll see how each step flows into the next and how to specifically implement each part in Python. The complete project is available on GitHub, with the first notebook here.

(As a note, this problem was originally given to me as an “assignment” for a job screen at a start-up. After completing the work, I was offered the job, but then the CTO of the company quit and they weren’t able to bring on any new employees. I guess that’s how things go on the start-up scene!)

Problem Definition

The first step before we get coding is to understand the problem we are trying to solve and the available data. In this project, we will work with publicly available building energy data from New York City.

The objective is to use the energy data to build a model that can predict the Energy Star Score of a building and interpret the results to find the factors which influence the score.

The data includes the Energy Star Score, which makes this a supervised regression machine learning task:

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

We want to develop a model that is both **accurate **— it can predict the Energy Star Score close to the true value — and interpretable — we can understand the model predictions. Once we know the goal, we can use it to guide our decisions as we dig into the data and build models.

Data Cleaning

Contrary to what most data science courses would have you believe, not every dataset is a perfectly curated group of observations with no missing values or anomalies (looking at you mtcars and iris datasets). Real-world data is messy which means we need to clean and wrangle it into an acceptable format before we can even start the analysis. Data cleaning is an un-glamorous, but necessary part of most actual data science problems.

First, we can load in the data as a Pandas DataFrame and take a look:

import pandas as pd
import numpy as np

# Read in data into a dataframe 
data = pd.read_csv('data/Energy_and_Water_Data_Disclosure_for_Local_Law_84_2017__Data_for_Calendar_Year_2016_.csv')

# Display top of dataframe
data.head()

This is a subset of the full data which contains 60 columns. Already, we can see a couple issues: first, we know that we want to predict the ENERGY STAR Score but we don’t know what any of the columns mean. While this isn’t necessarily an issue — we can often make an accurate model without any knowledge of the variables — we want to focus on interpretability, and it might be important to understand at least some of the columns.

When I originally got the assignment from the start-up, I didn’t want to ask what all the column names meant, so I looked at the name of the file,

and decided to search for “Local Law 84”. That led me to this page which explains this is an NYC law requiring all buildings of a certain size to report their energy use. More searching brought me to all the definitions of the columns. Maybe looking at a file name is an obvious place to start, but for me this was a reminder to go slow so you don’t miss anything important!

We don’t need to study all of the columns, but we should at least understand the Energy Star Score, which is described as:

A 1-to-100 percentile ranking based on self-reported energy usage for the reporting year. The Energy Star score is a relative measure used for comparing the energy efficiency of buildings.

That clears up the first problem, but the second issue is that missing values are encoded as “Not Available”. This is a string in Python which means that even the columns with numbers will be stored as object datatypes because Pandas converts a column with any strings into a column of all strings. We can see the datatypes of the columns using the dataframe.info()method:

# See the column data types and non-missing values
data.info()

Sure enough, some of the columns that clearly contain numbers (such as ft²), are stored as objects. We can’t do numerical analysis on strings, so these will have to be converted to number (specifically float) data types!

Here’s a little Python code that replaces all the “Not Available” entries with not a number ( np.nan), which can be interpreted as numbers, and then converts the relevant columns to the float datatype:

# Replace all occurrences of Not Available with numpy not a number
data = data.replace({'Not Available': np.nan})

# Iterate through the columns
for col in list(data.columns):
    # Select columns that should be numeric
    if ('ft²' in col or 'kBtu' in col or 'Metric Tons CO2e' in col or 'kWh' in 
        col or 'therms' in col or 'gal' in col or 'Score' in col):
        # Convert the data type to float
data[col] = data[col].astype(float)

Once the correct columns are numbers, we can start to investigate the data.

Missing Data and Outliers

In addition to incorrect datatypes, another common problem when dealing with real-world data is missing values. These can arise for many reasons and have to be either filled in or removed before we train a machine learning model. First, let’s get a sense of how many missing values are in each column (see the notebook for code).

(To create this table, I used a function from this Stack Overflow Forum).

While we always want to be careful about removing information, if a column has a high percentage of missing values, then it probably will not be useful to our model. The threshold for removing columns should depend on the problem (here is a discussion), and for this project, we will remove any columns with more than 50% missing values.

At this point, we may also want to remove outliers. These can be due to typos in data entry, mistakes in units, or they could be legitimate but extreme values. For this project, we will remove anomalies based on the definition of extreme outliers:

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

(For the code to remove the columns and the anomalies, see the notebook). At the end of the data cleaning and anomaly removal process, we are left with over 11,000 buildings and 49 features.

Exploratory Data Analysis

Now that the tedious — but necessary — step of data cleaning is complete, we can move on to exploring our data! Exploratory Data Analysis (EDA) is an open-ended process where we calculate statistics and make figures to find trends, anomalies, patterns, or relationships within the data.

In short, the goal of EDA is to learn what our data can tell us. It generally starts out with a high level overview, then narrows in to specific areas as we find interesting parts of the data. The findings may be interesting in their own right, or they can be used to inform our modeling choices, such as by helping us decide which features to use.

Single Variable Plots

The goal is to predict the Energy Star Score (renamed to score in our data) so a reasonable place to start is examining the distribution of this variable. A histogram is a simple yet effective way to visualize the distribution of a single variable and is easy to make using matplotlib.

import matplotlib.pyplot as plt

# Histogram of the Energy Star Score
plt.style.use('fivethirtyeight')
plt.hist(data['score'].dropna(), bins = 100, edgecolor = 'k');
plt.xlabel('Score'); plt.ylabel('Number of Buildings'); 
plt.title('Energy Star Score Distribution');

This looks quite suspicious! The Energy Star score is a percentile rank, which means we would expect to see a uniform distribution, with each score assigned to the same number of buildings. However, a disproportionate number of buildings have either the highest, 100, or the lowest, 1, score (higher is better for the Energy Star score).

If we go back to the definition of the score, we see that it is based on “self-reported energy usage” which might explain the very high scores. Asking building owners to report their own energy usage is like asking students to report their own scores on a test! As a result, this probably is not the most objective measure of a building’s energy efficiency.

If we had an unlimited amount of time, we might want to investigate why so many buildings have very high and very low scores which we could by selecting these buildings and seeing what they have in common. However, our objective is only to predict the score and not to devise a better method of scoring buildings! We can make a note in our report that the scores have a suspect distribution, but our main focus in on predicting the score.

Looking for Relationships

A major part of EDA is searching for relationships between the features and the target. Variables that are correlated with the target are useful to a model because they can be used to predict the target. One way to examine the effect of a categorical variable (which takes on only a limited set of values) on the target is through a density plot using the seaborn library.

A density plot can be thought of as a smoothed histogram because it shows the distribution of a single variable. We can color a density plot by class to see how a categorical variable changes the distribution. The following code makes a density plot of the Energy Star Score colored by the the type of building (limited to building types with more than 100 data points):

# Create a list of buildings with more than 100 measurements
types = data.dropna(subset=['score'])
types = types['Largest Property Use Type'].value_counts()
types = list(types[types.values > 100].index)

# Plot of distribution of scores for building categories
figsize(12, 10)

# Plot each building
for b_type in types:
    # Select the building type
    subset = data[data['Largest Property Use Type'] == b_type]
    
    # Density plot of Energy Star scores
    sns.kdeplot(subset['score'].dropna(),
               label = b_type, shade = False, alpha = 0.8);
    
# label the plot
plt.xlabel('Energy Star Score', size = 20); plt.ylabel('Density', size = 20); 
plt.title('Density Plot of Energy Star Scores by Building Type', size = 28);

We can see that the building type has a significant impact on the Energy Star Score. Office buildings tend to have a higher score while Hotels have a lower score. This tells us that we should include the building type in our modeling because it does have an impact on the target. As a categorical variable, we will have to one-hot encode the building type.

A similar plot can be used to show the Energy Star Score by borough:

The borough does not seem to have as large of an impact on the score as the building type. Nonetheless, we might want to include it in our model because there are slight differences between the boroughs.

To quantify relationships between variables, we can use the Pearson Correlation Coefficient. This is a measure of the strength and direction of a linear relationship between two variables. A score of +1 is a perfectly linear positive relationship and a score of -1 is a perfectly negative linear relationship. Several values of the correlation coefficient are shown below:

While the correlation coefficient cannot capture non-linear relationships, it is a good way to start figuring out how variables are related. In Pandas, we can easily calculate the correlations between any columns in a dataframe:

# Find all correlations with the score and sort 
correlations_data = data.corr()['score'].sort_values()

The most negative (left) and positive (right) correlations with the target:

There are several strong negative correlations between the features and the target with the most negative the different categories of EUI (these measures vary slightly in how they are calculated). The EUI — Energy Use Intensity — is the amount of energy used by a building divided by the square footage of the buildings. It is meant to be a measure of the efficiency of a building with a lower score being better. Intuitively, these correlations make sense: as the EUI increases, the Energy Star Score tends to decrease.

Two-Variable Plots

To visualize relationships between two continuous variables, we use scatterplots. We can include additional information, such as a categorical variable, in the color of the points. For example, the following plot shows the Energy Star Score vs. Site EUI colored by the building type:

This plot lets us visualize what a correlation coefficient of -0.7 looks like. As the Site EUI decreases, the Energy Star Score increases, a relationship that holds steady across the building types.

The final exploratory plot we will make is known as the Pairs Plot. This is a great exploration tool because it lets us see relationships between multiple pairs of variables as well as distributions of single variables. Here we are using the seaborn visualization library and the PairGrid function to create a Pairs Plot with scatterplots on the upper triangle, histograms on the diagonal, and 2D kernel density plots and correlation coefficients on the lower triangle.

# Extract the columns to  plot
plot_data = features[['score', 'Site EUI (kBtu/ft²)', 
                      'Weather Normalized Source EUI (kBtu/ft²)', 
                      'log_Total GHG Emissions (Metric Tons CO2e)']]

# Replace the inf with nan
plot_data = plot_data.replace({np.inf: np.nan, -np.inf: np.nan})

# Rename columns 
plot_data = plot_data.rename(columns = {'Site EUI (kBtu/ft²)': 'Site EUI', 
                                        'Weather Normalized Source EUI (kBtu/ft²)': 'Weather Norm EUI',
                                        'log_Total GHG Emissions (Metric Tons CO2e)': 'log GHG Emissions'})

# Drop na values
plot_data = plot_data.dropna()

# Function to calculate correlation coefficient between two columns
def corr_func(x, y, **kwargs):
    r = np.corrcoef(x, y)[0][1]
    ax = plt.gca()
    ax.annotate("r = {:.2f}".format(r),
                xy=(.2, .8), xycoords=ax.transAxes,
                size = 20)

# Create the pairgrid object
grid = sns.PairGrid(data = plot_data, size = 3)

# Upper is a scatter plot
grid.map_upper(plt.scatter, color = 'red', alpha = 0.6)

# Diagonal is a histogram
grid.map_diag(plt.hist, color = 'red', edgecolor = 'black')

# Bottom is correlation and density plot
grid.map_lower(corr_func);
grid.map_lower(sns.kdeplot, cmap = plt.cm.Reds)

# Title for entire plot
plt.suptitle('Pairs Plot of Energy Data', size = 36, y = 1.02);

To see interactions between variables, we look for where a row intersects with a column. For example, to see the correlation of Weather Norm EUI with score, we look in the Weather Norm EUI row and the score column and see a correlation coefficient of -0.67. In addition to looking cool, plots such as these can help us decide which variables to include in modeling.

Feature Engineering and Selection

Feature engineering and selection often provide the greatest return on time invested in a machine learning problem. First of all, let’s define what these two tasks are:

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

A machine learning model can only learn from the data we provide it, so ensuring that data includes all the relevant information for our task is crucial. If we don’t feed a model the correct data, then we are setting it up to fail and we should not expect it to learn!

For this project, we will take the following feature engineering steps:

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

One-hot encoding is necessary to include categorical variables in a model. A machine learning algorithm cannot understand a building type of “office”, so we have to record it as a 1 if the building is an office and a 0 otherwise.

Adding transformed features can help our model learn non-linear relationships within the data. Taking the square root, natural log, or various powers of features is common practice in data science and can be based on domain knowledge or what works best in practice. Here we will include the natural log of all numerical features.

The following code selects the numeric features, takes log transformations of these features, selects the two categorical features, one-hot encodes these features, and joins the two sets together. This seems like a lot of work, but it is relatively straightforward in Pandas!

# Copy the original data
features = data.copy()

# Select the numeric columns
numeric_subset = data.select_dtypes('number')

# Create columns with log of numeric columns
for col in numeric_subset.columns:
    # Skip the Energy Star Score column
    if col == 'score':
        next
    else:
        numeric_subset['log_' + col] = np.log(numeric_subset[col])
        
# Select the categorical columns
categorical_subset = data[['Borough', 'Largest Property Use Type']]

# One hot encode
categorical_subset = pd.get_dummies(categorical_subset)

# Join the two dataframes using concat
# Make sure to use axis = 1 to perform a column bind
features = pd.concat([numeric_subset, categorical_subset], axis = 1)

After this process we have over 11,000 observations (buildings) with 110 columns (features). Not all of these features are likely to be useful for predicting the Energy Star Score, so now we will turn to feature selection to remove some of the variables.

Feature Selection

Many of the 110 features we have in our data are redundant because they are highly correlated with one another. For example, here is a plot of Site EUI vs Weather Normalized Site EUI which have a correlation coefficient of 0.997.

Features that are strongly correlated with each other are known as collinear and removing one of the variables in these pairs of features can often help a machine learning model generalize and be more interpretable. (I should point out we are talking about correlations of features with other features, not correlations with the target, which help our model!)

There are a number of methods to calculate collinearity between features, with one of the most common the variance inflation factor. In this project, we will use thebcorrelation coefficient to identify and remove collinear features. We will drop one of a pair of features if the correlation coefficient between them is greater than 0.6. For the implementation, take a look at the notebook (and this Stack Overflow answer)

While this value may seem arbitrary, I tried several different thresholds, and this choice yielded the best model. Machine learning is an empirical field and is often about experimenting and finding what performs best! After feature selection, we are left with 64 total features and 1 target.

# Remove any columns with all na values
features  = features.dropna(axis=1, how = 'all')
print(features.shape)

(11319, 65)

Establishing a Baseline

We have now completed data cleaning, exploratory data analysis, and feature engineering. The final step to take before getting started with modeling is establishing a naive baseline. This is essentially a guess against which we can compare our results. If the machine learning models do not beat this guess, then we might have to conclude that machine learning is not acceptable for the task or we might need to try a different approach.

For regression problems, a reasonable naive baseline is to guess the median value of the target on the training set for all the examples in the test set. This sets a relatively low bar for any model to surpass.

The metric we will use is mean absolute error (mae) which measures the average absolute error on the predictions. There are many metrics for regression, but I like Andrew Ng’s advice to pick a single metric and then stick to it when evaluating models. The mean absolute error is easy to calculate and is interpretable.

Before calculating the baseline, we need to split our data into a training and a testing set:

  1. Data cleaning and formatting
  2. Exploratory data analysis
  3. Feature engineering and selection
  4. Compare several machine learning models on a performance metric
  5. Perform hyperparameter tuning on the best model
  6. Evaluate the best model on the testing set
  7. Interpret the model results
  8. Draw conclusions and document work

We will use 70% of the data for training and 30% for testing:

# Split into 70% training and 30% testing set
X, X_test, y, y_test = train_test_split(features, targets, 
                                        test_size = 0.3, 
                                        random_state = 42)

Now we can calculate the naive baseline performance:

# Function to calculate mean absolute error
def mae(y_true, y_pred):
    return np.mean(abs(y_true - y_pred))

baseline_guess = np.median(y)

print('The baseline guess is a score of %0.2f' % baseline_guess)
print("Baseline Performance on the test set: MAE = %0.4f" % mae(y_test, baseline_guess))
The baseline guess is a score of 66.00
Baseline Performance on the test set: MAE = 24.5164

The naive estimate is off by about 25 points on the test set. The score ranges from 1–100, so this represents an error of 25%, quite a low bar to surpass!

Conclusions

In this article we walked through the first three steps of a machine learning problem. After defining the question, we:

  1. Data cleaning and formatting
  2. Exploratory data analysis
  3. Feature engineering and selection
  4. Compare several machine learning models on a performance metric
  5. Perform hyperparameter tuning on the best model
  6. Evaluate the best model on the testing set
  7. Interpret the model results
  8. Draw conclusions and document work

Finally, we also completed the crucial step of establishing a baseline against which we can judge our machine learning algorithms.

A Complete Machine Learning Walk-Through in Python (Part Two): Model Selection, Hyperparameter Tuning, and Evaluation

Model Evaluation and Selection

As a reminder, we are working on a supervised regression task: using New York City building energy data, we want to develop a model that can predict the Energy Star Score of a building. Our focus is on both accuracy of the predictions and interpretability of the model.

There are a ton of machine learning models to choose from and deciding where to start can be intimidating. While there are some charts that try to show you which algorithm to use, I prefer to just try out several and see which one works best! Machine learning is still a field driven primarily by empirical (experimental) rather than theoretical results, and it’s almost impossible to know ahead of time which model will do the best.

Generally, it’s a good idea to start out with simple, interpretable models such as linear regression, and if the performance is not adequate, move on to more complex, but usually more accurate methods. The following chart shows a (highly unscientific) version of the accuracy vs interpretability trade-off:

We will evaluate five different models covering the complexity spectrum:

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

In this post we will focus on implementing these methods rather than the theory behind them. For anyone interesting in learning the background, I highly recommend An Introduction to Statistical Learning (available free online) or Hands-On Machine Learning with Scikit-Learn and TensorFlow. Both of these textbooks do a great job of explaining the theory and showing how to effectively use the methods in R and Python respectively.

Imputing Missing Values

While we dropped the columns with more than 50% missing values when we cleaned the data, there are still quite a few missing observations. Machine learning models cannot deal with any absent values, so we have to fill them in, a process known as imputation.

First, we’ll read in all the data and remind ourselves what it looks like:

import pandas as pd
import numpy as np

# Read in data into dataframes 
train_features = pd.read_csv('data/training_features.csv')
test_features = pd.read_csv('data/testing_features.csv')
train_labels = pd.read_csv('data/training_labels.csv')
test_labels = pd.read_csv('data/testing_labels.csv')

Training Feature Size:  (6622, 64)
Testing Feature Size:   (2839, 64)
Training Labels Size:   (6622, 1)
Testing Labels Size:    (2839, 1)

Every value that is NaN represents a missing observation. While there are a number of ways to fill in missing data, we will use a relatively simple method, median imputation. This replaces all the missing values in a column with the median value of the column.

In the following code, we create a Scikit-Learn Imputer object with the strategy set to median. We then train this object on the training data (using imputer.fit) and use it to fill in the missing values in both the training and testing data (using imputer.transform). This means missing values in the test data are filled in with the corresponding median value from the training data.

(We have to do imputation this way rather than training on all the data to avoid the problem of test data leakage, where information from the testing dataset spills over into the training data.)

# Create an imputer object with a median filling strategy
imputer = Imputer(strategy='median')

# Train on the training features
imputer.fit(train_features)

# Transform both training data and testing data
X = imputer.transform(train_features)
X_test = imputer.transform(test_features)

Missing values in training features:  0
Missing values in testing features:   0

All of the features now have real, finite values with no missing examples.

Feature Scaling

Scaling refers to the general process of changing the range of a feature. This is necessary because features are measured in different units, and therefore cover different ranges. Methods such as support vector machines and K-nearest neighbors that take into account distance measures between observations are significantly affected by the range of the features and scaling allows them to learn. While methods such as Linear Regression and Random Forest do not actually require feature scaling, it is still best practice to take this step when we are comparing multiple algorithms.

We will scale the features by putting each one in a range between 0 and 1. This is done by taking each value of a feature, subtracting the minimum value of the feature, and dividing by the maximum minus the minimum (the range). This specific version of scaling is often called normalization and the other main version is known as standardization.

While this process would be easy to implement by hand, we can do it using a MinMaxScaler object in Scikit-Learn. The code for this method is identical to that for imputation except with a scaler instead of imputer! Again, we make sure to train only using training data and then transform all the data.

# Create the scaler object with a range of 0-1
scaler = MinMaxScaler(feature_range=(0, 1))

# Fit on the training data
scaler.fit(X)

# Transform both the training and testing data
X = scaler.transform(X)
X_test = scaler.transform(X_test)

Every feature now has a minimum value of 0 and a maximum value of 1. Missing value imputation and feature scaling are two steps required in nearly any machine learning pipeline so it’s a good idea to understand how they work!

Implementing Machine Learning Models in Scikit-Learn

After all the work we spent cleaning and formatting the data, actually creating, training, and predicting with the models is relatively simple. We will use the Scikit-Learn library in Python, which has great documentation and a consistent model building syntax. Once you know how to make one model in Scikit-Learn, you can quickly implement a diverse range of algorithms.

We can illustrate one example of model creation, training (using .fit ) and testing (using .predict ) with the Gradient Boosting Regressor:

from sklearn.ensemble import GradientBoostingRegressor

# Create the model
gradient_boosted = GradientBoostingRegressor()

# Fit the model on the training data
gradient_boosted.fit(X, y)

# Make predictions on the test data
predictions = gradient_boosted.predict(X_test)

# Evaluate the model
mae = np.mean(abs(predictions - y_test))

print('Gradient Boosted Performance on the test set: MAE = %0.4f' % mae)
Gradient Boosted Performance on the test set: MAE = 10.0132

Model creation, training, and testing are each one line! To build the other models, we use the same syntax, with the only change the name of the algorithm. The results are presented below:

To put these figures in perspective, the naive baseline calculated using the median value of the target was 24.5. Clearly, machine learning is applicable to our problem because of the significant improvement over the baseline!

The gradient boosted regressor (MAE = 10.013) slightly beats out the random forest (10.014 MAE). These results aren’t entirely fair because we are mostly using the default values for the hyperparameters. Especially in models such as the support vector machine, the performance is highly dependent on these settings. Nonetheless, from these results we will select the gradient boosted regressor for model optimization.

Hyperparameter Tuning for Model Optimization

In machine learning, after we have selected a model, we can optimize it for our problem by tuning the model hyperparameters.

First off, what are hyperparameters and how do they differ from parameters?

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

Controlling the hyperparameters affects the model performance by altering the balance between underfitting and overfitting in a model. Underfitting is when our model is not complex enough (it does not have enough degrees of freedom) to learn the mapping from features to target. An underfit model has high bias, which we can correct by making our model more complex.

Overfitting is when our model essentially memorizes the training data. An overfit model has high variance, which we can correct by limiting the complexity of the model through regularization. Both an underfit and an overfit model will not be able to generalize well to the testing data.

The problem with choosing the right hyperparameters is that the optimal set will be different for every machine learning problem! Therefore, the only way to find the best settings is to try out a number of them on each new dataset. Luckily, Scikit-Learn has a number of methods to allow us to efficiently evaluate hyperparameters. Moreover, projects such as TPOT by Epistasis Lab are trying to optimize the hyperparameter search using methods like genetic programming. In this project, we will stick to doing this with Scikit-Learn, but stayed tuned for more work on the auto-ML scene!

Random Search with Cross Validation

The particular hyperparameter tuning method we will implement is called random search with cross validation:

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

The idea of K-Fold cross validation with K = 5 is shown below:

The entire process of performing random search with cross validation is:

  1. Data cleaning and formatting
  2. Exploratory data analysis
  3. Feature engineering and selection
  4. Compare several machine learning models on a performance metric
  5. Perform hyperparameter tuning on the best model
  6. Evaluate the best model on the testing set
  7. Interpret the model results
  8. Draw conclusions and document work

Of course, we don’t do actually do this manually, but rather let Scikit-Learn’s RandomizedSearchCV handle all the work!

Slight Diversion: Gradient Boosted Methods

Since we will be using the Gradient Boosted Regression model, I should give at least a little background! This model is an ensemble method, meaning that it is built out of many weak learners, in this case individual decision trees. While a bagging algorithm such as random forest trains the weak learners in parallel and has them vote to make a prediction, a boosting method like Gradient Boosting, trains the learners in sequence, with each learner “concentrating” on the mistakes made by the previous ones.

Boosting methods have become popular in recent years and frequently win machine learning competitions. The Gradient Boosting Method is one particular implementation that uses Gradient Descent to minimize the cost function by sequentially training learners on the residuals of previous ones. The Scikit-Learn implementation of Gradient Boosting is generally regarded as less efficient than other libraries such as XGBoost , but it will work well enough for our small dataset and is quite accurate.

Back to Hyperparameter Tuning

There are many hyperparameters to tune in a Gradient Boosted Regressor and you can look at the Scikit-Learn documentation for the details. We will optimize the following hyperparameters:

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

I’m not sure if there is anyone who truly understands how all of these interact, and the only way to find the best combination is to try them out!

In the following code, we build a hyperparameter grid, create a RandomizedSearchCV object, and perform hyperparameter search using 4-fold cross validation over 25 different combinations of hyperparameters:

# Loss function to be optimized
loss = ['ls', 'lad', 'huber']

# Number of trees used in the boosting process
n_estimators = [100, 500, 900, 1100, 1500]

# Maximum depth of each tree
max_depth = [2, 3, 5, 10, 15]

# Minimum number of samples per leaf
min_samples_leaf = [1, 2, 4, 6, 8]

# Minimum number of samples to split a node
min_samples_split = [2, 4, 6, 10]

# Maximum number of features to consider for making splits
max_features = ['auto', 'sqrt', 'log2', None]

# Define the grid of hyperparameters to search
hyperparameter_grid = {'loss': loss,
                       'n_estimators': n_estimators,
                       'max_depth': max_depth,
                       'min_samples_leaf': min_samples_leaf,
                       'min_samples_split': min_samples_split,
                       'max_features': max_features}

# Create the model to use for hyperparameter tuning
model = GradientBoostingRegressor(random_state = 42)

# Set up the random search with 4-fold cross validation
random_cv = RandomizedSearchCV(estimator=model,
                               param_distributions=hyperparameter_grid,
                               cv=4, n_iter=25, 
                               scoring = 'neg_mean_absolute_error',
                               n_jobs = -1, verbose = 1, 
                               return_train_score = True,
                               random_state=42)

# Fit on the training data
random_cv.fit(X, y)

After performing the search, we can inspect the RandomizedSearchCV object to find the best model:

# Find the best combination of settings
random_cv.best_estimator_

GradientBoostingRegressor(loss='lad', max_depth=5,
                          max_features=None,
                          min_samples_leaf=6,
                          min_samples_split=6,
                          n_estimators=500)

We can then use these results to perform grid search by choosing parameters for our grid that are close to these optimal values. However, further tuning is unlikely to significant improve our model. As a general rule, proper feature engineering will have a much larger impact on model performance than even the most extensive hyperparameter tuning. It’s the law of diminishing returns applied to machine learning: feature engineering gets you most of the way there, and hyperparameter tuning generally only provides a small benefit.

One experiment we can try is to change the number of estimators (decision trees) while holding the rest of the hyperparameters steady. This directly lets us observe the effect of this particular setting. See the notebook for the implementation, but here are the results:

As the number of trees used by the model increases, both the training and the testing error decrease. However, the training error decreases much more rapidly than the testing error and we can see that our model is overfitting: it performs very well on the training data, but is not able to achieve that same performance on the testing set.

We always expect at least some decrease in performance on the testing set (after all, the model can see the true answers for the training set), but a significant gap indicates overfitting. We can address overfitting by getting more training data, or decreasing the complexity of our model through the hyerparameters. In this case, we will leave the hyperparameters where they are, but I encourage anyone to try and reduce the overfitting.

For the final model, we will use 800 estimators because that resulted in the lowest error in cross validation. Now, time to test out this model!

Evaluating on the Test Set

As responsible machine learning engineers, we made sure to not let our model see the test set at any point of training. Therefore, we can use the test set performance as an indicator of how well our model would perform when deployed in the real world.

Making predictions on the test set and calculating the performance is relatively straightforward. Here, we compare the performance of the default Gradient Boosted Regressor to the tuned model:

# Make predictions on the test set using default and final model
default_pred = default_model.predict(X_test)
final_pred = final_model.predict(X_test)

Default model performance on the test set: MAE = 10.0118.
Final model performance on the test set:   MAE = 9.0446.

Hyperparameter tuning improved the accuracy of the model by about 10%. Depending on the use case, 10% could be a massive improvement, but it came at a significant time investment!

We can also time how long it takes to train the two models using the %timeit magic command in Jupyter Notebooks. First is the default model:

%%timeit -n 1 -r 5
default_model.fit(X, y)

1.09 s ± 153 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

1 second to train seems very reasonable. The final tuned model is not so fast:

%%timeit -n 1 -r 5
final_model.fit(X, y)

12.1 s ± 1.33 s per loop (mean ± std. dev. of 5 runs, 1 loop each)

This demonstrates a fundamental aspect of machine learning: it is always a game of trade-offs. We constantly have to balance accuracy vs interpretability, bias vs variance, accuracy vs run time, and so on. The right blend will ultimately depend on the problem. In our case, a 12 times increase in run-time is large in relative terms, but in absolute terms it’s not that significant.

Once we have the final predictions, we can investigate them to see if they exhibit any noticeable skew. On the left is a density plot of the predicted and actual values, and on the right is a histogram of the residuals:

The model predictions seem to follow the distribution of the actual values although the peak in the density occurs closer to the median value (66) on the training set than to the true peak in density (which is near 100). The residuals are nearly normally distribution, although we see a few large negative values where the model predictions were far below the true values.

Conclusions

In this article we covered several steps in the machine learning workflow:

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

The results of this work showed us that machine learning is applicable to the task of predicting a building’s Energy Star Score using the available data. Using a gradient boosted regressor we were able to predict the scores on the test set to within 9.1 points of the true value. Moreover, we saw that hyperparameter tuning can increase the performance of a model at a significant cost in terms of time invested. This is one of many trade-offs we have to consider when developing a machine learning solution.

A Complete Machine Learning Walk-Through in Python (Part Three): Interpreting a machine learning model and presenting results

As a reminder, we are working through a supervised regression machine learning problem. Using New York City building energy data, we have developed a model which can predict the Energy Star Score of a building. The final model we built is a Gradient Boosted Regressor which is able to predict the Energy Star Score on the test data to within 9.1 points (on a 1–100 scale).

Model Interpretation

The gradient boosted regressor sits somewhere in the middle on the scale of model interpretability: the entire model is complex, but it is made up of hundreds of decision trees, which by themselves are quite understandable. We will look at three ways to understand how our model makes predictions:

  1. Data cleaning and formatting
  2. Exploratory data analysis
  3. Feature engineering and selection
  4. Compare several machine learning models on a performance metric
  5. Perform hyperparameter tuning on the best model
  6. Evaluate the best model on the testing set
  7. Interpret the model results
  8. Draw conclusions and document work

The first two methods are specific to ensembles of trees, while the third — as you might have guessed from the name — can be applied to any machine learning model. LIME is a relatively new package and represents an exciting step in the ongoing effort to explain machine learning predictions.

Feature Importances

Feature importances attempt to show the relevance of each feature to the task of predicting the target. The technical details of feature importances are complex (they measure the mean decrease impurity, or the reduction in error from including the feature), but we can use the relative values to compare which features are the most relevant. In Scikit-Learn, we can extract the feature importances from any ensemble of tree-based learners.

With model as our trained model, we can find the feature importances usingmodel.feature_importances_. Then we can put them into a pandas DataFrame and display or plot the top ten most important:

import pandas as pd

# model is the trained model
importances = model.feature_importances_

# train_features is the dataframe of training features
feature_list = list(train_features.columns)

# Extract the feature importances into a dataframe
feature_results = pd.DataFrame({'feature': feature_list, 
                                'importance': importances})

# Show the top 10 most important
feature_results = feature_results.sort_values('importance', 
                                              ascending = False).reset_index(drop=True)

feature_results.head(10)

The Site EUI(Energy Use Intensity) and the Weather Normalized Site Electricity Intensity are by far the most important features, accounting for over 66% of the total importance. After the top two features, the importance drops off significantly, which indicates we might not need to retain all 64 features in the data to achieve high performance. (In the Jupyter notebook, I take a look at using only the top 10 features and discover that the model is not quite as accurate.)

Based on these results, we can finally answer one of our initial questions: the most important indicators of a building’s Energy Star Score are the Site EUI and the Weather Normalized Site Electricity Intensity. While we do want to be careful about reading too much into the feature importances, they are a useful way to start to understand how the model makes its predictions.

Visualizing a Single Decision Tree

While the entire gradient boosting regressor may be difficult to understand, any one individual decision tree is quite intuitive. We can visualize any tree in the forest using the Scikit-Learn function [export_graphviz]([http://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html)](http://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html) "http://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html)"). We first extract a tree from the ensemble then save it as a dot file:

from sklearn import tree

# Extract a single tree (number 105)
single_tree = model.estimators_[105][0]

# Save the tree to a dot file
tree.export_graphviz(single_tree, out_file = 'images/tree.dot', 
feature_names = feature_list)

Using the Graphviz visualization software we can convert the dot file to a png from the command line:

dot -Tpng images/tree.dot -o images/tree.png

The result is a complete decision tree:

This is a little overwhelming! Even though this tree only has a depth of 6 (the number of layers), it’s difficult to follow. We can modify the call to export_graphviz and limit our tree to a more reasonable depth of 2:

Each node (box) in the tree has four pieces of information:

  1. Data cleaning and formatting
  2. Exploratory data analysis
  3. Feature engineering and selection
  4. Compare several machine learning models on a performance metric
  5. Perform hyperparameter tuning on the best model
  6. Evaluate the best model on the testing set
  7. Interpret the model results
  8. Draw conclusions and document work

(Leaf nodes only have 2.–4. because they represent the final estimate and do not have any children).

A decision tree makes a prediction for a data point by starting at the top node, called the root, and working its way down through the tree. At each node, a yes-or-no question is asked of the data point. For example, the question for the node above is: Does the building have a Site EUI less than or equal to 68.95? If the answer is yes, the building is placed in the right child node, and if the answer is no, the building goes to the left child node.

This process is repeated at each layer of the tree until the data point is placed in a leaf node, at the bottom of the tree (the leaf nodes are cropped from the small tree image). The prediction for all the data points in a leaf node is the value. If there are multiple data points ( samples ) in a leaf node, they all get the same prediction. As the depth of the tree is increased, the error on the training set will decrease because there are more leaf nodes and the examples can be more finely divided. However, a tree that is too deep will overfit to the training data and will not be able to generalize to new testing data.

In the second article, we tuned a number of the model hyperparameters, which control aspects of each tree such as the maximum depth of the tree and the minimum number of samples required in a leaf node. These both have a significant impact on the balance of under vs over-fitting, and visualizing a single decision tree allows us to see how these settings work.

Although we cannot examine every tree in the model, looking at one lets us understand how each individual learner makes a prediction. This flowchart-based method seems much like how a human makes decisions, answering one question about a single value at a time. Decision-tree-based ensembles combine the predictions of many individual decision trees in order to create a more accurate model with less variance. Ensembles of trees tend to be very accurate, and also are intuitive to explain.

Local Interpretable Model-Agnostic Explanations (LIME)

The final tool we will explore for trying to understand how our model “thinks” is a new entry into the field of model explanations. LIME aims to explain a single prediction from any machine learning model by creating a approximation of the model locally near the data point using a simple model such as linear regression (the full details can be found in the paper ).

Here we will use LIME to examine a prediction the model gets completely wrong to see what it might tell us about why the model makes mistakes.

First we need to find the observation our model gets most wrong. We do this by training and predicting with the model and extracting the example on which the model has the greatest error:

from sklearn.ensemble import GradientBoostingRegressor

# Create the model with the best hyperparamters
model = GradientBoostingRegressor(loss='lad', max_depth=5, max_features=None,
                                  min_samples_leaf=6, min_samples_split=6, 
                                  n_estimators=800, random_state=42)

# Fit and test on the features
model.fit(X, y)
model_pred = model.predict(X_test)

# Find the residuals
residuals = abs(model_pred - y_test)
    
# Extract the most wrong prediction
wrong = X_test[np.argmax(residuals), :]

print('Prediction: %0.4f' % np.argmax(residuals))
print('Actual Value: %0.4f' % y_test[np.argmax(residuals)])
Prediction: 12.8615
Actual Value: 100.0000

Next, we create the LIME explainer object passing it our training data, the mode, the training labels, and the names of the features in our data. Finally, we ask the explainer object to explain the wrong prediction, passing it the observation and the prediction function.

import lime 

# Create a lime explainer object
explainer = lime.lime_tabular.LimeTabularExplainer(training_data = X, 
                                                   mode = 'regression',
                                                   training_labels = y,
                                                   feature_names = feature_list)


# Explanation for wrong prediction
exp = explainer.explain_instance(data_row = wrong, 
                                 predict_fn = model.predict)

# Plot the prediction explaination
exp.as_pyplot_figure();

The plot explaining this prediction is below:

Here’s how to interpret the plot: Each entry on the y-axis indicates one value of a variable and the red and green bars show the effect this value has on the prediction. For example, the top entry says the Site EUI is greater than 95.90 which subtracts about 40 points from the prediction. The second entry says the Weather Normalized Site Electricity Intensity is less than 3.80 which adds about 10 points to the prediction. The final prediction is an intercept term plus the sum of each of these individual contributions.

We can get another look at the same information by calling the explainer .show_in_notebook() method:

# Show the explanation in the Jupyter Notebook
exp.show_in_notebook()

This shows the reasoning process of the model on the left by displaying the contributions of each variable to the prediction. The table on the right shows the actual values of the variables for the data point.

For this example, the model prediction was about 12 and the actual value was 100! While initially this prediction may be puzzling, looking at the explanation, we can see this was not an extreme guess, but a reasonable estimate given the values for the data point. The Site EUI was relatively high and we would expect the Energy Star Score to be low (because EUI is strongly negatively correlated with the score), a conclusion shared by our model. In this case, the logic was faulty because the building had a perfect score of 100.

It can be frustrating when a model is wrong, but explanations such as these help us to understand why the model is incorrect. Moreover, based on the explanation, we might want to investigate why the building has a perfect score despite such a high Site EUI. Perhaps we can learn something new about the problem that would have escaped us without investigating the model. Tools such as this are not perfect, but they go a long way towards helping us understand the model which in turn can allow us to make better decisions.

Documenting Work and Reporting Results

An often under-looked part of any technical project is documentation and reporting. We can do the best analysis in the world, but if we do not clearly communicate the results, then they will not have any impact!

When we document a data science project, we take all the versions of the data and code and package it so it our project can be reproduced or built on by other data scientists. It’s important to remember that code is read more often than it is written, and we want to make sure our work is understandable both for others and for ourselves if we come back to it a few months later. This means putting in helpful comments in the code and explaining your reasoning. I find Jupyter Notebooks to be a great tool for documentation because they allow for explanations and code one after the other.

Jupyter Notebooks can also be a good platform for communicating findings to others. Using notebook extensions, we can hide the code from our final report , because although it’s hard to believe, not everyone wants to see a bunch of Python code in a document!

Personally, I struggle with succinctly summarizing my work because I like to go through all the details. However, it’s important to understand your audience when you are presenting and tailor the message accordingly. With that in mind, here is my 30-second takeaway from the project:

  1. Data cleaning and formatting
  2. Exploratory data analysis
  3. Feature engineering and selection
  4. Compare several machine learning models on a performance metric
  5. Perform hyperparameter tuning on the best model
  6. Evaluate the best model on the testing set
  7. Interpret the model results
  8. Draw conclusions and document work

Originally, I was given this project as a job-screening “assignment” by a start-up. For the final report, they wanted to see both my work and my conclusions, so I developed a Jupyter Notebook to turn in. However, instead of converting directly to PDF in Jupyter, I converted it to a Latex .tex file that I then edited in texStudio before rendering to a PDF for the final version. The default PDF output from Jupyter has a decent appearance, but it can be significantly improved with a few minutes of editing. Moreover, Latex is a powerful document preparation system and it’s good to know the basics.

At the end of the day, our work is only as valuable as the decisions it enables, and being able to present results is a crucial skill. Furthermore, by properly documenting work, we allow others to reproduce our results, give us feedback so we can become better data scientists, and build on our work for the future.

Conclusions

Throughout this series of posts, we’ve walked through a complete end-to-end machine learning project. We started by cleaning the data, moved into model building, and finally looked at how to interpret a machine learning model. As a reminder, the general structure of a machine learning project is below:

  1. Data cleaning and formatting
  2. Exploratory data analysis
  3. Feature engineering and selection
  4. Compare several machine learning models on a performance metric
  5. Perform hyperparameter tuning on the best model
  6. Evaluate the best model on the testing set
  7. Interpret the model results
  8. Draw conclusions and document work

While the exact steps vary by project, and machine learning is often an iterative rather than linear process, this guide should serve you well as you tackle future machine learning projects. I hope this series has given you confidence to be able to implement your own machine learning solutions, but remember, none of us do this by ourselves! If you want any help, there are many incredibly supportive communities where you can look for advice.

A few resources I have found helpful throughout my learning process:

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

*Originally published at *https://towardsdatascience.com

Thanks for reading

If you liked this post, share it with all of your programming buddies!

Follow us on Facebook | Twitter

Further reading

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python for Data Science and Machine Learning Bootcamp

Machine Learning, Data Science and Deep Learning with Python

Deep Learning A-Z™: Hands-On Artificial Neural Networks

Artificial Intelligence A-Z™: Learn How To Build An AI

A Complete Machine Learning Project Walk-Through in Python

Machine Learning: how to go from Zero to Hero

Top 18 Machine Learning Platforms For Developers

10 Amazing Articles On Python Programming And Machine Learning

100+ Basic Machine Learning Interview Questions and Answers