Python for Financial Analysis and Algorithmic Trading

Python for Financial Analysis and Algorithmic Trading

This Python for Finance tutorial introduces you to financial analysis, algorithmic trading, and much more.

This Python for Finance tutorial introduces you to financial analysis, algorithmic trading, and much more.

Getting Started With Python for Finance

Before you go into trading strategies, it’s a good idea to get the hang of the basics first. This first part of the tutorial will focus on explaining the Python basics that you need to get started.

Stocks & Trading

When a company wants to grow and undertake new projects or expand, it can issue stocks to raise capital. A stock represents a share in the ownership of a company and is issued in return for money. Stocks are bought and sold: buyers and sellers trade existing, previously issued shares. The price at which stocks are sold can move independent of the company’s success: the prices instead reflect supply and demand. This means that whenever a stock is considered as ‘desirable’, due to success, popularity, … the stock price will go up.

Note that stocks are not the same as bonds, which is when companies raise money through borrowing, either as a loan from a bank or by issuing debt.

As you just read, buying and selling or trading is essential when you’re talking about stocks, but certainly not limited to it: trading is the act of buying or selling an asset, which could be financial security, like stock, a bond or a tangible product, such as gold or oil.

Stock trading is then the process of the cash that is paid for the stocks is converted into a share in the ownership of a company, which can be converted back to cash by selling, and this all hopefully with a profit. Now, to achieve a profitable return, you either go long or short in markets: you either by shares thinking that the stock price will go up to sell at a higher price in the future, or you sell your stock, expecting that you can buy it back at a lower price and realize a profit. When you follow a fixed plan to go long or short in markets, you have a trading strategy.

Developing a trading strategy is something that goes through a couple of phases, just like when you, for example, build machine learning models: you formulate a strategy and specify it in a form that you can test on your computer, you do some preliminary testing or backtesting, you optimize your strategy and lastly, you evaluate the performance and robustness of your strategy.

Trading strategies are usually verified by backtesting: you reconstruct, with historical data, trades that would have occurred in the past using the rules that are defined with the strategy that you have developed. This way, you can get an idea of the effectiveness of your strategy, and you can use it as a starting point to optimize and improve your strategy before applying it to real markets. Of course, this all relies heavily on the underlying theory or belief that any strategy that has worked out well in the past will likely also work out well in the future, and, that any strategy that has performed poorly in the past will probably also do badly in the future.

Time Series Data

A time series is a sequence of numerical data points taken at successive equally spaced points in time. In investing, a time series tracks the movement of the chosen data points, such as the stock price, over a specified period of time with data points recorded at regular intervals. If you’re still in doubt about what this would exactly look like, take a look at the following example:

Python for Financial Analysis and Algorithmic Trading

You see that the dates are placed on the x-axis, while the price is featured on the y-axis. The “successive equally spaced points in time” in this case means that the days that are featured on the x-axis are 14 days apart: note the difference between 3/7/2005 and the next point, 3/31/2005, and 4/5/2005 and 4/19/2005.

However, what you’ll often see when you’re working with stock data is not just two columns, that contain period and price observations, but most of the times, you’ll have five columns that contain observations of the period and the opening, high, low and closing prices of that period. This means that, if your period is set at a daily level, the observations for that day will give you an idea of the opening and closing price for that day and the extreme high and low price movement for a particular stock during that day.

For now, you have a basic idea of the basic concepts that you need to know to go through this tutorial. These concepts will come back soon enough, and you’ll learn more about them later on in this tutorial.

Setting Up The Workspace

Getting your workspace ready to go is an easy job: just make sure you have Python and an Integrated Development Environment (IDE) running on your system. However, there are some ways in which you can get started that are maybe a little easier when you’re just starting out.

Take for instance Anaconda, a high-performance distribution of Python and R and includes over 100 of the most popular Python, R and Scala packages for data science. Additionally, installing Anaconda will give you access to over 720 packages that can easily be installed with conda, our renowned package, dependency and environment manager, that is included in Anaconda. And, besides all that, you’ll get the Jupyter Notebook and Spyder IDE with it.

That sounds like a good deal, right?

You can install Anaconda from here and don’t forget to check out how to set up your Jupyter Notebook

Of course, Anaconda is not your only option: you can also check out the Canopy Python distribution (which doesn’t come free), or try out the Quant Platform.

The latter offers you a couple of additional advantages over using, for example, Jupyter or the Spyder IDE, since it provides you everything you need specifically to do financial analytics in your browser! With the Quant Platform, you’ll gain access to GUI-based Financial Engineering, interactive and Python based financial analytics and your own Python based analytics library. What’s more, you’ll also have access to a forum where you can discuss solutions or questions with peers!

Python Basics For Finance: Pandas

When you’re using Python for finance, you’ll often find yourself using the data manipulation package, Pandas. But also other packages such as NumPy, SciPy, Matplotlib,… will pass by once you start digging deeper.

For now, let’s focus on Pandas and using it to analyze time series data. This section will explain how you can import data, explore and manipulate it with Pandas. On top of all of that, you’ll learn how you can perform common financial analyses on the data that you imported.

Importing Financial Data Into Python

The pandas-datareader package allows for reading in data from sources such as Google, World Bank,… If you want to have an updated list of the data sources that are made available with this function, go to the documentation. You used to be able to access data from Yahoo! Finance directly, but it has since been deprecated. To access Yahoo! Finance data, check out this video by Matt Macarty that shows a workaround. For this tutorial, you will use the package to read in data from Yahoo! Finance. Make sure to install the package first by installing the latest release version via pip with pip install pandas-datareader.

Tip: if you want to install the latest development version or if you experience any issues, you can read up on the installation instructions here.

import pandas_datareader as pdr
import datetime 
aapl = pdr.get_data_yahoo('AAPL', 
                          start=datetime.datetime(2006, 10, 1), 
                          end=datetime.datetime(2012, 1, 1))

Note that the Yahoo API endpoint has recently changed and that, if you want to already start working with the library on your own, you’ll need to install a temporary fix until the patch has been merged into the master branch to start pulling in data from Yahoo! Finance with pandas-datareader. Make sure to read up on the issue here before you start on your own!

No worries, though, for this tutorial, the data has been loaded in for you so that you don’t face any issues while learning about finance in Python with Pandas.

It’s wise to consider though that, even though pandas-datareader offers a lot of options to pull in data into Python, it isn’t the only package that you can use to pull in financial data: you can also make use of libraries such as Quandl, for example, to get data from Google Finance:

import quandl 
aapl = quandl.get("WIKI/AAPL", start_date="2006-10-01", end_date="2012-01-01")

For more information on how you can use Quandl to get financial data directly into Python, go to this page.

Lastly, if you’ve already been working in finance for a while, you’ll probably know that you most often use Excel also to manipulate your data. In such cases, you should know that you can integrate Python with Excel.

Working With Time Series Data

The first thing that you want to do when you finally have the data in your workspace is getting your hands dirty. However, now that you’re working with time series data, this might not seem as straightforward, since your index now contains DateTime values.

No worries, though! Let’s start step-by-step and explore the data first with some functions that you might already know if you have some prior programming experience with R or if you’ve previously worked with Pandas.

Either way, you’ll see it’s pretty straightforward!

As you saw in the code chunk above, you have used pandas_datareader to import data into your workspace. The resulting object aapl is a DataFrame, which is a 2-dimensional labeled data structure with columns of potentially different types. Now, one of the first things that you probably do when you have a regular DataFrame on your hands, is running the head() and tail() functions to take a peek at the first and the last rows of your DataFrame. Luckily, this doesn’t change when you’re working with time series data!

Tip: also make sure to use the describe() function to get some useful summary statistics about your data.

Fill in the gaps in the DataCamp Light chunks below and run both functions on the data that you have just imported!

script.py

# Return first rows of `aapl`
aapl.____()

# Return last rows of `aapl`
aapl.____()

# Describe `aapl`
aapl._______()

solution.py

# Return first rows of `aapl`
aapl.head()

# Return last rows of `aapl`
aapl.tail()

# Describe `aapl`

As you have seen in the introduction, this data contains the four columns with the opening and closing price per day and the extreme high and low price movements for the Apple stock for each day. Additionally, you also get two extra columns: Volume and Adj Close.

The former column is used to register the number of shares that got traded during a single day. The latter, on the other hand, is the adjusted closing price: it’s the closing price of the day that has been slightly adapted to include any actions that occurred at any time before the next day’s open. You can use this column to examine historical returns or when you’re performing a detailed analysis on historical returns.

Note how the index or row labels contain dates, and how your columns or column labels contain numerical values.

Tip: if you now would like to save this data to a csv file with the to_csv() function from pandas and that you can use the read_csv() function to read the data back into Python. This is extremely handy in cases where, for example, the Yahoo API endpoint has changed, and you don’t have access to your data any longer :)

import pandas as pd
aapl.to_csv('data/aapl_ohlc.csv')
df = pd.read_csv('data/aapl_ohlc.csv', header=0, index_col='Date', parse_dates=True)

Now that you have briefly inspected the first lines of your data and have taken a look at some summary statistics, it’s time to go a little bit deeper.

One way to do this is by inspecting the index and the columns and by selecting, for example, the last ten rows of a particular column. The latter is called subsetting because you take a small subset of your data. The result of the subsetting is a Series, which is a one-dimensional labeled array that is capable of holding any type.

Remember that the DataFrame structure was a two-dimensional labeled array with columns that potentially hold different types of data.

Check all of this out in the exercise below. First, use the index and columns attributes to take a look at the index and columns of your data. Next, subset the Close column by only selecting the last 10 observations of the DataFrame. Make use of the square brackets [] to isolate the last ten values. You might already know this way of subsetting from other programming languages, such as R. To conclude, assign the latter to a variable ts and then check what type ts is by using the type() function:

script.py

# Inspect the index 
aapl._____

# Inspect the columns
aapl._______

# Select only the last 10 observations of `Close`
ts = ____['Close'][-10:]

# Check the type of `ts` 
type(__)

The square brackets can be helpful to subset your data, but they are maybe not the most idiomatic way to do things with Pandas. That’s why you should also take a look at the loc() and iloc() functions: you use the former for label-based indexing and the latter for positional indexing.

In practice, this means that you can pass the label of the row labels, such as 2007 and 2006-11-01, to the loc() function, while you pass integers such as 22 and 43 to the iloc() function.

Complete the exercise below to understand how both loc() and iloc() work:

script.py

# Inspect the first rows of November-December 2006
print(aapl.loc[pd.Timestamp('2006-11-01'):pd.Timestamp('2006-12-31')].______)

# Inspect the first rows of 2007 
print(aapl.loc['2007'].______)

# Inspect November 2006
print(aapl.____[22:43])

# Inspect the 'Open' and 'Close' values at 2006-11-01 and 2006-12-01
print(aapl.iloc[[22,43], [0, 3]])

Tip: if you look closely at the results of the subsetting, you’ll notice that there are certain days missing in the data; If you look more closely at the pattern, you’ll see that it’s usually two or three days that are missing; These days are traditionally weekend days or public holidays and aren’t part of your data. This is nothing to worry about: it’s completely normal, and you don’t have to fill in these missing days.

Besides indexing, you might also want to explore some other techniques to get to know your data a little bit better. You never know what else will show up. Let’s try to sample some 20 rows from the data set and then let’s resample the data so that aapl is now at the monthly level instead of daily. You can make use of the sample() and resample() functions to do this:

script.py

# Sample 20 rows
sample = aapl.______(20)

# Print `sample`
print(______)

# Resample to monthly level 
monthly_aapl = aapl.________('M').mean()

# Print `monthly_aapl`
print(____________)

Very straightforward, isn’t it?

The resample() function is often used because it provides elaborate control and more flexibility on the frequency conversion of your times series: besides specifying new time intervals yourself and specifying how you want to handle missing data, you also have the option to indicate how you want to resample your data, as you can see in the code example above. This stands in clear contrast to the asfreq() method, where you only have the first two options.

Tip: try this out for yourself in the IPython console of the above DataCamp Light chunk. Pass in aapl.asfreq("M", method="bfill") to see what happens!

Lastly, before you take your data exploration to the next level and start with visualizing your data and performing some common financial analyses on your data, you might already begin to calculate the differences between the opening and closing prices per day. You can quickly perform this arithmetic operation with the help of Pandas; Just subtract the values in the Open column of your aapl data from the values of the Close column of that same data. Or, in other words, deduct aapl.Close from aapl.Open. You store the result in a new column of the aapl DataFrame called diff, and then you delete it again with the help of del:

script.py

# Add a column `diff` to `aapl` 
aapl['diff'] = aapl.Open - aapl.Close

# Delete the new `diff` column
del aapl['diff']

Tip: make sure to comment out the last line of code so that the new column of your aapl DataFrame doesn’t get removed and you can check the results of your arithmetic operation!

Of course, knowing the gains in absolute terms might already help you to get an idea of whether you’re making a good investment, but as a quant, you might be more interested in a more relative means of measuring your stock’s value, like how much the value of a particular stock has gone up or gone down. A way to do this is by calculating the daily percentage change.

This is good to know for now, but don’t worry about it just yet; You’ll go deeper into this in a bit!

This section introduced you to some ways to first explore your data before you start performing some prior analyses.

Visualizing Time Series Data

Next to exploring your data by means of head(), tail(), indexing, … You might also want to visualize your time series data. Thanks to Pandas’ plotting integration with Matplotlib, this task becomes easy; Just use the plot() function and pass the relevant arguments to it. Additionally, you can also add the grid argument to indicate that the plot should also have a grid in the background.

Let’s examine and run the code below to see how you can make this plot!

script.py

# Import Matplotlib's `pyplot` module as `plt`
import matplotlib.pyplot as plt

# Plot the closing prices for `aapl`
aapl['Close'].plot(grid=True)

# Show the plot
plt.show()

Python for Financial Analysis and Algorithmic Trading

Common Financial Analysis

Now that you have an idea of your data, what time series data is about and how you can use pandas to explore your data quickly, it’s time to dive deeper into some of the common financial analyses that you can do so that you can actually start working towards developing a trading strategy.

In the rest of this section, you’ll learn more about the returns, moving windows, volatility calculation and Ordinary Least-Squares Regression (OLS).

Returns

The simple daily percentage change doesn’t take into account dividends and other factors and represents the amount of percentage change in the value of a stock over a single day of trading. You will find that the daily percentage change is easily calculated, as there is a pct_change() function included in the Pandas package to make your life easier:

script.py

# Import `numpy` as `np`
import numpy as np

# Assign `Adj Close` to `daily_close`
daily_close = aapl[['___________']]

# Daily returns
daily_pct_change = daily_close.__________()

# Replace NA values with 0
daily_pct_change.fillna(0, inplace=True)

# Inspect daily returns
print(______________)

# Daily log returns
daily_log_returns = np.log(daily_close.pct_change()+1)

# Print daily log returns
print(daily_log_returns)

Note that you calculate the log returns to get a better insight into the growth of your returns over time.

Knowing how to calculate the daily percentage change is nice, but what when you want to know the monthly or quarterly returns? In such cases, you can fall back on the resample(), which you already saw in the first part of this tutorial.

script.py

# Resample `aapl` to business months, take last observation as value 
monthly = aapl._________('BM').apply(lambda x: x[-1])

# Calculate the monthly percentage change
monthly.pct_change()

# Resample `aapl` to quarters, take the mean as value per quarter
quarter = aapl.resample("4M").mean()

# Calculate the quarterly percentage change
quarter.__________()

Using pct_change() is quite the convenience, but it also obscures how exactly the daily percentages are calculated. That’s why you can alternatively make use of Pandas’ shift() function instead of using pct_change(). You then divide the daily_close values by the daily_close.shift(1) -1. By using this function, however, you will be left with NA values at the beginning of the resulting DataFrame.

Tip: compare the result of the following code with the result that you had obtained in the first DataCamp Light chunk to clearly see the difference between these two methods of calculating the daily percentage change.

script.py

# Daily returns
daily_pct_change = ___________ / daily_close.shift(1) - 1

# Print `daily_pct_change`
print(___________)

Tip: calculate the daily log returns with the help of Pandas’ shift() function. Try it out in the IPython console of this DataCamp Light chunk! (For those who can’t find the solution, try out this line of code: daily_log_returns_shift = np.log(daily_close / daily_close.shift(1))).

For your reference, the calculation of the daily percentage change is based on the following formula:

Python for Financial Analysis and Algorithmic Trading

, where p is the price, t is the time (a day in this case) and r is the return.

Additionally, you can plot the distribution of daily_pct_change:

script.py

# Import matplotlib
import matplotlib.pyplot as plt

# Plot the distribution of `daily_pct_c`
daily_pct_change.hist(bins=50)

# Show the plot
plt.show()

# Pull up summary statistics
print(daily_pct_change.describe())

Python for Financial Analysis and Algorithmic Trading

The distribution looks very symmetrical and normally distributed: the daily changes center around the bin 0.00. Note, though, how you can and should use the results of the describe() function, applied on daily_pct_c, to correctly interpret the results of the histogram. You will see that the mean is very close to the 0.00 bin also and that the standard deviation is 0.02. Also, take a look at the percentiles to know how many of your data points fall below -0.010672, 0.001677 and 0.014306.

The cumulative daily rate of return is useful to determine the value of an investment at regular intervals. You can calculate the cumulative daily rate of return by using the daily percentage change values, adding 1 to them and calculating the cumulative product with the resulting values:

script.py

# Calculate the cumulative daily returns
cum_daily_return = (1 + ________________).cumprod()

# Print `cum_daily_return`
print(_________________)

Note that you can use can again use Matplotlib to quickly plot the cum_daily_return; Just add the plot() function to it and, optionally, determine the figsize or the size of the figure:

script.py

# Import matplotlib
import matplotlib.pyplot as plt 

# Plot the cumulative daily returns
cum_daily_return.plot(figsize=(12,8))

# Show the plot
plt.show()

Python for Financial Analysis and Algorithmic Trading

Very easy, isn’t it? Now, if you don’t want to see the daily returns, but rather the monthly returns, remember that you can easily use the resample() function to bring the cum_daily_return to the monthly level:

script.py

# Resample the cumulative daily return to cumulative monthly return 
cum_monthly_return = cum_daily_return.________("M").mean()

# Print the `cum_monthly_return`
print(_________________)

Knowing how to calculate the returns is a valuable skill, but you’ll often see that these numbers don’t really say much when you don’t compare them to other stock. That’s why you’ll often see examples where two or more stocks are compared. In the rest of this section, you’ll focus on getting more data from Yahoo! Finance so that you can calculate the daily percentage change and compare the results.

Note that, if you want to be doing this, you’ll need to have a more thorough understanding of Pandas and how you can manipulate your data with Pandas!

Let’s start! Get more data from Yahoo! Finance first. You can easily do this by making a function that takes in the ticker or symbol of the stock, a start date and an end date. The next function that you see, data(), then takes the ticker to get your data from the startdate to the enddate and returns it so that the get() function can continue. You map the data with the right tickers and return a DataFrame that concatenates the mapped data with tickers.

Check out the code below, where the stock data from Apple, Microsoft, IBM, and Google are loaded and gathered into one big DataFrame:

def get(tickers, startdate, enddate):
  def data(ticker):
    return (pdr.get_data_yahoo(ticker, start=startdate, end=enddate))
  datas = map (data, tickers)
  return(pd.concat(datas, keys=tickers, names=['Ticker', 'Date']))

tickers = ['AAPL', 'MSFT', 'IBM', 'GOOG']
all_data = get(tickers, datetime.datetime(2006, 10, 1), datetime.datetime(2012, 1, 1))

Note that this code originally was used in “Mastering Pandas for Finance”. It was updated for this tutorial to the new standards. Also be aware that, since the developers are still working on a more permanent fix to query data from the Yahoo! Finance API, it could be that you need to import the fix_yahoo_finance package. You can find the installation instructions here or check out the Jupyter notebook that goes along with this tutorial.

For the rest of this tutorial, you’re safe, as the data has been loaded in for you!

Now, the result of these lines of code, you ask? Check it out:

Python for Financial Analysis and Algorithmic Trading

You can then use the big DataFrame to start making some interesting plots:

script.py

# Import matplotlib
import matplotlib.pyplot as plt 

# Isolate the `Adj Close` values and transform the DataFrame
daily_close_px = all_data[['Adj Close']].reset_index().pivot('Date', 'Ticker', 'Adj Close')

# Calculate the daily percentage change for `daily_close_px`
daily_pct_change = daily_close_px.pct_change()

# Plot the distributions
daily_pct_change.hist(bins=50, sharex=True, figsize=(12,8))

# Show the resulting plot
plt.show()

Python for Financial Analysis and Algorithmic Trading

Another useful plot is the scatter matrix. You can easily do this by using the pandas library. Don’t forget to add the scatter_matrix() function to your code so that you actually make a scatter matrix :) As arguments, you pass the daily_pct_change and as a diagonal, you set that you want to have a Kernel Density Estimate (KDE) plot. Additionally, you can set the transparency with the alpha argument and the figure size with figsize.

script.py

# Import matplotlib
import matplotlib.pyplot as plt

# Plot a scatter matrix with the `daily_pct_change` data 
pd.scatter_matrix(daily_pct_change, diagonal='kde', alpha=0.1,figsize=(12,12))

# Show the plot
plt.show()

Python for Financial Analysis and Algorithmic Trading

Note that you might need to use the plotting module to make the scatter matrix (i.e. pd.plotting.scatter_matrix()) when you’re working locally. Also, it’s good to know that the Kernel Density Estimate plot estimates the probability density function of a random variable.

Congratulations! You’ve successfully made it through the first common financial analysis, where you explored returns! Now it’s time to move on to the second one, which are the moving windows.

Moving Windows

Moving windows are there when you compute the statistic on a window of data represented by a particular period of time and then slide the window across the data by a specified interval. That way, the statistic is continually calculated as long as the window falls first within the dates of the time series.

There are a lot of functions in Pandas to calculate moving windows, such as rolling_mean(), rolling_std(), … See all of them here.

However, note that most of them will soon be deprecated, so it’s best to use a combination of the functions rolling() with mean() or std(),… Depending of course on which type of moving window you want to calculate exactly.

But what does a moving window exactly mean for you?

The exact meaning, of course, depends on the statistic that you’re applying to the data. For example, a rolling mean smoothes out short-term fluctuations and highlight longer-term trends in data.

script.py

# Isolate the adjusted closing prices 
adj_close_px = aapl['_________']

# Calculate the moving average
moving_avg = adj_close_px._______(window=40).mean()

# Inspect the result
print(__________[-10:])

Tip: try out some of the other standard moving windows functions that come with the Pandas package, such as rolling_max(), rolling_var() or rolling_median(), in the IPython console. Note that you can also use rolling() in combination with max(), var() or median() to accomplish the same results! Remember that you can find more functions if you click on the link that’s provided in the text on top of this DataCamp Light chunk.

Of course, you might not really understand what all of this is about. Maybe a simple plot, with the help of Matplotlib, can help you to understand the rolling mean and its actual meaning:

script.py

# Import matplotlib 
import matplotlib.pyplot as plt

# Short moving window rolling mean
aapl['42'] = adj_close_px.rolling(window=40).mean()

# Long moving window rolling mean
aapl['252'] = adj_close_px.rolling(window=252).mean()

# Plot the adjusted closing price, the short and long windows of rolling means
aapl[['Adj Close', '42', '252']].plot()

# Show plot
plt.show()

Python for Financial Analysis and Algorithmic Trading

Volatility Calculation

The volatility of a stock is a measurement of the change in variance in the returns of a stock over a specific period of time. It is common to compare the volatility of a stock with another stock to get a feel for which may have less risk or to a market index to examine the stock’s volatility in the overall market. Generally, the higher the volatility, the riskier the investment in that stock, which results in investing in one over another.

the moving historical standard deviation of the log returns—i.e. the moving historical volatility—might be more of interest: Also make use of pd.rolling_std(data, window=x) * math.sqrt(window) for the moving historical standard deviation of the log returns (aka the moving historical volatility).

script.py

# Import matplotlib
import matplotlib.pyplot as plt 

# Define the minumum of periods to consider 
min_periods = 75 

# Calculate the volatility
vol = daily_pct_change.rolling(min_periods).std() * np.sqrt(min_periods) 

# Plot the volatility
vol.plot(figsize=(10, 8))

# Show the plot
plt.show()

Python for Financial Analysis and Algorithmic Trading

The volatility is calculated by taking a rolling window standard deviation on the percentage change in a stock. You can clearly see this in the code because you pass daily_pct_change and the min_periods to rolling_std().

Note that the size of the window can and will change the overall result: if you take the window wider and make min_periods larger, your result will become less representative. If you make it smaller and make the window more narrow, the result will come closer to the standard deviation.

Considering all of this, you see that it’s definitely a skill to get the right window size based upon the data sampling frequency.

Ordinary Least-Squares Regression (OLS)

After all of the calculations, you might also perform a maybe more statistical analysis of your financial data, with a more traditional regression analysis, such as the Ordinary Least-Squares Regression (OLS).

To do this, you have to make use of the statsmodels library, which not only provides you with the classes and functions to estimate many different statistical models but also allows you to conduct statistical tests and perform statistical data exploration.

Note that you could indeed to the OLS regression with Pandas, but that the ols module is now deprecated and will be removed in future versions. It is therefore wise to use the statsmodels package.

script.py

# Import the `api` model of `statsmodels` under alias `sm`
import statsmodels.api as sm

# Import the `datetools` module from `pandas`
from pandas.core import datetools

# Isolate the adjusted closing price
all_adj_close = all_data[['Adj Close']]

# Calculate the returns 
all_returns = np.log(all_adj_close / all_adj_close.shift(1))

# Isolate the AAPL returns 
aapl_returns = all_returns.iloc[all_returns.index.get_level_values('Ticker') == 'AAPL']
aapl_returns.index = aapl_returns.index.droplevel('Ticker')

# Isolate the MSFT returns
msft_returns = all_returns.iloc[all_returns.index.get_level_values('Ticker') == 'MSFT']
msft_returns.index = msft_returns.index.droplevel('Ticker')

# Build up a new DataFrame with AAPL and MSFT returns
return_data = pd.concat([aapl_returns, msft_returns], axis=1)[1:]
return_data.columns = ['AAPL', 'MSFT']

# Add a constant 
X = sm.add_constant(return_data['AAPL'])

# Construct the model
model = sm.OLS(return_data['MSFT'],X).fit()

# Print the summary
print(model.summary())

Note that you add [1:] to the concatenation of the AAPL and MSFT return data so that you don’t have any NaN values that can interfere with your model.

Things to look out for when you’re studying the result of the model summary are the following:

  • The Dep. Variable, which indicates which variable is the response in the model
  • The Model, in this case, is OLS. It’s the model you’re using in the fit
  • Additionally, you also have the Method to indicate how the parameters of the model were calculated. In this case, you see that this is set at Least Squares.

Up until now, you haven’t seen much new information. You have basically set all of these in the code that you ran in the DataCamp Light chunk. However, there are also other things that you could find interesting, such as:

  • The number of observations (No. Observations). Note that you could also derive this with the Pandas package by using the info() function. Run return_data.info() in the IPython console of the DataCamp Light chunk above to confirm this.
  • The degree of freedom of the residuals (DF Residuals)
  • The number of parameters in the model, indicated by DF Model; Note that the number doesn’t include the constant term X which was defined in the code above.

This was basically the whole left column that you went over. The right column gives you some more insight into the goodness of the fit. You see, for example:

  • R-squared, which is the coefficient of determination. This score indicates how well the regression line approximates the real data points. In this case, the result is 0.280. In percentages, this means that the score is at 28%. When the score is 0%, it indicates that the model explains none of the variability of the response data around its mean. Of course, a score of 100% indicates the opposite.
  • You also see the Adj. R-squared score, which at first sight gives the same number. However, the calculation behind this metric adjusts the R-Squared value based on the number of observations and the degrees-of-freedom of the residuals (registered in DF Residuals). The adjustment in this case hasn’t had much effect, as the result of the adjusted score is still the same as the regular R-squared score.
  • The F-statistic measures how significant the fit is. It is calculated by dividing the mean squared error of the model by the mean squared error of the residuals. The F-statistic for this model is 514.2.
  • Next, there’s also the Prob (F-statistic), which indicates the probability that you would get the result of the F-statistic, given the null hypothesis that they are unrelated.
  • The Log-likelihood indicates the log of the likelihood function, which is, in this case 3513.2.
  • The AIC is the Akaike Information Criterion: this metric adjusts the log-likelihood based on the number of observations and the complexity of the model. The AIC of this model is -7022.
  • Lastly, the BIC or the Bayesian Information Criterion, is similar to the AIC that you just have seen, but it penalizes models with more parameters more severely. Given the fact that this model only has one parameter (check DF Model), the BIC score will be the same as the AIC score.

Below the first part of the model summary, you see reports for each of the model’s coefficients:

  • The estimated value of the coefficient is registered at coef.
  • std err is the standard error of the estimate of the coefficient.
  • There’s also the t-statistic value, which you’ll find under t. This metric is used to measure how statistically significant a coefficient is.
  • P > |t| indicates the null-hypothesis that the coefficient = 0 is true. If it is less than the confidence level, often 0.05, it indicates that there is a statistically significant relationship between the term and the response. In this case, you see that the constant has a value of 0.198, while AAPL is set at 0.000.

Lastly, there is a final part of the model summary in which you’ll see other statistical tests to assess the distribution of the residuals:

  • Omnibus, which is the Omnibus D’Angostino’s test: it provides a combined statistical test for the presence of skewness and kurtosis.
  • The Prob(Omnibus) is the Omnibus metric turned into a probability.
  • Next, the Skew or Skewness measures the symmetry of the data about the mean.
  • The Kurtosis gives an indication of the shape of the distribution, as it compares the amount of data close to the mean with those far away from the mean (in the tails).
  • Durbin-Watson is a test for the presence of autocorrelation, and the Jarque-Bera is another test of the skewness and kurtosis. You can also turn the result of this test into a probability, as you can see in Prob (JB).
  • Lastly, you have the Cond. No, which tests the multicollinearity.

You can plot the Ordinary Least-Squares Regression with the help of Matplotlib:

script.py

# Import matplotlib
import matplotlib.pyplot as plt

# Plot returns of AAPL and MSFT
plt.plot(return_data['AAPL'], return_data['MSFT'], 'r.')

# Add an axis to the plot
ax = plt.axis()

# Initialize `x`
x = np.linspace(ax[0], ax[1] + 0.01)

# Plot the regression line
plt.plot(x, model.params[0] + model.params[1] * x, 'b', lw=2)

# Customize the plot
plt.grid(True)
plt.axis('tight')
plt.xlabel('Apple Returns')
plt.ylabel('Microsoft returns')

# Show the plot
plt.show()

Python for Financial Analysis and Algorithmic Trading

Note that you can also use the rolling correlation of returns as a way to crosscheck your results. You can handily make use of the Matplotlib integration with Pandas to call the plot() function on the results of the rolling correlation:

script.py

# Import matplotlib 
import matplotlib.pyplot as plt

# Plot the rolling correlation
return_data['MSFT'].rolling(window=252).corr(return_data['AAPL']).plot()

# Show the plot
plt.show()

Python for Financial Analysis and Algorithmic Trading

Building A Trading Strategy With Python

Now that you have done some primary analyses to your data, it’s time to formulate your first trading strategy; But before you go into all of this, why not first get to know some of the most common trading strategies? After a short introduction, you’ll undoubtedly move on more easily your trading strategy.

Common Trading Strategies

From the introduction, you’ll still remember that a trading strategy is a fixed plan to go long or short in markets, but much more information you didn’t really get yet; In general, there are two common trading strategies: the momentum strategy and the reversion strategy.

Firstly, the momentum strategy is also called divergence or trend trading. When you follow this strategy, you do so because you believe the movement of a quantity will continue in its current direction. Stated differently, you believe that stocks have momentum or upward or downward trends, that you can detect and exploit.

Some examples of this strategy are the moving average crossover, the dual moving average crossover, and turtle trading:

  • The moving average crossover is when the price of an asset moves from one side of a moving average to the other. This crossover represents a change in momentum and can be used as a point of making the decision to enter or exit the market. You’ll see an example of this strategy, which is the “hello world” of quantitative trading later on in this tutorial.
  • The dual moving average crossover occurs when a short-term average crosses a long-term average. This signal is used to identify that momentum is shifting in the direction of the short-term average. A buy signal is generated when the short-term average crosses the long-term average and rises above it, while a sell signal is triggered by a short-term average crossing long-term average and falling below it.
  • Turtle trading is a popular trend following strategy that was initially taught by Richard Dennis. The basic strategy is to buy futures on a 20-day high and sell on a 20-day low.

Secondly, the reversion strategy, which is also known as convergence or cycle trading. This strategy departs from the belief that the movement of a quantity will eventually reverse. This might seem a little bit abstract, but will not be so anymore when you take the example. Take a look at the mean reversion strategy, where you actually believe that stocks return to their mean and that you can exploit when it deviates from that mean.

That already sounds a whole lot more practical, right?

Another example of this strategy, besides the mean reversion strategy, is the pairs trading mean-reversion, which is similar to the mean reversion strategy. Whereas the mean reversion strategy basically stated that stocks return to their mean, the pairs trading strategy extends this and states that if two stocks can be identified that have a relatively high correlation, the change in the difference in price between the two stocks can be used to signal trading events if one of the two moves out of correlation with the other. That means that if the correlation between two stocks has decreased, the stock with the higher price can be considered to be in a short position. It should be sold because the higher-priced stock will return to the mean. The lower-priced stock, on the other hand, will be in a long position because the price will rise as the correlation will return to normal.

Besides these two most frequent strategies, there are also other ones that you might come across once in a while, such as the forecasting strategy, which attempts to predict the direction or value of a stock, in this case, in subsequent future time periods based on certain historical factors. There’s also the High-Frequency Trading (HFT) strategy, which exploits the sub-millisecond market microstructure.

That’s all music for the future for now; Let’s focus on developing your first trading strategy for now!

A Simple Trading Strategy

As you read above, you’ll start with the “hello world” of quantitative trading: the moving average crossover. The strategy that you’ll be developing is simple: you create two separate Simple Moving Averages (SMA) of a time series with differing lookback periods, let’s say, 40 days and 100 days. If the short moving average exceeds the long moving average then you go long, if the long moving average exceeds the short moving average then you exit.

Remember that when you go long, you think that the stock price will go up and will sell at a higher price in the future (= buy signal); When you go short, you sell your stock, expecting that you can buy it back at a lower price and realize a profit (= sell signal).

This simple strategy might seem quite complex when you’re just starting out, but let’s take this step by step:

  • First define your two different lookback periods: a short window and a long window. You set up two variables and assign one integer per variable. Make sure that the integer that you assign to the short window is shorter than the integer that you assign to the long window variable!
  • Next, make an empty signals DataFrame, but do make sure to copy the index of your aapl data so that you can start calculating the daily buy or sell signal for your aapl data.
  • Create a column in your empty signals DataFrame that is named signal and initialize it by setting the value for all rows in this column to 0.0.
  • After the preparatory work, it’s time to create the set of short and long simple moving averages over the respective long and short time windows. Make use of the rolling() function to start your rolling window calculations: within the function, specify the window and the min_period, and set the center argument. In practice, this will result in a rolling() function to which you have passed either short_window or long_window, 1 as the minimum number of observations in the window that are required to have a value, and False, so that the labels are not set at the center of the window. Next, don’t forget to also chain the mean() function so that you calculate the rolling mean.
  • After you have calculated the mean average of the short and long windows, you should create a signal when the short moving average crosses the long moving average, but only for the period greater than the shortest moving average window. In Python, this will result in a condition: signals['short_mavg'][short_window:] > signals['long_mavg'][short_window:]. Note that you add the [short_window:] to comply with the condition “only for the period greater than the shortest moving average window”. When the condition is true, the initialized value 0.0 in the signal column will be overwritten with 1.0. A “signal” is created! If the condition is false, the original value of 0.0 will be kept and no signal is generated. You use the NumPy where() function to set up this condition. Much the same like you read just now, the variable to which you assign this result is signals['signal'][short_window], because you only want to create signals for the period greater than the shortest moving average window!
  • Lastly, you take the difference of the signals in order to generate actual trading orders. In other words, in this column of your signals DataFrame, you’ll be able to distinguish between long and short positions, whether you’re buying or selling stock.

Try all of this out in the DataCamp Light chunk below:

script.py

# Initialize the short and long windows
short_window = 40
long_window = 100

# Initialize the `signals` DataFrame with the `signal` column
signals = pd.DataFrame(index=aapl.index)
signals['signal'] = 0.0

# Create short simple moving average over the short window
signals['short_mavg'] = aapl['Close'].rolling(window=short_window, min_periods=1, center=False).mean()

# Create long simple moving average over the long window
signals['long_mavg'] = aapl['Close'].rolling(window=long_window, min_periods=1, center=False).mean()

# Create signals
signals['signal'][short_window:] = np.where(signals['short_mavg'][short_window:] 
                                            > signals['long_mavg'][short_window:], 1.0, 0.0)   

# Generate trading orders
signals['positions'] = signals['signal'].diff()

# Print `signals`
print(signals)

This wasn’t too hard, was it? Print out the signals DataFrame and inspect the results. Important to grasp here is what the positions and the signal columns mean in this DataFrame. You’ll see that it will become very important when you move on!

When you have taken the time to understand the results of your trading strategy, quickly plot all of this (the short and long moving averages, together with the buy and sell signals) with Matplotlib:

script.py

# Import `pyplot` module as `plt`
import matplotlib.pyplot as plt

# Initialize the plot figure
fig = plt.figure()

# Add a subplot and label for y-axis
ax1 = fig.add_subplot(111,  ylabel='Price in /pre>)

# Plot the closing price
aapl['Close'].plot(ax=ax1, color='r', lw=2.)

# Plot the short and long moving averages
signals[['short_mavg', 'long_mavg']].plot(ax=ax1, lw=2.)

# Plot the buy signals
ax1.plot(signals.loc[signals.positions == 1.0].index, 
         signals.short_mavg[signals.positions == 1.0],
         '^', markersize=10, color='m')

# Plot the sell signals
ax1.plot(signals.loc[signals.positions == -1.0].index, 
         signals.short_mavg[signals.positions == -1.0],
         'v', markersize=10, color='k')

# Show the plot
plt.show()

Python for Financial Analysis and Algorithmic Trading

The result is pretty cool, isn’t it?

Backtesting The Trading Strategy

Now that you’ve got your trading strategy at hand, it’s a good idea to also backtest it and calculate its performance. But right before you go deeper into this, you might want to know just a little bit more about the pitfalls of backtesting, what components are needed in a backtester and what Python tools you can use to backtest your simple algorithm.

If, however, you’re already well up to date, you can simply move on to the implementation of your backtester!

Backtesting Pitfalls

Backtesting is, besides just “testing a trading strategy”, testing the strategy on relevant historical data to make sure that it’s an actual viable strategy before you start making moves. With backtesting, a trader can simulate and analyze the risk and profitability of trading with a specific strategy over a period of time. However, when you’re backtesting, it’s a good idea to keep in mind that there are some pitfalls, which might not be obvious to you when you’re just starting out.

For example, there are external events, such as market regime shifts, which are regulatory changes or macroeconomic events, which definitely influence your backtesting. Also, liquidity constraints, such as the ban of short sales, could affect your backtesting heavily.

Next, there are pitfalls which you might introduce yourself when you, for example, overfit a model (optimization bias), when you ignore strategy rules because you think it’s better like that (interference), or when you accidentally introduce information into past data (lookahead bias).

These are just a few pitfalls that you need to take into account mainly after this tutorial, when you go and make your own strategies and backtest them.

Backtesting Components

Besides the pitfalls, it’s good to know that your backtester usually consists of some four essential components, which should usually present in every backtester:

  • A data handler, which is an interface to a set of data,
  • A strategy, which generates a signal to go long or go short based on the data,
  • A portfolio, which generates orders and manages Profit & Loss (also known as “PnL”), and
  • An execution handler, which sends the order to the broker and receives the “fills” or signals that the stock has been bought or sold.

Besides these four components, there are many more that you can add to your backtester, depending on the complexity. You can definitely go a lot further than just these four components. However, for this beginner tutorial, you’ll just focus on getting these basic components to work in your code.

Python Tools

To implement the backtesting, you can make use of some other tools besides Pandas, which you have already used extensively in the first part of this tutorial to perform some financial analyses on your data. Apart from Pandas, there is, for example, also NumPy and SciPy, which provide, vectorization, optimization and linear algebra routines which you can use when you’re developing trading strategies.

Also Scikit-Learn, the Python Machine Learning library, can come in handy when you’re working with forecasting strategies, as they offer everything you need to create regression and classification models. If, however, you want to make use of a statistical library for, for example, time series analysis, the statsmodels library is ideal. You briefly used this library already in this tutorial when you were performing the Ordinary Least-Squares Regression (OLS).

Lastly, there’s also the IbPy and ZipLine libraries. The former offers you a Python API for the Interactive Brokers online trading system: you’ll get all the functionality to connect to Interactive Brokers, request stock ticker data, submit orders for stocks,…

Implementation Of A Simple Backtester

As you read above, a simple backtester consists of a strategy, a data handler, a portfolio and an execution handler. You have already implemented a strategy above, and you also have access to a data handler, which is the pandas-datareader or the Pandas library that you use to get your saved data from Excel into Python. The components that are still left to implement are the execution handler and the portfolio.

However, since you’re just starting out, you’ll not focus on implementing an execution handler just yet. Instead, you’ll see below how you can get started on creating a portfolio which can generate orders and manages the profit and loss:

  • First off, you’ll create set a variable initial_capital to set your initial capital and a new DataFrame positions. Once again, you copy the index from another DataFrame; In this case, this is the signals DataFrame because you want to consider the time frame for which you have generated the signals.
  • Next, you create a new column AAPL in the DataFrame. On the days that the signal is 1 and the short moving average crosses the long moving average (for the period greater than the shortest moving average window), you’ll buy a 100 shares. The days on which the signal is 0, the final result will be 0 as a result of the operation 100*signals['signal'].
  • A new DataFrame portfolio is created to store the market value of an open position.
  • Next, you create a DataFrame that stores the differences in positions (or number of stock)
  • Then the real backtesting begins: you create a new column to the portfolio DataFrame with name holdings, which stores the value of the positions or shares you have bought, multiplied by the ‘Adj Close’ price.
  • Your portfolio also contains a cash column, which is the capital that you still have left to spend: it is calculated by taking your initial_capital and subtracting your holdings (the price that you paid for buying stock).
  • You’ll also add a total column to your portfolio DataFrame, which contains the sum of your cash and the holdings that you own, and
  • Lastly, you also add a returns column to your portfolio, in which you’ll store the returns

script.py

# Set the initial capital
initial_capital= float(100000.0)

# Create a DataFrame `positions`
positions = pd.DataFrame(index=signals.index).fillna(0.0)

# Buy a 100 shares
positions['AAPL'] = 100*signals['signal']   

# Initialize the portfolio with value owned   
portfolio = positions.multiply(aapl['Adj Close'], axis=0)

# Store the difference in shares owned 
pos_diff = positions.diff()

# Add `holdings` to portfolio
portfolio['holdings'] = (positions.multiply(aapl['Adj Close'], axis=0)).sum(axis=1)

# Add `cash` to portfolio
portfolio['cash'] = initial_capital - (pos_diff.multiply(aapl['Adj Close'], axis=0)).sum(axis=1).cumsum()   

# Add `total` to portfolio
portfolio['total'] = portfolio['cash'] + portfolio['holdings']

# Add `returns` to portfolio
portfolio['returns'] = portfolio['total'].pct_change()

# Print the first lines of `portfolio`
print(portfolio.head())

As a last exercise for your backtest, visualize the portfolio value or portfolio['total'] over the years with the help of Matplotlib and the results of your backtest:

script.py

# Import the `pyplot` module as `plt`
import matplotlib.pyplot as plt

# Create a figure
fig = plt.figure()

ax1 = fig.add_subplot(111, ylabel='Portfolio value in /pre>)

# Plot the equity curve in dollars
portfolio['total'].plot(ax=ax1, lw=2.)

ax1.plot(portfolio.loc[signals.positions == 1.0].index, 
         portfolio.total[signals.positions == 1.0],
         '^', markersize=10, color='m')
ax1.plot(portfolio.loc[signals.positions == -1.0].index, 
         portfolio.total[signals.positions == -1.0],
         'v', markersize=10, color='k')

# Show the plot
plt.show()

Python for Financial Analysis and Algorithmic Trading

Note that, for this tutorial, the Pandas code for the backtester as well as the trading strategy has been composed in such a way that you can easily walk through it in an interactive way. In a real-life application, you might opt for a more object-oriented design with classes, which contain all the logic.

Backtesting With Zipline & Quantopian

You have seen now how you can implement a backtester with the Python’s popular data manipulation package Pandas. However, you can also see that it’s easy to make mistakes and that this might not be the most fail-safe option to use every time: you need to build most of the components from scratch, even though you already leverage Pandas to get your results.

That’s why it’s common to use a backtesting platform, such as Quantopian, for your backtesters. Quantopian is a free, community-centered, hosted platform for building and executing trading strategies. It’s powered by zipline, a Python library for algorithmic trading. You can use the library locally, but for the purpose of this beginner tutorial, you’ll use Quantopian to write and backtest your algorithm. Before you can do this, though, make sure that you first sign up and log in.

Next, you can get started pretty easily. Click “New Algorithm” to start writing up your trading algorithm or select one of the examples that has already been coded up for you to get a better feeling of what you’re exactly dealing with :)

Python for Financial Analysis and Algorithmic Trading

Let’s start simple and make a new algorithm, but still following our simple example of the moving average crossover, which is the standard example that you find in the zipline Quickstart guide.

It so happens that this example is very similar to the simple trading strategy that you implemented in the previous section. You see, though, that the structure in the code chunk below and in the screenshot above is somewhat different than what you have seen up until now in this tutorial, namely, you have two definitions that you start working from, namely initialize() and handle_data():

def initialize(context):
    context.sym = symbol('AAPL')
    context.i = 0


def handle_data(context, data):
    # Skip first 300 days to get full windows
    context.i += 1
    if context.i < 300:
        return

    # Compute averages
    # history() has to be called with the same params
    # from above and returns a pandas dataframe.
    short_mavg = data.history(context.sym, 'price', 100, '1d').mean()
    long_mavg = data.history(context.sym, 'price', 300, '1d').mean()

    # Trading logic
    if short_mavg > long_mavg:
        # order_target orders as many shares as needed to
        # achieve the desired number of shares.
        order_target(context.sym, 100)
    elif short_mavg < long_mavg:
        order_target(context.sym, 0)

    # Save values for later inspection
    record(AAPL=data.current(context.sym, "price"),
           short_mavg=short_mavg,
           long_mavg=long_mavg)

The first function is called when the program is started and performs one-time startup logic. As an argument, the initialize() function takes a context, which is used to store the state during a backtest or live trading and can be referenced in different parts of the algorithm, as you can see in the code below; You see that context comes back, among others, in the definition of the first moving average window. You see that you assign the result of the lookup of a security (stock in this case) by its symbol, (AAPL in this case) to context.security.

The handle_data() function is called once per minute during simulation or live-trading to decide what orders, if any, should be placed each minute. The function requires context and data as input: the context is the same as the one that you read about just now, while the data is an object that stores several API functions, such as current() to retrieve the most recent value of a given field(s) for a given asset(s) or history() to get trailing windows of historical pricing or volume data. These API functions don’t come back in the code below and are not in the scope of this tutorial.

Note That the code that you type into the Quantopian console will only work on the platform itself and not in your local Jupyter Notebook, for example!

You’ll see that the data object allows you to retrieve the price, which is the forward-filled, returning last known price, if there is one. If there is none, an NaN value will be returned.

Another object that you see in the code chunk above is the portfolio, which stores important information about…. Your portfolio. As you can see in the piece of code context.portfolio.positions, this object is stored in the context and is then also accessible in the core functions that context has to offer to you as a user. Note that the positions that you just read about, store Position objects and include information such as the number of shares and price paid as values. Additionally, you also see that the portfolio also has a cash property to retrieve the current amount of cash in your portfolio and that the positions object also has an amount property to explore the whole number of shares in a certain position.

The order_target() places an order to adjust a position to a target number of shares. If there is no existing position in the asset, an order is placed for the full target number. If there is a position in the asset, an order is placed for the difference between the target number of shares or contracts and the number currently held. Placing a negative target order will result in a short position equal to the negative number specified.

Tip: if you have any more questions about the functions or objects, make sure to check the Quantopian Help page, which contains more information about all (and much more) that you have briefly seen in this tutorial.

When you have created your strategy with the initialize() and handle_data() functions (or copy-pasted the above code) into the console on the left-hand side of your interface, just press the “Build Algorithm” button to build the code and run a backtest. If you press the “Run Full Backtest” button, a full backtest is run, which is basically the same as the one that you run when you build the algorithm, but you’ll be able to see a lot more in detail. The backtesting, whether ‘simple’ or full, can take a while; Make sure to keep an eye out on the progress bar on top of the page!

Python for Financial Analysis and Algorithmic Trading

You can find more information on how to get started with Quantopian here.

Note that Quantopian is an easy way to get started with zipline, but that you can always move on to using the library locally in, for example, your Jupyter notebook.

Improving The Trading Strategy

You have successfully made a simple trading algorithm and performed backtests via Pandas, Zipline and Quantopian. It’s fair to say that you’ve been introduced to trading with Python. However, when you have coded up the trading strategy and backtested it, your work doesn’t stop yet; You might want to improve your strategy. There are one or more algorithms may be used to improve the model on a continuous basis, such as KMeans, k-Nearest Neighbors (KNN), Classification or Regression Trees and the Genetic Algorithm. This will be the topic of a future DataCamp tutorial.

Apart from the other algorithms you can use, you saw that you can improve your strategy by working with multi-symbol portfolios. Just incorporating one company or symbol into your strategy often doesn’t really say much. You’ll also see this coming back in the evaluation of your moving average crossover strategy. Other things that you can add or do differently is using a risk management framework or use event-driven backtesting to help mitigate the lookahead bias that you read about earlier. There are still many other ways in which you could improve your strategy, but for now, this is a good basis to start from!

Evaluating Moving Average Crossover Strategy

Improving your strategy doesn’t mean that you’re finished just yet! You can easily use Pandas to calculate some metrics to further judge your simple trading strategy. First, you can use the Sharpe ratio to get to know whether your portfolio’s returns are the result of the fact that you decided to make smart investments or to take a lot of risks.

The ideal situation is, of course, that the returns are considerable but that the additional risk of investing is as small as possible. That’s why, the greater the portfolio’s Sharpe ratio, the better: the ratio between the returns and the additional risk that is incurred is quite OK. Usually, a ratio greater than 1 is acceptable by investors, 2 is very good and 3 is excellent.

Let’s see how your algorithm does!

script.py

# Isolate the returns of your strategy
returns = portfolio['________']

# annualized Sharpe ratio
sharpe_ratio = np.sqrt(252) * (returns.mean() / returns.std())

# Print the Sharpe ratio
print(____________)

Note that the risk free rate that is excluded in the definition of the Sharpe ratio for this tutorial and that the Sharpe ratio is usually not considered as a standalone: it’s usually compared to other stocks. The best way to approach this issue is thus by extending your original trading strategy with more data (from other companies)!

Next, you can also calculate a Maximum Drawdown, which is used to measure the largest single drop from peak to bottom in the value of a portfolio, so before a new peak is achieved. In other words, the score indicates the risk of a portfolio chosen based on a certain strategy.

script.py

# Define a trailing 252 trading day window
window = 252

# Calculate the max drawdown in the past window days for each day 
rolling_max = aapl['Adj Close'].rolling(window, min_periods=1).max()
daily_drawdown = aapl['Adj Close']/rolling_max - 1.0

# Calculate the minimum (negative) daily drawdown
max_daily_drawdown = daily_drawdown.rolling(window, min_periods=1).min()

# Plot the results
daily_drawdown.plot()
max_daily_drawdown.plot()

# Show the plot
plt.show()

Python for Financial Analysis and Algorithmic Trading

Note that you set min_periods to 1 because you want to let the first 252 days data have an expanding window.

Next up is the Compound Annual Growth Rate (CAGR), which provides you with a constant rate of return over the time period. In other words, the rate tells you what you really have at the end of your investment period. You can calculate this rate by first dividing the investments ending value (EV) by the investment’s beginning value (BV). You raise the result to the power of 1/n, where n is the number of periods. You subtract 1 from the consequent result and there’s your CAGR!

Maybe a formula is more clear:

Python for Financial Analysis and Algorithmic Trading

Note that, in the code chunk below, you’ll see that you consider days, so your 1 is adjusted to 365 days (which is equal to 1 year).

script.py

# Get the number of days in `aapl`
days = (aapl.index[-1] - aapl.index[0]).days

# Calculate the CAGR 
cagr = ((((aapl['Adj Close'][-1]) / aapl['Adj Close'][1])) ** (365.0/days)) - 1

# Print the CAGR
print(____)

Besides these two metrics, there are also many others that you could consider, such as the distribution of returns, trade-level metrics, …

What Now?

Well done, you’ve made it through this Python Finance introduction tutorial! You’ve covered a lot of ground, but there’s still so much more for you to discover!

You should also check out Yves Hilpisch’s Python For Finance book, which is a great book for those who already have gathered some background into Finance, but not so much in Python. “Mastering Pandas for Data Science” by Michael Heydt is also recommended for those who want to get started with Finance in Python! Also make sure to check out Quantstart’s articles for guided tutorials on algorithmic trading and this complete series on Python programming for finance.

If you’re more interested in continuing your journey into finance with R, consider taking Datacamp’s Quantitative Analyst with R track. And in the meantime, keep posted for our second post on starting finance with Python and check out the Jupyter notebook of this tutorial.

The infor­ma­tion pro­vided on this site is not finan­cial advice and none of the authors are finan­cial professionals. The mate­r­ial pro­vided on this Web­site should be used for infor­ma­tional pur­poses only and in no way should be relied upon for finan­cial advice. We make no rep­re­sen­ta­tions as to accu­racy, com­plete­ness, suit­abil­ity, or valid­ity, of any infor­ma­tion. We will not be liable for any errors, omis­sions, or any losses, injuries, or dam­ages aris­ing from its dis­play or use. All infor­ma­tion is pro­vided AS IS with no war­ranties, and con­fers no rights. Also, note that such mate­r­ial is not updated reg­u­larly and some of the infor­ma­tion may not, there­fore, be cur­rent. Please be sure to con­sult your own finan­cial advi­sor when mak­ing deci­sions regard­ing your finan­cial man­age­ment. The ideas and strate­gies men­tioned in this blog should never be used with­out first assess­ing your own per­sonal and finan­cial sit­u­a­tion, or with­out con­sult­ing a finan­cial professional.

Introduction to Python Hex() Function for Beginners

Introduction to Python Hex() Function for Beginners

Python hex() function is used to convert any integer number ( in base 10) to the corresponding hexadecimal number. Notably, the given input should be in base 10. Python hex function is one of the built-in functions in Python3, which is used to convert an integer number into its corresponding hexadecimal form.

Python hex() function is used to convert any integer number ( in base 10) to the corresponding hexadecimal number. Notably, the given input should be in base 10. Python hex function is one of the built-in functions in Python3, which is used to convert an integer number into its corresponding hexadecimal form.

##Python Hex Example
The hex() function converts the integer to the corresponding hexadecimal number in string form and returns it.

The input integer argument can be in any base such as binary, octal, etc. Python will take care of converting them to hexadecimal format.

Syntax

hex(number)

number: It is an integer that will be converted into hexadecimal value.
This function converts the number into the hexadecimal form, and then it returns that hexadecimal number in a string format.

Please note that the return value always starts with ‘0x’ (without quotes), which proves that the number is in hexadecimal format.

# app.py

print("Enter the number: ")

# taking input from user
num = int(input())

# converting the number into hexadecimal form
h1 = hex(num)

# Printing hexadecimal form
print("The ", num, " in hexadecimal is: ", h1)

# Converting float number to hexadecimal form
print("\nEnter a float number")
num2 = float(input())

# converting into hexadecimal form
# for float we have to use float.hex() here
h2 = float.hex(num2)

# printing result
print("The ", num2, " in hexadecimal is: ", h2)

In the above example, we used the Python input() function to take the input from the user.

See the output.

Enter the number:
541
The  541  in hexadecimal is:  0x21d
    
Enter a float number
123.54
The  123.54  in hexadecimal is:  0x1.ee28f5c28f5c3p+6

Python hex() without 0x

See the following program.

# app.py

print("Enter the number: ")

# taking input from user
num = int(input())

# converting the number into hexadecimal form
h1 = hex(num)

# Printing hexadecimal form
# we have used string slicing here
print("The ", num, " in hexadecimal is: ", h1[2:])

# Converting float number to hexadecimal form
print("\nEnter a float number")
num2 = float(input())

# converting into hexadecimal form
h2 = float.hex(num2)

# printing result
print("The ", num2, " in hexadecimal is: ", h2[2:])

See the output.

Enter the number:
541
The  541  in hexadecimal is:  21d

Enter a float number
123.65
The  123.65  in hexadecimal is:  1.ee9999999999ap+6

On the above program, we have used string slicing to print the result without ‘0x’.

We have started our index from position 2 to the last of the string, i.e., h1[2:]; this means the string will print characters from position 2 to the last of the string.

Hexadecimal representation of float in Python

See the following program.

# app.py

numberEL = 11.21
print(numberEL, 'in hex =', float.hex(numberEL))

numberK = 19.21
print(numberK, 'in hex =', float.hex(numberK))

See the output.

➜  pyt python3 app.py
11.21 in hex = 0x1.66b851eb851ecp+3
19.21 in hex = 0x1.335c28f5c28f6p+4
➜  pyt

Python hex() with object

See the following code.

# app.py

class AI:
    id = 0

    def __index__(self):
        print('__index__() function called')
        return self.rank


stockfish = AI()
stockfish.rank = 2900

print(hex(stockfish))

In the above example, we have used the index() method so that we can use it with hex() function.

See the output.

➜  pyt python3 app.py
__index__() function called
0xb54
➜  pyt

How to convert hex string to int in Python

Without the 0x prefix, you need to specify the base explicitly. Otherwise, it won’t work.

See the following code.

# app.py

data = int("0xa", 16)
print(data)

With the 0x prefix, Python can distinguish hex and decimal automatically.

You must specify 0 as the base to invoke this prefix-guessing behavior, omitting the second parameter means to assume base-10.)
If you want to convert the string to an int, pass the string to int along with a base you are converting from. Both strings will suffice for conversion in this way.

# app.py

hexStrA = "0xffff"
hexStrB = "ffff"

print(int(hexStrA, 16))
print(int(hexStrB, 16))

See the output.

➜  pyt python3 app.py
65535
65535
➜  pyt

In the all above examples, we have used Python int() method.

Thanks for reading !

Python Tutorial: Image processing with Python (Using OpenCV)

Python Tutorial: Image processing with Python (Using OpenCV)

In this tutorial, you will learn how you can process images in Python using the OpenCV library.

In this tutorial, you will learn how you can process images in Python using the OpenCV library.

OpenCV is a free open source library used in real-time image processing. It’s used to process images, videos, and even live streams, but in this tutorial, we will process images only as a first step. Before getting started, let’s install OpenCV.

Table of Contents

Install OpenCV

To install OpenCV on your system, run the following pip command:

 pip install opencv-python

Now OpenCV is installed successfully and we are ready. Let’s have some fun with some images!

Rotate an Image

First of all, import the cv2 module.

 import cv2

Now to read the image, use the imread() method of the cv2 module, specify the path to the image in the arguments and store the image in a variable as below:

 img = cv2.imread("pyimg.jpg")

The image is now treated as a matrix with rows and columns values stored in img.

Actually, if you check the type of the img, it will give you the following result:

>>>print(type(img))
 
<class 'numpy.ndarray'>

It’s a NumPy array! That why image processing using OpenCV is so easy. All the time you are working with a NumPy array.

To display the image, you can use the imshow() method of cv2.

cv2.imshow('Original Image', img) 
 
cv2.waitKey(0)

The waitkey functions take time as an argument in milliseconds as a delay for the window to close. Here we set the time to zero to show the window forever until we close it manually.

To rotate this image, you need the width and the height of the image because you will use them in the rotation process as you will see later.

 height, width = img.shape[0:2]

The shape attribute returns the height and width of the image matrix. If you print img.shape[0:2] , you will have the following output:

Okay, now we have our image matrix and we want to get the rotation matrix. To get the rotation matrix, we use the getRotationMatrix2D() method of cv2. The syntax of getRotationMatrix2D() is:

 cv2.getRotationMatrix2D(center, angle, scale)

Here the center is the center point of rotation, the angle is the angle in degrees and scale is the scale property which makes the image fit on the screen.

To get the rotation matrix of our image, the code will be:

 rotationMatrix = cv2.getRotationMatrix2D((width/2, height/2), 90, .5)

The next step is to rotate our image with the help of the rotation matrix.

To rotate the image, we have a cv2 method named wrapAffine which takes the original image, the rotation matrix of the image and the width and height of the image as arguments.

 rotatedImage = cv2.warpAffine(img, rotationMatrix, (width, height))

The rotated image is stored in the rotatedImage matrix. To show the image, use imshow() as below:

cv2.imshow('Rotated Image', rotatedImage)
 
cv2.waitKey(0)

After running the above lines of code, you will have the following output:

Crop an Image

First, we need to import the cv2 module and read the image and extract the width and height of the image:

import cv2
 
img = cv2.imread("pyimg.jpg")
 
height, width = img.shape[0:2]

Now get the starting and ending index of the row and column. This will define the size of the newly created image. For example, start from row number 10 till row number 15 will give the height of the image.

Similarly, start from column number 10 until column number 15 will give the width of the image.

You can get the starting point by specifying the percentage value of the total height and the total width. Similarly, to get the ending point of the cropped image, specify the percentage values as below:

startRow = int(height*.15)
 
startCol = int(width*.15)
 
endRow = int(height*.85)
 
endCol = int(width*.85)

Now map these values to the original image. Note that you have to cast the starting and ending values to integers because when mapping, the indexes are always integers.

 croppedImage = img[startRow:endRow, startCol:endCol]

Here we specified the range from starting to ending of rows and columns.

Now display the original and cropped image in the output:

cv2.imshow('Original Image', img)
 
cv2.imshow('Cropped Image', croppedImage)
 
cv2.waitKey(0)

The result will be as follows:

Resize an Image

To resize an image, you can use the resize() method of openCV. In the resize method, you can either specify the values of x and y axis or the number of rows and columns which tells the size of the image.

Import and read the image:

import cv2
 
img = cv2.imread("pyimg.jpg")

Now using the resize method with axis values:

newImg = cv2.resize(img, (0,0), fx=0.75, fy=0.75)
 
cv2.imshow('Resized Image', newImg)
 
cv2.waitKey(0)

The result will be as follows:

Now using the row and column values to resize the image:

newImg = cv2.resize(img, (550, 350))
 
cv2.imshow('Resized Image', newImg)
 
cv2.waitKey(0)

We say we want 550 columns (the width) and 350 rows (the height).

The result will be:

Adjust Image Contrast

In Python OpenCV module, there is no particular function to adjust image contrast but the official documentation of OpenCV suggests an equation that can perform image brightness and image contrast both at the same time.

 new_img = a * original_img + b

Here a is alpha which defines contrast of the image. If a is greater than 1, there will be higher contrast.

If the value of a is between 0 and 1 (smaller than 1 but greater than 0), there would be lower contrast. If a is 1, there will be no contrast effect on the image.

b stands for beta. The values of b vary from -127 to +127.

To implement this equation in Python OpenCV, you can use the addWeighted() method. We use The addWeighted() method as it generates the output in the range of 0 and 255 for a 24-bit color image.

The syntax of addWeighted() method is as follows:

 cv2.addWeighted(source_img1, alpha1, source_img2, alpha2, beta)

This syntax will blend two images, the first source image (source_img1) with a weight of alpha1 and second source image (source_img2).

If you only want to apply contrast in one image, you can add a second image source as zeros using NumPy.

Let’s work on a simple example. Import the following modules:

import cv2
 
import numpy as np

Read the original image:

 img = cv2.imread("pyimg.jpg")

Now apply the contrast. Since there is no other image, we will use the np.zeros which will create an array of the same shape and data type as the original image but the array will be filled with zeros.

contrast_img = cv2.addWeighted(img, 2.5, np.zeros(img.shape, img.dtype), 0, 0)
 
cv2.imshow('Original Image', img)
 
cv2.imshow('Contrast Image', contrast_img)
 
cv2.waitKey(0)

In the above code, the brightness is set to 0 as we only want to apply contrast.

The comparison of the original and contrast image is as follows:

Make an image blurry

Gaussian Blur

To make an image blurry, you can use the GaussianBlur() method of OpenCV.

The GaussianBlur() uses the Gaussian kernel. The height and width of the kernel should be a positive and an odd number.

Then you have to specify the X and Y direction that is sigmaX and sigmaY respectively. If only one is specified, both are considered the same.

Consider the following example:

import cv2
 
img = cv2.imread("pyimg.jpg")
 
blur_image = cv2.GaussianBlur(img, (7,7), 0)
 
cv2.imshow('Original Image', img)
 
cv2.imshow('Blur Image', blur_image)
 
cv2.waitKey(0)

In the above snippet, the actual image is passed to GaussianBlur() along with height and width of the kernel and the X and Y directions.

The comparison of the original and blurry image is as follows:

Median Blur

In median blurring, the median of all the pixels of the image is calculated inside the kernel area. The central value is then replaced with the resultant median value. Median blurring is used when there are salt and pepper noise in the image.

To apply median blurring, you can use the medianBlur() method of OpenCV.

Consider the following example where we have a salt and pepper noise in the image:

import cv2
 
img = cv2.imread("pynoise.png")
 
blur_image = cv2.medianBlur(img,5)

This will apply 50% noise in the image along with median blur. Now show the images:

cv2.imshow('Original Image', img)
 
cv2.imshow('Blur Image', blur_image)
 
cv2.waitKey(0)

The result will be like the following:

Another comparison of the original image and after blurring:

Detect Edges

To detect the edges in an image, you can use the Canny() method of cv2 which implements the Canny edge detector. The Canny edge detector is also known as the optimal detector.

The syntax to Canny() is as follows:

 cv2.Canny(image, minVal, maxVal)

Here minVal and maxVal are the minimum and maximum intensity gradient values respectively.

Consider the following code:

import cv2
 
img = cv2.imread("pyimg.jpg")
 
edge_img = cv2.Canny(img,100,200)
 
cv2.imshow("Detected Edges", edge_img)
 
cv2.waitKey(0)

The output will be the following:

Here is the result of the above code on another image:

Convert image to grayscale (Black & White)

The easy way to convert an image in grayscale is to load it like this:

 img = cv2.imread("pyimg.jpg", 0)

There is another method using BGR2GRAY.

To convert a color image into a grayscale image, use the BGR2GRAY attribute of the cv2 module. This is demonstrated in the example below:

Import the cv2 module:

 import cv2

Read the image:

 img = cv2.imread("pyimg.jpg")

Use the cvtColor() method of the cv2 module which takes the original image and the COLOR_BGR2GRAY attribute as an argument. Store the resultant image in a variable:

 gray_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

Display the original and grayscale images:

cv2.imshow("Original Image", img)
 
cv2.imshow("Gray Scale Image", gray_img)
 
cv2.waitKey(0)

The output will be as follows:

Centroid (Center of blob) detection

To find the center of an image, the first step is to convert the original image into grayscale. We can use the cvtColor() method of cv2 as we did before.

This is demonstrated in the following code:

import cv2
 
img = cv2.imread("py.jpg")
 
gray_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

We read the image and convert it to a grayscale image. The new image is stored in gray_img.

Now we have to calculate the moments of the image. Use the moments() method of cv2. In the moments() method, the grayscale image will be passed as below:

 moment = cv2.moments(gray_img)

Finally, we have the center of the image. To highlight this center position, we can use the circle method which will create a circle in the given coordinates of the given radius.

The circle() method takes the img, the x and y coordinates where the circle will be created, the size, the color that we want the circle to be and the thickness.

 cv2.circle(img, (X, Y), 15, (205, 114, 101), 1)

The circle is created on the image.

cv2.imshow("Center of the Image", img)
 
cv2.waitKey(0)

The original image is:

After detecting the center, our image will be as follows:

Apply a mask for a colored image

Image masking means to apply some other image as a mask on the original image or to change the pixel values in the image.

To apply a mask on the image, we will use the HoughCircles() method of the OpenCV module. The HoughCircles() method detects the circles in an image. After detecting the circles, we can simply apply a mask on these circles.

The HoughCircles() method takes the original image, the Hough Gradient (which detects the gradient information in the edges of the circle), and the information from the following circle equation:

 (x - xcenter)2 + (y - ycenter)2 = r2

In this equation (xcenter , ycenter) is the center of the circle and r is the radius of the circle.

Our original image is:

After detecting circles in the image, the result will be:

Okay, so we have the circles in the image and we can apply the mask. Consider the following code:

import cv2
 
import numpy as np
 
img1 = cv2.imread('pyimg.jpg')
 
img1 = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

Detecting the circles in the image using the HoughCircles() code from OpenCV: Hough Circle Transform:

gray_img = cv2.medianBlur(cv2.cvtColor(img, cv2.COLOR_RGB2GRAY), 3)
 
circles = cv2.HoughCircles(gray_img, cv2.HOUGH_GRADIENT, 1, 20, param1=50, param2=50, minRadius=0, maxRadius=0)
 
circles = np.uint16(np.around(circles))

To create the mask, use np.full which will return a NumPy array of given shape:

masking=np.full((img1.shape[0], img1.shape[1]),0,dtype=np.uint8)
 
for j in circles[0, :]:
 
    cv2.circle(masking, (j[0], j[1]), j[2], (255, 255, 255), -1)

The next step is to combine the image and the masking array we created using the bitwise_or operator as follows:

 final_img = cv2.bitwise_or(img1, img1, masking=masking)

Display the resultant image:

Extracting text from Image (OCR)

To extract text from an image, you can use Google Tesseract-OCR. You can download it from this link

Then you should install the pytesseract module which is a Python wrapper for Tesseract-OCR.

The image from which we will extract the text from is as follows:

Now let’s convert the text in this image to a string of characters and display the text as a string on output:

Import the pytesseract module:

 import pytesseract

Set the path of the Tesseract-OCR executable file:

 pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files (x86)\Tesseract-OCR\tesseract'

Now use the image_to_string method to convert the image into a string:

 print(pytesseract.image_to_string('pytext.png'))

The output will be as follows:

Works like charm!

Detect and correct text skew

In this section, we will correct the text skew.

The original image is as follows:

Import the modules cv2, NumPy and read the image:

import cv2
 
import numpy as np
 
img = cv2.imread("pytext1.png")

Convert the image into a grayscale image:

 gray_img=cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

Invert the grayscale image using bitwise_not:

 gray_img=cv2.bitwise_not(gray_img)

Select the x and y coordinates of the pixels greater than zero by using the column_stack method of NumPy:

 coordinates = np.column_stack(np.where(gray_img > 0))

Now we have to calculate the skew angle. We will use the minAreaRect() method of cv2 which returns an angle range from -90 to 0 degrees (where 0 is not included).

 ang=cv2.minAreaRect(coordinates)[-1]

The rotated angle of the text region will be stored in the ang variable. Now we add a condition for the angle; if the text region’s angle is smaller than -45, we will add a 90 degrees else we will multiply the angle with a minus to make the angle positive.

if ang<-45:
 
    ang=-(90+ang)
 
else:
 
    ang=-ang

Calculate the center of the text region:

height, width = img.shape[:2]
 
center_img = (width / 2, height / 2)

Now we have the angle of text skew, we will apply the getRotationMatrix2D() to get the rotation matrix then we will use the wrapAffine() method to rotate the angle (explained earlier).

rotationMatrix = cv2.getRotationMatrix2D(center, angle, 1.0)
 
rotated_img = cv2.warpAffine(img, rotationMatrix, (width, height), borderMode = cv2.BORDER_REFLECT)

Display the rotated image:

cv2.imshow("Rotated Image", rotated_img)
 
cv2.waitKey(0)

Color Detection

Let’s detect the green color from an image:

Import the modules cv2 for images and NumPy for image arrays:

import cv2
 
import numpy as np

Read the image and convert it into HSV using cvtColor():

img = cv2.imread("pydetect.png")
 
hsv_img = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)

Display the image:

 cv2.imshow("HSV Image", hsv_img)

Now create a NumPy array for the lower green values and the upper green values:

lower_green = np.array([34, 177, 76])
 
upper_green = np.array([255, 255, 255])

Use the inRange() method of cv2 to check if the given image array elements lie between array values of upper and lower boundaries:

 masking = cv2.inRange(hsv_img, lower_green, upper_green)

This will detect the green color.

Finally, display the original and resultant images:

 cv2.imshow("Original Image", img)

cv2.imshow("Green Color detection", masking)
 
cv2.waitKey(0)

Reduce Noise

To reduce noise from an image, OpenCV provides the following methods:

  1. fastNlMeansDenoising(): Removes noise from a grayscale image
  2. fastNlMeansDenoisingColored(): Removes noise from a colored image
  3. fastNlMeansDenoisingMulti(): Removes noise from grayscale image frames (a grayscale video)
  4. fastNlMeansDenoisingColoredMulti(): Same as 3 but works with colored frames

Let’s use fastNlMeansDenoisingColored() in our example:

Import the cv2 module and read the image:

2
3
	
import cv2
 
img = cv2.imread("pyn1.png")

Apply the denoising function which takes respectively the original image (src), the destination (which we have kept none as we are storing the resultant), the filter strength, the image value to remove the colored noise (usually equal to filter strength or 10), the template patch size in pixel to compute weights which should always be odd (recommended size equals 7) and the window size in pixels to compute average of the given pixel.

 result = cv2.fastNlMeansDenoisingColored(img,None,20,10,7,21)

Display original and denoised image:

cv2.imshow("Original Image", img)
 
cv2.imshow("Denoised Image", result)
 
cv2.waitKey(0)

The output will be:

Get image contour

Contours are the curves in an image that are joint together. The curves join the continuous points in an image. The purpose of contours is used to detect the objects.

The original image of which we are getting the contours of is given below:

Consider the following code where we used the findContours() method to find the contours in the image:

Import cv2 module:

 import cv2

Read the image and convert it to a grayscale image:

img = cv2.imread('py1.jpg')
 
gray_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

Find the threshold:

 retval, thresh = cv2.threshold(gray_img, 127, 255, 0)

Use the findContours() which takes the image (we passed threshold here) and some attributes. See findContours() Official.

 img_contours, _ = cv2.findContours(thresh, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)

Draw the contours on the image using drawContours() method:

  cv2.drawContours(img, img_contours, -1, (0, 255, 0))

Display the image:

cv2.imshow('Image Contours', img)
 
cv2.waitKey(0)

The result will be:

Remove Background from an image

To remove the background from an image, we will find the contours to detect edges of the main object and create a mask with np.zeros for the background and then combine the mask and the image using the bitwise_and operator.

Consider the example below:

Import the modules (NumPy and cv2):

import cv2
 
import numpy as np

Read the image and convert the image into a grayscale image:

img = cv2.imread("py.jpg")
 
gray_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

Find the threshold:

 _, thresh = cv2.threshold(gray_img, 127, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)

In the threshold() method, the last argument defines the style of the threshold. See Official documentation of OpenCV threshold.

Find the image contours:

 img_contours = cv2.findContours(threshed, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)[-2]

Sort the contours:

img_contours = sorted(img_contours, key=cv2.contourArea)
 
for i in img_contours:
 
    if cv2.contourArea(i) > 100:
 
        break

Generate the mask using np.zeros:

 mask = np.zeros(img.shape[:2], np.uint8)

Draw contours:

 cv2.drawContours(mask, [i],-1, 255, -1)

Apply the bitwise_and operator:

 new_img = cv2.bitwise_and(img, img, mask=mask)

Display the original image:

 cv2.imshow("Original Image", img)

Display the resultant image:

cv2.imshow("Image with background removed", new_img)
 
cv2.waitKey(0)

Image processing is fun when using OpenCV as you saw. I hope you find the tutorial useful. Keep coming back.

Thank you.

How to Set up an SMS Notification With Python

How to Set up an SMS Notification With Python

How to Set up an SMS Notification With Python. oday I am beginning a new series of posts specifically aimed at Python beginners.

Hi everyone :) Today I am beginning a new series of posts specifically aimed at Python beginners. The concept is rather simple: I'll do a fun project, in as few lines of code as possible, and will try out as many new tools as possible.

For example, today we will learn to use the Twilio API, the Twitch API, and we'll see how to deploy the project on Heroku. I'll show you how you can have your own "Twitch Live" SMS notifier, in 30 lines of codes, and for 12 cents a month.

Prerequisite: You only need to know how to run Python on your machine and some basic commands in git (commit & push). If you need help with these, I can recommend these 2 articles to you:

Python 3 Installation & Setup Guide

The Ultimate Git Command Tutorial for Beginners from Adrian Hajdin.

What you'll learn:

  • Twitch API
  • Twilio API
  • Deploying on Heroku
  • Setting up a scheduler on Heroku

What you will build:

The specifications are simple: we want to receive an SMS as soon as a specific Twitcher is live streaming. We want to know when this person is going live and when they leave streaming. We want this whole thing to run by itself, all day long.

We will split the project into 3 parts. First, we will see how to programmatically know if a particular Twitcher is online. Then we will see how to receive an SMS when this happens. We will finish by seeing how to make this piece of code run every X minutes, so we never miss another moment of our favorite streamer's life.

Is this Twitcher live?

To know if a Twitcher is live, we can do two things: we can go to the Twitcher URL and try to see if the badge "Live" is there.

Screenshot of a Twitcher live streaming.

This process involves scraping and is not easily doable in Python in less than 20 or so lines of code. Twitch runs a lot of JS code and a simple request.get() won't be enough.

For scraping to work, in this case, we would need to scrape this page inside Chrome to get the same content like what you see in the screenshot. This is doable, but it will take much more than 30 lines of code. If you'd like to learn more, don't hesitate to check my recent web scraping guide.

So instead of trying to scrape Twitch, we will use their API. For those unfamiliar with the term, an API is a programmatic interface that allows websites to expose their features and data to anyone, mainly developers. In Twitch's case, their API is exposed through HTTP, witch means that we can have lots of information and do lots of things by just making a simple HTTP request.

Get your API key

To do this, you have to first create a Twitch API key. Many services enforce authentication for their APIs to ensure that no one abuses them or to restrict access to certain features by certain people.

Please follow these steps to get your API key:

  • Create a Twitch account
  • Now create a Twitch dev account -> "Signing up with Twitch" top right
  • Go to your "dashboard" once logged in
  • "Register your application"
  • Name -> Whatever, Oauth redirection URL -> http://localhost, Category -> Whatever

You should now see, at the bottom of your screen, your client-id. Keep this for later.

Is that Twitcher streaming now?

With your API key in hand, we can now query the Twitch API to have the information we want, so let's begin to code. The following snippet just consumes the Twitch API with the correct parameters and prints the response.

# requests is the go to package in python to make http request
# https://2.python-requests.org/en/master/
import requests

# This is one of the route where Twich expose data, 
# They have many more: https://dev.twitch.tv/docs
endpoint = "https://api.twitch.tv/helix/streams?"

# In order to authenticate we need to pass our api key through header
headers = {"Client-ID": "<YOUR-CLIENT-ID>"}

# The previously set endpoint needs some parameter, here, the Twitcher we want to follow
# Disclaimer, I don't even know who this is, but he was the first one on Twich to have a live stream so I could have nice examples
params = {"user_login": "Solary"}

# It is now time to make the actual request
response = request.get(endpoint, params=params, headers=headers)
print(response.json())

The output should look like this:

{
   'data':[
      {
         'id':'35289543872',
         'user_id':'174955366',
         'user_name':'Solary',
         'game_id':'21779',
         'type':'live',
         'title':"Wakz duoQ w/ Tioo - GM 400LP - On récupère le chall après les -250LP d'inactivité !",
         'viewer_count':4073,
         'started_at':'2019-08-14T07:01:59Z',
         'language':'fr',
         'thumbnail_url':'https://static-cdn.jtvnw.net/previews-ttv/live_user_solary-{width}x{height}.jpg',
         'tag_ids':[
            '6f655045-9989-4ef7-8f85-1edcec42d648'
         ]
      }
   ],
   'pagination':{
      'cursor':'eyJiIjpudWxsLCJhIjp7Ik9mZnNldCI6MX19'
   }
}

This data format is called JSON and is easily readable. The data object is an array that contains all the currently active streams. The key type ensures that the stream is currently live. This key will be empty otherwise (in case of an error, for example).

So if we want to create a boolean variable in Python that stores whether the current user is streaming, all we have to append to our code is:

json_response = response.json()

# We get only streams
streams = json_response.get('data', [])

# We create a small function, (a lambda), that tests if a stream is live or not
is_active = lambda stream: stream.get('type') == 'live'
# We filter our array of streams with this function so we only keep streams that are active
streams_active = filter(is_active, streams)

# any returns True if streams_active has at least one element, else False
at_least_one_stream_active = any(streams_active)

print(at_least_one_stream_active)

At this point, at_least_one_stream_active is True when your favourite Twitcher is live.

Let's now see how to get notified by SMS.

Send me a text, NOW!

So to send a text to ourselves, we will use the Twilio API. Just go over there and create an account. When asked to confirm your phone number, please use the phone number you want to use in this project. This way you'll be able to use the $15 of free credit Twilio offers to new users. At around 1 cent a text, it should be enough for your bot to run for one year.

If you go on the console, you'll see your Account SID and your Auth Token , save them for later. Also click on the big red button "Get My Trial Number", follow the step, and save this one for later too.

Sending a text with the Twilio Python API is very easy, as they provide a package that does the annoying stuff for you. Install the package with pip install Twilio and just do:

from twilio.rest import Client
client = Client(<Your Account SID>, <Your Auth Token>)
client.messages.create(
	body='Test MSG',from_=<Your Trial Number>,to=<Your Real Number>)

And that is all you need to send yourself a text, amazing right?

Putting everything together

We will now put everything together, and shorten the code a bit so we manage to say under 30 lines of Python code.

import requests
from twilio.rest import Client
endpoint = "https://api.twitch.tv/helix/streams?"
headers = {"Client-ID": "<YOUR-CLIENT-ID>"}
params = {"user_login": "Solary"}
response = request.get(endpoint, params=params, headers=headers)
json_response = response.json()
streams = json_response.get('data', [])
is_active = lambda stream:stream.get('type') == 'live'
streams_active = filter(is_active, streams)
at_least_one_stream_active = any(streams_active)
if at_least_one_stream_active:
    client = Client(<Your Account SID>, <Your Auth Token>)
	client.messages.create(body='LIVE !!!',from_=<Your Trial Number>,to=<Your Real Number>)

Avoiding double notifications

This snippet works great, but should that snippet run every minute on a server, as soon as our favorite Twitcher goes live we will receive an SMS every minute.

We need a way to store the fact that we were already notified that our Twitcher is live and that we don't need to be notified anymore.

The good thing with the Twilio API is that it offers a way to retrieve our message history, so we just have to retrieve the last SMS we sent to see if we already sent a text notifying us that the twitcher is live.

Here what we are going do to in pseudocode:

if favorite_twitcher_live and last_sent_sms is not live_notification:
	send_live_notification()
if not favorite_twitcher_live and last_sent_sms is live_notification:
	send_live_is_over_notification()

This way we will receive a text as soon as the stream starts, as well as when it is over. This way we won't get spammed - perfect right? Let's code it:

# reusing our Twilio client
last_messages_sent = client.messages.list(limit=1)
last_message_id = last_messages_sent[0].sid
last_message_data = client.messages(last_message_id).fetch()
last_message_content = last_message_data.body

Let's now put everything together again:

import requests
from twilio.rest import Client
client = Client(<Your Account SID>, <Your Auth Token>)

endpoint = "https://api.twitch.tv/helix/streams?"
headers = {"Client-ID": "<YOUR-CLIENT-ID>"}
params = {"user_login": "Solary"}
response = request.get(endpoint, params=params, headers=headers)
json_response = response.json()
streams = json_response.get('data', [])
is_active = lambda stream:stream.get('type') == 'live'
streams_active = filter(is_active, streams)
at_least_one_stream_active = any(streams_active)

last_messages_sent = client.messages.list(limit=1)
if last_messages_sent:
	last_message_id = last_messages_sent[0].sid
	last_message_data = client.messages(last_message_id).fetch()
	last_message_content = last_message_data.body
    online_notified = "LIVE" in last_message_content
    offline_notified = not online_notified
else:
	online_notified, offline_notified = False, False

if at_least_one_stream_active and not online_notified:
	client.messages.create(body='LIVE !!!',from_=<Your Trial Number>,to=<Your Real Number>)
if not at_least_one_stream_active and not offline_notified:
	client.messages.create(body='OFFLINE !!!',from_=<Your Trial Number>,to=<Your Real Number>)

And voilà!

You now have a snippet of code, in less than 30 lines of Python, that will send you a text a soon as your favourite Twitcher goes Online / Offline and without spamming you.

We just now need a way to host and run this snippet every X minutes.

The quest for a host

To host and run this snippet we will use Heroku. Heroku is honestly one of the easiest ways to host an app on the web. The downside is that it is really expensive compared to other solutions out there. Fortunately for us, they have a generous free plan that will allow us to do what we want for almost nothing.

If you don't already, you need to create a Heroku account. You also need to download and install the Heroku client.

You now have to move your Python script to its own folder, don't forget to add a requirements.txt file in it. The content of the latter begins:

requests
twilio

This is to ensure that Heroku downloads the correct dependencies.

cd into this folder and just do a heroku create --app <app name>.

If you go on your app dashboard you'll see your new app.

We now need to initialize a git repo and push the code on Heroku:

git init
heroku git:remote -a <app name>
git add .
git commit -am 'Deploy breakthrough script'
git push heroku master

Your app is now on Heroku, but it is not doing anything. Since this little script can't accept HTTP requests, going to <app name>.herokuapp.com won't do anything. But that should not be a problem.

To have this script running 24/7 we need to use a simple Heroku add-on call "Heroku Scheduler". To install this add-on, click on the "Configure Add-ons" button on your app dashboard.

Then, on the search bar, look for Heroku Scheduler:

Click on the result, and click on "Provision"

If you go back to your App dashboard, you'll see the add-on:

Click on the "Heroku Scheduler" link to configure a job. Then click on "Create Job". Here select "10 minutes", and for run command select python <name_of_your_script>.py. Click on "Save job".

While everything we used so far on Heroku is free, the Heroku Scheduler will run the job on the $25/month instance, but prorated to the second. Since this script approximately takes 3 seconds to run, for this script to run every 10 minutes you should just have to spend 12 cents a month.

Ideas for improvements

I hope you liked this project and that you had fun putting it into place. In less than 30 lines of code, we did a lot, but this whole thing is far from perfect. Here are a few ideas to improve it:

  • Send yourself more information about the current streaming (game played, number of viewers ...)
  • Send yourself the duration of the last stream once the twitcher goes offline
  • Don't send you a text, but rather an email
  • Monitor multiple twitchers at the same time

Do not hesitate to tell me in the comments if you have more ideas.

Conclusion

I hope that you liked this post and that you learned things reading it. I truly believe that this kind of project is one of the best ways to learn new tools and concepts, I recently launched a web scraping API where I learned a lot while making it.

Please tell me in the comments if you liked this format and if you want to do more.

I have many other ideas, and I hope you will like them. Do not hesitate to share what other things you build with this snippet, possibilities are endless.

Happy Coding.

Pierre

Don't want to miss my next post:

You can subscribe here to my newsletter.

Top 10 Books To Learn Python

Top 10 Books To Learn Python

This video on 'Top 10 Books To Learn Python' will suggest to you what we think are the best books for Python, even if you are an experienced programmer or a complete beginner.

This video on 'Top 10 Books To Learn Python' will suggest to you what we think are the best books for Python, even if you are an experienced programmer or a complete beginner. Below are the topics covered in this video:

Why Python?

  • Beginner-Level Python Books
  • Domain-Specific Python Books
  • Bonus Python Book

Links for the Python Books:

  1. Learning Python by Mark Lutz: http://bit.ly/2BR38aY
  2. Python Crash Course by Eric Matthews: http://bit.ly/2BLlJ8i
  3. Think Python by Allen Downey: http://bit.ly/2pjoXNC
  4. Python Programming by John M Zelle: http://bit.ly/31SkYon
  5. Python in a Nutshell by Alex Martelli: http://bit.ly/32UOyeh
  6. Programming Python Mark Lutz: https://amzn.to/31Slhj1
  7. Effective Computation in Physics by Anthony Scopatz, Kathryn D. Huff: http://bit.ly/2BPD00c
  8. Python for Data Analysis by Wes McKinney: http://bit.ly/2pWCaMo
  9. Python Machine Learning by Sebastian Raschka and Vahid Mirjalili: https://amzn.to/36amOV3
  10. Django for Beginners by William S. Vincent: https://amzn.to/36lQtuG

A Complete Machine Learning Project Walk-Through in Python

A Complete Machine Learning Project Walk-Through in Python

A Complete Machine Learning Project Walk-Through in Python: Putting the machine learning pieces together; Model Selection, Hyperparameter Tuning, and Evaluation; Interpreting a machine learning model and presenting results

Reading through a data science book or taking a course, it can feel like you have the individual pieces, but don’t quite know how to put them together. Taking the next step and solving a complete machine learning problem can be daunting, but preserving and completing a first project will give you the confidence to tackle any data science problem. This series of articles will walk through a complete machine learning solution with a real-world dataset to let you see how all the pieces come together.

We’ll follow the general machine learning workflow step-by-step:

  1. Data cleaning and formatting
  2. Exploratory data analysis
  3. Feature engineering and selection
  4. Compare several machine learning models on a performance metric
  5. Perform hyperparameter tuning on the best model
  6. Evaluate the best model on the testing set
  7. Interpret the model results
  8. Draw conclusions and document work

Along the way, we’ll see how each step flows into the next and how to specifically implement each part in Python. The complete project is available on GitHub, with the first notebook here.

(As a note, this problem was originally given to me as an “assignment” for a job screen at a start-up. After completing the work, I was offered the job, but then the CTO of the company quit and they weren’t able to bring on any new employees. I guess that’s how things go on the start-up scene!)

Problem Definition

The first step before we get coding is to understand the problem we are trying to solve and the available data. In this project, we will work with publicly available building energy data from New York City.

The objective is to use the energy data to build a model that can predict the Energy Star Score of a building and interpret the results to find the factors which influence the score.

The data includes the Energy Star Score, which makes this a supervised regression machine learning task:

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

We want to develop a model that is both **accurate **— it can predict the Energy Star Score close to the true value — and interpretable — we can understand the model predictions. Once we know the goal, we can use it to guide our decisions as we dig into the data and build models.

Data Cleaning

Contrary to what most data science courses would have you believe, not every dataset is a perfectly curated group of observations with no missing values or anomalies (looking at you mtcars and iris datasets). Real-world data is messy which means we need to clean and wrangle it into an acceptable format before we can even start the analysis. Data cleaning is an un-glamorous, but necessary part of most actual data science problems.

First, we can load in the data as a Pandas DataFrame and take a look:

import pandas as pd
import numpy as np

# Read in data into a dataframe 
data = pd.read_csv('data/Energy_and_Water_Data_Disclosure_for_Local_Law_84_2017__Data_for_Calendar_Year_2016_.csv')

# Display top of dataframe
data.head()

This is a subset of the full data which contains 60 columns. Already, we can see a couple issues: first, we know that we want to predict the ENERGY STAR Score but we don’t know what any of the columns mean. While this isn’t necessarily an issue — we can often make an accurate model without any knowledge of the variables — we want to focus on interpretability, and it might be important to understand at least some of the columns.

When I originally got the assignment from the start-up, I didn’t want to ask what all the column names meant, so I looked at the name of the file,

and decided to search for “Local Law 84”. That led me to this page which explains this is an NYC law requiring all buildings of a certain size to report their energy use. More searching brought me to all the definitions of the columns. Maybe looking at a file name is an obvious place to start, but for me this was a reminder to go slow so you don’t miss anything important!

We don’t need to study all of the columns, but we should at least understand the Energy Star Score, which is described as:

A 1-to-100 percentile ranking based on self-reported energy usage for the reporting year. The Energy Star score is a relative measure used for comparing the energy efficiency of buildings.

That clears up the first problem, but the second issue is that missing values are encoded as “Not Available”. This is a string in Python which means that even the columns with numbers will be stored as object datatypes because Pandas converts a column with any strings into a column of all strings. We can see the datatypes of the columns using the dataframe.info()method:

# See the column data types and non-missing values
data.info()

Sure enough, some of the columns that clearly contain numbers (such as ft²), are stored as objects. We can’t do numerical analysis on strings, so these will have to be converted to number (specifically float) data types!

Here’s a little Python code that replaces all the “Not Available” entries with not a number ( np.nan), which can be interpreted as numbers, and then converts the relevant columns to the float datatype:

# Replace all occurrences of Not Available with numpy not a number
data = data.replace({'Not Available': np.nan})

# Iterate through the columns
for col in list(data.columns):
    # Select columns that should be numeric
    if ('ft²' in col or 'kBtu' in col or 'Metric Tons CO2e' in col or 'kWh' in 
        col or 'therms' in col or 'gal' in col or 'Score' in col):
        # Convert the data type to float
data[col] = data[col].astype(float)

Once the correct columns are numbers, we can start to investigate the data.

Missing Data and Outliers

In addition to incorrect datatypes, another common problem when dealing with real-world data is missing values. These can arise for many reasons and have to be either filled in or removed before we train a machine learning model. First, let’s get a sense of how many missing values are in each column (see the notebook for code).

(To create this table, I used a function from this Stack Overflow Forum).

While we always want to be careful about removing information, if a column has a high percentage of missing values, then it probably will not be useful to our model. The threshold for removing columns should depend on the problem (here is a discussion), and for this project, we will remove any columns with more than 50% missing values.

At this point, we may also want to remove outliers. These can be due to typos in data entry, mistakes in units, or they could be legitimate but extreme values. For this project, we will remove anomalies based on the definition of extreme outliers:

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

(For the code to remove the columns and the anomalies, see the notebook). At the end of the data cleaning and anomaly removal process, we are left with over 11,000 buildings and 49 features.

Exploratory Data Analysis

Now that the tedious — but necessary — step of data cleaning is complete, we can move on to exploring our data! Exploratory Data Analysis (EDA) is an open-ended process where we calculate statistics and make figures to find trends, anomalies, patterns, or relationships within the data.

In short, the goal of EDA is to learn what our data can tell us. It generally starts out with a high level overview, then narrows in to specific areas as we find interesting parts of the data. The findings may be interesting in their own right, or they can be used to inform our modeling choices, such as by helping us decide which features to use.

Single Variable Plots

The goal is to predict the Energy Star Score (renamed to score in our data) so a reasonable place to start is examining the distribution of this variable. A histogram is a simple yet effective way to visualize the distribution of a single variable and is easy to make using matplotlib.

import matplotlib.pyplot as plt

# Histogram of the Energy Star Score
plt.style.use('fivethirtyeight')
plt.hist(data['score'].dropna(), bins = 100, edgecolor = 'k');
plt.xlabel('Score'); plt.ylabel('Number of Buildings'); 
plt.title('Energy Star Score Distribution');

This looks quite suspicious! The Energy Star score is a percentile rank, which means we would expect to see a uniform distribution, with each score assigned to the same number of buildings. However, a disproportionate number of buildings have either the highest, 100, or the lowest, 1, score (higher is better for the Energy Star score).

If we go back to the definition of the score, we see that it is based on “self-reported energy usage” which might explain the very high scores. Asking building owners to report their own energy usage is like asking students to report their own scores on a test! As a result, this probably is not the most objective measure of a building’s energy efficiency.

If we had an unlimited amount of time, we might want to investigate why so many buildings have very high and very low scores which we could by selecting these buildings and seeing what they have in common. However, our objective is only to predict the score and not to devise a better method of scoring buildings! We can make a note in our report that the scores have a suspect distribution, but our main focus in on predicting the score.

Looking for Relationships

A major part of EDA is searching for relationships between the features and the target. Variables that are correlated with the target are useful to a model because they can be used to predict the target. One way to examine the effect of a categorical variable (which takes on only a limited set of values) on the target is through a density plot using the seaborn library.

A density plot can be thought of as a smoothed histogram because it shows the distribution of a single variable. We can color a density plot by class to see how a categorical variable changes the distribution. The following code makes a density plot of the Energy Star Score colored by the the type of building (limited to building types with more than 100 data points):

# Create a list of buildings with more than 100 measurements
types = data.dropna(subset=['score'])
types = types['Largest Property Use Type'].value_counts()
types = list(types[types.values > 100].index)

# Plot of distribution of scores for building categories
figsize(12, 10)

# Plot each building
for b_type in types:
    # Select the building type
    subset = data[data['Largest Property Use Type'] == b_type]
    
    # Density plot of Energy Star scores
    sns.kdeplot(subset['score'].dropna(),
               label = b_type, shade = False, alpha = 0.8);
    
# label the plot
plt.xlabel('Energy Star Score', size = 20); plt.ylabel('Density', size = 20); 
plt.title('Density Plot of Energy Star Scores by Building Type', size = 28);

We can see that the building type has a significant impact on the Energy Star Score. Office buildings tend to have a higher score while Hotels have a lower score. This tells us that we should include the building type in our modeling because it does have an impact on the target. As a categorical variable, we will have to one-hot encode the building type.

A similar plot can be used to show the Energy Star Score by borough:

The borough does not seem to have as large of an impact on the score as the building type. Nonetheless, we might want to include it in our model because there are slight differences between the boroughs.

To quantify relationships between variables, we can use the Pearson Correlation Coefficient. This is a measure of the strength and direction of a linear relationship between two variables. A score of +1 is a perfectly linear positive relationship and a score of -1 is a perfectly negative linear relationship. Several values of the correlation coefficient are shown below:

While the correlation coefficient cannot capture non-linear relationships, it is a good way to start figuring out how variables are related. In Pandas, we can easily calculate the correlations between any columns in a dataframe:

# Find all correlations with the score and sort 
correlations_data = data.corr()['score'].sort_values()

The most negative (left) and positive (right) correlations with the target:

There are several strong negative correlations between the features and the target with the most negative the different categories of EUI (these measures vary slightly in how they are calculated). The EUI — Energy Use Intensity — is the amount of energy used by a building divided by the square footage of the buildings. It is meant to be a measure of the efficiency of a building with a lower score being better. Intuitively, these correlations make sense: as the EUI increases, the Energy Star Score tends to decrease.

Two-Variable Plots

To visualize relationships between two continuous variables, we use scatterplots. We can include additional information, such as a categorical variable, in the color of the points. For example, the following plot shows the Energy Star Score vs. Site EUI colored by the building type:

This plot lets us visualize what a correlation coefficient of -0.7 looks like. As the Site EUI decreases, the Energy Star Score increases, a relationship that holds steady across the building types.

The final exploratory plot we will make is known as the Pairs Plot. This is a great exploration tool because it lets us see relationships between multiple pairs of variables as well as distributions of single variables. Here we are using the seaborn visualization library and the PairGrid function to create a Pairs Plot with scatterplots on the upper triangle, histograms on the diagonal, and 2D kernel density plots and correlation coefficients on the lower triangle.

# Extract the columns to  plot
plot_data = features[['score', 'Site EUI (kBtu/ft²)', 
                      'Weather Normalized Source EUI (kBtu/ft²)', 
                      'log_Total GHG Emissions (Metric Tons CO2e)']]

# Replace the inf with nan
plot_data = plot_data.replace({np.inf: np.nan, -np.inf: np.nan})

# Rename columns 
plot_data = plot_data.rename(columns = {'Site EUI (kBtu/ft²)': 'Site EUI', 
                                        'Weather Normalized Source EUI (kBtu/ft²)': 'Weather Norm EUI',
                                        'log_Total GHG Emissions (Metric Tons CO2e)': 'log GHG Emissions'})

# Drop na values
plot_data = plot_data.dropna()

# Function to calculate correlation coefficient between two columns
def corr_func(x, y, **kwargs):
    r = np.corrcoef(x, y)[0][1]
    ax = plt.gca()
    ax.annotate("r = {:.2f}".format(r),
                xy=(.2, .8), xycoords=ax.transAxes,
                size = 20)

# Create the pairgrid object
grid = sns.PairGrid(data = plot_data, size = 3)

# Upper is a scatter plot
grid.map_upper(plt.scatter, color = 'red', alpha = 0.6)

# Diagonal is a histogram
grid.map_diag(plt.hist, color = 'red', edgecolor = 'black')

# Bottom is correlation and density plot
grid.map_lower(corr_func);
grid.map_lower(sns.kdeplot, cmap = plt.cm.Reds)

# Title for entire plot
plt.suptitle('Pairs Plot of Energy Data', size = 36, y = 1.02);

To see interactions between variables, we look for where a row intersects with a column. For example, to see the correlation of Weather Norm EUI with score, we look in the Weather Norm EUI row and the score column and see a correlation coefficient of -0.67. In addition to looking cool, plots such as these can help us decide which variables to include in modeling.

Feature Engineering and Selection

Feature engineering and selection often provide the greatest return on time invested in a machine learning problem. First of all, let’s define what these two tasks are:

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

A machine learning model can only learn from the data we provide it, so ensuring that data includes all the relevant information for our task is crucial. If we don’t feed a model the correct data, then we are setting it up to fail and we should not expect it to learn!

For this project, we will take the following feature engineering steps:

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

One-hot encoding is necessary to include categorical variables in a model. A machine learning algorithm cannot understand a building type of “office”, so we have to record it as a 1 if the building is an office and a 0 otherwise.

Adding transformed features can help our model learn non-linear relationships within the data. Taking the square root, natural log, or various powers of features is common practice in data science and can be based on domain knowledge or what works best in practice. Here we will include the natural log of all numerical features.

The following code selects the numeric features, takes log transformations of these features, selects the two categorical features, one-hot encodes these features, and joins the two sets together. This seems like a lot of work, but it is relatively straightforward in Pandas!

# Copy the original data
features = data.copy()

# Select the numeric columns
numeric_subset = data.select_dtypes('number')

# Create columns with log of numeric columns
for col in numeric_subset.columns:
    # Skip the Energy Star Score column
    if col == 'score':
        next
    else:
        numeric_subset['log_' + col] = np.log(numeric_subset[col])
        
# Select the categorical columns
categorical_subset = data[['Borough', 'Largest Property Use Type']]

# One hot encode
categorical_subset = pd.get_dummies(categorical_subset)

# Join the two dataframes using concat
# Make sure to use axis = 1 to perform a column bind
features = pd.concat([numeric_subset, categorical_subset], axis = 1)

After this process we have over 11,000 observations (buildings) with 110 columns (features). Not all of these features are likely to be useful for predicting the Energy Star Score, so now we will turn to feature selection to remove some of the variables.

Feature Selection

Many of the 110 features we have in our data are redundant because they are highly correlated with one another. For example, here is a plot of Site EUI vs Weather Normalized Site EUI which have a correlation coefficient of 0.997.

Features that are strongly correlated with each other are known as collinear and removing one of the variables in these pairs of features can often help a machine learning model generalize and be more interpretable. (I should point out we are talking about correlations of features with other features, not correlations with the target, which help our model!)

There are a number of methods to calculate collinearity between features, with one of the most common the variance inflation factor. In this project, we will use thebcorrelation coefficient to identify and remove collinear features. We will drop one of a pair of features if the correlation coefficient between them is greater than 0.6. For the implementation, take a look at the notebook (and this Stack Overflow answer)

While this value may seem arbitrary, I tried several different thresholds, and this choice yielded the best model. Machine learning is an empirical field and is often about experimenting and finding what performs best! After feature selection, we are left with 64 total features and 1 target.

# Remove any columns with all na values
features  = features.dropna(axis=1, how = 'all')
print(features.shape)

(11319, 65)

Establishing a Baseline

We have now completed data cleaning, exploratory data analysis, and feature engineering. The final step to take before getting started with modeling is establishing a naive baseline. This is essentially a guess against which we can compare our results. If the machine learning models do not beat this guess, then we might have to conclude that machine learning is not acceptable for the task or we might need to try a different approach.

For regression problems, a reasonable naive baseline is to guess the median value of the target on the training set for all the examples in the test set. This sets a relatively low bar for any model to surpass.

The metric we will use is mean absolute error (mae) which measures the average absolute error on the predictions. There are many metrics for regression, but I like Andrew Ng’s advice to pick a single metric and then stick to it when evaluating models. The mean absolute error is easy to calculate and is interpretable.

Before calculating the baseline, we need to split our data into a training and a testing set:

  1. Data cleaning and formatting
  2. Exploratory data analysis
  3. Feature engineering and selection
  4. Compare several machine learning models on a performance metric
  5. Perform hyperparameter tuning on the best model
  6. Evaluate the best model on the testing set
  7. Interpret the model results
  8. Draw conclusions and document work

We will use 70% of the data for training and 30% for testing:

# Split into 70% training and 30% testing set
X, X_test, y, y_test = train_test_split(features, targets, 
                                        test_size = 0.3, 
                                        random_state = 42)

Now we can calculate the naive baseline performance:

# Function to calculate mean absolute error
def mae(y_true, y_pred):
    return np.mean(abs(y_true - y_pred))

baseline_guess = np.median(y)

print('The baseline guess is a score of %0.2f' % baseline_guess)
print("Baseline Performance on the test set: MAE = %0.4f" % mae(y_test, baseline_guess))
The baseline guess is a score of 66.00
Baseline Performance on the test set: MAE = 24.5164

The naive estimate is off by about 25 points on the test set. The score ranges from 1–100, so this represents an error of 25%, quite a low bar to surpass!

Conclusions

In this article we walked through the first three steps of a machine learning problem. After defining the question, we:

  1. Data cleaning and formatting
  2. Exploratory data analysis
  3. Feature engineering and selection
  4. Compare several machine learning models on a performance metric
  5. Perform hyperparameter tuning on the best model
  6. Evaluate the best model on the testing set
  7. Interpret the model results
  8. Draw conclusions and document work

Finally, we also completed the crucial step of establishing a baseline against which we can judge our machine learning algorithms.

A Complete Machine Learning Walk-Through in Python (Part Two): Model Selection, Hyperparameter Tuning, and Evaluation

Model Evaluation and Selection

As a reminder, we are working on a supervised regression task: using New York City building energy data, we want to develop a model that can predict the Energy Star Score of a building. Our focus is on both accuracy of the predictions and interpretability of the model.

There are a ton of machine learning models to choose from and deciding where to start can be intimidating. While there are some charts that try to show you which algorithm to use, I prefer to just try out several and see which one works best! Machine learning is still a field driven primarily by empirical (experimental) rather than theoretical results, and it’s almost impossible to know ahead of time which model will do the best.

Generally, it’s a good idea to start out with simple, interpretable models such as linear regression, and if the performance is not adequate, move on to more complex, but usually more accurate methods. The following chart shows a (highly unscientific) version of the accuracy vs interpretability trade-off:

We will evaluate five different models covering the complexity spectrum:

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

In this post we will focus on implementing these methods rather than the theory behind them. For anyone interesting in learning the background, I highly recommend An Introduction to Statistical Learning (available free online) or Hands-On Machine Learning with Scikit-Learn and TensorFlow. Both of these textbooks do a great job of explaining the theory and showing how to effectively use the methods in R and Python respectively.

Imputing Missing Values

While we dropped the columns with more than 50% missing values when we cleaned the data, there are still quite a few missing observations. Machine learning models cannot deal with any absent values, so we have to fill them in, a process known as imputation.

First, we’ll read in all the data and remind ourselves what it looks like:

import pandas as pd
import numpy as np

# Read in data into dataframes 
train_features = pd.read_csv('data/training_features.csv')
test_features = pd.read_csv('data/testing_features.csv')
train_labels = pd.read_csv('data/training_labels.csv')
test_labels = pd.read_csv('data/testing_labels.csv')

Training Feature Size:  (6622, 64)
Testing Feature Size:   (2839, 64)
Training Labels Size:   (6622, 1)
Testing Labels Size:    (2839, 1)

Every value that is NaN represents a missing observation. While there are a number of ways to fill in missing data, we will use a relatively simple method, median imputation. This replaces all the missing values in a column with the median value of the column.

In the following code, we create a Scikit-Learn Imputer object with the strategy set to median. We then train this object on the training data (using imputer.fit) and use it to fill in the missing values in both the training and testing data (using imputer.transform). This means missing values in the test data are filled in with the corresponding median value from the training data.

(We have to do imputation this way rather than training on all the data to avoid the problem of test data leakage, where information from the testing dataset spills over into the training data.)

# Create an imputer object with a median filling strategy
imputer = Imputer(strategy='median')

# Train on the training features
imputer.fit(train_features)

# Transform both training data and testing data
X = imputer.transform(train_features)
X_test = imputer.transform(test_features)

Missing values in training features:  0
Missing values in testing features:   0

All of the features now have real, finite values with no missing examples.

Feature Scaling

Scaling refers to the general process of changing the range of a feature. This is necessary because features are measured in different units, and therefore cover different ranges. Methods such as support vector machines and K-nearest neighbors that take into account distance measures between observations are significantly affected by the range of the features and scaling allows them to learn. While methods such as Linear Regression and Random Forest do not actually require feature scaling, it is still best practice to take this step when we are comparing multiple algorithms.

We will scale the features by putting each one in a range between 0 and 1. This is done by taking each value of a feature, subtracting the minimum value of the feature, and dividing by the maximum minus the minimum (the range). This specific version of scaling is often called normalization and the other main version is known as standardization.

While this process would be easy to implement by hand, we can do it using a MinMaxScaler object in Scikit-Learn. The code for this method is identical to that for imputation except with a scaler instead of imputer! Again, we make sure to train only using training data and then transform all the data.

# Create the scaler object with a range of 0-1
scaler = MinMaxScaler(feature_range=(0, 1))

# Fit on the training data
scaler.fit(X)

# Transform both the training and testing data
X = scaler.transform(X)
X_test = scaler.transform(X_test)

Every feature now has a minimum value of 0 and a maximum value of 1. Missing value imputation and feature scaling are two steps required in nearly any machine learning pipeline so it’s a good idea to understand how they work!

Implementing Machine Learning Models in Scikit-Learn

After all the work we spent cleaning and formatting the data, actually creating, training, and predicting with the models is relatively simple. We will use the Scikit-Learn library in Python, which has great documentation and a consistent model building syntax. Once you know how to make one model in Scikit-Learn, you can quickly implement a diverse range of algorithms.

We can illustrate one example of model creation, training (using .fit ) and testing (using .predict ) with the Gradient Boosting Regressor:

from sklearn.ensemble import GradientBoostingRegressor

# Create the model
gradient_boosted = GradientBoostingRegressor()

# Fit the model on the training data
gradient_boosted.fit(X, y)

# Make predictions on the test data
predictions = gradient_boosted.predict(X_test)

# Evaluate the model
mae = np.mean(abs(predictions - y_test))

print('Gradient Boosted Performance on the test set: MAE = %0.4f' % mae)
Gradient Boosted Performance on the test set: MAE = 10.0132

Model creation, training, and testing are each one line! To build the other models, we use the same syntax, with the only change the name of the algorithm. The results are presented below:

To put these figures in perspective, the naive baseline calculated using the median value of the target was 24.5. Clearly, machine learning is applicable to our problem because of the significant improvement over the baseline!

The gradient boosted regressor (MAE = 10.013) slightly beats out the random forest (10.014 MAE). These results aren’t entirely fair because we are mostly using the default values for the hyperparameters. Especially in models such as the support vector machine, the performance is highly dependent on these settings. Nonetheless, from these results we will select the gradient boosted regressor for model optimization.

Hyperparameter Tuning for Model Optimization

In machine learning, after we have selected a model, we can optimize it for our problem by tuning the model hyperparameters.

First off, what are hyperparameters and how do they differ from parameters?

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

Controlling the hyperparameters affects the model performance by altering the balance between underfitting and overfitting in a model. Underfitting is when our model is not complex enough (it does not have enough degrees of freedom) to learn the mapping from features to target. An underfit model has high bias, which we can correct by making our model more complex.

Overfitting is when our model essentially memorizes the training data. An overfit model has high variance, which we can correct by limiting the complexity of the model through regularization. Both an underfit and an overfit model will not be able to generalize well to the testing data.

The problem with choosing the right hyperparameters is that the optimal set will be different for every machine learning problem! Therefore, the only way to find the best settings is to try out a number of them on each new dataset. Luckily, Scikit-Learn has a number of methods to allow us to efficiently evaluate hyperparameters. Moreover, projects such as TPOT by Epistasis Lab are trying to optimize the hyperparameter search using methods like genetic programming. In this project, we will stick to doing this with Scikit-Learn, but stayed tuned for more work on the auto-ML scene!

Random Search with Cross Validation

The particular hyperparameter tuning method we will implement is called random search with cross validation:

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

The idea of K-Fold cross validation with K = 5 is shown below:

The entire process of performing random search with cross validation is:

  1. Data cleaning and formatting
  2. Exploratory data analysis
  3. Feature engineering and selection
  4. Compare several machine learning models on a performance metric
  5. Perform hyperparameter tuning on the best model
  6. Evaluate the best model on the testing set
  7. Interpret the model results
  8. Draw conclusions and document work

Of course, we don’t do actually do this manually, but rather let Scikit-Learn’s RandomizedSearchCV handle all the work!

Slight Diversion: Gradient Boosted Methods

Since we will be using the Gradient Boosted Regression model, I should give at least a little background! This model is an ensemble method, meaning that it is built out of many weak learners, in this case individual decision trees. While a bagging algorithm such as random forest trains the weak learners in parallel and has them vote to make a prediction, a boosting method like Gradient Boosting, trains the learners in sequence, with each learner “concentrating” on the mistakes made by the previous ones.

Boosting methods have become popular in recent years and frequently win machine learning competitions. The Gradient Boosting Method is one particular implementation that uses Gradient Descent to minimize the cost function by sequentially training learners on the residuals of previous ones. The Scikit-Learn implementation of Gradient Boosting is generally regarded as less efficient than other libraries such as XGBoost , but it will work well enough for our small dataset and is quite accurate.

Back to Hyperparameter Tuning

There are many hyperparameters to tune in a Gradient Boosted Regressor and you can look at the Scikit-Learn documentation for the details. We will optimize the following hyperparameters:

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

I’m not sure if there is anyone who truly understands how all of these interact, and the only way to find the best combination is to try them out!

In the following code, we build a hyperparameter grid, create a RandomizedSearchCV object, and perform hyperparameter search using 4-fold cross validation over 25 different combinations of hyperparameters:

# Loss function to be optimized
loss = ['ls', 'lad', 'huber']

# Number of trees used in the boosting process
n_estimators = [100, 500, 900, 1100, 1500]

# Maximum depth of each tree
max_depth = [2, 3, 5, 10, 15]

# Minimum number of samples per leaf
min_samples_leaf = [1, 2, 4, 6, 8]

# Minimum number of samples to split a node
min_samples_split = [2, 4, 6, 10]

# Maximum number of features to consider for making splits
max_features = ['auto', 'sqrt', 'log2', None]

# Define the grid of hyperparameters to search
hyperparameter_grid = {'loss': loss,
                       'n_estimators': n_estimators,
                       'max_depth': max_depth,
                       'min_samples_leaf': min_samples_leaf,
                       'min_samples_split': min_samples_split,
                       'max_features': max_features}

# Create the model to use for hyperparameter tuning
model = GradientBoostingRegressor(random_state = 42)

# Set up the random search with 4-fold cross validation
random_cv = RandomizedSearchCV(estimator=model,
                               param_distributions=hyperparameter_grid,
                               cv=4, n_iter=25, 
                               scoring = 'neg_mean_absolute_error',
                               n_jobs = -1, verbose = 1, 
                               return_train_score = True,
                               random_state=42)

# Fit on the training data
random_cv.fit(X, y)

After performing the search, we can inspect the RandomizedSearchCV object to find the best model:

# Find the best combination of settings
random_cv.best_estimator_

GradientBoostingRegressor(loss='lad', max_depth=5,
                          max_features=None,
                          min_samples_leaf=6,
                          min_samples_split=6,
                          n_estimators=500)

We can then use these results to perform grid search by choosing parameters for our grid that are close to these optimal values. However, further tuning is unlikely to significant improve our model. As a general rule, proper feature engineering will have a much larger impact on model performance than even the most extensive hyperparameter tuning. It’s the law of diminishing returns applied to machine learning: feature engineering gets you most of the way there, and hyperparameter tuning generally only provides a small benefit.

One experiment we can try is to change the number of estimators (decision trees) while holding the rest of the hyperparameters steady. This directly lets us observe the effect of this particular setting. See the notebook for the implementation, but here are the results:

As the number of trees used by the model increases, both the training and the testing error decrease. However, the training error decreases much more rapidly than the testing error and we can see that our model is overfitting: it performs very well on the training data, but is not able to achieve that same performance on the testing set.

We always expect at least some decrease in performance on the testing set (after all, the model can see the true answers for the training set), but a significant gap indicates overfitting. We can address overfitting by getting more training data, or decreasing the complexity of our model through the hyerparameters. In this case, we will leave the hyperparameters where they are, but I encourage anyone to try and reduce the overfitting.

For the final model, we will use 800 estimators because that resulted in the lowest error in cross validation. Now, time to test out this model!

Evaluating on the Test Set

As responsible machine learning engineers, we made sure to not let our model see the test set at any point of training. Therefore, we can use the test set performance as an indicator of how well our model would perform when deployed in the real world.

Making predictions on the test set and calculating the performance is relatively straightforward. Here, we compare the performance of the default Gradient Boosted Regressor to the tuned model:

# Make predictions on the test set using default and final model
default_pred = default_model.predict(X_test)
final_pred = final_model.predict(X_test)

Default model performance on the test set: MAE = 10.0118.
Final model performance on the test set:   MAE = 9.0446.

Hyperparameter tuning improved the accuracy of the model by about 10%. Depending on the use case, 10% could be a massive improvement, but it came at a significant time investment!

We can also time how long it takes to train the two models using the %timeit magic command in Jupyter Notebooks. First is the default model:

%%timeit -n 1 -r 5
default_model.fit(X, y)

1.09 s ± 153 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

1 second to train seems very reasonable. The final tuned model is not so fast:

%%timeit -n 1 -r 5
final_model.fit(X, y)

12.1 s ± 1.33 s per loop (mean ± std. dev. of 5 runs, 1 loop each)

This demonstrates a fundamental aspect of machine learning: it is always a game of trade-offs. We constantly have to balance accuracy vs interpretability, bias vs variance, accuracy vs run time, and so on. The right blend will ultimately depend on the problem. In our case, a 12 times increase in run-time is large in relative terms, but in absolute terms it’s not that significant.

Once we have the final predictions, we can investigate them to see if they exhibit any noticeable skew. On the left is a density plot of the predicted and actual values, and on the right is a histogram of the residuals:

The model predictions seem to follow the distribution of the actual values although the peak in the density occurs closer to the median value (66) on the training set than to the true peak in density (which is near 100). The residuals are nearly normally distribution, although we see a few large negative values where the model predictions were far below the true values.

Conclusions

In this article we covered several steps in the machine learning workflow:

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

The results of this work showed us that machine learning is applicable to the task of predicting a building’s Energy Star Score using the available data. Using a gradient boosted regressor we were able to predict the scores on the test set to within 9.1 points of the true value. Moreover, we saw that hyperparameter tuning can increase the performance of a model at a significant cost in terms of time invested. This is one of many trade-offs we have to consider when developing a machine learning solution.

A Complete Machine Learning Walk-Through in Python (Part Three): Interpreting a machine learning model and presenting results

As a reminder, we are working through a supervised regression machine learning problem. Using New York City building energy data, we have developed a model which can predict the Energy Star Score of a building. The final model we built is a Gradient Boosted Regressor which is able to predict the Energy Star Score on the test data to within 9.1 points (on a 1–100 scale).

Model Interpretation

The gradient boosted regressor sits somewhere in the middle on the scale of model interpretability: the entire model is complex, but it is made up of hundreds of decision trees, which by themselves are quite understandable. We will look at three ways to understand how our model makes predictions:

  1. Data cleaning and formatting
  2. Exploratory data analysis
  3. Feature engineering and selection
  4. Compare several machine learning models on a performance metric
  5. Perform hyperparameter tuning on the best model
  6. Evaluate the best model on the testing set
  7. Interpret the model results
  8. Draw conclusions and document work

The first two methods are specific to ensembles of trees, while the third — as you might have guessed from the name — can be applied to any machine learning model. LIME is a relatively new package and represents an exciting step in the ongoing effort to explain machine learning predictions.

Feature Importances

Feature importances attempt to show the relevance of each feature to the task of predicting the target. The technical details of feature importances are complex (they measure the mean decrease impurity, or the reduction in error from including the feature), but we can use the relative values to compare which features are the most relevant. In Scikit-Learn, we can extract the feature importances from any ensemble of tree-based learners.

With model as our trained model, we can find the feature importances usingmodel.feature_importances_. Then we can put them into a pandas DataFrame and display or plot the top ten most important:

import pandas as pd

# model is the trained model
importances = model.feature_importances_

# train_features is the dataframe of training features
feature_list = list(train_features.columns)

# Extract the feature importances into a dataframe
feature_results = pd.DataFrame({'feature': feature_list, 
                                'importance': importances})

# Show the top 10 most important
feature_results = feature_results.sort_values('importance', 
                                              ascending = False).reset_index(drop=True)

feature_results.head(10)

The Site EUI(Energy Use Intensity) and the Weather Normalized Site Electricity Intensity are by far the most important features, accounting for over 66% of the total importance. After the top two features, the importance drops off significantly, which indicates we might not need to retain all 64 features in the data to achieve high performance. (In the Jupyter notebook, I take a look at using only the top 10 features and discover that the model is not quite as accurate.)

Based on these results, we can finally answer one of our initial questions: the most important indicators of a building’s Energy Star Score are the Site EUI and the Weather Normalized Site Electricity Intensity. While we do want to be careful about reading too much into the feature importances, they are a useful way to start to understand how the model makes its predictions.

Visualizing a Single Decision Tree

While the entire gradient boosting regressor may be difficult to understand, any one individual decision tree is quite intuitive. We can visualize any tree in the forest using the Scikit-Learn function [export_graphviz]([http://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html)](http://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html) "http://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html)"). We first extract a tree from the ensemble then save it as a dot file:

from sklearn import tree

# Extract a single tree (number 105)
single_tree = model.estimators_[105][0]

# Save the tree to a dot file
tree.export_graphviz(single_tree, out_file = 'images/tree.dot', 
feature_names = feature_list)

Using the Graphviz visualization software we can convert the dot file to a png from the command line:

dot -Tpng images/tree.dot -o images/tree.png

The result is a complete decision tree:

This is a little overwhelming! Even though this tree only has a depth of 6 (the number of layers), it’s difficult to follow. We can modify the call to export_graphviz and limit our tree to a more reasonable depth of 2:

Each node (box) in the tree has four pieces of information:

  1. Data cleaning and formatting
  2. Exploratory data analysis
  3. Feature engineering and selection
  4. Compare several machine learning models on a performance metric
  5. Perform hyperparameter tuning on the best model
  6. Evaluate the best model on the testing set
  7. Interpret the model results
  8. Draw conclusions and document work

(Leaf nodes only have 2.–4. because they represent the final estimate and do not have any children).

A decision tree makes a prediction for a data point by starting at the top node, called the root, and working its way down through the tree. At each node, a yes-or-no question is asked of the data point. For example, the question for the node above is: Does the building have a Site EUI less than or equal to 68.95? If the answer is yes, the building is placed in the right child node, and if the answer is no, the building goes to the left child node.

This process is repeated at each layer of the tree until the data point is placed in a leaf node, at the bottom of the tree (the leaf nodes are cropped from the small tree image). The prediction for all the data points in a leaf node is the value. If there are multiple data points ( samples ) in a leaf node, they all get the same prediction. As the depth of the tree is increased, the error on the training set will decrease because there are more leaf nodes and the examples can be more finely divided. However, a tree that is too deep will overfit to the training data and will not be able to generalize to new testing data.

In the second article, we tuned a number of the model hyperparameters, which control aspects of each tree such as the maximum depth of the tree and the minimum number of samples required in a leaf node. These both have a significant impact on the balance of under vs over-fitting, and visualizing a single decision tree allows us to see how these settings work.

Although we cannot examine every tree in the model, looking at one lets us understand how each individual learner makes a prediction. This flowchart-based method seems much like how a human makes decisions, answering one question about a single value at a time. Decision-tree-based ensembles combine the predictions of many individual decision trees in order to create a more accurate model with less variance. Ensembles of trees tend to be very accurate, and also are intuitive to explain.

Local Interpretable Model-Agnostic Explanations (LIME)

The final tool we will explore for trying to understand how our model “thinks” is a new entry into the field of model explanations. LIME aims to explain a single prediction from any machine learning model by creating a approximation of the model locally near the data point using a simple model such as linear regression (the full details can be found in the paper ).

Here we will use LIME to examine a prediction the model gets completely wrong to see what it might tell us about why the model makes mistakes.

First we need to find the observation our model gets most wrong. We do this by training and predicting with the model and extracting the example on which the model has the greatest error:

from sklearn.ensemble import GradientBoostingRegressor

# Create the model with the best hyperparamters
model = GradientBoostingRegressor(loss='lad', max_depth=5, max_features=None,
                                  min_samples_leaf=6, min_samples_split=6, 
                                  n_estimators=800, random_state=42)

# Fit and test on the features
model.fit(X, y)
model_pred = model.predict(X_test)

# Find the residuals
residuals = abs(model_pred - y_test)
    
# Extract the most wrong prediction
wrong = X_test[np.argmax(residuals), :]

print('Prediction: %0.4f' % np.argmax(residuals))
print('Actual Value: %0.4f' % y_test[np.argmax(residuals)])
Prediction: 12.8615
Actual Value: 100.0000

Next, we create the LIME explainer object passing it our training data, the mode, the training labels, and the names of the features in our data. Finally, we ask the explainer object to explain the wrong prediction, passing it the observation and the prediction function.

import lime 

# Create a lime explainer object
explainer = lime.lime_tabular.LimeTabularExplainer(training_data = X, 
                                                   mode = 'regression',
                                                   training_labels = y,
                                                   feature_names = feature_list)


# Explanation for wrong prediction
exp = explainer.explain_instance(data_row = wrong, 
                                 predict_fn = model.predict)

# Plot the prediction explaination
exp.as_pyplot_figure();

The plot explaining this prediction is below:

Here’s how to interpret the plot: Each entry on the y-axis indicates one value of a variable and the red and green bars show the effect this value has on the prediction. For example, the top entry says the Site EUI is greater than 95.90 which subtracts about 40 points from the prediction. The second entry says the Weather Normalized Site Electricity Intensity is less than 3.80 which adds about 10 points to the prediction. The final prediction is an intercept term plus the sum of each of these individual contributions.

We can get another look at the same information by calling the explainer .show_in_notebook() method:

# Show the explanation in the Jupyter Notebook
exp.show_in_notebook()

This shows the reasoning process of the model on the left by displaying the contributions of each variable to the prediction. The table on the right shows the actual values of the variables for the data point.

For this example, the model prediction was about 12 and the actual value was 100! While initially this prediction may be puzzling, looking at the explanation, we can see this was not an extreme guess, but a reasonable estimate given the values for the data point. The Site EUI was relatively high and we would expect the Energy Star Score to be low (because EUI is strongly negatively correlated with the score), a conclusion shared by our model. In this case, the logic was faulty because the building had a perfect score of 100.

It can be frustrating when a model is wrong, but explanations such as these help us to understand why the model is incorrect. Moreover, based on the explanation, we might want to investigate why the building has a perfect score despite such a high Site EUI. Perhaps we can learn something new about the problem that would have escaped us without investigating the model. Tools such as this are not perfect, but they go a long way towards helping us understand the model which in turn can allow us to make better decisions.

Documenting Work and Reporting Results

An often under-looked part of any technical project is documentation and reporting. We can do the best analysis in the world, but if we do not clearly communicate the results, then they will not have any impact!

When we document a data science project, we take all the versions of the data and code and package it so it our project can be reproduced or built on by other data scientists. It’s important to remember that code is read more often than it is written, and we want to make sure our work is understandable both for others and for ourselves if we come back to it a few months later. This means putting in helpful comments in the code and explaining your reasoning. I find Jupyter Notebooks to be a great tool for documentation because they allow for explanations and code one after the other.

Jupyter Notebooks can also be a good platform for communicating findings to others. Using notebook extensions, we can hide the code from our final report , because although it’s hard to believe, not everyone wants to see a bunch of Python code in a document!

Personally, I struggle with succinctly summarizing my work because I like to go through all the details. However, it’s important to understand your audience when you are presenting and tailor the message accordingly. With that in mind, here is my 30-second takeaway from the project:

  1. Data cleaning and formatting
  2. Exploratory data analysis
  3. Feature engineering and selection
  4. Compare several machine learning models on a performance metric
  5. Perform hyperparameter tuning on the best model
  6. Evaluate the best model on the testing set
  7. Interpret the model results
  8. Draw conclusions and document work

Originally, I was given this project as a job-screening “assignment” by a start-up. For the final report, they wanted to see both my work and my conclusions, so I developed a Jupyter Notebook to turn in. However, instead of converting directly to PDF in Jupyter, I converted it to a Latex .tex file that I then edited in texStudio before rendering to a PDF for the final version. The default PDF output from Jupyter has a decent appearance, but it can be significantly improved with a few minutes of editing. Moreover, Latex is a powerful document preparation system and it’s good to know the basics.

At the end of the day, our work is only as valuable as the decisions it enables, and being able to present results is a crucial skill. Furthermore, by properly documenting work, we allow others to reproduce our results, give us feedback so we can become better data scientists, and build on our work for the future.

Conclusions

Throughout this series of posts, we’ve walked through a complete end-to-end machine learning project. We started by cleaning the data, moved into model building, and finally looked at how to interpret a machine learning model. As a reminder, the general structure of a machine learning project is below:

  1. Data cleaning and formatting
  2. Exploratory data analysis
  3. Feature engineering and selection
  4. Compare several machine learning models on a performance metric
  5. Perform hyperparameter tuning on the best model
  6. Evaluate the best model on the testing set
  7. Interpret the model results
  8. Draw conclusions and document work

While the exact steps vary by project, and machine learning is often an iterative rather than linear process, this guide should serve you well as you tackle future machine learning projects. I hope this series has given you confidence to be able to implement your own machine learning solutions, but remember, none of us do this by ourselves! If you want any help, there are many incredibly supportive communities where you can look for advice.

A few resources I have found helpful throughout my learning process:

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

*Originally published at *https://towardsdatascience.com

Thanks for reading

If you liked this post, share it with all of your programming buddies!

Follow us on Facebook | Twitter

Further reading

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python for Data Science and Machine Learning Bootcamp

Machine Learning, Data Science and Deep Learning with Python

Deep Learning A-Z™: Hands-On Artificial Neural Networks

Artificial Intelligence A-Z™: Learn How To Build An AI

A Complete Machine Learning Project Walk-Through in Python

Machine Learning: how to go from Zero to Hero

Top 18 Machine Learning Platforms For Developers

10 Amazing Articles On Python Programming And Machine Learning

100+ Basic Machine Learning Interview Questions and Answers