An intuitive guide to differencing time series in Python

While working with time series, sooner or later you will encounter the term differencing. In this article, I will do my best to provide a simple and easy on the maths introduction to the theory. Then, I will also show two different approaches you can follow in Python. Let’s start.

Theory

Before I actually explain what differencing is, I need to quickly introduce another concept which is crucial when working with time series data — stationarity. There are quite a few great articles out there going deeply into what stationarity is, including the distinction between weak and strong variants, etc. However, for the sake of this article, we will focus on a very basic definition.

It all comes around the fact that time series data is different from other kinds of data you can encounter while working with regression problems, for example, predicting the price of houses in the Boston area. That is because time series are characterized by temporal structure, what in practice means that the order of the data points actually does matter.

To give some examples, time series data can exhibit a trend (an increasing and/or decreasing pattern, for example, in production of some goods or in sales) or seasonality (when some time periods exhibit different patterns, for example, increased tourism-related income during summer months). From the statistical side, a trend means varying mean over time, while seasonality hints at varying variance. In such a case, we are dealing with non-stationary series.

So a stationary series is basically a time series that has stable/constant statistical properties (mean, variance, etc.) over time. Or in other words, the observations in such time series are not dependent on time. And why do we care about that? Simply, it is much easier to work with such series and make accurate predictions. Some approaches to time series modeling either assume or require the underlying time series to be stationary.

I will leave out the details on testing for stationarity (for example, with the Augmented Dickey-Fuller test) for another article and come right back to the main topic — differencing. Differencing is one of the possible methods of dealing with non-stationary data and it is used for trying to make such a series stationary. In practice, it means subtracting subsequent observations from one another, following the formula:

diff(t) = x(t) — x(t — 1)

where diff is the differenced series at time t and x stands for an observation of the original series. The transformation is simple enough, but I will illustrate some small nuances in the practical example below.

#statistics #time-series-analysis #machine-learning #python #education

Theory

towardsdatascience.com

An intuitive guide to differencing time series in Python