Time Series Prediction with TensorFlow

Time Series Prediction with TensorFlow

In this article, we focus on ‘Time Series Data’ which is a part of Sequence models. In essence, this represents a type of data that changes over time such as the weather of a particular place, the trend of behaviour of a group of people, the rate of change of data,

In this article, we focus on ‘Time Series Data’ which is a part of Sequence models. In essence, this represents a type of data that changes over time such as the weather of a particular place, the trend of behaviour of a group of people, the rate of change of data, the movement of body in a 2D or 3D space or the closing price for a particular stock in the markets.

Analysis of time series data can be done for anything that has a ‘time’ factor involved in it.

So what can machine learning help us achieve over time series data?

1) Forecasting — The most conspicuous application of all is forecasting the future based on historical data collected.

2) Imputation of data — Projecting data in the past thereby predicting data from the past even though we haven’t collected it. It can also be used to predict missing values in the data.

3) Detect anomalies — Can be used to detect potential denial of service attacks.

4) Detecting patterns — Can be used to predict words in a sound wave series of data.

There are certain keywords that always come up when dealing with time series data.

Trend

The time series has a specific direction in which the data is moving in. The above picture is of a series of data having an upwards facing trend.

Seasonality

Patterns repeating at predictable intervals. The above is the data for a website for software developers. The data has an upward value for 5 units and down for 2. Therefore we can infer for each individual hump, the dips at the beginning of the hump and at the end correspond to the weekends and the data in between corresponds to the working days

Combination

The above pic is the combination of the types of data having an upward trend and a seasonality.

White Noise

This is just a set of random values and can’t be used for prediction. This is therefore referred to as white noise.

Autocorrelation

There is no seasonality or trend and spikes appear to occur at random intervals but the spikes aren’t random. Between the spikes, there is a very deterministic type of decay.

The next time step is 99% the value of the previous time step plus an occasional spike. The above is an auto correlated series, namely, it correlates with a delayed copy of itself also called a lag.

The values in the above highlighted box seem to have a strong autocorrelation, a time series like this is describes as having memory where steps are dependent on the previous ones. The spikes are often called innovation. In other words, these innovations cannot be predicted using past values.

Therefore we know now that a time series data is a combination of Trend, Seasonality, Autocorrelation and Noise.

Machine Learning models are trained to spot patterns and based on these patterns, it predicts the future. This, therefore, for the most part, it is relevant with respect to time series except for the noise which is unpredictable, but still, this gives the model an intuition that patterns that existed in the past may exist in the future.

A stationary (time) series is one whose statistical properties such as the mean, variance and autocorrelation are all constant over time. Hence, a non-stationary series is one whose statistical properties change over time.

Of course the real life time series, there can also be unforeseen events that may influence the data. Take for example the above graph. If the above graph was for a stock, the “Big Event” that may have caused such a change may be due to some financial crisis or a scandal.

One way to predict the data is using “Naive Forecasting”, i.e. taking the last value and assuming that the next value will be the same one.

Dividing the data into three parts for training, validation and testing is called “Fixed Portioning”. While doing this one has to ensure that each period contains a whole number of seasons.

Eg: — If the time series has a yearly seasonality, then each period must contain a span of 1 year or 2 year or 3 year seasonality. If we take a period that contains a span of 1 1/2 years, then some months will be represented more than the others.

This is a bit different when compared to non-time series data set wherein we picked random values to do the same and it didn’t affect the end result. Here, however, we’ll notice that the choice of span and period effects the final result as data is time sensitive.

Another thing to consider is that while training we can use the strategy as portrayed above to split the data into the three parts, however once the training is complete and we have the model ready, we should retrain it on the entire data as the portion of the data used for testing correlates to the most recent data to the current point of time and is the strongest signal in determining the future values.

Metrics for evaluating performance

The most basic method for forecasting is the ‘Moving Average’. The idea is that the yellow line in the above graph is the average of a blue values taken in a particular fixed time-frame called an ‘Averaging Window’.

The above curve gives a nice smooth curve for the time-series but it does not anticipate the trend or seasonality. It can even give a poor performance than a naive forecast, giving a large MSE.

One way to eliminate this is by removing the trend or seasonality by a method called ‘Differencing’. In this method, we study the value at time ‘t’ and at time ‘t-365’. Depending on the time period of the time-series, the period over which differencing is done can change.

Doing this we get the above result. The above values have no trend and no seasonality.

We can then use a moving average to forecast the time series. The yellow line therefore gives us the moving average of the above data, but this is just the forecast of the difference in the time-series. To get the real data, we add the ‘differenced’ data with the data of ‘ t-365’.

In the picture below, since the seasonality period is 365, we will subtract the value t-365 from the value at time t.

The result of the above operation is shown in the above graph. Compared to the previous moving average forecasting, we get a low ‘MSE’. This is a slightly better version than the Naive method.

We would notice that the forecast predicted has a lot of noise. This is because while taking the difference, the noise from the ‘t-365’ data got added into our prediction. To avoid that, we can take a moving average on the ‘t**-365**’ data and then add it to our ‘Differenced’ data.

If we do that, we end up with the above forecast depicted by the yellow line. The above curve is smooth and has a better ‘MSE’.

Let’s get our hands dirty with the actual coding part. We’ll follow a few steps mentioned below to simulate a time series dataset and attempt to build a model on it.

We create a dataset with values ranging from 0 to 9 and explicitly convert it to numpy array.

The dataset is broken into windows of size 5, i.e. each window of data contains 5 instances of data. A shift of 1 causes the next window of data to have the first instance of data to shift by 1 unit.

The drop_remainder causes the residual windows not containing a set of 5 data points to be dropped.

Each window of data is flat mapped into batches of 5.

Mapping is done to contain the first 4 instances as training data and remaining data at the end of each window as the label. The dataset is then shuffled with the buffer size of the same size of the range of data.

Data is shuffled in order to avoid Sequence Bias.

Sequence bias is when the order of things can impact the selection of things. For example, if I were to ask you your favorite TV show, and listed “Game of Thrones”, “Killing Eve”, “Travelers” and “Doctor Who” in that order, you’re probably more likely to select ‘Game of Thrones’ as you are familiar with it, and it’s the first thing you see. Even if it is equal to the other TV shows. So, when training data in a dataset, we don’t want the sequence to impact the training in a similar way, so it’s good to shuffle them up.

The dataset is broken into batches of 2 with each X as a list containing training data and each Y mapped to the data as label.

The data is split so that the training data contains the seasonality and trend to be incorporated into the training data.

The data is broken down into different windows using the function below.

Windowed instance of the dataset is created by name dataset and the model is created.

Iterating over each window, we try to predict the validation data.

We incorporate a learning rate scheduler in an attempt to gauge the optimum learning rate for the model.

We plot the series of different learning rates that have been implemented and identify the best suited learning rate for our model.

From the graph above we observe that the lowest cost associated to the learning rate is somewhere around 1e-5. However, an interesting thing to notice here is that the graph seems to be quite shaky around this and a few subsequent points. Using this learning rate might adversely affect our prediction model. Therefore the best suited learning rate seems to be 1e-6, being both smooth as well as low.

We now plot the loss function with respect to epochs.

It may seem that the model has stopped learning and consequently, the graph seems to have been saturated in the first few epochs, but this is not the case. When we plot the loss function for the first 10 epochs, we notice that the loss keeps on decreasing. This means that even though the learning rate is quite slow, the model continues to learn.

Finally, we plot the graph for the data. Blue being the original and yellow being the predicted,

Our mean squared error is given below.

Recurrent Layer

We begin to use Machine Learning techniques in order to predict the future values for the data. In the above picture, we have an input batch vector of size (4, 1). Imagine if the Memory Cell has 3 neurons, the output of the Memory Cell will have a shape of (4, 3). Therefore, if the entire series has 30 memory cells, the overall size of the output of the network would be (4, 3, 30).

The previous diagram was an example of Sequence-to-Sequence network where a sequence was given as an input and a sequence was received as an output, i.e. the output was of the shape of (4, 3, 30). This type of architecture is useful when we need to stack one RNN layer above the other.

The diagram above is of a Sequence-to-Vector network where a sequence is given as an input but we receive a vector as an output, i.e. the output is of the shape of (4, 3, 1).

To demonstrate the working of the Sequence-to-Vector model, we use the above diagram. For the first layer, the “return_sequences = True” causes every time-step in the layer below to return an output which is a sequence. This is then fed into the next layer. The next layer has by default, “return_sequences = False”, this causes only a vector to be given as an output.

Notice the input_shape = [None, 1] parameter. Tensorflow assumes the first dimension is the batch size and it being set to “None” means that it can have any size as the input batch size, the next dimension is the no. of time-steps which can be set to “None” meaning, that the RNN model can handle sequences of any length, the final value is “1” as the data is univariate.

If we add the parameter of “return_sequences = True” in the above model, the second layer of the model now gives a sequence as an output instead of a vector. The Dense layer now gets a sequence as an input. Tensorflow handles this by using the same Dense layer independently with the output of each time-step.

LSTMs

When we looked at an RNN, it appeared to be something like the above diagram. Here the individual nodes were fed inputs on batch sized data X[i] and gave an output y[i] while passing the cell state from this node to the next, the state being an important factor to calculate subsequent outputs.

The problem with this approach is that, as the cell state passes from one node to another, it’s value diminishes and it’s effect gradually fades off.

To cope with this, a new approach called LSTMs are applied.

LSTMs eliminates this problem by introducing a Cell State which passes the cell state from cell to cell and timestep to timestep and therefore the state can be better maintained.

The cell state can be both uni-directional as shown above or bi-directional as shown below.

This means that the data in the earlier window can have a greater impact on the overall projection than in the case of RNNs.

Implementing the LSTMs in our code.

Upon adding another layer of bi-directional LSTM to our model.

The above code gives us the following output:-

By adding a new layer of bi-directional code below, we get the following results.

We don’t observe much difference, in fact, the MAE appears to have fallen down, i.e. our model’s performance has dropped.

Therefore, one must experiment with different types of layers in the model and select the optimal one.

Implementing Convolutions to the model

Why do we use convolutions in a time series analysis problem?

One reason is because they are cheaper to train.

Second, they are best suited for many practical problems. For example, in certain instances where we apply RNNs, we assume that the “every” data that occurred previously will be required to guess the next data. RNNs work on this concept as it takes into consideration all the historical data to predict the next item of data. The fact is that we don’t really require the entire lot of data.

To understand this, let’s consider an example of a bullet projectile. In order to predict the next value of the projectile, we don’t really require all the historical projectile data. Suppose the projectile data is a set of ’n’ instances, we’ll require only the latest ‘k’ instances to predict the next projectile data where k<n.

Thus one can limit the time dependence and instead use a CNN architecture to model a time series. So now each assumption becomes a time instance dependent on the previous ‘k’instances which is the receptive field on the convolutions. Each CNN filter learns a set of rules and applies those rules to a particular portion of data that best fit the rule. This, therefore, becomes a non-linear system.

We notice the ‘input shape’ which is converting the data to 1 dimensional data in contrast to a 2 dimensional data normally used by convolutions to process images (images being a 2-D data).

The above plot is obtained after we apply the conv layers. It’s a huge improvement from the earlier plots. The plot however can still be improved by:-

  1. Training it for a longer period of time.

One can see from the plot that model is still learning perhaps at a small pace and therefore if we continue to train the model for a longer period of time, the overall efficiency of the model can improve.

2. Another method is to make the model bi-directional.

However, we notice that the model is now beginning to overfit the data and the ‘MAE’ has in fact increased.

When we plot the data we can clearly see that there is a lot of noise in the data and since we are feeding the model with data in mini-batches, we end up with this result. So this allows us to conclude that the above step is a step in the right direction and with a few tweaks in the hyperparameters like the batch size, we can improve our results. We can experiment with batch size and try to get to proper result.

Process of tuning our model


We take an example of ‘Sunspot activity dataset’. We train our model using the basic ‘Dense’ layer network with two layers having 10 neurons in each layer. We therefore get the above result.

Indeed when we zoom in we can see how our forecast behaves when compared to the original data.

We can see that the train window size is 20 which basically means we have 20 time slices worth of data. Each time slice corresponds to a month.

Upon viewing the data we realize that we have a seasonality of 11 to 22 years.

So we now use a time slice of 132 which is again equal to 11 years (132 months = 11 years).

But we notice that the MAE actually increased. Therefore increasing the window size didn’t work.

When we analyze our data, we find out that even though the data has an 11 year seasonality, the noise makes it behave as a typical time series data.

Therefore, we change the ‘window size’ to 30 and focus more on the amount of data being given for training and testing. Giving more data to training improves the efficiency of the model.

Now the ‘MAE’ has reduced to 15.14. We can further make changes by changing the number of input nodes. Now that our input data has increased to 30, we can have the input nodes as 30. However, this caused our network to perform worse than before, on the other hand, tweaking the ‘learning rate’ improved the efficiency of the model.

Let’s appreciate the fact that although changing the:-

1. Batch Size

2. No. of neurons

3. Window size

4. Training data size

5. Learning rate

Resulted in many changes in our model, some facilitating the prediction efficiency while some deteriorating it, each one is equally important and needs to be experimented with to achieve better results.

This brings us to the end of the documented course. One should appreciate the multitude of ways in which machine learning can be used to predict on sequential and time series data. However, what we have witnessed here are just the superficial aspects and there is a lot more to delve deeper into. With what we have learnt over the course of this article, we are ready to take up new and complex challenges. To mention one… ‘Multivariate time series data’.

I hope this motivates you to actually take up the course and try to implement it yourself.

Predicting the Stock price Using TensorFlow

Predicting the Stock price Using TensorFlow

Predicting the Stock price Using TensorFlow. A simple deep learning model for stock price prediction using TensorFlow

For a recent hackathon that we did at STATWORX, some of our team members scraped minutely S&P 500 data from the Google Finance API. The data consisted of index as well as stock prices of the S&P’s 500 constituents. Having this data at hand, the idea of developing a deep learning model for predicting the S&P 500 index based on the 500 constituents prices one minute ago came immediately on my mind.

Playing around with the data and building the deep learning model with TensorFlow was fun and so I decided to write my first Medium.com story: a little TensorFlow tutorial on predicting S&P 500 stock prices. What you will read is not an in-depth tutorial, but more a high-level introduction to the important building blocks and concepts of TensorFlow models. The Python code I’ve created is not optimized for efficiency but understandability. The dataset I’ve used can be downloaded from here (40MB).

Note, that this story is a hands-on tutorial on TensorFlow. Actual prediction of stock prices is a really challenging and complex task that requires tremendous efforts, especially at higher frequencies, such as minutes used here.

Importing and preparing the data

Our team exported the scraped stock data from our scraping server as a csv file. The dataset contains n = 41266 minutes of data ranging from April to August 2017 on 500 stocks as well as the total S&P 500 index price. Index and stocks are arranged in wide format.

# Import data
data = pd.read_csv('data_stocks.csv')
# Drop date variable
data = data.drop(['DATE'], 1)
# Dimensions of dataset
n = data.shape[0]
p = data.shape[1]
# Make data a numpy array
data = data.values

The data was already cleaned and prepared, meaning missing stock and index prices were LOCF’ed (last observation carried forward), so that the file did not contain any missing values.

A quick look at the S&P time series using pyplot.plot(data['SP500']):

Time series plot of the S&P 500 index.

Note: This is actually the lead of the S&P 500 index, meaning, its value is shifted 1 minute into the future (this has already been done in the dataset). This operation is necessary since we want to predict the next minute of the index and not the current minute. Technically speaking, each row in the dataset contains the price of the S&P500 at t+1 and the constituent’s prices at T=t.

Preparing training and test data

The dataset was split into training and test data. The training data contained 80% of the total dataset. The data was not shuffled but sequentially sliced. The training data ranges from April to approx. end of July 2017, the test data ends end of August 2017.

# Training and test data
train_start = 0
train_end = int(np.floor(0.8*n))
test_start = train_end
test_end = n
data_train = data[np.arange(train_start, train_end), :]
data_test = data[np.arange(test_start, test_end), :]

There are a lot of different approaches to time series cross validation, such as rolling forecasts with and without refitting or more elaborate concepts such as time series bootstrap resampling. The latter involves repeated samples from the remainder of the seasonal decomposition of the time series in order to simulate samples that follow the same seasonal pattern as the original time series but are not exact copies of its values.

Data scaling

Most neural network architectures benefit from scaling the inputs (sometimes also the output). Why? Because most common activation functions of the network’s neurons such as tanh or sigmoid are defined on the [-1, 1] or [0, 1] interval respectively. Nowadays, rectified linear unit (ReLU) activations are commonly used activations which are unbounded on the axis of possible activation values. However, we will scale both the inputs and targets anyway. Scaling can be easily accomplished in Python using sklearn’s MinMaxScaler.

# Scale data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
data_train = scaler.fit_transform(data_train)
data_test = scaler.transform(data_test)
# Build X and y
X_train = data_train[:, 1:]
y_train = data_train[:, 0]
X_test = data_test[:, 1:]
y_test = data_test[:, 0]

Remark: Caution must be undertaken regarding what part of the data is scaled and when. A common mistake is to scale the whole dataset before training and test split are being applied. Why is this a mistake? Because scaling invokes the calculation of statistics e.g. the min/max of a variable. When performing time series forecasting in real life, you do not have information from future observations at the time of forecasting. Therefore, calculation of scaling statistics has to be conducted on training data and must then be applied to the test data. Otherwise, you use future information at the time of forecasting which commonly biases forecasting metrics in a positive direction.

Introduction to TensorFlow

TensorFlow is a great piece of software and currently the leading deep learning and neural network computation framework. It is based on a C++ low level backend but is usually controlled via Python (there is also a neat TensorFlow library for R, maintained by RStudio). TensorFlow operates on a graph representation of the underlying computational task. This approach allows the user to specify mathematical operations as elements in a graph of data, variables and operators. Since neural networks are actually graphs of data and mathematical operations, TensorFlow is just perfect for neural networks and deep learning. Check out this simple example (stolen from our deep learning introduction from our blog):


A very simple graph that adds two numbers together.

In the figure above, two numbers are supposed to be added. Those numbers are stored in two variables, a and b. The two values are flowing through the graph and arrive at the square node, where they are being added. The result of the addition is stored into another variable, c. Actually, a, b and c can be considered as placeholders. Any numbers that are fed into a and b get added and are stored into c. This is exactly how TensorFlow works. The user defines an abstract representation of the model (neural network) through placeholders and variables. Afterwards, the placeholders get "filled" with real data and the actual computations take place. The following code implements the toy example from above in TensorFlow:

# Import TensorFlow
import tensorflow as tf

# Define a and b as placeholders
a = tf.placeholder(dtype=tf.int8)
b = tf.placeholder(dtype=tf.int8)

# Define the addition
c = tf.add(a, b)

# Initialize the graph
graph = tf.Session()

# Run the graph
graph.run(c, feed_dict={a: 5, b: 4})

After having imported the TensorFlow library, two placeholders are defined using tf.placeholder(). They correspond to the two blue circles on the left of the image above. Afterwards, the mathematical addition is defined via tf.add(). The result of the computation is c = 9. With placeholders set up, the graph can be executed with any integer value for a and b. Of course, the former problem is just a toy example. The required graphs and computations in a neural network are much more complex.

Placeholders

As mentioned before, it all starts with placeholders. We need two placeholders in order to fit our model: X contains the network's inputs (the stock prices of all S&P 500 constituents at time T = t) and Y the network's outputs (the index value of the S&P 500 at time T = t + 1).

The shape of the placeholders correspond to [None, n_stocks] with [None] meaning that the inputs are a 2-dimensional matrix and the outputs are a 1-dimensional vector. It is crucial to understand which input and output dimensions the neural net needs in order to design it properly.

# Placeholder
X = tf.placeholder(dtype=tf.float32, shape=[None, n_stocks])
Y = tf.placeholder(dtype=tf.float32, shape=[None])

The None argument indicates that at this point we do not yet know the number of observations that flow through the neural net graph in each batch, so we keep if flexible. We will later define the variable batch_size that controls the number of observations per training batch.

Variables

Besides placeholders, variables are another cornerstone of the TensorFlow universe. While placeholders are used to store input and target data in the graph, variables are used as flexible containers within the graph that are allowed to change during graph execution. Weights and biases are represented as variables in order to adapt during training. Variables need to be initialized, prior to model training. We will get into that a litte later in more detail.

The model consists of four hidden layers. The first layer contains 1024 neurons, slightly more than double the size of the inputs. Subsequent hidden layers are always half the size of the previous layer, which means 512, 256 and finally 128 neurons. A reduction of the number of neurons for each subsequent layer compresses the information the network identifies in the previous layers. Of course, other network architectures and neuron configurations are possible but are out of scope for this introduction level article.

# Model architecture parameters
n_stocks = 500
n_neurons_1 = 1024
n_neurons_2 = 512
n_neurons_3 = 256
n_neurons_4 = 128
n_target = 1
# Layer 1: Variables for hidden weights and biases
W_hidden_1 = tf.Variable(weight_initializer([n_stocks, n_neurons_1]))
bias_hidden_1 = tf.Variable(bias_initializer([n_neurons_1]))
# Layer 2: Variables for hidden weights and biases
W_hidden_2 = tf.Variable(weight_initializer([n_neurons_1, n_neurons_2]))
bias_hidden_2 = tf.Variable(bias_initializer([n_neurons_2]))
# Layer 3: Variables for hidden weights and biases
W_hidden_3 = tf.Variable(weight_initializer([n_neurons_2, n_neurons_3]))
bias_hidden_3 = tf.Variable(bias_initializer([n_neurons_3]))
# Layer 4: Variables for hidden weights and biases
W_hidden_4 = tf.Variable(weight_initializer([n_neurons_3, n_neurons_4]))
bias_hidden_4 = tf.Variable(bias_initializer([n_neurons_4]))

# Output layer: Variables for output weights and biases
W_out = tf.Variable(weight_initializer([n_neurons_4, n_target]))
bias_out = tf.Variable(bias_initializer([n_target]))

It is important to understand the required variable dimensions between input, hidden and output layers. As a rule of thumb in multilayer perceptrons (MLPs, the type of networks used here), the second dimension of the previous layer is the first dimension in the current layer for weight matrices. This might sound complicated but is essentially just each layer passing its output as input to the next layer. The biases dimension equals the second dimension of the current layer’s weight matrix, which corresponds the number of neurons in this layer.

Designing the network architecture

After definition of the required weight and bias variables, the network topology, the architecture of the network, needs to be specified. Hereby, placeholders (data) and variables (weighs and biases) need to be combined into a system of sequential matrix multiplications.

Furthermore, the hidden layers of the network are transformed by activation functions. Activation functions are important elements of the network architecture since they introduce non-linearity to the system. There are dozens of possible activation functions out there, one of the most common is the rectified linear unit (ReLU) which will also be used in this model.

# Hidden layer
hidden_1 = tf.nn.relu(tf.add(tf.matmul(X, W_hidden_1), bias_hidden_1))
hidden_2 = tf.nn.relu(tf.add(tf.matmul(hidden_1, W_hidden_2), bias_hidden_2))
hidden_3 = tf.nn.relu(tf.add(tf.matmul(hidden_2, W_hidden_3), bias_hidden_3))
hidden_4 = tf.nn.relu(tf.add(tf.matmul(hidden_3, W_hidden_4), bias_hidden_4))

# Output layer (must be transposed)
out = tf.transpose(tf.add(tf.matmul(hidden_4, W_out), bias_out))

The image below illustrates the network architecture. The model consists of three major building blocks. The input layer, the hidden layers and the output layer. This architecture is called a feedforward network. Feedforward indicates that the batch of data solely flows from left to right. Other network architectures, such as recurrent neural networks, also allow data flowing “backwards” in the network.

Cool technical illustration of our feedforward network architecture.

Cost function

The cost function of the network is used to generate a measure of deviation between the network’s predictions and the actual observed training targets. For regression problems, the mean squared error (MSE) function is commonly used. MSE computes the average squared deviation between predictions and targets. Basically, any differentiable function can be implemented in order to compute a deviation measure between predictions and targets.

# Cost function
mse = tf.reduce_mean(tf.squared_difference(out, Y))

However, the MSE exhibits certain properties that are advantageous for the general optimization problem to be solved.

Optimizer

The optimizer takes care of the necessary computations that are used to adapt the network’s weight and bias variables during training. Those computations invoke the calculation of so called gradients, that indicate the direction in which the weights and biases have to be changed during training in order to minimize the network’s cost function. The development of stable and speedy optimizers is a major field in neural network an deep learning research.

# Optimizer
opt = tf.train.AdamOptimizer().minimize(mse)

Here the Adam Optimizer is used, which is one of the current default optimizers in deep learning development. Adam stands for “Adaptive Moment Estimation” and can be considered as a combination between two other popular optimizers AdaGrad and RMSProp.

Initializers

Initializers are used to initialize the network’s variables before training. Since neural networks are trained using numerical optimization techniques, the starting point of the optimization problem is one the key factors to find good solutions to the underlying problem. There are different initializers available in TensorFlow, each with different initialization approaches. Here, I use the tf.variance_scaling_initializer(), which is one of the default initialization strategies.

# Initializers
sigma = 1
weight_initializer = tf.variance_scaling_initializer(mode="fan_avg", distribution="uniform", scale=sigma)
bias_initializer = tf.zeros_initializer()

Note, that with TensorFlow it is possible to define multiple initialization functions for different variables within the graph. However, in most cases, a unified initialization is sufficient.

Fitting the neural network

After having defined the placeholders, variables, initializers, cost functions and optimizers of the network, the model needs to be trained. Usually, this is done by minibatch training. During minibatch training random data samples of n = batch_size are drawn from the training data and fed into the network. The training dataset gets divided into n / batch_size batches that are sequentially fed into the network. At this point the placeholders X and Y come into play. They store the input and target data and present them to the network as inputs and targets.

A sampled data batch of X flows through the network until it reaches the output layer. There, TensorFlow compares the models predictions against the actual observed targets Y in the current batch. Afterwards, TensorFlow conducts an optimization step and updates the networks parameters, corresponding to the selected learning scheme. After having updated the weights and biases, the next batch is sampled and the process repeats itself. The procedure continues until all batches have been presented to the network. One full sweep over all batches is called an epoch.

The training of the network stops once the maximum number of epochs is reached or another stopping criterion defined by the user applies.

# Make Session
net = tf.Session()
# Run initializer
net.run(tf.global_variables_initializer())

# Setup interactive plot
plt.ion()
fig = plt.figure()
ax1 = fig.add_subplot(111)
line1, = ax1.plot(y_test)
line2, = ax1.plot(y_test*0.5)
plt.show()

# Number of epochs and batch size
epochs = 10
batch_size = 256

for e in range(epochs):

    # Shuffle training data
    shuffle_indices = np.random.permutation(np.arange(len(y_train)))
    X_train = X_train[shuffle_indices]
    y_train = y_train[shuffle_indices]

    # Minibatch training
    for i in range(0, len(y_train) // batch_size):
        start = i * batch_size
        batch_x = X_train[start:start + batch_size]
        batch_y = y_train[start:start + batch_size]
        # Run optimizer with batch
        net.run(opt, feed_dict={X: batch_x, Y: batch_y})

        # Show progress
        if np.mod(i, 5) == 0:
            # Prediction
            pred = net.run(out, feed_dict={X: X_test})
            line2.set_ydata(pred)
            plt.title('Epoch ' + str(e) + ', Batch ' + str(i))
            file_name = 'img/epoch_' + str(e) + '_batch_' + str(i) + '.jpg'
            plt.savefig(file_name)
            plt.pause(0.01)
# Print final MSE after Training
mse_final = net.run(mse, feed_dict={X: X_test, Y: y_test})
print(mse_final)

During the training, we evaluate the networks predictions on the test set — the data which is not learned, but set aside — for every 5th batch and visualize it. Additionally, the images are exported to disk and later combined into a video animation of the training process (see below). The model quickly learns the shape and location of the time series in the test data and is able to produce an accurate prediction after some epochs. Nice!

Video animation of the network’s test data prediction (orange) during training.

One can see that the networks rapidly adapts to the basic shape of the time series and continues to learn finer patterns of the data. This also corresponds to the Adam learning scheme that lowers the learning rate during model training in order not to overshoot the optimization minimum. After 10 epochs, we have a pretty close fit to the test data! The final test MSE equals 0.00078 (it is very low, because the target is scaled). The mean absolute percentage error of the forecast on the test set is equal to 5.31% which is pretty good. Note, that this is just a fit to the test data, no actual out of sample metrics in a real world scenario.

Please note that there are tons of ways of further improving this result: design of layers and neurons, choosing different initialization and activation schemes, introduction of dropout layers of neurons, early stopping and so on. Furthermore, different types of deep learning models, such as recurrent neural networks might achieve better performance on this task. However, this is not the scope of this introductory post.

Conclusion and outlook

The release of TensorFlow was a landmark event in deep learning research. Its flexibility and performance allows researchers to develop all kinds of sophisticated neural network architectures as well as other ML algorithms. However, flexibility comes at the cost of longer time-to-model cycles compared to higher level APIs such as Keras or MxNet. Nonetheless, I am sure that TensorFlow will make its way to the de-facto standard in neural network and deep learning development in research and practical applications. Many of our customers are already using TensorFlow or start developing projects that employ TensorFlow models. Also our data science consultants at STATWORX are heavily using TensorFlow for deep learning and neural net research and development. Let’s see what Google has planned for the future of TensorFlow. One thing that is missing, at least in my opinion, is a neat graphical user interface for designing and developing neural net architectures with TensorFlow backend. Maybe, this is something Google is already working on ;)

Update: I’ve added both the Python script as well as a (zipped) dataset to a Github repository. Feel free to clone and fork.

Final remarks

If you have any comments or questions on my story, feel free to comment below! I will try to answer them. Also, feel free to use my code or share this story with your peers on social platforms of your choice. Follow me on LinkedIn or Twitter, if you want to stay in touch.

Make sure, you also check the awesome STATWORX Blog for more interesting data science, ML and AI content straight from the our office in Frankfurt, Germany!

If you’re interested in more quality content like this, join my mailing list, constantly bringing you new data science, machine learning and AI reads and treats from me and my team right into your inbox!

Guide to Python Programming Language

Guide to Python Programming Language

Guide to Python Programming Language

Description
The course will lead you from beginning level to advance in Python Programming Language. You do not need any prior knowledge on Python or any programming language or even programming to join the course and become an expert on the topic.

The course is begin continuously developing by adding lectures regularly.

Please see the Promo and free sample video to get to know more.

Hope you will enjoy it.

Basic knowledge
An Enthusiast Mind
A Computer
Basic Knowledge To Use Computer
Internet Connection
What will you learn
Will Be Expert On Python Programming Language
Build Application On Python Programming Language

Learn Python Programming

Learn Python Programming

Learn Python Programming

Description
Learn Python Programming

Learn Python Programming and increase your python programming skills with Coder Kovid.

Python is the highest growing programming language in this era. You can use Python to do everything like, web development, software development, cognitive development, machine learning, artificial intelligence, etc. You should learn python programming and increase your skills of programming.

In this course of learn python programming you don't need any prior programming knowledge. Every beginner can start with.

Basic knowledge
No prior knowledge needed to learn this course
What will you learn
Write Basic Syntax of Python Programming
Create Basic Real World Application
Program in a fluent manner
Get Familiar in Programming Environment