Build Your Own Image Classifier In Tensorflow

Build Your Own Image Classifier In Tensorflow

Building a Convolutional Neural Network for Image Classification with Tensorflow. Convolutional Neural Network (CNN) is a special type of deep neural network that performs impressively in computer vision problems such as image classification, object detection, etc.

Convolutional Neural Network (CNN) is a special type of deep neural network that performs impressively in computer vision problems such as image classification, object detection, etc. In this article, we are going to create an image classifier with Tensorflow by implementing a CNN to classify cats & dogs.

With traditional programming is it not possible to build scalable solutions for problems like computer vision since it is not feasible to write an algorithm that is generalized enough to identify the nature of images. With machine learning, we can build an approximation that is sufficient enough for use-cases by training a model for given examples and predict for unseen data.

How CNN work?

CNN is constructed with multiple convolution layers, pooling layers, and dense layers.

The idea of the convolution layer is to transform the input image in order to extract features (ex. ears, nose, legs of cats & dogs) to distinguish them correctly. This is done by convolving the image with a kernel. A kernel is specialized to extract certain features. It is possible to apply multiple kernels to a single image to capture multiple features.


How kernel is applied to an image to extract features

Usually, an activation function (ex. tanh, relu) will be applied to the convoluted values to increase the non-linearity.

The job of the pooling layer is to reduce the image size. It will only keep the most important features and remove the other area from the image. Moreover, this will reduce the computational cost as well. The most popular pooling strategies are max-pooling and average-pooling.

The size of the pooling matrix will determize the image reduction. Ex. 2x2 will reduce the image size by 50%

How max-pooling and average-pooling works

These series of convolution layers and pooling layers will help to identify the features and they will be followed by the dense layers for learning and prediction later.


Layers of a CNN

Building the Image Classifier

CNN is a deep neural network that needs much computation power for training. Moreover, to obtain sufficient accuracy there should be a large dataset to construct a generalized model for unseen data. Hence here I am running the code in Google Colab which is a platform for research purposes. Colab supports GPU enabled hardware which gives a huge boost for training as well.

Download and load the dataset

This dataset contains 2000 jpg images of cats and dogs. First, we need to download the dataset and extract it (Here data is downloaded to /tmp directory in Colab instance).


Downloading dataset


Extracting the dataset

The above code segments will download the datasets and extract them to /tmp directory. The extracted directory will have 2 subdirectories named train and validation. Those will have the training and testing data. Inside both those directories, there are 2 subdirectories for cats and dogs as well. We can easily load these training and testing data for the 2 classes with the TensorFlow data generator.

Setting the paths of testing and validation images


Load data with Ternsorflow image generator

Here we have 2 data generators for train and test data. When loading the data a rescaling is applied to normalize the pixel values for faster converging the model. Moreover, when loading the data we do it in 20 image batches and all of them are resized into 150x150 size. If there are images in different sizes this will fix it.

Constructing the model

Since the data is ready, now we can build up the model. Here I am going to add 3 convolutional layers followed by 3 max-pooling layers. Then there is a Flatten layer and finally, there are 2 dense layers.


Construct the CNN model

In the first convolution layer, I have added 16 kernels which have the size of 3x3. Once the image is convoluted with kernel it will be passed through relu activation to obtain non-linearity. The input shape of this layer should be 150x150 since we resized images for that size. Since all the images are colored images, they have 3 channels for RGB.

In the max-pooling layer, I have added a 2x2 kernel such that the max value will be taken when reducing the image size by 50%.

There are 3 such layers (convolution and max-pooling) to extract the features of images. If there are very complex features that need to be learned, more layers should be added to the model making it much deeper.

The Flatten layer will take the output from the previous max-pooling layer and convert it to a 1D array such that it can be feed into the Dense layers. A dense layer is a regular layer of neurons in a neural network. This is where the actual learning process happens by adjusting the weights. Here we have 2 such dense layers and since this is a binary classification there is only 1 neuron in the output layer. The number of neurons in the other layer can be adjusted as a hyperparameter to obtain the best accuracy.

Train the model

Since we have constructed the model, now we can compile it.


Compile the model

Here we need to define how to calculate the loss or error. Since we are using a binary classification we can use binary_crossentropy. With the optimizer parameter, we pass how to adjust the weights in the network such that the loss gets reduced. There are many options that can be used and here I use the RMSprop method. Finally, the metrics parameter will be used to estimate how good our model is and here we use the accuracy.

Now we can start training the model


Train the model

Here we are passing the train and validation generators we used to load our data. Since our data generator has 20 batch size we need to have 100 stps_per_epoch to cover all 2000 training images and 50 for validation images. The epochs parameter sets the number of iterations we conduct for training. The verbose parameter will show the progress in each iteration while training.

Results


Results after 15 epochs

After 15 epochs the model has scored 98.9% accuracy on training set and 71.5% accuracy on the validation set. This is a clear indication that our model has overfitted. Our model will perform really good in the training set and it will poorly perform for the unseen data.

To solve the overfitting problem either we can add regularization to avoid over-complexing the model or we can add more data to the training set to make the model more generalized for unseen data. Since we have a very small data set (2000 images) for training, adding more data should fix the issue.

Collecting more data to train a model is overwhelming in machine learning since it is required to preprocess the data again. But when working with images, especially in image classification, there is no need to collect more data. This can be fixed the technique called Image Augmentation.

Image Augmentation

The idea of Image Augmentation is to create more images by resizing, zooming, rotating images, etc to construct new images. With this approach, the model will able to capture more features than before and will able to generalize well for unseen data.

For example, let's assume most of the cats in our training set as follows which have the full body of a cat. The model will try to learn the shape of the body of the cat from these images.

Due to this, the classifier might fail to identify images like follow correctly since it hasn’t trained with examples similar to that.

But with image augmentation, we can construct new images from existing images to make the classifier learn new features. With the zoom feature in image augmentation, we can construct a new image like below to help the learner to classify images like above which failed to classify correctly before


Zoomed image from the original image with image augmentation

Adding image augmentation is really easy with the TensorFlow image generator. When image augmentation is applying, the original dataset will be untouched and all the manipulations will be done in the memory. The following code segment will show how to add this functionality.


Adding image augmentation when loading data

In here image rotating, shifting, zooming and few other image manipulation techniques are applied to generate new samples in the training set.

Once we apply the image augmentation it is possible to obtain 86% training accuracy and 81% testing accuracy. As you can see this model is not overfitted like before and with a very small dataset like this, this accuracy is impressive. Further, you can improve the accuracy by playing with the hyperparameters like the optimizer, the number of dense layers, number of neurons in each layer, etc.

Codeless ML with TensorFlow and AI Platform

Codeless ML with TensorFlow and AI Platform

Codeless ML with TensorFlow and AI Platform - Building an end-to-end machine learning pipeline without writing any ML code.

Originally published by Gad Benram at blog.doit-intl.com

Advances in AI frameworks enable developers to create and deploy deep learning models with as little effort as clicking a few buttons on the screen. Using a UI or an API based on Tensorflow Estimators, models can be built and served without writing a single line of machine learning code.

70 years ago, only a handful of experts knew how to create computer programs, because the process of programming required very high theoretical and technical specialization. Over the years, humans have created increasingly higher levels of abstraction and encapsulation of programming, allowing less-skilled personnel to create software with very basic tools (see Wix for example). The exact same process occurs these days with machine learning — only it advances extremely faster. In this blog post we will write down a simple script that will generate a full machine learning pipeline.

Truly codeless?

This post contains two types of code. The first is a SQL query to generate the dataset — this is the part of the code could be replaced by tools like Google Cloud Dataprep. The other type involves API calls using a Python client library — all of these actions are available through the AI platform UI. When I say codeless, I mean that at no point will you need to import TensorFlow or other ML libraries.

In this demo, I will use the Chicago Taxi Trips open dataset in Google BigQuery to predict the travel time of a taxi based on pickup location, desired drop-off, and the time of ride start. The model will be trained and deployed using Google Cloud services that wrap Tensorflow.

The entire code sample can be found in this GitHub repository.

Extract Features using BigQuery

Based on an EDA shown in this notebook, I created a SQL query to generate a training dataset:

WITH dataset AS( SELECT 
          EXTRACT(HOUR FROM  trip_start_timestamp) trip_start_hour
        , EXTRACT(DAYOFWEEK FROM  trip_start_timestamp) trip_start_weekday
        , EXTRACT(WEEK FROM  trip_start_timestamp) trip_start_week
        , EXTRACT(DAYOFYEAR FROM  trip_start_timestamp) trip_start_yearday
        , EXTRACT(MONTH FROM  trip_start_timestamp) trip_start_month
        , (trip_miles * 1.60934 ) / ((trip_seconds + .01) / (60 * 60)) trip_speed_kmph
        , trip_miles
        , pickup_latitude
        , pickup_longitude
        , dropoff_latitude
        , dropoff_longitude
        , pickup_community_area
        , dropoff_community_area
        , ST_DISTANCE(
          (ST_GEOGPOINT(pickup_longitude,pickup_latitude)),
          (ST_GEOGPOINT(dropoff_longitude,dropoff_latitude))) air_distance
        , CAST (trip_seconds AS FLOAT64) trip_seconds
    FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips` 
        WHERE RAND() < (3000000/112860054) --sample maximum ~3M records 
                AND  trip_start_timestamp < '2016-01-01'
                AND pickup_location IS NOT NULL
                AND dropoff_location IS NOT NULL)
    SELECT 
         trip_seconds
        , air_distance
        , pickup_latitude
        , pickup_longitude
        , dropoff_latitude
        , dropoff_longitude
        , pickup_community_area
        , dropoff_community_area
        , trip_start_hour
        , trip_start_weekday
        , trip_start_week
        , trip_start_yearday
        , trip_start_month
    FROM dataset
    WHERE trip_speed_kmph BETWEEN 5 AND 90

feature extraction script

In the repo, you will be able to see how I execute the query using a python client and export it to GCS.

Important! In order for the AI platform to build a model with this data, the first column must be the target variable and the CSV export should not contain a header.

Submit hyper-parameter tuning job and deploy

After I have my dataset containing a few hundred thousand rides, I define a simple neural network architecture based on the TensorFlow Estimator API, with parameters space to search. This specific spec will create a 3 hidden-layers neural network that solves a regression task (the expected trip time). It will launch 50 trials to search optimal settings for the learning rate, regularization factors, and maximum steps.

{
"scaleTier": "CUSTOM",
"masterType": "standard_gpu",
"args": [
"--preprocess",
"--validation_split=0.2",
"--model_type=regression",
"--hidden_units=120,60,60",
"--batch_size=128",
"--eval_frequency_secs=128",
"--optimizer_type=ftrl",
"--use_wide",
"--embed_categories",
"--dnn_learning_rate=0.001",
"--dnn_optimizer_type=ftrl"
],
"hyperparameters": {
"goal": "MINIMIZE",
"params": [
{
"parameterName": "max_steps",
"minValue": 100,
"maxValue": 60000,
"type": "INTEGER",
"scaleType": "UNIT_LINEAR_SCALE"
},
{
"parameterName": "learning_rate",
"minValue": 0.0001,
"maxValue": 0.5,
"type": "DOUBLE",
"scaleType": "UNIT_LINEAR_SCALE"
},
{
"parameterName": "l1_regularization_strength",
"maxValue": 1,
"type": "DOUBLE",
"scaleType": "UNIT_LINEAR_SCALE"
},
{
"parameterName": "l2_regularization_strength",
"maxValue": 1,
"type": "DOUBLE",
"scaleType": "UNIT_LINEAR_SCALE"
},
{
"parameterName": "l2_shrinkage_regularization_strength",
"maxValue": 1,
"type": "DOUBLE",
"scaleType": "UNIT_LINEAR_SCALE"
}
],
"maxTrials": 50,
"maxParallelTrials": 10,
"hyperparameterMetricTag": "loss",
"enableTrialEarlyStopping": True
},
"region": "us-central1",
"jobDir": "{JOB_DIR}",
"masterConfig": {
"imageUri": "gcr.io/cloud-ml-algos/wide_deep_learner_gpu:latest"
}
}

Provided the spec above I can use a Python client to launch a training job:

def train_hyper_params(cloudml_client, training_inputs):

job_name = 'chicago_travel_time_training_{}'.format(datetime.utcnow().strftime('%Y%m%d%H%M%S'))
project_name = 'projects/{}'.format(project_id)
job_spec = {'jobId': job_name, 'trainingInput': training_inputs}
response = cloudml_client.projects().jobs().create(body=job_spec,
                                            parent=project_name).execute()
print(response)

I use the API client to monitor the job run, and, when the job is done, I deploy and test the model.

def create_model(cloudml_client):
"""
Creates a Model entity in AI Platform
:param cloudml_client: discovery client
"""
models = cloudml_client.projects().models()
create_spec = {'name': model_name}

models.create(body=create_spec,
              parent=project_name).execute()

def deploy_version(cloudml_client, job_results):
"""
Deploying the best trail's model to AI platform
:param cloudml_client: discovery client
:param job_results: response of the finished AI platform job
"""
models = cloudml_client.projects().models()

training_outputs = job_results['trainingOutput']
version_spec = {
    "name": model_version,
    "isDefault": False,
    "runtimeVersion": training_outputs['builtInAlgorithmOutput']['runtimeVersion'],

    # Assuming the trials are sorted by performance (best is first)
    "deploymentUri": training_outputs['trials'][0]['builtInAlgorithmOutput']['modelPath'],
    "framework": training_outputs['builtInAlgorithmOutput']['framework'],
    "pythonVersion": training_outputs['builtInAlgorithmOutput']['pythonVersion'],
    "autoScaling": {
        'minNodes': 0
    }
}

versions = models.versions()
response = versions.create(body=version_spec,
                parent='{}/models/{}'.format(project_name, model_name)).execute()
return response

With this, I completed the deployment of a machine learning pipeline using only API calls.

Get predictions

In order to get predictions, I load part of the test set records to the memory and send it to the deployed version for inference:

def validate_model():
"""
Function to validate the model results
"""
df_val = pd.read_csv('{}/processed_data/test.csv'.format(job_dir))

# Submit only 10 samples to the server, ignore the first column (=target column)
instances = [", ".join(x) for x in df_val.iloc[:10, 1:].astype(str).values.tolist()]
service = discovery.build('ml', 'v1')
version_name = 'projects/{}/models/{}'.format(project_id, model_name)

if model_version is not None:
    version_name += '/versions/{}'.format(model_version)

response = service.projects().predict(
    name=version_name,
    body={'instances': instances}
).execute()

if 'error' in response:
    raise RuntimeError(response['error'])

return response['predictions']

Getting predictions

Originally published by Gad Benram at blog.doit-intl.com

-------------------------------------------------

Thanks for reading :heart: If you liked this post, share it with all of your programming buddies! Follow me on Facebook | Twitter

Learn More

☞ Complete Guide to TensorFlow for Deep Learning with Python

☞ Data Science: Deep Learning in Python

☞ Python for Data Science and Machine Learning Bootcamp

☞ Deep Learning with TensorFlow 2.0 [2019]

☞ TensorFlow 2.0: A Complete Guide on the Brand New TensorFlow

☞ Tensorflow and Keras For Neural Networks and Deep Learning

☞ Tensorflow Bootcamp For Data Science in Python

☞ Complete 2019 Data Science & Machine Learning Bootcamp



Predicting the Stock price Using TensorFlow

Predicting the Stock price Using TensorFlow

Predicting the Stock price Using TensorFlow. A simple deep learning model for stock price prediction using TensorFlow

For a recent hackathon that we did at STATWORX, some of our team members scraped minutely S&P 500 data from the Google Finance API. The data consisted of index as well as stock prices of the S&P’s 500 constituents. Having this data at hand, the idea of developing a deep learning model for predicting the S&P 500 index based on the 500 constituents prices one minute ago came immediately on my mind.

Playing around with the data and building the deep learning model with TensorFlow was fun and so I decided to write my first Medium.com story: a little TensorFlow tutorial on predicting S&P 500 stock prices. What you will read is not an in-depth tutorial, but more a high-level introduction to the important building blocks and concepts of TensorFlow models. The Python code I’ve created is not optimized for efficiency but understandability. The dataset I’ve used can be downloaded from here (40MB).

Note, that this story is a hands-on tutorial on TensorFlow. Actual prediction of stock prices is a really challenging and complex task that requires tremendous efforts, especially at higher frequencies, such as minutes used here.

Importing and preparing the data

Our team exported the scraped stock data from our scraping server as a csv file. The dataset contains n = 41266 minutes of data ranging from April to August 2017 on 500 stocks as well as the total S&P 500 index price. Index and stocks are arranged in wide format.

# Import data
data = pd.read_csv('data_stocks.csv')
# Drop date variable
data = data.drop(['DATE'], 1)
# Dimensions of dataset
n = data.shape[0]
p = data.shape[1]
# Make data a numpy array
data = data.values

The data was already cleaned and prepared, meaning missing stock and index prices were LOCF’ed (last observation carried forward), so that the file did not contain any missing values.

A quick look at the S&P time series using pyplot.plot(data['SP500']):

Time series plot of the S&P 500 index.

Note: This is actually the lead of the S&P 500 index, meaning, its value is shifted 1 minute into the future (this has already been done in the dataset). This operation is necessary since we want to predict the next minute of the index and not the current minute. Technically speaking, each row in the dataset contains the price of the S&P500 at t+1 and the constituent’s prices at T=t.

Preparing training and test data

The dataset was split into training and test data. The training data contained 80% of the total dataset. The data was not shuffled but sequentially sliced. The training data ranges from April to approx. end of July 2017, the test data ends end of August 2017.

# Training and test data
train_start = 0
train_end = int(np.floor(0.8*n))
test_start = train_end
test_end = n
data_train = data[np.arange(train_start, train_end), :]
data_test = data[np.arange(test_start, test_end), :]

There are a lot of different approaches to time series cross validation, such as rolling forecasts with and without refitting or more elaborate concepts such as time series bootstrap resampling. The latter involves repeated samples from the remainder of the seasonal decomposition of the time series in order to simulate samples that follow the same seasonal pattern as the original time series but are not exact copies of its values.

Data scaling

Most neural network architectures benefit from scaling the inputs (sometimes also the output). Why? Because most common activation functions of the network’s neurons such as tanh or sigmoid are defined on the [-1, 1] or [0, 1] interval respectively. Nowadays, rectified linear unit (ReLU) activations are commonly used activations which are unbounded on the axis of possible activation values. However, we will scale both the inputs and targets anyway. Scaling can be easily accomplished in Python using sklearn’s MinMaxScaler.

# Scale data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
data_train = scaler.fit_transform(data_train)
data_test = scaler.transform(data_test)
# Build X and y
X_train = data_train[:, 1:]
y_train = data_train[:, 0]
X_test = data_test[:, 1:]
y_test = data_test[:, 0]

Remark: Caution must be undertaken regarding what part of the data is scaled and when. A common mistake is to scale the whole dataset before training and test split are being applied. Why is this a mistake? Because scaling invokes the calculation of statistics e.g. the min/max of a variable. When performing time series forecasting in real life, you do not have information from future observations at the time of forecasting. Therefore, calculation of scaling statistics has to be conducted on training data and must then be applied to the test data. Otherwise, you use future information at the time of forecasting which commonly biases forecasting metrics in a positive direction.

Introduction to TensorFlow

TensorFlow is a great piece of software and currently the leading deep learning and neural network computation framework. It is based on a C++ low level backend but is usually controlled via Python (there is also a neat TensorFlow library for R, maintained by RStudio). TensorFlow operates on a graph representation of the underlying computational task. This approach allows the user to specify mathematical operations as elements in a graph of data, variables and operators. Since neural networks are actually graphs of data and mathematical operations, TensorFlow is just perfect for neural networks and deep learning. Check out this simple example (stolen from our deep learning introduction from our blog):


A very simple graph that adds two numbers together.

In the figure above, two numbers are supposed to be added. Those numbers are stored in two variables, a and b. The two values are flowing through the graph and arrive at the square node, where they are being added. The result of the addition is stored into another variable, c. Actually, a, b and c can be considered as placeholders. Any numbers that are fed into a and b get added and are stored into c. This is exactly how TensorFlow works. The user defines an abstract representation of the model (neural network) through placeholders and variables. Afterwards, the placeholders get "filled" with real data and the actual computations take place. The following code implements the toy example from above in TensorFlow:

# Import TensorFlow
import tensorflow as tf

# Define a and b as placeholders
a = tf.placeholder(dtype=tf.int8)
b = tf.placeholder(dtype=tf.int8)

# Define the addition
c = tf.add(a, b)

# Initialize the graph
graph = tf.Session()

# Run the graph
graph.run(c, feed_dict={a: 5, b: 4})

After having imported the TensorFlow library, two placeholders are defined using tf.placeholder(). They correspond to the two blue circles on the left of the image above. Afterwards, the mathematical addition is defined via tf.add(). The result of the computation is c = 9. With placeholders set up, the graph can be executed with any integer value for a and b. Of course, the former problem is just a toy example. The required graphs and computations in a neural network are much more complex.

Placeholders

As mentioned before, it all starts with placeholders. We need two placeholders in order to fit our model: X contains the network's inputs (the stock prices of all S&P 500 constituents at time T = t) and Y the network's outputs (the index value of the S&P 500 at time T = t + 1).

The shape of the placeholders correspond to [None, n_stocks] with [None] meaning that the inputs are a 2-dimensional matrix and the outputs are a 1-dimensional vector. It is crucial to understand which input and output dimensions the neural net needs in order to design it properly.

# Placeholder
X = tf.placeholder(dtype=tf.float32, shape=[None, n_stocks])
Y = tf.placeholder(dtype=tf.float32, shape=[None])

The None argument indicates that at this point we do not yet know the number of observations that flow through the neural net graph in each batch, so we keep if flexible. We will later define the variable batch_size that controls the number of observations per training batch.

Variables

Besides placeholders, variables are another cornerstone of the TensorFlow universe. While placeholders are used to store input and target data in the graph, variables are used as flexible containers within the graph that are allowed to change during graph execution. Weights and biases are represented as variables in order to adapt during training. Variables need to be initialized, prior to model training. We will get into that a litte later in more detail.

The model consists of four hidden layers. The first layer contains 1024 neurons, slightly more than double the size of the inputs. Subsequent hidden layers are always half the size of the previous layer, which means 512, 256 and finally 128 neurons. A reduction of the number of neurons for each subsequent layer compresses the information the network identifies in the previous layers. Of course, other network architectures and neuron configurations are possible but are out of scope for this introduction level article.

# Model architecture parameters
n_stocks = 500
n_neurons_1 = 1024
n_neurons_2 = 512
n_neurons_3 = 256
n_neurons_4 = 128
n_target = 1
# Layer 1: Variables for hidden weights and biases
W_hidden_1 = tf.Variable(weight_initializer([n_stocks, n_neurons_1]))
bias_hidden_1 = tf.Variable(bias_initializer([n_neurons_1]))
# Layer 2: Variables for hidden weights and biases
W_hidden_2 = tf.Variable(weight_initializer([n_neurons_1, n_neurons_2]))
bias_hidden_2 = tf.Variable(bias_initializer([n_neurons_2]))
# Layer 3: Variables for hidden weights and biases
W_hidden_3 = tf.Variable(weight_initializer([n_neurons_2, n_neurons_3]))
bias_hidden_3 = tf.Variable(bias_initializer([n_neurons_3]))
# Layer 4: Variables for hidden weights and biases
W_hidden_4 = tf.Variable(weight_initializer([n_neurons_3, n_neurons_4]))
bias_hidden_4 = tf.Variable(bias_initializer([n_neurons_4]))

# Output layer: Variables for output weights and biases
W_out = tf.Variable(weight_initializer([n_neurons_4, n_target]))
bias_out = tf.Variable(bias_initializer([n_target]))

It is important to understand the required variable dimensions between input, hidden and output layers. As a rule of thumb in multilayer perceptrons (MLPs, the type of networks used here), the second dimension of the previous layer is the first dimension in the current layer for weight matrices. This might sound complicated but is essentially just each layer passing its output as input to the next layer. The biases dimension equals the second dimension of the current layer’s weight matrix, which corresponds the number of neurons in this layer.

Designing the network architecture

After definition of the required weight and bias variables, the network topology, the architecture of the network, needs to be specified. Hereby, placeholders (data) and variables (weighs and biases) need to be combined into a system of sequential matrix multiplications.

Furthermore, the hidden layers of the network are transformed by activation functions. Activation functions are important elements of the network architecture since they introduce non-linearity to the system. There are dozens of possible activation functions out there, one of the most common is the rectified linear unit (ReLU) which will also be used in this model.

# Hidden layer
hidden_1 = tf.nn.relu(tf.add(tf.matmul(X, W_hidden_1), bias_hidden_1))
hidden_2 = tf.nn.relu(tf.add(tf.matmul(hidden_1, W_hidden_2), bias_hidden_2))
hidden_3 = tf.nn.relu(tf.add(tf.matmul(hidden_2, W_hidden_3), bias_hidden_3))
hidden_4 = tf.nn.relu(tf.add(tf.matmul(hidden_3, W_hidden_4), bias_hidden_4))

# Output layer (must be transposed)
out = tf.transpose(tf.add(tf.matmul(hidden_4, W_out), bias_out))

The image below illustrates the network architecture. The model consists of three major building blocks. The input layer, the hidden layers and the output layer. This architecture is called a feedforward network. Feedforward indicates that the batch of data solely flows from left to right. Other network architectures, such as recurrent neural networks, also allow data flowing “backwards” in the network.

Cool technical illustration of our feedforward network architecture.

Cost function

The cost function of the network is used to generate a measure of deviation between the network’s predictions and the actual observed training targets. For regression problems, the mean squared error (MSE) function is commonly used. MSE computes the average squared deviation between predictions and targets. Basically, any differentiable function can be implemented in order to compute a deviation measure between predictions and targets.

# Cost function
mse = tf.reduce_mean(tf.squared_difference(out, Y))

However, the MSE exhibits certain properties that are advantageous for the general optimization problem to be solved.

Optimizer

The optimizer takes care of the necessary computations that are used to adapt the network’s weight and bias variables during training. Those computations invoke the calculation of so called gradients, that indicate the direction in which the weights and biases have to be changed during training in order to minimize the network’s cost function. The development of stable and speedy optimizers is a major field in neural network an deep learning research.

# Optimizer
opt = tf.train.AdamOptimizer().minimize(mse)

Here the Adam Optimizer is used, which is one of the current default optimizers in deep learning development. Adam stands for “Adaptive Moment Estimation” and can be considered as a combination between two other popular optimizers AdaGrad and RMSProp.

Initializers

Initializers are used to initialize the network’s variables before training. Since neural networks are trained using numerical optimization techniques, the starting point of the optimization problem is one the key factors to find good solutions to the underlying problem. There are different initializers available in TensorFlow, each with different initialization approaches. Here, I use the tf.variance_scaling_initializer(), which is one of the default initialization strategies.

# Initializers
sigma = 1
weight_initializer = tf.variance_scaling_initializer(mode="fan_avg", distribution="uniform", scale=sigma)
bias_initializer = tf.zeros_initializer()

Note, that with TensorFlow it is possible to define multiple initialization functions for different variables within the graph. However, in most cases, a unified initialization is sufficient.

Fitting the neural network

After having defined the placeholders, variables, initializers, cost functions and optimizers of the network, the model needs to be trained. Usually, this is done by minibatch training. During minibatch training random data samples of n = batch_size are drawn from the training data and fed into the network. The training dataset gets divided into n / batch_size batches that are sequentially fed into the network. At this point the placeholders X and Y come into play. They store the input and target data and present them to the network as inputs and targets.

A sampled data batch of X flows through the network until it reaches the output layer. There, TensorFlow compares the models predictions against the actual observed targets Y in the current batch. Afterwards, TensorFlow conducts an optimization step and updates the networks parameters, corresponding to the selected learning scheme. After having updated the weights and biases, the next batch is sampled and the process repeats itself. The procedure continues until all batches have been presented to the network. One full sweep over all batches is called an epoch.

The training of the network stops once the maximum number of epochs is reached or another stopping criterion defined by the user applies.

# Make Session
net = tf.Session()
# Run initializer
net.run(tf.global_variables_initializer())

# Setup interactive plot
plt.ion()
fig = plt.figure()
ax1 = fig.add_subplot(111)
line1, = ax1.plot(y_test)
line2, = ax1.plot(y_test*0.5)
plt.show()

# Number of epochs and batch size
epochs = 10
batch_size = 256

for e in range(epochs):

    # Shuffle training data
    shuffle_indices = np.random.permutation(np.arange(len(y_train)))
    X_train = X_train[shuffle_indices]
    y_train = y_train[shuffle_indices]

    # Minibatch training
    for i in range(0, len(y_train) // batch_size):
        start = i * batch_size
        batch_x = X_train[start:start + batch_size]
        batch_y = y_train[start:start + batch_size]
        # Run optimizer with batch
        net.run(opt, feed_dict={X: batch_x, Y: batch_y})

        # Show progress
        if np.mod(i, 5) == 0:
            # Prediction
            pred = net.run(out, feed_dict={X: X_test})
            line2.set_ydata(pred)
            plt.title('Epoch ' + str(e) + ', Batch ' + str(i))
            file_name = 'img/epoch_' + str(e) + '_batch_' + str(i) + '.jpg'
            plt.savefig(file_name)
            plt.pause(0.01)
# Print final MSE after Training
mse_final = net.run(mse, feed_dict={X: X_test, Y: y_test})
print(mse_final)

During the training, we evaluate the networks predictions on the test set — the data which is not learned, but set aside — for every 5th batch and visualize it. Additionally, the images are exported to disk and later combined into a video animation of the training process (see below). The model quickly learns the shape and location of the time series in the test data and is able to produce an accurate prediction after some epochs. Nice!

Video animation of the network’s test data prediction (orange) during training.

One can see that the networks rapidly adapts to the basic shape of the time series and continues to learn finer patterns of the data. This also corresponds to the Adam learning scheme that lowers the learning rate during model training in order not to overshoot the optimization minimum. After 10 epochs, we have a pretty close fit to the test data! The final test MSE equals 0.00078 (it is very low, because the target is scaled). The mean absolute percentage error of the forecast on the test set is equal to 5.31% which is pretty good. Note, that this is just a fit to the test data, no actual out of sample metrics in a real world scenario.

Please note that there are tons of ways of further improving this result: design of layers and neurons, choosing different initialization and activation schemes, introduction of dropout layers of neurons, early stopping and so on. Furthermore, different types of deep learning models, such as recurrent neural networks might achieve better performance on this task. However, this is not the scope of this introductory post.

Conclusion and outlook

The release of TensorFlow was a landmark event in deep learning research. Its flexibility and performance allows researchers to develop all kinds of sophisticated neural network architectures as well as other ML algorithms. However, flexibility comes at the cost of longer time-to-model cycles compared to higher level APIs such as Keras or MxNet. Nonetheless, I am sure that TensorFlow will make its way to the de-facto standard in neural network and deep learning development in research and practical applications. Many of our customers are already using TensorFlow or start developing projects that employ TensorFlow models. Also our data science consultants at STATWORX are heavily using TensorFlow for deep learning and neural net research and development. Let’s see what Google has planned for the future of TensorFlow. One thing that is missing, at least in my opinion, is a neat graphical user interface for designing and developing neural net architectures with TensorFlow backend. Maybe, this is something Google is already working on ;)

Update: I’ve added both the Python script as well as a (zipped) dataset to a Github repository. Feel free to clone and fork.

Final remarks

If you have any comments or questions on my story, feel free to comment below! I will try to answer them. Also, feel free to use my code or share this story with your peers on social platforms of your choice. Follow me on LinkedIn or Twitter, if you want to stay in touch.

Make sure, you also check the awesome STATWORX Blog for more interesting data science, ML and AI content straight from the our office in Frankfurt, Germany!

If you’re interested in more quality content like this, join my mailing list, constantly bringing you new data science, machine learning and AI reads and treats from me and my team right into your inbox!

TensorFlow 2.0: Natural Language Processing

TensorFlow 2.0: Natural Language Processing

NLP in TensorFlow 2.0/PyTorch.This blog post is dedicated to the use of the Transformers library using TensorFlow: using the Keras API as well as the TensorFlow TPUStrategy to fine-tune a State-of-The-Art Transformer model.

Introduction

Hugging Face is an NLP-focused startup with a large open-source community, in particular around the Transformers library.Transformers is a python-based library that exposes an API to use many well-known transformer architectures, such as BERT, RoBERTa, GPT-2 or DistilBERT, that obtain state-of-the-art results on a variety of NLP tasks like text classification, information extraction, question answering, and text generation. Those architectures come pre-trained with several sets of weights. Getting started with Transformers only requires to install the pip package:

pip install transformers

The library has seen super-fast growth in PyTorch and has recently been ported to TensorFlow 2.0, offering an API that now works with Keras’ fit API, TensorFlow Extended, and TPUs 👏. This blog post is dedicated to the use of the Transformers library using TensorFlow: using the Keras API as well as the TensorFlow TPUStrategy to fine-tune a State-of-The-Art Transformer model.

Library & Philosophy

Transformers is based around the concept of pre-trained transformer models. These transformer models come in different shapes, sizes, and architectures and have their own ways of accepting input data: via tokenization.

The library builds on three main classes: a configuration class, a tokenizer class, and a model class.

  • The configuration class: the configuration class hosts relevant information concerning the model we will be using, such as the number of layers and the number of attention heads. Below is an example of a BERT configuration file, for the pre-trained weights bert-base-cased. The configuration classes host these attributes with various I/O methods and standardized name properties.

bert-base-cased-config.json

{
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 28996
}
  • The tokenizer class:the tokenizer class takes care of converting python string in arrays or tensors of integers which are indices in a model vocabulary. It has many handy features revolving around the tokenization of a string into tokens. This tokenization varies according to the model, therefore each model has its own tokenizer.
  • The model class: the model class holds the neural network modeling logic itself. When using a TensorFlow model, it inherits from tf.keras.layers.Layer which means it can be used very simply by the Keras’ fit API or trained using a custom training loop and GradientTape .

Joy in simplicity

The advantage of using Transformers lies in the straight-forward model-agnostic API. Loading a pre-trained model, along with its tokenizer can be done in a few lines of code. Here is an example of loading the BERT and GPT-2 TensorFlow models as well as their tokenizers:

model_load.py

from transformers import (TFBertModel, BertTokenizer,
                         TFGPT2Model, GPT2Tokenizer)

bert_model = TFBertModel.from_pretrained("bert-base-cased")  # Automatically loads the config
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

gpt2_model = TFGPT2Model.from_pretrained("gpt2")  # Automatically loads the config
gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

Loading architectures is model-agnostic

The weights are downloaded from HuggingFace’s S3 bucket and cached locally on your machine. The models are ready to be used for inference or finetuned if need be. Let’s see that in action.

Fine-tuning a Transformer model

Fine-tuning a model is made easy thanks to some methods available in the Transformer library. The next parts are built as such:

  • Loading text data and pre-processing it
  • Defining the hyper-parameters
  • Training (with Keras on CPU/GPU and with TPUStrategy)

Building an input pipeline

We have made an accompanying colab notebook to get you fast on track with all the code. We’ll leverage the tensorflow_datasets package for data loading. Tensorflow-dataset provides us with a tf.data.Dataset , which can be fed into our glue_convert_examples_to_features method.

This method will make use of the tokenizer to tokenize the input and add special tokens at the beginning and the end of sequences (like [SEP], [CLS], or for instance) if such additional tokens are required by the model. This method returns a tf.data.Dataset holding the featurized inputs.

We can then shuffle this dataset and batch it in batches of 32 units using standard tf.data.Dataset methods.

loading_data.py

import tensorflow_datasets
from transformers import glue_convert_examples_to_features

data = tensorflow_datasets.load("glue/mrpc")

train_dataset = data["train"]
validation_dataset = data["validation"]

train_dataset = glue_convert_examples_to_features(train_dataset, bert_tokenizer, 128, 'mrpc')
validation_dataset = glue_convert_examples_to_features(validation_dataset, bert_tokenizer, 128, 'mrpc')
train_dataset = train_dataset.shuffle(100).batch(32).repeat(2)
validation_dataset = validation_dataset.batch(64)

Building an input pipeline for our model

Training with Keras’ fit method

Training a model using Keras’ fit method has never been simpler. Now that we have the input pipeline setup, we can define the hyperparameters, and call the Keras’ fit method with our dataset.

keras_fit.py

optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
bert_model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

bert_history = bert_model.fit(
	bert_train_dataset, 
	epochs=2, 
	steps_per_epoch=115, 
	validation_data=bert_validation_dataset, 
	validation_steps=7
)
Training with Strategy

Training with a strategy gives you better control over what happens during the training. By switching between strategies, the user can select the distributed fashion in which the model is trained: from multi-GPUs to TPUs.

As of the time of writing, TPUStrategy is the only surefire way to train a model on a TPU using TensorFlow 2. Building a custom loop using a strategy makes even more sense in that regard, as strategies may easily be switched around and training on multi-GPU would require practically no code change.

Building a custom loop requires a bit of work to set-up, therefore the reader is advised to open the following colab notebook to have a better grasp of the subject at hand. It does not go into the detail of tokenization as the first colab has done, but it shows how to build an input pipeline that will be used by the TPUStrategy.

This makes use of Google Cloud Platform bucket as a means to host data, as TPUs are complicated to handle when using local filesystems. The colab notebook is available here.

Transformers now has access to TensorFlow APIs - So what?

The main selling point of the Transformers library is its model agnostic and simple API. Acting as a front-end to models that obtain state-of-the-art results in NLP, switching between models according to the task at hand is extremely easy.

As an example, here’s the complete script to fine-tune BERT on a language classification task(MRPC):
BERT_keras.py

model = TFBertForSequenceClassification.from_pretrained("bert-base-cased")
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

data = tensorflow_datasets.load("glue/mrpc")
train_dataset = data["train"]
train_dataset = glue_convert_examples_to_features(train_dataset, tokenizer, 128, 'mrpc')

optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')

model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
model.fit(train_dataset, epochs=3)

However, in a production environment, memory is scarce. You would like to use a smaller model instead; switching to DistilBERT for example. Simply change the first two lines to these two in order to do so:

DistilBERT_keras.py

model = TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
tokenizer = DistilbertTokenizer.from_pretrained("distilbert-base-uncased")

As a platform hosting 10+ Transformer architectures, Transformers makes it very easy to use, fine-tune and compare the models that have transfigured the deep-learning for NLP field. It serves as a backend for many downstream apps that leverage transformer models and is in use in production by many different companies. We’ll welcome any question or issue you might have on our GitHub repository.