Building efficient data pipelines using TensorFlow

Building efficient data pipelines using TensorFlow

<strong>Having efficient data pipelines is of paramount importance for any machine learning model. In this blog, we will learn how to use TensorFlow’s Dataset module tf.data to build efficient data pipelines.</strong>

Having efficient data pipelines is of paramount importance for any machine learning model. In this blog, we will learn how to use TensorFlow’s Dataset module tf.data to build efficient data pipelines.

Motivation

Most of the introductory articles on TensorFlow would introduce you with the feed_dict method of feeding the data to the model. feed_dict processes the input data in a single thread and while the data is being loaded and processed on CPU, the GPU remains idle and when the GPU is training a batch of data, CPU remains in the idle state. The developers of TensorFlow have advised not to use this method during training or repeated validation of the same datasets.

tf_data improves the performance by prefetching the next batch of data asynchronously so that GPU need not wait for the data. You can also parallelize the process of preprocessing and loading the dataset.

In this blog post we will cover Datasets and Iterators. We will learn how to create Datasets from source data, apply the transformation to Dataset and then consume the data using Iterators.

How to create Datasets?

Tensorflow provides various methods to create Datasets from numpy arrays, text files, CSV files, tensors, etc. Let’s look at few methods below

  • from_tensor_slices: It accepts single or multiple numpy arrays or tensors. Dataset created using this method will emit only one data at a time.
# source data - numpy array
data = np.arange(10)
# create a dataset from numpy array
dataset = tf.data.Dataset.from_tensor_slices(data)

The object dataset is a tensorflow Dataset object.

  • from_tensors: It also accepts single or multiple numpy arrays or tensors. Dataset created using this method will emit all the data at once.
data = tf.arange(10)
dataset = tf.data.Dataset.from_tensors(data)

3. from_generator: Creates a Dataset whose elements are generated by a function.

def generator():
  for i in range(10):
    yield 2*i

dataset = tf.data.Dataset.from_generator(generator, (tf.int32))

Operations on Datasets

  • Batches: Combines consecutive elements of the Dataset into a single batch. Useful when you want to train smaller batches of data to avoid out of memory errors.
data = np.arange(10,40)

create batches of 10

dataset = tf.data.Dataset.from_tensor_slices(data).batch(10)

creates the iterator to consume the data

iterator = dataset.make_one_shot_iterator()
next_ele = iterator.get_next()
with tf.Session() as sess:
try:
while True:
val = sess.run(next_ele)
print(val)
except tf.errors.OutOfRangeError:
pass

You can skip the code where we create an iterator and print the elements of the Dataset. We will learn about Iterators in detail in the later part of this blog. The output is :

[10 11 12 13 14 15 16 17 18 19]
[20 21 22 23 24 25 26 27 28 29]
[30 31 32 33 34 35 36 37 38 39]
  • Zip: Creates a Dataset by zipping together datasets. Useful in scenarios where you have features and labels and you need to provide the pair of feature and label for training the model.
datax = np.arange(10,20)
datay = np.arange(11,21)
datasetx = tf.data.Dataset.from_tensor_slices(datax)
datasety = tf.data.Dataset.from_tensor_slices(datay)
dcombined = tf.data.Dataset.zip((datasetx, datasety)).batch(2)
iterator = dcombined.make_one_shot_iterator()
next_ele = iterator.get_next()
with tf.Session() as sess:
try:
while True:
val = sess.run(next_ele)
print(val)
except tf.errors.OutOfRangeError:
pass

The output is

(array([10, 11]), array([11, 12]))
(array([12, 13]), array([13, 14]))
(array([14, 15]), array([15, 16]))
(array([16, 17]), array([17, 18]))
(array([18, 19]), array([19, 20]))
  • Repeat: Used to repeat the Dataset.
dataset = tf.data.Dataset.from_tensor_slices(tf.range(10))
dataset = dataset.repeat(count=2)
iterator = dataset.make_one_shot_iterator()
next_ele = iterator.get_next()
with tf.Session() as sess:
try:
while True:
val = sess.run(next_ele)
print(val)
except tf.errors.OutOfRangeError:
pass

The output is

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
  • Map: Used to transform the elements of the Dataset. Useful in cases where you want to transform your raw data before feeding into the model.
def map_fnc(x):
return x*2;
data = np.arange(10)
dataset = tf.data.Dataset.from_tensor_slices(data)
dataset = dataset.map(map_fnc)
iterator = dataset.make_one_shot_iterator()
next_ele = iterator.get_next()
with tf.Session() as sess:
try:
while True:
val = sess.run(next_ele)
print(val)
except tf.errors.OutOfRangeError:
pass

The output is

0 2 4 6 8 10 12 14 16 18

Creating Iterators

We have learned various ways to create Datasets and apply various transformations to it, but how do we consume the data? Tensorflow provides Iterators to do that.

The iterator is not aware of the number of elements present in the Dataset. It has a get_next function that is used to create an operation in the tensorflow graph and when run over a session, it will return the values from the iterator. Once the Dataset is exhausted, it throws an tf.errors.OutOfRangeError exception.

Let’s look at various Iterators that TensorFlow provides.

  • One-shot iterator: This is the most basic form of iterator. It requires no explicit initialization and iterates over the data only one time and once it gets exhausted, it cannot be re-initialized.
data = np.arange(10,15)
#create the dataset
dataset = tf.data.Dataset.from_tensor_slices(data)
#create the iterator
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
with tf.Session() as sess:
val = sess.run(next_element)
print(val)
  • Initializable iterator: This iterator requires you to explicitly initialize the iterator by running iterator.initialize. You can define a tf.placeholder and pass data to it dynamically each time you call the initialize operation.
# define two placeholders to accept min and max value
min_val = tf.placeholder(tf.int32, shape=[])
max_val = tf.placeholder(tf.int32, shape=[])
data = tf.range(min_val, max_val)
dataset = tf.data.Dataset.from_tensor_slices(data)
iterator = dataset.make_initializable_iterator()
next_ele = iterator.get_next()
with tf.Session() as sess:

# initialize an iterator with range of values from 10 to 15
sess.run(iterator.initializer, feed_dict={min_val:10, max_val:15})
try:
while True:
val = sess.run(next_ele)
print(val)
except tf.errors.OutOfRangeError:
pass

# initialize an iterator with range of values from 1 to 10
sess.run(iterator.initializer, feed_dict={min_val:1, max_val:10})
try:
while True:
val = sess.run(next_ele)
print(val)
except tf.errors.OutOfRangeError:
pass

  • Reinitializable iterator: This iterator can be initialized from different Dataset objects that have the same structure. Each dataset can pass through its own transformation pipeline.
def map_fnc(ele):
return ele*2
min_val = tf.placeholder(tf.int32, shape=[])
max_val = tf.placeholder(tf.int32, shape=[])
data = tf.range(min_val, max_val)
#Define separate datasets for training and validation
train_dataset = tf.data.Dataset.from_tensor_slices(data)
val_dataset = tf.data.Dataset.from_tensor_slices(data).map(map_fnc)
#create an iterator
iterator=tf.data.Iterator.from_structure(train_dataset.output_types ,train_dataset.output_shapes)
train_initializer = iterator.make_initializer(train_dataset)
val_initializer = iterator.make_initializer(val_dataset)
next_ele = iterator.get_next()
with tf.Session() as sess:

initialize an iterator with range of values from 10 to 15

sess.run(train_initializer, feed_dict={min_val:10, max_val:15})
try:
while True:
val = sess.run(next_ele)
print(val)
except tf.errors.OutOfRangeError:
pass

initialize an iterator with range of values from 1 to 10

sess.run(val_initializer, feed_dict={min_val:1, max_val:10})
try:
while True:
val = sess.run(next_ele)
print(val)
except tf.errors.OutOfRangeError:
pass

  • Feedable iterator: Can be used to switch between Iterators for different Datasets. Useful when you have different Datasets and you want to have more control over which iterator to use over the Dataset.
def map_fnc(ele):
return ele*2
min_val = tf.placeholder(tf.int32, shape=[])
max_val = tf.placeholder(tf.int32, shape=[])
data = tf.range(min_val, max_val)
train_dataset = tf.data.Dataset.from_tensor_slices(data)
val_dataset = tf.data.Dataset.from_tensor_slices(data).map(map_fnc)
train_val_iterator = tf.data.Iterator.from_structure(train_dataset.output_types , train_dataset.output_shapes)
train_initializer = train_val_iterator.make_initializer(train_dataset)
val_initializer = train_val_iterator.make_initializer(val_dataset)
test_dataset = tf.data.Dataset.from_tensor_slices(tf.range(10,15))
test_iterator = test_dataset.make_one_shot_iterator()
handle = tf.placeholder(tf.string, shape=[])
iterator = tf.data.Iterator.from_string_handle(handle, train_dataset.output_types, train_dataset.output_shapes)
next_ele = iterator.get_next()
with tf.Session() as sess:

train_val_handle = sess.run(train_val_iterator.string_handle())
test_handle = sess.run(test_iterator.string_handle())

training

sess.run(train_initializer, feed_dict={min_val:10, max_val:15})
try:
while True:
val = sess.run(next_ele, feed_dict={handle:train_val_handle})
print(val)
except tf.errors.OutOfRangeError:
pass

#validation
sess.run(val_initializer, feed_dict={min_val:1, max_val:10})
try:
while True:
val = sess.run(next_ele, feed_dict={handle:train_val_handle})
print(val)
except tf.errors.OutOfRangeError:
pass

#testing
try:
while True:
val = sess.run(next_ele, feed_dict={handle:test_handle})
print(val)
except tf.errors.OutOfRangeError:
pass

We have learned about various iterators. Let’s apply the knowledge to a practical dataset. We will train the famous MNIST dataset using the LeNet-5 Model. This tutorial will not dive into the details of implementing the LeNet-5 Model as it is beyond the scope of this article.

LeNet-5 Model

Let’s import the MNIST data from the tensorflow library. The MNIST database contains 60,000 training images and 10,000 testing images. Each image is of size 28281. We need to resize it to 32321 for the LeNet-5 Model.

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", reshape=False, one_hot = True)
X_train, y_train = mnist.train.images, mnist.train.labels
X_val, y_val = mnist.validation.images, mnist.validation.labels
X_test, y_test = mnist.test.images, mnist.test.labels
X_train = np.pad(X_train, ((0,0), (2,2), (2,2), (0,0)), 'constant')
X_val = np.pad(X_val, ((0,0), (2,2), (2,2), (0,0)), 'constant')
X_test = np.pad(X_test, ((0,0), (2,2), (2,2), (0,0)), 'constant')

Let’s define the forward propagation of the model.

def forward_pass(X):
W1 = tf.get_variable("W1", [5,5,1,6], initializer = tf.contrib.layers.xavier_initializer(seed=0))
# for conv layer2
W2 = tf.get_variable("W2", [5,5,6,16], initializer = tf.contrib.layers.xavier_initializer(seed=0))
Z1 = tf.nn.conv2d(X, W1, strides = [1,1,1,1], padding='VALID')
A1 = tf.nn.relu(Z1)
P1 = tf.nn.max_pool(A1, ksize = [1,2,2,1], strides = [1,2,2,1], padding='VALID')
Z2 = tf.nn.conv2d(P1, W2, strides = [1,1,1,1], padding='VALID')
A2= tf.nn.relu(Z2)
P2= tf.nn.max_pool(A2, ksize = [1,2,2,1], strides=[1,2,2,1], padding='VALID')
P2 = tf.contrib.layers.flatten(P2)

Z3 = tf.contrib.layers.fully_connected(P2, 120)
Z4 = tf.contrib.layers.fully_connected(Z3, 84)
Z5 = tf.contrib.layers.fully_connected(Z4,10, activation_fn= None)
return Z5

Let’s define the model operations

def model(X,Y):

logits = forward_pass(X)
cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits_v2(logits=logits, labels=Y))
optimizer = tf.train.AdamOptimizer(learning_rate=0.0009)
learner = optimizer.minimize(cost)
correct_predictions = tf.equal(tf.argmax(logits,1),   tf.argmax(Y,1))
accuracy = tf.reduce_mean(tf.cast(correct_predictions, tf.float32))

return (learner, accuracy)

We have now created the model. Before deciding on the Iterator to use for our model, let’s see what are the typical requirements of a machine learning model.

  1. Training the data over batches: Dataset can be very huge. To prevent out of memory errors, we would need to train our dataset in small batches.
  2. Train the model over n passes of the dataset: Typically you want to run your training model over multiple passes of the dataset.
  3. Validate the model at each epoch: You would need to validate your model at each epoch to check your model’s performance.
  4. Finally, test your model on unseen data: After the model is trained, you would like to test your model on unseen data.

Let’s see the pros and cons of each iterator.

  • One-shot iterator: The Dataset can’t be reinitialized once exhausted. To train for more epochs, you would need to repeat the Dataset before feeding to the iterator. This will require huge memory if the size of the data is large. It also doesn’t provide any option to validate the model.
epochs = 10
batch_size = 64
iterations = len(y_train) * epochs
tf.reset_default_graph()
dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train))

need to repeat the dataset for epoch number of times, as all the data needs to be fed to the dataset at once

dataset = dataset.repeat(epochs).batch(batch_size)
iterator = dataset.make_one_shot_iterator()
X_batch , Y_batch = iterator.get_next()
(learner, accuracy) = model(X_batch, Y_batch)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())

total_accuracy = 0
try:
while True:
temp_accuracy, _ = sess.run([accuracy, learner])
total_accuracy += temp_accuracy

except tf.errors.OutOfRangeError:
pass

print('Avg training accuracy is {}'.format((total_accuracy * batch_size) / iterations ))

  • Initializable iterator: You can dynamically change the Dataset between training and validation Datasets. However, in this case both the Datasets needs to go through the same transformation pipeline.
epochs = 10
batch_size = 64
tf.reset_default_graph()
X_data = tf.placeholder(tf.float32, [None, 32,32,1])
Y_data = tf.placeholder(tf.float32, [None, 10])
dataset = tf.data.Dataset.from_tensor_slices((X_data, Y_data))
dataset = dataset.batch(batch_size)
iterator = dataset.make_initializable_iterator()
X_batch , Y_batch = iterator.get_next()
(learner, accuracy) = model(X_batch, Y_batch)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in range(epochs):

# train the model
sess.run(iterator.initializer, feed_dict={X_data:X_train, Y_data:y_train})
total_train_accuracy = 0
no_train_examples = len(y_train)
try:
  while True:
    temp_train_accuracy, _ = sess.run([accuracy, learner])
    total_train_accuracy += temp_train_accuracy*batch_size
except tf.errors.OutOfRangeError:
  pass

# validate the model
sess.run(iterator.initializer, feed_dict={X_data:X_val, Y_data:y_val})
total_val_accuracy = 0
no_val_examples = len(y_val)
try:
  while True:
    temp_val_accuracy = sess.run(accuracy)
    total_val_accuracy += temp_val_accuracy*batch_size
except tf.errors.OutOfRangeError:
  pass

print('Epoch {}'.format(str(epoch+1)))
print("---------------------------")
print('Training accuracy is {}'.format(total_train_accuracy/no_train_examples))
print('Validation accuracy is {}'.format(total_val_accuracy/no_val_examples))

  • Re-initializable iterator: This iterator overcomes the problem of initializable iterator by using two separate Datasets. Each dataset can go through its own preprocessing pipeline. The iterator can be created using the tf.Iterator.from_structure method.
def map_fnc(X, Y):
return X, Y
epochs = 10
batch_size = 64
tf.reset_default_graph()
X_data = tf.placeholder(tf.float32, [None, 32,32,1])
Y_data = tf.placeholder(tf.float32, [None, 10])
train_dataset = tf.data.Dataset.from_tensor_slices((X_data, Y_data)).batch(batch_size).map(map_fnc)
val_dataset = tf.data.Dataset.from_tensor_slices((X_data, Y_data)).batch(batch_size)
iterator = tf.data.Iterator.from_structure(train_dataset.output_types, train_dataset.output_shapes)
X_batch , Y_batch = iterator.get_next()
(learner, accuracy) = model(X_batch, Y_batch)
train_initializer = iterator.make_initializer(train_dataset)
val_initializer = iterator.make_initializer(val_dataset)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in range(epochs):

# train the model
sess.run(train_initializer, feed_dict={X_data:X_train, Y_data:y_train})
total_train_accuracy = 0
no_train_examples = len(y_train)
try:
  while True:
    temp_train_accuracy, _ = sess.run([accuracy, learner])
    total_train_accuracy += temp_train_accuracy*batch_size
except tf.errors.OutOfRangeError:
  pass

# validate the model
sess.run(val_initializer, feed_dict={X_data:X_val, Y_data:y_val})
total_val_accuracy = 0
no_val_examples = len(y_val)
try:
  while True:
    temp_val_accuracy = sess.run(accuracy)
    total_val_accuracy += temp_val_accuracy*batch_size
except tf.errors.OutOfRangeError:
  pass

print('Epoch {}'.format(str(epoch+1)))
print("---------------------------")
print('Training accuracy is {}'.format(total_train_accuracy/no_train_examples))
print('Validation accuracy is {}'.format(total_val_accuracy/no_val_examples))

  • Feedable iterator: This iterator provides the option of switching between various iterators. You can create a re-initializable iterator for training and validation purposes. For inference/testing where you require one pass of the dataset, you can use the one shot iterator.
epochs = 10
batch_size = 64
tf.reset_default_graph()
X_data = tf.placeholder(tf.float32, [None, 32,32,1])
Y_data = tf.placeholder(tf.float32, [None, 10])
train_dataset = tf.data.Dataset.from_tensor_slices((X_data, Y_data)).batch(batch_size)
val_dataset = tf.data.Dataset.from_tensor_slices((X_data, Y_data)).batch(batch_size)
test_dataset = tf.data.Dataset.from_tensor_slices((X_test, y_test.astype(np.float32)).batch(batch_size)
handle = tf.placeholder(tf.string, shape=[])
iterator = tf.data.Iterator.from_string_handle(handle, train_dataset.output_types, train_dataset.output_shapes)
X_batch , Y_batch = iterator.get_next()
(learner, accuracy) = model(X_batch, Y_batch)
train_val_iterator = tf.data.Iterator.from_structure(train_dataset.output_types, train_dataset.output_shapes)
train_iterator = train_val_iterator.make_initializer(train_dataset)
val_iterator = train_val_iterator.make_initializer(val_dataset)
test_iterator = test_dataset.make_one_shot_iterator()
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
train_val_string_handle = sess.run(train_val_iterator.string_handle())
test_string_handle = sess.run(test_iterator.string_handle())

for epoch in range(epochs):

# train the model
sess.run(train_iterator, feed_dict={X_data:X_train, Y_data:y_train})
total_train_accuracy = 0
no_train_examples = len(y_train)
try:
  while True:
    temp_train_accuracy, _ = sess.run([accuracy, learner], feed_dict={handle:train_val_string_handle})
    total_train_accuracy += temp_train_accuracy*batch_size
except tf.errors.OutOfRangeError:
  pass

# validate the model
sess.run(val_iterator, feed_dict={X_data:X_val, Y_data:y_val})
total_val_accuracy = 0
no_val_examples = len(y_val)
try:
  while True:
    temp_val_accuracy, _ = sess.run([accuracy, learner], feed_dict={handle:train_val_string_handle})
    total_val_accuracy += temp_val_accuracy*batch_size
except tf.errors.OutOfRangeError:
  pass

print('Epoch {}'.format(str(epoch+1)))
print("---------------------------")
print('Training accuracy is {}'.format(total_train_accuracy/no_train_examples))
print('Validation accuracy is {}'.format(total_val_accuracy/no_val_examples))

print("Testing the model --------")

total_test_accuracy = 0
no_test_examples = len(y_test)
try:
while True:
temp_test_accuracy, _ = sess.run([accuracy, learner], feed_dict={handle:test_string_handle})
total_test_accuracy += temp_test_accuracy*batch_size
except tf.errors.OutOfRangeError:
pass

print('Testing accuracy is {}'.format(total_test_accuracy/no_test_examples))

Thanks for reading the blog. The code examples used in this blog can be found in this jupyter notebook.

Do leave your comments below if you have any questions or if you have any suggestions for improving this blog.

Originally published by Animesh Agarwal at https://towardsdatascience.com/

Learn More

☞ Complete Guide to TensorFlow for Deep Learning with Python

☞ Modern Deep Learning in Python

☞ TensorFlow 101: Introduction to Deep Learning

☞ Tensorflow Bootcamp For Data Science in Python

☞ Machine Learning A-Z™: Hands-On Python & R In Data Science

☞ Data Science A-Z™: Real-Life Data Science Exercises Included

Machine Learning In Node.js With TensorFlow.js

Machine Learning In Node.js With TensorFlow.js

Machine Learning In Node.js With TensorFlow.js - TensorFlow.js is a new version of the popular open-source library which brings deep learning to JavaScript. Developers can now define, train, and run machine learning models using the high-level library API.

Machine Learning In Node.js With TensorFlow.js - TensorFlow.js is a new version of the popular open-source library which brings deep learning to JavaScript. Developers can now define, train, and run machine learning models using the high-level library API.

Pre-trained models mean developers can now easily perform complex tasks like visual recognitiongenerating music or detecting human poses with just a few lines of JavaScript.

Having started as a front-end library for web browsers, recent updates added experimental support for Node.js. This allows TensorFlow.js to be used in backend JavaScript applications without having to use Python.

Reading about the library, I wanted to test it out with a simple task... 🧐

Use TensorFlow.js to perform visual recognition on images using JavaScript from Node.js
Unfortunately, most of the documentation and example code provided uses the library in a browser. Project utilities provided to simplify loading and using pre-trained models have not yet been extended with Node.js support. Getting this working did end up with me spending a lot of time reading the Typescript source files for the library. 👎

However, after a few days' hacking, I managed to get this completed! Hurrah! 🤩

Before we dive into the code, let's start with an overview of the different TensorFlow libraries.

TensorFlow

TensorFlow is an open-source software library for machine learning applications. TensorFlow can be used to implement neural networks and other deep learning algorithms.

Released by Google in November 2015, TensorFlow was originally a Python library. It used either CPU or GPU-based computation for training and evaluating machine learning models. The library was initially designed to run on high-performance servers with expensive GPUs.

Recent updates have extended the software to run in resource-constrained environments like mobile devices and web browsers.

TensorFlow Lite

Tensorflow Lite, a lightweight version of the library for mobile and embedded devices, was released in May 2017. This was accompanied by a new series of pre-trained deep learning models for vision recognition tasks, called MobileNet. MobileNet models were designed to work efficiently in resource-constrained environments like mobile devices.

TensorFlow.js

Following Tensorflow Lite, TensorFlow.js was announced in March 2018. This version of the library was designed to run in the browser, building on an earlier project called deeplearn.js. WebGL provides GPU access to the library. Developers use a JavaScript API to train, load and run models.

TensorFlow.js was recently extended to run on Node.js, using an extension library called tfjs-node.

The Node.js extension is an alpha release and still under active development.

Importing Existing Models Into TensorFlow.js

Existing TensorFlow and Keras models can be executed using the TensorFlow.js library. Models need converting to a new format using this tool before execution. Pre-trained and converted models for image classification, pose detection and k-nearest neighbours are available on Github.

Using TensorFlow.js in Node.js

Installing TensorFlow Libraries

TensorFlow.js can be installed from the NPM registry.

npm install @tensorflow/tfjs @tensorflow/tfjs-node
// or...
npm install @tensorflow/tfjs @tensorflow/tfjs-node-gpu

Both Node.js extensions use native dependencies which will be compiled on demand.

Loading TensorFlow Libraries

TensorFlow's JavaScript API is exposed from the core library. Extension modules to enable Node.js support do not expose additional APIs.

const tf = require('@tensorflow/tfjs')
// Load the binding (CPU computation)
require('@tensorflow/tfjs-node')
// Or load the binding (GPU computation)
require('@tensorflow/tfjs-node-gpu')

Loading TensorFlow Models

TensorFlow.js provides an NPM library (tfjs-models) to ease loading pre-trained & converted models for image classificationpose detection and k-nearest neighbours.

The MobileNet model used for image classification is a deep neural network trained to identify 1000 different classes.

In the project's README, the following example code is used to load the model.

import * as mobilenet from '@tensorflow-models/mobilenet';

// Load the model.
const model = await mobilenet.load();

One of the first challenges I encountered was that this does not work on Node.js.

Error: browserHTTPRequest is not supported outside the web browser.

Looking at the source code, the mobilenet library is a wrapper around the underlying tf.Model class. When the load() method is called, it automatically downloads the correct model files from an external HTTP address and instantiates the TensorFlow model.

The Node.js extension does not yet support HTTP requests to dynamically retrieve models. Instead, models must be manually loaded from the filesystem.

After reading the source code for the library, I managed to create a work-around...

Loading Models From a Filesystem

Rather than calling the module's load method, if the MobileNet class is created manually, the auto-generated path variable which contains the HTTP address of the model can be overwritten with a local filesystem path. Having done this, calling the load method on the class instance will trigger the filesystem loader class, rather than trying to use the browser-based HTTP loader.

const path = "mobilenet/model.json"
const mn = new mobilenet.MobileNet(1, 1);
mn.path = `file://${path}`
await mn.load()

Awesome, it works!

But how where do the models files come from?

MobileNet Models

Models for TensorFlow.js consist of two file types, a model configuration file stored in JSON and model weights in a binary format. Model weights are often sharded into multiple files for better caching by browsers.

Looking at the automatic loading code for MobileNet models, models configuration and weight shards are retrieved from a public storage bucket at this address.

https://storage.googleapis.com/tfjs-models/tfjs/mobilenet_v${version}_${alpha}_${size}/

The template parameters in the URL refer to the model versions listed here. Classification accuracy results for each version are also shown on that page.

According to the source code, only MobileNet v1 models can be loaded using the tensorflow-models/mobilenet library.

The HTTP retrieval code loads the model.json file from this location and then recursively fetches all referenced model weights shards. These files are in the format groupX-shard1of1.

Downloading Models Manually

Saving all model files to a filesystem can be achieved by retrieving the model configuration file, parsing out the referenced weight files and downloading each weight file manually.

I want to use the MobileNet V1 Module with 1.0 alpha value and image size of 224 pixels. This gives me the following URL for the model configuration file.

https://storage.googleapis.com/tfjs-models/tfjs/mobilenet_v1_1.0_224/model.json

Once this file has been downloaded locally, I can use the jq tool to parse all the weight file names.

$ cat model.json | jq -r ".weightsManifest[].paths[0]"
group1-shard1of1
group2-shard1of1
group3-shard1of1
...

Using the sed tool, I can prefix these names with the HTTP URL to generate URLs for each weight file.

$ cat model.json | jq -r ".weightsManifest[].paths[0]" | sed 's/^/https:\/\/storage.googleapis.com\/tfjs-models\/tfjs\/mobilenet_v1_1.0_224\//'
https://storage.googleapis.com/tfjs-models/tfjs/mobilenet_v1_1.0_224/group1-shard1of1
https://storage.googleapis.com/tfjs-models/tfjs/mobilenet_v1_1.0_224/group2-shard1of1
https://storage.googleapis.com/tfjs-models/tfjs/mobilenet_v1_1.0_224/group3-shard1of1
...

Using the parallel and curl commands, I can then download all of these files to my local directory.

cat model.json | jq -r ".weightsManifest[].paths[0]" | sed 's/^/https:\/\/storage.googleapis.com\/tfjs-models\/tfjs\/mobilenet_v1_1.0_224\//' |  parallel curl -O

Classifying Images

This example code is provided by TensorFlow.js to demonstrate returning classifications for an image.

const img = document.getElementById('img');

// Classify the image.
const predictions = await model.classify(img);

This does not work on Node.js due to the lack of a DOM.

The classify method accepts numerous DOM elements (canvas, video, image) and will automatically retrieve and convert image bytes from these elements into a tf.Tensor3D class which is used as the input to the model. Alternatively, the tf.Tensor3D input can be passed directly.

Rather than trying to use an external package to simulate a DOM element in Node.js, I found it easier to construct the tf.Tensor3D manually.

Generating Tensor3D from an Image

Reading the source code for the method used to turn DOM elements into Tensor3D classes, the following input parameters are used to generate the Tensor3D class.

const values = new Int32Array(image.height * image.width * numChannels);
// fill pixels with pixel channel bytes from image
const outShape = [image.height, image.width, numChannels];
const input = tf.tensor3d(values, outShape, 'int32');

pixels is a 2D array of type (Int32Array) which contains a sequential list of channel values for each pixel. numChannels is the number of channel values per pixel.

Creating Input Values For JPEGs

The jpeg-js library is a pure javascript JPEG encoder and decoder for Node.js. Using this library the RGB values for each pixel can be extracted.

const pixels = jpeg.decode(buffer, true);

This will return a Uint8Array with four channel values (RGBA) for each pixel (width * height). The MobileNet model only uses the three colour channels (RGB) for classification, ignoring the alpha channel. This code converts the four channel array into the correct three channel version.

const numChannels = 3;
const numPixels = image.width * image.height;
const values = new Int32Array(numPixels * numChannels);

for (let i = 0; i < numPixels; i++) {
  for (let channel = 0; channel < numChannels; ++channel) {
    values[i * numChannels + channel] = pixels[i * 4 + channel];
  }
}

MobileNet Models Input Requirements

The MobileNet model being used classifies images of width and height 224 pixels. Input tensors must contain float values, between -1 and 1, for each of the three channels pixel values.

Input values for images of different dimensions needs to be re-sized before classification. Additionally, pixels values from the JPEG decoder are in the range 0 - 255, rather than -1 to 1. These values also need converting prior to classification.

TensorFlow.js has library methods to make this process easier but, fortunately for us, the tfjs-models/mobilenet library automatically handles this issue! 👍

Developers can pass in Tensor3D inputs of type int32 and different dimensions to the classify method and it converts the input to the correct format prior to classification. Which means there's nothing to do... Super 🕺🕺🕺.

Obtaining Predictions

MobileNet models in Tensorflow are trained to recognise entities from the top 1000 classes in the ImageNet dataset. The models output the probabilities that each of those entities is in the image being classified.

The full list of trained classes for the model being used can be found in this file.

The tfjs-models/mobilenet library exposes a classify method on the MobileNet class to return the top X classes with highest probabilities from an image input.

const predictions = await mn_model.classify(input, 10);

predictions is an array of X classes and probabilities in the following format.

{
  className: 'panda',
  probability: 0.9993536472320557
}

Example

Having worked how to use the TensorFlow.js library and MobileNet models on Node.js, this script will classify an image given as a command-line argument.

source code

testing it out

npm install

wget http://bit.ly/2JYSal9 -O panda.jpg

node script.js mobilenet/model.json panda.jpg

If everything worked, the following output should be printed to the console.

classification results: [ {
    className: 'giant panda, panda, panda bear, coon bear',
    probability: 0.9993536472320557 
} ]

The image is correctly classified as containing a Panda with 99.93% probability! 🐼🐼🐼

Conclusion

TensorFlow.js brings the power of deep learning to JavaScript developers. Using pre-trained models with the TensorFlow.js library makes it simple to extend JavaScript applications with complex machine learning tasks with minimal effort and code.

Having been released as a browser-based library, TensorFlow.js has now been extended to work on Node.js, although not all of the tools and utilities support the new runtime. With a few days' hacking, I was able to use the library with the MobileNet models for visual recognition on images from a local file.

Getting this working in the Node.js runtime means I now move on to my next idea... making this run inside a serverless function! Come back soon to read about my next adventure with TensorFlow.js.

Originally published by** James Thomas** *at *dev.to

=================================

Thanks for reading :heart: If you liked this post, share it with all of your programming buddies! Follow me on Facebook | Twitter

Learn More

☞ Machine Learning A-Z™: Hands-On Python & R In Data Science

☞ Python for Data Science and Machine Learning Bootcamp

☞ Machine Learning, Data Science and Deep Learning with Python

☞ [2019] Machine Learning Classification Bootcamp in Python

☞ Introduction to Machine Learning & Deep Learning in Python

☞ Machine Learning Career Guide – Technical Interview

☞ Machine Learning Guide: Learn Machine Learning Algorithms

☞ Machine Learning Basics: Building Regression Model in Python

☞ Machine Learning using Python - A Beginner’s Guide

How to building Python Data Science Container using Docker

How to building Python Data Science Container using Docker

In this article we will start with building a Python data science container - let's get started...

Artificial Intelligence(AI) and Machine Learning(ML) are literally on fire these days. Powering a wide spectrum of use-cases ranging from self-driving cars to drug discovery and to God knows what. AI and ML have a bright and thriving future ahead of them.

On the other hand, Docker revolutionized the computing world through the introduction of ephemeral lightweight containers. Containers basically package all the software required to run inside an image(a bunch of read-only layers) with a COW(Copy On Write) layer to persist the data.

Python Data Science Packages

Our Python data science container makes use of the following super cool python packages:

  1. NumPy: NumPy or Numeric Python supports large, multi-dimensional arrays and matrices. It provides fast precompiled functions for mathematical and numerical routines. In addition, NumPy optimizes Python programming with powerful data structures for efficient computation of multi-dimensional arrays and matrices.

  2. SciPy: SciPy provides useful functions for regression, minimization, Fourier-transformation, and many more. Based on NumPy, SciPy extends its capabilities. SciPy’s main data structure is again a multidimensional array, implemented by Numpy. The package contains tools that help with solving linear algebra, probability theory, integral calculus, and many more tasks.

  3. Pandas: Pandas offer versatile and powerful tools for manipulating data structures and performing extensive data analysis. It works well with incomplete, unstructured, and unordered real-world data — and comes with tools for shaping, aggregating, analyzing, and visualizing datasets.

  4. SciKit-Learn: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. It is one of the best-known machine-learning libraries for python. The Scikit-learn package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. The primary emphasis is upon ease of use, performance, documentation, and API consistency. With minimal dependencies and easy distribution under the simplified BSD license, SciKit-Learn is widely used in academic and commercial settings. Scikit-learn exposes a concise and consistent interface to the common machine learning algorithms, making it simple to bring ML into production systems.

  5. Matplotlib: Matplotlib is a Python 2D plotting library, capable of producing publication quality figures in a wide variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shell, the Jupyter notebook, web application servers, and four graphical user interface toolkits.

  6. NLTK: NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

Building the Data Science Container

Python is fast becoming the go-to language for data scientists and for this reason we are going to use Python as the language of choice for building our data science container.

The Base Alpine Linux Image

Alpine Linux is a tiny Linux distribution designed for power users who appreciate security, simplicity and resource efficiency.

As claimed by Alpine:

Small. Simple. Secure. Alpine Linux is a security-oriented, lightweight Linux distribution based on musl libc and busybox.

The Alpine image is surprisingly tiny with a size of no more than 8MB for containers. With minimal packages installed to reduce the attack surface on the underlying container. This makes Alpine an image of choice for our data science container.

Downloading and Running an Alpine Linux container is as simple as:

$ docker container run --rm alpine:latest cat /etc/os-release

In our, Dockerfile we can simply use the Alpine base image as:

FROM alpine:latest

Talk is cheap let’s build the Dockerfile

Now let’s work our way through the Dockerfile.

The FROM directive is used to set alpine:latest as the base image. Using the WORKDIR directive we set the /var/www as the working directory for our container. The ENV PACKAGES lists the software packages required for our container like git, blas and libgfortran. The python packages for our data science container are defined in the ENV PACKAGES.

We have combined all the commands under a single Dockerfile RUN directive to reduce the number of layers which in turn helps in reducing the resultant image size.

Building and tagging the image

Now that we have our Dockerfile defined, navigate to the folder with the Dockerfile using the terminal and build the image using the following command:

$ docker build -t faizanbashir/python-datascience:2.7 -f Dockerfile .

The -t flag is used to name a tag in the 'name:tag' format. The -f tag is used to define the name of the Dockerfile (Default is 'PATH/Dockerfile').

Running the container

We have successfully built and tagged the docker image, now we can run the container using the following command:

$ docker container run --rm -it faizanbashir/python-datascience:2.7 python

Voila, we are greeted by the sight of a python shell ready to perform all kinds of cool data science stuff.

Python 2.7.15 (default, Aug 16 2018, 14:17:09) [GCC 6.4.0] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>>

Our container comes with Python 2.7, but don’t be sad if you wanna work with Python 3.6. Lo, behold the Dockerfile for Python 3.6:

Build and tag the image like so:

$ docker build -t faizanbashir/python-datascience:3.6 -f Dockerfile .

Run the container like so:

$ docker container run --rm -it faizanbashir/python-datascience:3.6 python

With this, you have a ready to use container for doing all kinds of cool data science stuff.

Serving Puddin’

Figures, you have the time and resources to set up all this stuff. In case you don’t, you can pull the existing images that I have already built and pushed to Docker’s registry Docker Hub using:

# For Python 2.7 pull
$ docker pull faizanbashir/python-datascience:2.7# For Python 3.6 pull
$ docker pull faizanbashir/python-datascience:3.6

After pulling the images you can use the image or extend the same in your Dockerfile file or use it as an image in your docker-compose or stack file.

Aftermath

The world of AI, ML is getting pretty exciting these days and will continue to become even more exciting. Big players are investing heavily in these domains. About time you start to harness the power of data, who knows it might lead to something wonderful.

You can check out the code here.

Linear Regression using TensorFlow 2.0

Linear Regression using TensorFlow 2.0

Are you looking for a deep learning library that’s one of the most popular and widely-used in this world? Do you want to use a GPU and highly-parallel computation for your machine learning model training? Then look no further than TensorFlow.

Are you looking for a deep learning library that’s one of the most popular and widely-used in this world? Do you want to use a GPU and highly-parallel computation for your machine learning model training? Then look no further than TensorFlow.

Created by the team at Google, TensorFlow is an open source library for numerical computation and machine learning. Undoubtedly, TensorFlow is one of the most popular deep learning libraries, and in recent weeks, Google released the full version of TensorFlow 2.0.

Python developers around the world should be about TensorFlow 2.0, as it’s more Pythonic compared to earlier versions. To help us get started working with TensorFlow 2.0, let’s work through an example with linear regression.

Getting Started

Before we start, let me remind you that if you have TensorFlow 2.0 installed on your machine, then the code written for linear regression using TensorFlow 1.x may not work. For example, tf.placeholder, which works with TensorFlow 1.x, won’twork with 2.0. You’ll get the error AttributeError: module ‘tensorflow’ has no attribute ‘placeholder’as shown in the image below_._
Linear Regression using TensorFlow 2.0

If you want to run the existing code (written in version 1.x) with version 2.0, you have two options:

  1. Run your TensorFlow 2.0 installation in version 1.0 mode by using the below two lines of code:
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
  1. Modify your code to to work with version 2.0, as well as use the new and exciting features of version 2.0.
Linear Regression with TensorFlow 2.0

In this article, we’re going to use TensorFlow 2.0-compatible code to train a linear regression model.

Linear regression is an algorithm that finds a linear relationship between a dependent variable and one or more independent variables. The dependent variable is also called a label and independent variables are called features.

We’ll start by importing the necessary libraries. Let’s import three, namely numpy, tensorflow, and matplotlib, as shown below:

import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

Before coding further, let’s make sure we’ve got the current version of TensorFlow ready to go.
Linear Regression using TensorFlow 2.0

Our next step is to create synthetic data for the model, as shown below. Assuming the equation of a line as y = mx + c , note that we’ve taken the slope of the line m as 2 and constant value c as 0.9. There is some error data we’ve introduced using np.random, as we don’t want the model to overfit as a straight line — this is because we want the model to work on unseen data.

# actual weight = 2 and actual bias = 0.9
x = np.linspace(0, 3, 120)
y = 2 * x + 0.9 + np.random.randn(*x.shape) * 0.3

Let’s plot the data to see if it has linear pattern. We’re using Matplotlib for plotting. The data points below clearly show a pattern we’re looking for. Noticed that the data isn’t a perfectly straight line.
Linear Regression using TensorFlow 2.0

After visualizing our data, let’s create a class called Linear Model that has two methods: init and call. Init initializes the weight and bias randomly, and call returns the values, as per the straight line equation y = mx + c

class LinearModel:
    def __call__(self, x):
        return self.Weight * x + self.Bias
    
    def __init__(self):
        self.Weight = tf.Variable(11.0)
        self.Bias = tf.Variable(12.0)

Now let’s define the loss and train functions for the model. The train function takes four parameters: linear_model (model instance) , x (independent variable) , y (dependent variable), and lr (learning rate).

The loss function takes two parameters: y (actual value of dependent variable) and pred (predicted value of dependent variable).

Note that we’re using the tf.square function to get the square of the difference of y and the predicted value, and then we’re using the . tf.reduce_mean method to calculate the square root of the mean.

Note that the tf.GradientTape method is used for automatic differentiation, computing the gradient of a computation with respect to its input variables.
Hence, all operations executed inside the context of a tf.GradientTape are recorded.

def loss(y, pred):
    return tf.reduce_mean(tf.square(y - pred))

def train(linear_model, x, y, lr=0.12):
    with tf.GradientTape() as t:
        current_loss = loss(y, linear_model(x))

    lr_weight, lr_bias = t.gradient(current_loss, [linear_model.Weight, linear_model.Bias])
    linear_model.Weight.assign_sub(lr * lr_weight)
    linear_model.Bias.assign_sub(lr * lr_bias)

Here we’re defining the number of epochs as 80 and using a for loop to train the model. Note that we’re printing the epoch count and loss for each epoch using that same for loop. We’ve used 0.12 for learning rate, and we’re calculating the loss in each epoch by calling our loss function inside the for loop as shown below.

linear_model = LinearModel()
Weights, Biases = [], []
epochs = 80
for epoch_count in range(epochs):
    Weights.append(linear_model.Weight.numpy()) 
    Biases.append(linear_model.Bias.numpy())
    real_loss = loss(y, linear_model(x))
    train(linear_model, x, y, lr=0.12)
    print(f"Epoch count {epoch_count}: Loss value: {real_loss.numpy()}")

Below is the output during model training. This shows how our loss value is decreasing as the epoch count is increasing. Note that, initially, the loss was very high as we initialized the model with random values for weight and bias. Once the model starts learning, the loss starts decreasing.
Linear Regression using TensorFlow 2.0

And finally, we’d like to know the weight and bias values as well as RMSE for the model, which is shown below.
Linear Regression using TensorFlow 2.0

End notes

I hope you enjoyed creating and evaluating a linear regression model with TensorFlow 2.0.

You can find the complete code here.

Happy Machine Learning :)

The easy way to work with CSV, JSON, and XML in Python

The easy way to work with CSV, JSON, and XML in Python

<strong>Originally published by </strong><a href="https://towardsdatascience.com/@george.seif94" target="_blank">George Seif</a><strong> </strong><em>at&nbsp;</em><a href="https://towardsdatascience.com/the-easy-way-to-work-with-csv-json-and-xml-in-python-5056f9325ca9" target="_blank">towardsdatascience.com</a>

Originally published by George Seif at towardsdatascience.com

Python’s superior flexibility and ease of use are what make it one of the most popular programming language, especially for Data Scientists. A big part of that is how simple it is to work with large datasets.

Every technology company today is building up a data strategy. They’ve all realised that having the right data: insightful, clean, and as much of it as possible, gives them a key competitive advantage. Data, if used effectively, can offer deep, beneath the surface insights that can’t be discovered anywhere else.

Over the years, the list of possible formats that you can store your data in has grown significantly. But, there are 3 that dominate in their everyday usage: CSV, JSON, and XML. In this article, I’m going to share with you the easiest ways to work with these 3 popular data formats in Python!

CSV Data

A CSV file is the most common way to store your data. You’ll find that most of the data coming from Kaggle competitions is stored in this way. We can do both read and write of a CSV using the built-in Python csv library. Usually, we’ll read the data into a list of lists.


Check out the code below. When we run csv.reader() all of our CSV data becomes accessible. The csvreader.next() function reads a single line from the CSV; every time you call it, it moves to the next line. We can also loop through every row of the csv using a for-loop as with for row in csvreader . Make sure that you have the same number of columns in each row, otherwise, you’ll likely end up running into some errors when working with your list of lists

import csv 
filename = "my_data.csv"
  
fields = [] 
rows = [] 
  
# Reading csv file 
with open(filename, 'r') as csvfile: 
    # Creating a csv reader object 
    csvreader = csv.reader(csvfile) 
      
    # Extracting field names in the first row 
    fields = csvreader.next() 
  
    # Extracting each data row one by one 
    for row in csvreader: 
        rows.append(row)
  
# Printing out the first 5 rows 
for row in rows[:5]: 
    print(row)

Writing to CSV in Python is just as easy. Set up your field names in a single list, and your data in a list of lists. This time we’ll create a writer() object and use it to write our data to file very similarly to how we did the reading.

import csv

# Field names 
fields = ['Name', 'Goals', 'Assists', 'Shots'] 
  
# Rows of data in the csv file 
rows = [ ['Emily', '12', '18', '112'], 
         ['Katie', '8', '24', '96'], 
         ['John', '16', '9', '101'], 
         ['Mike', '3', '14', '82']]
         
filename = "soccer.csv"
  
# Writing to csv file 
with open(filename, 'w+') as csvfile: 
    # Creating a csv writer object 
    csvwriter = csv.writer(csvfile) 
      
    # Writing the fields 
    csvwriter.writerow(fields) 
      
    # Writing the data rows 
    csvwriter.writerows(rows)

Of course, installing the wonderful Pandas library will make working with your data far easier once you’ve read it into a variable. Reading from CSV is a single line as is writing it back to file!

import pandas as pd

filename = "my_data.csv"


# Read in the data
data = pd.read_csv(filename)


# Print the first 5 rows
print(data.head(5))


# Write the data to file
data.to_csv("new_data.csv", sep=",", index=False)

We can even use Pandas to convert from CSV to a list of dictionaries with a quick one-liner. Once you have the data formatted as a list of dictionaries, we’ll use the dicttoxml library to convert it to XML format. We’ll also save it to file as a JSON!

import pandas as pd
from dicttoxml import dicttoxml
import json

# Building our dataframe
data = {'Name': ['Emily', 'Katie', 'John', 'Mike'],
        'Goals': [12, 8, 16, 3],
        'Assists': [18, 24, 9, 14],
        'Shots': [112, 96, 101, 82]
        }


df = pd.DataFrame(data, columns=data.keys())


# Converting the dataframe to a dictionary
# Then save it to file
data_dict = df.to_dict(orient="records")
with open('output.json', "w+") as f:
    json.dump(data_dict, f, indent=4)


# Converting the dataframe to XML
# Then save it to file
xml_data = dicttoxml(data_dict).decode()
with open("output.xml", "w+") as f:
    f.write(xml_data)

JSON Data

JSON provides a clean and easily readable format because it maintains a dictionary-style structure. Just like CSV, Python has a built-in module for JSON that makes reading and writing super easy! When we read in the CSV, it will become a dictionary. We then write that dictionary to file.

import json
import pandas as pd

# Read the data from file
# We now have a Python dictionary
with open('data.json') as f:
    data_listofdict = json.load(f)
    
# We can do the same thing with pandas
data_df = pd.read_json('data.json', orient='records')


# We can write a dictionary to JSON like so
# Use 'indent' and 'sort_keys' to make the JSON
# file look nice
with open('new_data.json', 'w+') as json_file:
    json.dump(data_listofdict, json_file, indent=4, sort_keys=True)


# And again the same thing with pandas
export = data_df.to_json('new_data.json', orient='records')

And as we saw before, once we have our data you can easily convert to CSV via pandas or use the built-in Python CSV module. When converting to XML, the dicttoxml library is always our friend.

import json
import pandas as pd
import csv

# Read the data from file
# We now have a Python dictionary
with open('data.json') as f:
    data_listofdict = json.load(f)
    
# Writing a list of dicts to CSV
keys = data_listofdict[0].keys()
with open('saved_data.csv', 'wb') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(data_listofdict)

XML Data

XML is a bit of a different beast from CSV and JSON. Generally, CSV and JSON are widely used due to their simplicity. They’re both easy and fast to read, write, and interpret as a human. There’s no extra work involved and parsing a JSON or CSV is very lightweight.


XML on the other hand tends to be a bit heavier. You’re sending more data, which means you need more bandwidth, more storage space, and more run time. But XML does come with a few extra features over JSON and CSV: you can use namespaces to build and share standard structures, better representation for inheritance, and an industry standardised way of representing your data with XML schema, DTD, etc.

To read in the XML data, we’ll use Python’s built-in XML module with sub-module ElementTree. From there, we can convert the ElementTree object to a dictionary using the xmltodictlibrary. Once we have a dictionary, we can convert to CSV, JSON, or Pandas Dataframe like we saw above!

import xml.etree.ElementTree as ET
import xmltodict
import json

tree = ET.parse('output.xml')
xml_data = tree.getroot()


xmlstr = ET.tostring(xml_data, encoding='utf8', method='xml')




data_dict = dict(xmltodict.parse(xmlstr))


print(data_dict)


with open('new_data_2.json', 'w+') as json_file:
    json.dump(data_dict, json_file, indent=4, sort_keys=True)

Like to learn?

Follow me on twitter where I post all about the latest and greatest AI, Technology, and Science! Connect with me on LinkedIn too!


Recommended Reading

Want to learn more about coding in Python? The Python Crash Course book is the best resource out there for learning how to code in Python!

And just a heads up, I support this blog with Amazon affiliate links to great books, because sharing great books helps everyone! As an Amazon Associate I earn from qualifying purchases.

--------------------------------------------------------------------------------------------------------------------------------------

Thanks for reading :heart: If you liked this post, share it with all of your programming buddies! Follow me on Facebook | Twitter

Learn More

☞ Machine Learning with Python, Jupyter, KSQL and TensorFlow

☞ Python and HDFS for Machine Learning

☞ Applied Deep Learning with PyTorch - Full Course

☞ Tkinter Python Tutorial | Python GUI Programming Using Tkinter Tutorial | Python Training

☞ Machine Learning A-Z™: Hands-On Python & R In Data Science

☞ Python for Data Science and Machine Learning Bootcamp

☞ Data Science, Deep Learning, & Machine Learning with Python

☞ Deep Learning A-Z™: Hands-On Artificial Neural Networks

☞ Artificial Intelligence A-Z™: Learn How To Build An AI

Introducing TensorFlow Datasets

Introducing TensorFlow Datasets

Introducing TensorFlow Datasets

Public datasets fuel the machine learning research rocket (h/t Andrew Ng), but it’s still too difficult to simply get those datasets into your machine learning pipeline. Every researcher goes through the pain of writing one-off scripts to download and prepare every dataset they work with, which all have different source formats and complexities. Not anymore.

Today, we’re pleased to introduce TensorFlow Datasets (GitHub) which exposes public research datasets as [tf.data.Datasets]([https://www.tensorflow.org/api_docs/python/tf/data/Dataset)](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) "https://www.tensorflow.org/api_docs/python/tf/data/Dataset)") and as NumPy arrays. It does all the grungy work of fetching the source data and preparing it into a common format on disk, and it uses the [tf.data API]([https://www.tensorflow.org/guide/datasets)](https://www.tensorflow.org/guide/datasets) "https://www.tensorflow.org/guide/datasets)") to build high-performance input pipelines, which are TensorFlow 2.0-ready and can be used with tf.keras models. We’re launching with 29 popular research datasets such as MNIST, Street View House Numbers, the 1 Billion Word Language Model Benchmark, and the Large Movie Reviews Dataset, and will add more in the months to come; we hope that you join in and add a dataset yourself.

tl;dr

# Install: pip install tensorflow-datasets
import tensorflow_datasets as tfds
mnist_data = tfds.load("mnist")
mnist_train, mnist_test = mnist_data["train"], mnist_data["test"]
assert isinstance(mnist_train, tf.data.Dataset)

Try tfds out in a Colab notebook.

[tfds.load]([https://www.tensorflow.org/datasets/api_docs/python/tfds/load)](https://www.tensorflow.org/datasets/api_docs/python/tfds/load) "https://www.tensorflow.org/datasets/api_docs/python/tfds/load)") and [DatasetBuilder]([https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetBuilder)](https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetBuilder) "https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetBuilder)")

Every dataset is exposed as a DatasetBuilder, which knows:

  • Where to download the data from and how to extract it and write it to a standard format ([DatasetBuilder.download_and_prepare]([https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetBuilder#download_and_prepare)](https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetBuilder#download_and_prepare) "https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetBuilder#download_and_prepare)")).
  • How to load it from disk ([DatasetBuilder.as_dataset]([https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetBuilder#as_dataset)](https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetBuilder#as_dataset) "https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetBuilder#as_dataset)")).
  • And all the information about the dataset, like the names, types, and shapes of all the features, the number of records in each split, the source URLs, citation for the dataset or associated paper, etc. ([DatasetBuilder.info]([https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetBuilder#info)](https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetBuilder#info) "https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetBuilder#info)")).

You can directly instantiate any of the DatasetBuilders or fetch them by string with [tfds.builder]([https://www.tensorflow.org/datasets/api_docs/python/tfds/builder)](https://www.tensorflow.org/datasets/api_docs/python/tfds/builder) "https://www.tensorflow.org/datasets/api_docs/python/tfds/builder)"):

import tensorflow_datasets as tfds

# Fetch the dataset directly
mnist = tfds.image.MNIST()
# or by string name
mnist = tfds.builder('mnist')

# Describe the dataset with DatasetInfo
assert mnist.info.features['image'].shape == (28, 28, 1)
assert mnist.info.features['label'].num_classes == 10
assert mnist.info.splits['train'].num_examples == 60000

# Download the data, prepare it, and write it to disk
mnist.download_and_prepare()

# Load data from disk as tf.data.Datasets
datasets = mnist.as_dataset()
train_dataset, test_dataset = datasets['train'], datasets['test']
assert isinstance(train_dataset, tf.data.Dataset)

# And convert the Dataset to NumPy arrays if you'd like
for example in tfds.as_numpy(train_dataset):
  image, label = example['image'], example['label']
assert isinstance(image, np.array)

as_dataset() accepts a batch_size argument which will give you batches of examples instead of one example at a time. For small datasets that fit in memory, you can pass batch_size=-1 to get the entire dataset at once as a tf.Tensor. All tf.data.Datasets can easily be converted to iterables of NumPy arrays using [tfds.as_numpy()]([https://www.tensorflow.org/datasets/api_docs/python/tfds/as_numpy)](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_numpy) "https://www.tensorflow.org/datasets/api_docs/python/tfds/as_numpy)").

As a convenience, you can do all the above with [tfds.load]([https://www.tensorflow.org/datasets/api_docs/python/tfds/load)](https://www.tensorflow.org/datasets/api_docs/python/tfds/load) "https://www.tensorflow.org/datasets/api_docs/python/tfds/load)"), which fetches the DatasetBuilder by name, calls download_and_prepare(), and calls as_dataset().

import tensorflow_datasets as tfds

datasets = tfds.load("mnist")
train_dataset, test_dataset = datasets["train"], datasets["test"]
assert isinstance(train_dataset, tf.data.Dataset)

You can also easily get the [DatasetInfo]([https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetInfo)](https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetInfo) "https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetInfo)") object from tfds.load by passing with_info=True. See the API documentation for all the options.

Dataset Versioning

Every dataset is versioned (builder.info.version) so that you can rest assured that the data doesn’t change underneath you and that results are reproducible. For now, we guarantee that if the data changes, the version will be incremented.

Note that while we do guarantee the data values and splits are identical given the same version, we do not currently guarantee the ordering of records for the same version.

Dataset Configuration

Datasets with different variants are configured with named BuilderConfigs. For example, the Large Movie Review Dataset ([tfds.text.IMDBReviews]([https://www.tensorflow.org/datasets/datasets#imdb_reviews)](https://www.tensorflow.org/datasets/datasets#imdb_reviews) "https://www.tensorflow.org/datasets/datasets#imdb_reviews)")) could have different encodings for the input text (for example, plain text, or a character encoding, or a subword encoding). The built-in configurations are listed with the dataset documentation and can be addressed by string, or you can pass in your own configuration.

# See the built-in configs
configs = tfds.text.IMDBReviews.builder_configs
assert "bytes" in configs

# Address a built-in config with tfds.builder
imdb = tfds.builder("imdb_reviews/bytes")
# or when constructing the builder directly
imdb = tfds.text.IMDBReviews(config="bytes")
# or use your own custom configuration
my_encoder = tfds.features.text.ByteTextEncoder(additional_tokens=['hello'])
my_config = tfds.text.IMDBReviewsConfig(
    name="my_config",
    version="1.0.0",
    text_encoder_config=tfds.features.text.TextEncoderConfig(encoder=my_encoder),
)
imdb = tfds.text.IMDBReviews(config=my_config)

See the section on dataset configuration in our documentation on adding a dataset.

Text Datasets and Vocabularies

Text datasets can be often be painful to work with because of different encodings and vocabulary files. tensorflow-datasets makes it much easier. It’s shipping with many text tasks and includes three kinds of TextEncoders, all of which support Unicode:

  • Where to download the data from and how to extract it and write it to a standard format ([DatasetBuilder.download_and_prepare]([https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetBuilder#download_and_prepare)](https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetBuilder#download_and_prepare) "https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetBuilder#download_and_prepare)")).
  • How to load it from disk ([DatasetBuilder.as_dataset]([https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetBuilder#as_dataset)](https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetBuilder#as_dataset) "https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetBuilder#as_dataset)")).
  • And all the information about the dataset, like the names, types, and shapes of all the features, the number of records in each split, the source URLs, citation for the dataset or associated paper, etc. ([DatasetBuilder.info]([https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetBuilder#info)](https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetBuilder#info) "https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetBuilder#info)")).

The encoders, along with their vocabulary sizes, can be accessed through DatasetInfo:

imdb = tfds.builder("imdb_reviews/subwords8k")

# Get the TextEncoder from DatasetInfo
encoder = imdb.info.features["text"].encoder
assert isinstance(encoder, tfds.features.text.SubwordTextEncoder)

# Encode, decode
ids = encoder.encode("Hello world")
assert encoder.decode(ids) == "Hello world"

# Get the vocabulary size
vocab_size = encoder.vocab_size

Both TensorFlow and TensorFlow Datasets will be working to improve text support even further in the future.

Getting started

Our documentation site is the best place to start using tensorflow-datasets. Here are some additional pointers for getting started:

  • Where to download the data from and how to extract it and write it to a standard format ([DatasetBuilder.download_and_prepare]([https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetBuilder#download_and_prepare)](https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetBuilder#download_and_prepare) "https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetBuilder#download_and_prepare)")).
  • How to load it from disk ([DatasetBuilder.as_dataset]([https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetBuilder#as_dataset)](https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetBuilder#as_dataset) "https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetBuilder#as_dataset)")).
  • And all the information about the dataset, like the names, types, and shapes of all the features, the number of records in each split, the source URLs, citation for the dataset or associated paper, etc. ([DatasetBuilder.info]([https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetBuilder#info)](https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetBuilder#info) "https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetBuilder#info)")).

We expect to be adding datasets in the coming months, and we hope that the community will join in. Open a GitHub Issue to request a dataset, vote on which datasets should be added next, discuss implementation, or ask for help. And Pull Requests very welcome! Add a popular dataset to contribute to the community, or if you have your own data, contribute it to TFDS to make your data famous!

Now that data is easy, happy modeling!

Acknowledgements

We’d like to thank Stefan Webb of Oxford for allowing us to use the tensorflow-datasets PyPI name. Thanks Stefan!

We’d also like to thank Lukasz Kaiser and the Tensor2Tensor project for inspiring and guiding tensorflow/datasets. Thanks Lukasz! T2T will be migrating to tensorflow/datasets soon.

Originally published by TensorFlow at https://medium.com/tensorflow

Learn More

Applied Deep Learning with PyTorch - Full Course

Machine Learning In Node.js With TensorFlow.js

Introducing TensorFlow.js: Machine Learning in Javascript

A Complete Machine Learning Project Walk-Through in Python

An illustrated guide to Kubernetes Networking

Introduction to PyTorch and Machine Learning

Complete Guide to TensorFlow for Deep Learning with Python

Machine Learning with TensorFlow + Real-Life Business Case

Machine Learning & Tensorflow - Google Cloud Approach