A Practical Guide to TFRecords

TensorFlow’s custom data format TFRecord is really useful. The files are supported natively by the blazing fast tf.data API, support distributed datasets, leverage parallel I/O. But they are somewhat overwhelming at first. This post serves as a practical introduction.

Overview

We will first go over the concept behind TFRecord files. With this in mind, we can then go on to work with image data; we will use both a small and a large dataset. Expanding our knowledge, we then work with audio data. The last large domain is the text domain, which we’ll cover as well. To combine all this, we create an artificial multi-data-type dataset and, you guess it, write it to TFRecords as well.

TFRecord’s layout

When I started my deep learning research, I naively stored my data scattered over the disk. To make things worse, I polluted my directories with thousands of small files, in the order of a few KB. The cluster I was then working on was not amused. And it took quite some time to get all these files loaded.

This is where TFRecords (or large numpy arrays, for that matter) come in handy: Instead of storing the data scattered around, forcing the disks to jump between blocks, we simply store the data in a sequential layout. We can visualize this concept in the following way:

Visualization created by the author

The TFRecord file can be seen as a wrapper around all the single data samples. Every single data sample is called an Example, and is essentially a dictionary storing the mapping between a key and our actual data.

Now, the seemingly complicated part is this: When you want to write your data to TFRecords, you first have to convert your data to a Feature. These features are then the inner components of one Example:

Visualization created by the author

So far, so good. But what is now the difference to storing your data in a compressed numpy array, or a pickle file? Two things: The TFRecord file is stored sequentially, enabling fast streaming due to low access times. And secondly, the TFRecord files are natively integrated into TensorFlows tf.data API, easily enabling batching, shuffling, caching, and the like.

As a bonus, if you ever have the chance and the computing resources to do multi-worker training, you can distribute the dataset across your machines.

On a code level, the feature creation happens with these convenient methods, which we will talk about later on:

To write data to TFRecord files, you first create a dictionary that says

I want to store this data point under this key

When reading from TFRecord files, you invert this process by creating a dictionary that says

I have this keys, fill this placeholder with the value stored at this key

Let us see how this looks in action.

#tensorflow #python #data-science #deep-learning

Overview

TFRecord’s layout

towardsdatascience.com

A Practical Guide to TFRecords