A Practical Guide to TFRecords

A Practical Guide to TFRecords

TensorFlow’s custom data format TFRecord is really useful. In this post, we'll A Practical Guide to TFRecords

TensorFlow’s custom data format TFRecord is really useful. The files are supported natively by the blazing fast tf.data API, support distributed datasets, leverage parallel I/O. But they are somewhat overwhelming at first. This post serves as a practical introduction.

Overview

We will first go over the concept behind TFRecord files. With this in mind, we can then go on to work with image data; we will use both a small and a large dataset. Expanding our knowledge, we then work with audio data. The last large domain is the text domain, which we’ll cover as well. To combine all this, we create an artificial multi-data-type dataset and, you guess it, write it to TFRecords as well.

TFRecord’s layout

When I started my deep learning research, I naively stored my data scattered over the disk. To make things worse, I polluted my directories with thousands of small files, in the order of a few KB. The cluster I was then working on was not amused. And it took quite some time to get all these files loaded.

This is where TFRecords (or large numpy arrays, for that matter) come in handy: Instead of storing the data scattered around, forcing the disks to jump between blocks, we simply store the data in a sequential layout. We can visualize this concept in the following way:

Visualization created by the author

The TFRecord file can be seen as a wrapper around all the single data samples. Every single data sample is called an Example, and is essentially a dictionary storing the mapping between a key and our actual data.

Now, the seemingly complicated part is this: When you want to write your data to TFRecords, you first have to convert your data to a Feature. These features are then the inner components of one Example:

Visualization created by the author

So far, so good. But what is now the difference to storing your data in a compressed numpy array, or a pickle file? Two things: The TFRecord file is stored sequentially, enabling fast streaming due to low access times. And secondly, the TFRecord files are natively integrated into TensorFlows tf.data API, easily enabling batching, shuffling, caching, and the like.

As a bonus, if you ever have the chance and the computing resources to do multi-worker training, you can distribute the dataset across your machines.

On a code level, the feature creation happens with these convenient methods, which we will talk about later on:

To write data to TFRecord files, you first create a dictionary that says

I want to store this data point under this key

When reading from TFRecord files, you invert this process by creating a dictionary that says

I have this keys, fill this placeholder with the value stored at this key

Let us see how this looks in action.

tensorflow python data-science deep-learning

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Data Science With Python Training | Python Data Science Course | Intellipaat

🔵 Intellipaat Data Science with Python course: https://intellipaat.com/python-for-data-science-training/In this Data Science With Python Training video, you...

PyTorch for Deep Learning | Data Science | Machine Learning | Python

PyTorch for Deep Learning | Data Science | Machine Learning | Python. PyTorch is a library in Python which provides tools to build deep learning models. What python does for programming PyTorch does for deep learning. Python is a very flexible language for programming and just like python, the PyTorch library provides flexible tools for deep learning.

How I'd Learn Data Science If I Were To Start All Over Again

A couple of days ago I started thinking if I had to start learning machine learning and data science all over again where would I start?

top 30 Python Tips and Tricks for Beginners

In this post, we'll learn top 30 Python Tips and Tricks for Beginners

PyTorch for Deep Learning | Data Science | Machine Learning | Python

PyTorch is a library in Python which provides tools to build deep learning models. What python does for programming PyTorch does for deep learning.