TensorFlow’s custom data format TFRecord is really useful. In this post, we'll A Practical Guide to TFRecords
TensorFlow’s custom data format TFRecord is really useful. The files are supported natively by the blazing fast tf.data API, support distributed datasets, leverage parallel I/O. But they are somewhat overwhelming at first. This post serves as a practical introduction.
We will first go over the concept behind TFRecord files. With this in mind, we can then go on to work with image data; we will use both a small and a large dataset. Expanding our knowledge, we then work with audio data. The last large domain is the text domain, which we’ll cover as well. To combine all this, we create an artificial multi-data-type dataset and, you guess it, write it to TFRecords as well.
When I started my deep learning research, I naively stored my data scattered over the disk. To make things worse, I polluted my directories with thousands of small files, in the order of a few KB. The cluster I was then working on was not amused. And it took quite some time to get all these files loaded.
This is where TFRecords (or large numpy arrays, for that matter) come in handy: Instead of storing the data scattered around, forcing the disks to jump between blocks, we simply store the data in a sequential layout. We can visualize this concept in the following way:
Visualization created by the author
The TFRecord file can be seen as a wrapper around all the single data samples. Every single data sample is called an Example, and is essentially a dictionary storing the mapping between a key and our actual data.
Now, the seemingly complicated part is this: When you want to write your data to TFRecords, you first have to convert your data to a Feature. These features are then the inner components of one Example:
Visualization created by the author
So far, so good. But what is now the difference to storing your data in a compressed numpy array, or a pickle file? Two things: The TFRecord file is stored sequentially, enabling fast streaming due to low access times. And secondly, the TFRecord files are natively integrated into TensorFlows tf.data API, easily enabling batching, shuffling, caching, and the like.
As a bonus, if you ever have the chance and the computing resources to do multi-worker training, you can distribute the dataset across your machines.
On a code level, the feature creation happens with these convenient methods, which we will talk about later on:
To write data to TFRecord files, you first create a dictionary that says
I want to store this data point under this key
When reading from TFRecord files, you invert this process by creating a dictionary that says
I have this keys, fill this placeholder with the value stored at this key
Let us see how this looks in action.
🔵 Intellipaat Data Science with Python course: https://intellipaat.com/python-for-data-science-training/In this Data Science With Python Training video, you...
PyTorch for Deep Learning | Data Science | Machine Learning | Python. PyTorch is a library in Python which provides tools to build deep learning models. What python does for programming PyTorch does for deep learning. Python is a very flexible language for programming and just like python, the PyTorch library provides flexible tools for deep learning.
A couple of days ago I started thinking if I had to start learning machine learning and data science all over again where would I start?
In this post, we'll learn top 30 Python Tips and Tricks for Beginners
PyTorch is a library in Python which provides tools to build deep learning models. What python does for programming PyTorch does for deep learning.