Python and HDFS for Machine Learning

Python has come into its own in the fields of big data and ML thanks to a great community and a ton of useful libraries. Read on to learn how to take advantage!

This article is featured in the new DZone Guide to Big Data: Volume, Variety, and Velocity. Get your free copy for insightful articles, industry stats, and more!

The Python platform is an incredibly powerful alternative to closed source (and expensive!) platforms like MATLAB or Mathematica. Over the years, with the active development of packages like NumPy and SciPy (for general scientific computing) and platforms like TensorFlow, Keras, Theano, and PyTorch, the power available to everyone today via the Python environment is staggering. Add things like Jupyter notebooks, and, for most of us, the deal is sealed.

Personally, I stopped using MATLAB almost five years ago. MATLAB has an incredible array of software modules available in just about any discipline you can imagine, granted, and Python doesn’t have that magnitude of modules available (well, not yet at least). But for the deep learning work I do every day, the Python platform has been phenomenal.

I use a couple of tools for machine learning today. When I’m working on cybersecurity, I tend to use pip as my module manager, and a virtual env wrapper (with virtual env, duh) as my environment manager. For machine learning, I use Anaconda. I appreciate Anaconda because it provides both module management and environment management in a single tool. I would use it for cybersecurity work too, but it is scientific computing focused, and many of the system-oriented modules I use aren’t available via Anaconda and need to be installed via pip.

I also install NumPy, scikit-learn, Jupyter, IPython, and ipdb. I use the base functionality of these for machine learning projects. I’ll usually install some combination of TensorFlow, Keras, or PyTorch, depending on what I’m working on. I use tmux and powerline too, but these aren’t Python modules (well, powerline is, via powerline-status). They are pretty, though, and I really like how they integrate with IPython. Finally, I install H5py.

H5py is what I wanted to talk to you about today. A surprising number of people aren’t familiar with it, nor are they familiar with the underlying data storage format, HDF5. They should be.

Python has its own fully functional data serialization format. Everyone who’s worked with Python for any amount of time knows and loves pickle files. They’re convenient, built-in, and easy to save and load. But they can be BIG. And I don’t mean kind of big. I mean many gigabytes (terabytes?) big, especially when using imagery. And let’s not even think about video.

HDF5 (Hierarchical Data Format 5) is a data storage system originally designed for use with large geospatial datasets. It evolved from HDF4, another storage format created by the HDF Group. It solves some significant drawbacks associated with the use of pickle files to store large datasets — not only does it help control the size of stored datasets, it eliminates load lag, and has a much smaller memory footprint.

Storage Size

HDF5, via H5py, provides you with the same kind of flexibility with regard to stored data types as NumPy and SciPy. This provides you with the ability to be very specific when specifying the size of elements of a tensor. When you have millions of individual data elements, there’s a pretty significant difference between using a 16-bit or 32-bit data width.

You can also specify compression algorithms and options when creating and saving a dataset, including LZF, LZO, GZIP, and SZIP. You can specify the aggression of the compression algorithm too. This is a huge deal — with sparse datasets, the ability to compress elements in those datasets provides huge space savings. I typically use GZIP with the highest level of compression, and it’s remarkable how much room you can save. On one image dataset I recently created, I was forced to use an int64 to store a binary value because of the model I was using. Compression allowed me to eliminate almost all the empty overhead on these binary values, shrinking the archive by 40% from the previous int8 implementation (which was saving the binary value as ASCII, using the entire width of the field).

Load Lag

Pickle files need to be completely loaded into the process address space to be used. They are serialized memory resident objects, and to be accessed they need to be, well, memory residents, right? HDF5 files just don’t care about that.

HDF5 is a hierarchical set of data objects (big shock, right, since hierarchical is the first word in the name?). So, it’s more like a file system than a single file. This is important.

Because it’s more of a filesystem than a single data file, you don’t need to load all the contents of the file at once. HDF5 and H5py load a small driver into memory and that driver is responsible for accessing data from the HDF5 data file. This way, you only load what you need. If you’ve ever tried to load large pickle files, you know what a big deal this is. And not only can you load the data quickly, you can access it quickly via comfortable Pythonic data access interfaces, like indexing, slicing, and list comprehensions.

Data Footprint

Not having to load all your data every time you need to use it also gives you a much smaller data footprint in runtime memory. When you’re training deep networks using high-resolution true color imagery, where your pixel depth is on the order of 32 bits, you are using A LOT of memory. You need to free up as much memory as you can to train your model so you can finish in a few days instead of a few weeks. Setting aside terabytes (or gigabytes, for that matter) of memory just to store data is a waste of resources, using HDF5 you don’t have to.

HDF5 is essentially a key/value store, stored as a tree. You have access to either Datasets or Groups. Datasets are, well, datasets. Groups are collections of datasets, and you access them both via keys. Datasets are leaf elements in the storage graph, groups are internal nodes. Groups can hold other groups or datasets; datasets can only contain data. Both groups and datasets can have arbitrary metadata (again stored as key-value pairs) associated with them. In HDF5-ese, this metadata is called attributes. Accessing a dataset is as easy as this:

import h5py as h5
with h5.File('filename.h5', 'r') as f:
group = f['images']
dataset = group['my dataset']
# Go ahead, use the dataset! I dare you!

Figure 1: Getting your HDF5 on, Python style.

H5py, the Python interface to HDF5 files, is easy to use. It supports modern with semantics, as well as traditional open/close semantics. And with attributes, you don’t have to depend on naming conventions to provide metadata on stored datasets (like image resolution, or provenance, or time of creation). You store that data as attributes on the object itself.

Python is frequently used in data analysis today, both in statistical data analysis and machine learning. Many of us use native serialization formats for our data as well. While pickle files are easy to use, they bog down when dealing with large amounts of data. HDF5 is a data storage system designed for huge geospatial data sets and picks up perfectly where pickle files leave off.

=====================

Thanks for reading :heart: If you liked this post, share it with all of your programming buddies! Follow me on Facebook | Twitter