Python and HDFS for Machine Learning

Python and HDFS for Machine Learning

Python has come into its own in the fields of big data and ML thanks to a great community and a ton of useful libraries. Read on to learn how to take advantage!

Python has come into its own in the fields of big data and ML thanks to a great community and a ton of useful libraries. Read on to learn how to take advantage!

This article is featured in the new DZone Guide to Big Data: Volume, Variety, and Velocity. Get your free copy for insightful articles, industry stats, and more!

The Python platform is an incredibly powerful alternative to closed source (and expensive!) platforms like MATLAB or Mathematica. Over the years, with the active development of packages like NumPy and SciPy (for general scientific computing) and platforms like TensorFlow, Keras, Theano, and PyTorch, the power available to everyone today via the Python environment is staggering. Add things like Jupyter notebooks, and, for most of us, the deal is sealed.

Personally, I stopped using MATLAB almost five years ago. MATLAB has an incredible array of software modules available in just about any discipline you can imagine, granted, and Python doesn’t have that magnitude of modules available (well, not yet at least). But for the deep learning work I do every day, the Python platform has been phenomenal.

I use a couple of tools for machine learning today. When I’m working on cybersecurity, I tend to use pip as my module manager, and a virtual env wrapper (with virtual env, duh) as my environment manager. For machine learning, I use Anaconda. I appreciate Anaconda because it provides both module management and environment management in a single tool. I would use it for cybersecurity work too, but it is scientific computing focused, and many of the system-oriented modules I use aren’t available via Anaconda and need to be installed via pip.

I also install NumPy, scikit-learn, Jupyter, IPython, and ipdb. I use the base functionality of these for machine learning projects. I’ll usually install some combination of TensorFlow, Keras, or PyTorch, depending on what I’m working on. I use tmux and powerline too, but these aren’t Python modules (well, powerline is, via powerline-status). They are pretty, though, and I really like how they integrate with IPython. Finally, I install H5py.

H5py is what I wanted to talk to you about today. A surprising number of people aren’t familiar with it, nor are they familiar with the underlying data storage format, HDF5. They should be.

Python has its own fully functional data serialization format. Everyone who’s worked with Python for any amount of time knows and loves pickle files. They’re convenient, built-in, and easy to save and load. But they can be BIG. And I don’t mean kind of big. I mean many gigabytes (terabytes?) big, especially when using imagery. And let’s not even think about video.

HDF5 (Hierarchical Data Format 5) is a data storage system originally designed for use with large geospatial datasets. It evolved from HDF4, another storage format created by the HDF Group. It solves some significant drawbacks associated with the use of pickle files to store large datasets — not only does it help control the size of stored datasets, it eliminates load lag, and has a much smaller memory footprint.

Storage Size

HDF5, via H5py, provides you with the same kind of flexibility with regard to stored data types as NumPy and SciPy. This provides you with the ability to be very specific when specifying the size of elements of a tensor. When you have millions of individual data elements, there’s a pretty significant difference between using a 16-bit or 32-bit data width.

You can also specify compression algorithms and options when creating and saving a dataset, including LZF, LZO, GZIP, and SZIP. You can specify the aggression of the compression algorithm too. This is a huge deal — with sparse datasets, the ability to compress elements in those datasets provides huge space savings. I typically use GZIP with the highest level of compression, and it’s remarkable how much room you can save. On one image dataset I recently created, I was forced to use an int64 to store a binary value because of the model I was using. Compression allowed me to eliminate almost all the empty overhead on these binary values, shrinking the archive by 40% from the previous int8 implementation (which was saving the binary value as ASCII, using the entire width of the field).

Load Lag

Pickle files need to be completely loaded into the process address space to be used. They are serialized memory resident objects, and to be accessed they need to be, well, memory residents, right? HDF5 files just don’t care about that.

HDF5 is a hierarchical set of data objects (big shock, right, since hierarchical is the first word in the name?). So, it’s more like a file system than a single file. This is important.

Because it’s more of a filesystem than a single data file, you don’t need to load all the contents of the file at once. HDF5 and H5py load a small driver into memory and that driver is responsible for accessing data from the HDF5 data file. This way, you only load what you need. If you’ve ever tried to load large pickle files, you know what a big deal this is. And not only can you load the data quickly, you can access it quickly via comfortable Pythonic data access interfaces, like indexing, slicing, and list comprehensions.

Data Footprint

Not having to load all your data every time you need to use it also gives you a much smaller data footprint in runtime memory. When you’re training deep networks using high-resolution true color imagery, where your pixel depth is on the order of 32 bits, you are using A LOT of memory. You need to free up as much memory as you can to train your model so you can finish in a few days instead of a few weeks. Setting aside terabytes (or gigabytes, for that matter) of memory just to store data is a waste of resources, using HDF5 you don’t have to.

HDF5 is essentially a key/value store, stored as a tree. You have access to either Datasets or Groups. Datasets are, well, datasets. Groups are collections of datasets, and you access them both via keys. Datasets are leaf elements in the storage graph, groups are internal nodes. Groups can hold other groups or datasets; datasets can only contain data. Both groups and datasets can have arbitrary metadata (again stored as key-value pairs) associated with them. In HDF5-ese, this metadata is called attributes. Accessing a dataset is as easy as this:

import h5py as h5
with h5.File('filename.h5', 'r') as f:
group = f['images']
dataset = group['my dataset']
# Go ahead, use the dataset! I dare you!

Figure 1: Getting your HDF5 on, Python style.

H5py, the Python interface to HDF5 files, is easy to use. It supports modern with semantics, as well as traditional open/close semantics. And with attributes, you don’t have to depend on naming conventions to provide metadata on stored datasets (like image resolution, or provenance, or time of creation). You store that data as attributes on the object itself.

Python is frequently used in data analysis today, both in statistical data analysis and machine learning. Many of us use native serialization formats for our data as well. While pickle files are easy to use, they bog down when dealing with large amounts of data. HDF5 is a data storage system designed for huge geospatial data sets and picks up perfectly where pickle files leave off.


Thanks for reading :heart: If you liked this post, share it with all of your programming buddies! Follow me on Facebook | Twitter

Learn More

☞ Data Science, Deep Learning, & Machine Learning with Python

☞ Deep Learning A-Z™: Hands-On Artificial Neural Networks

☞ Machine Learning A-Z™: Hands-On Python & R In Data Science

☞ Python for Data Science and Machine Learning Bootcamp

☞ Machine Learning, Data Science and Deep Learning with Python

☞ [2019] Machine Learning Classification Bootcamp in Python

☞ Introduction to Machine Learning & Deep Learning in Python

☞ Machine Learning Career Guide – Technical Interview

☞ Machine Learning Guide: Learn Machine Learning Algorithms

☞ Machine Learning Basics: Building Regression Model in Python

☞ Machine Learning using Python - A Beginner’s Guide

Machine Learning, Data Science and Deep Learning with Python

Machine Learning, Data Science and Deep Learning with Python

Complete hands-on Machine Learning tutorial with Data Science, Tensorflow, Artificial Intelligence, and Neural Networks. Introducing Tensorflow, Using Tensorflow, Introducing Keras, Using Keras, Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Learning Deep Learning, Machine Learning with Neural Networks, Deep Learning Tutorial with Python

Machine Learning, Data Science and Deep Learning with Python

Complete hands-on Machine Learning tutorial with Data Science, Tensorflow, Artificial Intelligence, and Neural Networks

Explore the full course on Udemy (special discount included in the link):

In less than 3 hours, you can understand the theory behind modern artificial intelligence, and apply it with several hands-on examples. This is machine learning on steroids! Find out why everyone’s so excited about it and how it really works – and what modern AI can and cannot really do.

In this course, we will cover:
• Deep Learning Pre-requistes (gradient descent, autodiff, softmax)
• The History of Artificial Neural Networks
• Deep Learning in the Tensorflow Playground
• Deep Learning Details
• Introducing Tensorflow
• Using Tensorflow
• Introducing Keras
• Using Keras to Predict Political Parties
• Convolutional Neural Networks (CNNs)
• Using CNNs for Handwriting Recognition
• Recurrent Neural Networks (RNNs)
• Using a RNN for Sentiment Analysis
• The Ethics of Deep Learning
• Learning More about Deep Learning

At the end, you will have a final challenge to create your own deep learning / machine learning system to predict whether real mammogram results are benign or malignant, using your own artificial neural network you have learned to code from scratch with Python.

Separate the reality of modern AI from the hype – by learning about deep learning, well, deeply. You will need some familiarity with Python and linear algebra to follow along, but if you have that experience, you will find that neural networks are not as complicated as they sound. And how they actually work is quite elegant!

This is hands-on tutorial with real code you can download, study, and run yourself.

Python Tutorial - Learn Python for Machine Learning and Web Development

Python Tutorial - Learn Python for Machine Learning and Web Development

Python tutorial for beginners - Learn Python for Machine Learning and Web Development. Can Python be used for machine learning? Python is widely considered as the preferred language for teaching and learning ML (Machine Learning). Can I use Python for web development? Python can be used to build server-side web applications. Why Python is suitable for machine learning? How Python is used in AI? What language is best for machine learning?

Python tutorial for beginners - Learn Python for Machine Learning and Web Development


  • 00:00:00 Introduction
  • 00:01:49 Installing Python 3
  • 00:06:10 Your First Python Program
  • 00:08:11 How Python Code Gets Executed
  • 00:11:24 How Long It Takes To Learn Python
  • 00:13:03 Variables
  • 00:18:21 Receiving Input
  • 00:22:16 Python Cheat Sheet
  • 00:22:46 Type Conversion
  • 00:29:31 Strings
  • 00:37:36 Formatted Strings
  • 00:40:50 String Methods
  • 00:48:33 Arithmetic Operations
  • 00:51:33 Operator Precedence
  • 00:55:04 Math Functions
  • 00:58:17 If Statements
  • 01:06:32 Logical Operators
  • 01:11:25 Comparison Operators
  • 01:16:17 Weight Converter Program
  • 01:20:43 While Loops
  • 01:24:07 Building a Guessing Game
  • 01:30:51 Building the Car Game
  • 01:41:48 For Loops
  • 01:47:46 Nested Loops
  • 01:55:50 Lists
  • 02:01:45 2D Lists
  • 02:05:11 My Complete Python Course
  • 02:06:00 List Methods
  • 02:13:25 Tuples
  • 02:15:34 Unpacking
  • 02:18:21 Dictionaries
  • 02:26:21 Emoji Converter
  • 02:30:31 Functions
  • 02:35:21 Parameters
  • 02:39:24 Keyword Arguments
  • 02:44:45 Return Statement
  • 02:48:55 Creating a Reusable Function
  • 02:53:42 Exceptions
  • 02:59:14 Comments
  • 03:01:46 Classes
  • 03:07:46 Constructors
  • 03:14:41 Inheritance
  • 03:19:33 Modules
  • 03:30:12 Packages
  • 03:36:22 Generating Random Values
  • 03:44:37 Working with Directories
  • 03:50:47 Pypi and Pip
  • 03:55:34 Project 1: Automation with Python
  • 04:10:22 Project 2: Machine Learning with Python
  • 04:58:37 Project 3: Building a Website with Django

Thanks for reading

If you liked this post, share it with all of your programming buddies!

Follow us on Facebook | Twitter

Further reading

Complete Python Bootcamp: Go from zero to hero in Python 3

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python and Django Full Stack Web Developer Bootcamp

Complete Python Masterclass

Python Programming Tutorial | Full Python Course for Beginners 2019 👍

Top 10 Python Frameworks for Web Development In 2019

Python for Financial Analysis and Algorithmic Trading

Building A Concurrent Web Scraper With Python and Selenium

Machine Learning Full Course - Learn Machine Learning

Machine Learning Full Course - Learn Machine Learning

This complete Machine Learning full course video covers all the topics that you need to know to become a master in the field of Machine Learning.

Machine Learning Full Course | Learn Machine Learning | Machine Learning Tutorial

It covers all the basics of Machine Learning (01:46), the different types of Machine Learning (18:32), and the various applications of Machine Learning used in different industries (04:54:48).This video will help you learn different Machine Learning algorithms in Python. Linear Regression, Logistic Regression (23:38), K Means Clustering (01:26:20), Decision Tree (02:15:15), and Support Vector Machines (03:48:31) are some of the important algorithms you will understand with a hands-on demo. Finally, you will see the essential skills required to become a Machine Learning Engineer (04:59:46) and come across a few important Machine Learning interview questions (05:09:03). Now, let's get started with Machine Learning.

Below topics are explained in this Machine Learning course for beginners:

  1. Basics of Machine Learning - 01:46

  2. Why Machine Learning - 09:18

  3. What is Machine Learning - 13:25

  4. Types of Machine Learning - 18:32

  5. Supervised Learning - 18:44

  6. Reinforcement Learning - 21:06

  7. Supervised VS Unsupervised - 22:26

  8. Linear Regression - 23:38

  9. Introduction to Machine Learning - 25:08

  10. Application of Linear Regression - 26:40

  11. Understanding Linear Regression - 27:19

  12. Regression Equation - 28:00

  13. Multiple Linear Regression - 35:57

  14. Logistic Regression - 55:45

  15. What is Logistic Regression - 56:04

  16. What is Linear Regression - 59:35

  17. Comparing Linear & Logistic Regression - 01:05:28

  18. What is K-Means Clustering - 01:26:20

  19. How does K-Means Clustering work - 01:38:00

  20. What is Decision Tree - 02:15:15

  21. How does Decision Tree work - 02:25:15 

  22. Random Forest Tutorial - 02:39:56

  23. Why Random Forest - 02:41:52

  24. What is Random Forest - 02:43:21

  25. How does Decision Tree work- 02:52:02

  26. K-Nearest Neighbors Algorithm Tutorial - 03:22:02

  27. Why KNN - 03:24:11

  28. What is KNN - 03:24:24

  29. How do we choose 'K' - 03:25:38

  30. When do we use KNN - 03:27:37

  31. Applications of Support Vector Machine - 03:48:31

  32. Why Support Vector Machine - 03:48:55

  33. What Support Vector Machine - 03:50:34

  34. Advantages of Support Vector Machine - 03:54:54

  35. What is Naive Bayes - 04:13:06

  36. Where is Naive Bayes used - 04:17:45

  37. Top 10 Application of Machine Learning - 04:54:48

  38. How to become a Machine Learning Engineer - 04:59:46

  39. Machine Learning Interview Questions - 05:09:03