In this article, I’ll show you how to use a combination of built-in functions, the C-API, and Cython to quickly and easily put together your own super-fast custom data loader for NumPy/Pandas. First, we’ll review a common structure that’s often used for storing binary data, and then write code to load some sample data. Along the way, we’ll take brief detours into the C-API and the Python buffer protocol so that you understand how all the pieces work. There’s a lot here, but don’t worry — it’s all very straightforward and we’ll make sure that the most important parts of the code are generic and reusable. You can also follow along with a working notebook here. When we’re done, you’ll be able to easily adapt the code to your specific data format and get back to analysis!

Binary Data Formats

For our purposes, a binary data file is nothing more than a large array of bytes that encodes a series of data elements such as integers, floats, or character arrays. While there are many formats for the binary encoding, one common format consists of a series of individual ‘records’ stored back-to-back one after another. Within each record, the first bytes typically encode a header which specifies the length (in bytes) of the record, as well as other identifying information that allows the user to decode the data.

Generally, there will be multiple record types in the file, all of which share a common header format. For example, binary data from a car’s computer might have one record type for driver controls such as the brake pedal and steering wheel positions, and another type to record engine statistics such as fuel consumption and temperature.

In order to load binary data, you need to refer to documentation for your binary format to know exactly how the bytes encode data. For the purposes of demonstration, we’ll work with sample data laid out like this:

Image by author

In the next section, we’ll see how to deal with the simple case where the data contains only a single record type

#pandas #data #programming #numpy #python

Loading binary data to NumPy/Pandas
1.45 GEEK