A Hybrid Apache Arrow/Numpy DataFrame with Vaex Version 4.0

The Vaex DataFrame has always been very fast. Built from the ground up to be out of core (the size of your disk is the limit), it pushes the limits of what single machines can do in the context of big data analysis.
Starting from version 2, we added better support for string data, giving an almost 1000x speedup compared to Pandas at the time. To support this seemingly trivial datatype, we had to choose a disk and memory format and did not want to reinvent the wheel. Apache Arrow was an obvious choice but did not meet the requirements at that time. However, we still added string support in Vaex, but in a future compatible way so that when the time arrives (now!), we can adopt Apache Arrow without rendering data from the past obsolete, or requiring data conversions. For compatibility with Apache Arrow, we developed the vaex-arrowpackage, which made interoperability with Vaex smooth, at the cost of a possible memory copy here and there.

#apache-arrow #dataframes #data-engineering #data-science #python

towardsdatascience.com

A Hybrid Apache Arrow/Numpy DataFrame with Vaex Version 4.0