Processing a couple of gigabytes of data on one’s laptop is usually an uphill task, unless the laptop has high RAM and a whole lot of compute power.

That notwithstanding, data scientists still have to look for alternative solutions to deal with this problem. Some of the hacks involve tweaking Pandas to enable it to process huge datasets, buying a GPU machine, or purchasing compute power on the cloud. In this piece, we’ll see how we can use Dask to work with large datasets on our local machines.
Dask and Python

Dask is a flexible library for parallel computing in Python. It’s built to integrate nicely with other open-source projects such as NumPy, Pandas, and scikit-learn. In Dask, Dask arrays are the equivalent of NumPy Arrays, Dask DataFrames the equivalent of Pandas DataFrames, and Dask-ML the equivalent of scikit-learn.

#dask #machine learning #pandas #python

Machine Learning in Dask
2.30 GEEK