At work we visualise and analyze typically very large data. In a typical day, this amounts to 65 million records and 20 GB of data. The volume of data can be challenging to analyze over a range of many days

At work we visualise and analyze typically very large data. In a typical day, this amounts to 65 million records and 20 GB of data. The volume of data can be challenging to analyze over a range of many days. The size of the data forces our analyses to be performed over a shorter period than we would like.

I recently discovered the Dask library, hence I wanted to write an article on it for anyone who wants to get started on this amazing tool.

We use the typical Python data toolkit for our ETL jobs. The sheer volume of data is too large for our standard tools`numpy`

/ `pandas`

to handle. There are distributed computing frameworks, like Spark, that handles the heavy lifting. While Spark could handle the job, moving to Spark from the Python data toolkit is a radical change.

So here comes Dask!

Dask is designed to extend the `numpy`

and `pandas`

packages to work on data processing problems that are too large to be kept in memory. It breaks the larger processing job into many smaller tasks that are handled by `numpy`

or `pandas`

and then it reassembles the results into a coherent whole. This happens behind a seamless interface that is designed to mimic the `numpy`

/ `pandas`

interfaces.

python-programming data-science python machine-learning dask

ðŸ”µ Intellipaat Data Science with Python course: https://intellipaat.com/python-for-data-science-training/In this Data Science With Python Training video, you...

Practice your skills in Data Science with Python, by learning and then trying all these hands-on, interactive projects, that I have posted for you.