Lots of people talk about “democratizing” data science and machine learning. What could be more democratic — in the sense of widely accessible — than SQL, PyData, and scaling data science to larger datasets and models?

Dask is rapidly becoming a go-to technology for scalable computing. Despite a strong and flexible dataframe API, Dask has historically not supported SQL for querying most raw data.

In this post, we look at dask-sql, an exciting new open-source library that offers a SQL front-end to Dask. Follow along with this notebook. You can also load it up on Coiled Cloud if you want to access some serious Dask clusters for free with a single click! To do so, log into Coiled Cloud here, navigate to our example notebooks, and launch the dask-sql notebook.

In this post, we:

  • Launch a Dask cluster and use dask-sql to run SQL queries on it!
  • Perform some basic speed tests,
  • Use SQL and cached data to turbocharge our analytics,
  • Investigate SQL built-in helper functions in dask-sql,
  • Provide an example of fast plotting from big data.

Many thanks to Nils Braun, the creator of dask-sql, for his thoughtful and constructive feedback on this post.

#dask #sql

Getting started with Dask and SQL
1.10 GEEK