Moving from Pandas to Spark

When your datasets start getting large, a move to Spark can increase speed and save time.

Most data science workflows start with Pandas. Pandas is an awesome library that lets you do a variety of transformations and can handle different kinds of data such as CSVs or JSONs etc. I love Pandas — I made a podcast on it called “Why Pandas is the New Excel”. I still think Pandas is an awesome library in a data scientist’s arsenal. However, there comes up a point where the datasets you are working on get too big and Pandas starts running out of memory. It is here that Spark comes into the picture.

I am writing this blog post in a Q and A format with questions you might have and I also had when I was getting started.

#pandas #big-data #snowflake #spark #machine-learning #moving from pandas to spark

When your datasets start getting large, a move to Spark can increase speed and save time.

towardsdatascience.com

Moving from Pandas to Spark