If you have some basic knowledge in data analysis with Python Pandas and are curious about PySpark and don’t know where to start, tag along.

Python Pandas encouraged us to leave excel tables behind and to look at data from a coder perspective instead. Data sets became bigger and bigger, turned from data bases to data files and into data lakes. Some smart minds from Apache blessed us with the Scala based framework Spark to process the bigger amounts in a reasonable time. Since Python is the go to language for data science nowadays, there was a Python API available soon that’s called PySpark.

For a while now I am trying to conquer this Spark interface with its non-pythonic syntax that everybody in the big data world praises. It took me a few attempts and it’s still work in progress. However in this post I want to show you, who is also starting learning PySpark, how to replicate the same analysis you would otherwise do with Pandas.

The data analysis example we are going to look at you can find in the book “Python for Data Analysis” by Wes McKinney. In that analysis, the aim is to find out the top ranked movies from the MovieLens 1M data set, which is acquired and maintained by the GroupLens Research project from the University of Minnesota.

As a coding framework I used Kaggle, since it comes with the convenience of notebooks that have the basic data science modules installed and are ready to go with two clicks.

The complete analysis and the Pyspark code you can also find in this Kaggle notebookand the Pandas code inthis one. We won’t replicate the same analysis here, but instead focus on the syntax differences when handling Pandas and Pyspark dataframes. I will always show the Pandas code first following with the PySpark equivalent.

The basic functions that we need for this analysis are:

  • Loading data from csv format
  • Combining datasets from different tables
  • Extracting information

#python #data-science #pyspark #introduction-to-pyspark #pandas-dataframe

Syntax Gotchas Writing PySpark When Knowing Pandas
1.30 GEEK