Learning PySpark Locally Before Moving to Multi-node Cluster Databricks Environment

I have come across several frustrating tutorials on PySpark promising to teach me PySpark in under five minutes 🙄. They are click baits and lack the necessary depth to get me started and keep me rolling. So, I decided to write an article in hopes of helping others like myself with a project-driven tutorial as opposed to showing you code snippets and know-hows. I will primarily focus on a list of problems and use PySpark to answer the questions. You may follow along by grabbing the dataset and code here. At the end of this article, I have also included excellent resources I enjoyed learning from. Happy Learning!

Guiding Questions:

  1. Who are the winners of the D1 division in the Germany Football Association (Bundesliga) in the last decade?
  2. Which teams have been relegated in the past 10 years?
  3. Does Octoberfest affect the performance of Bundesliga?
  4. Which season of Bundesliga was the most competitive in the last decade?
  5. What’s the best month to watch Bundesliga?

#big-data #sql #python #pyspark #developer

A Project-driven Approach to Learning PySpark
2.70 GEEK