Moving from Pandas to Spark with Scala isn’t as challenging as you might think, and as a result your code will run faster and you’ll probably end up writing better code.
In my experience as a Data Engineer, I’ve found building data pipelines in Pandas often requires us to regularly increase resources to keep up with the increasing memory usage. In addition, we often see many runtime errors due to unexpected data types or nulls. As a result of using Spark with Scala instead, solutions feel more robust and easier to refactor and extend.
In this article we’ll run through the following:
Spark has become popular due to its ability to process large data sets at speed
For a visual comparison of run time see the below chart from Databricks, where we can see that Spark is significantly faster than Pandas, and also that Pandas runs out of memory at a lower threshold.
https://databricks.com/blog/2018/05/03/benchmarking-apache-spark-on-a-single-node-machine.html
Spark has a rich ecosystem
Spark provides a familiar API, so using Scala instead of Python won’t feel like a huge learning curve. Here are few reasons why you might want to use Scala:
In general, we’ll use Datasets where we can, because they’re type safe, more efficient, and improve readability as it’s clear what data we can expect in the Dataset.
To create our Dataset we first need to create a case class, which is similar to a data class in Python, and is really just a way to specify a data structure.
For example, let’s create a case class called FootballTeam, with a few fields:
case class FootballTeam(
name: String,
league: String,
matches_played: Int,
goals_this_season: Int,
top_goal_scorer: String,
wins: Int
)
Now, let’s create an instance of this case class:
val brighton: FootballTeam =
FootballTeam(
"Brighton and Hove Albion",
"Premier League",
matches_played = 29,
goals_this_season = 32,
top_goal_scorer = "Neil Maupay",
wins = 6
)
Let’s create another instance called _manCity _and now we’ll create a Dataset with these two FootballTeams:
val teams: Dataset[FootballTeam] = spark.createDataset(Seq(brighton,
manCity))
Another way to do this is:
val teams: Dataset[FootballTeam] =
spark.createDataFrame(Seq(brighton, manCity)).as[FootballTeam]
The second way can be useful when reading from an external data source and returning a DataFrame, as you can then casting to your Dataset, so that we now have a typed collection.
#pandas #spark #scala #data-science #developer