Moving from Pandas to Spark with Scala isn’t as challenging as you might think, and as a result your code will run faster and you’ll probably end up writing better code.

In my experience as a Data Engineer, I’ve found building data pipelines in Pandas often requires us to regularly increase resources to keep up with the increasing memory usage. In addition, we often see many runtime errors due to unexpected data types or nulls. As a result of using Spark with Scala instead, solutions feel more robust and easier to refactor and extend.

In this article we’ll run through the following:

  1. Why you should use Spark with Scala over Pandas
  2. How the Scala Spark API really isn’t too different from the Pandas API
  3. How to get started using either a Jupyter notebook or your favourite IDE

What is Spark?

  • Spark is an Apache open-source framework
  • It can be used as a library and run on a “local” cluster, or run on a Spark cluster
  • On a Spark cluster the code can be executed in a distributed way, with a single master node and multiple worker nodes that share the load
  • Even on a local cluster you will still see performance improvements over Pandas, and we’ll go through why below

Why use Spark?

Spark has become popular due to its ability to process large data sets at speed

  • By default, Spark is multi-threaded whereas Pandas is single-threaded
  • Spark code can be executed in a distributed way, on a Spark Cluster, whereas Pandas runs on a single machine
  • Spark is lazy, which means it will only execute when you _collect _(ie. when you actually need to return something), and in the meantime it builds up an execution plan and finds the optimal way to execute your code
  • This differs to Pandas, which is eager, and executes each step as it reaches it
  • Spark is also less likely to run out of memory as it will start using disk when it reaches its memory limit

For a visual comparison of run time see the below chart from Databricks, where we can see that Spark is significantly faster than Pandas, and also that Pandas runs out of memory at a lower threshold.

Image for post

https://databricks.com/blog/2018/05/03/benchmarking-apache-spark-on-a-single-node-machine.html

Spark has a rich ecosystem

  • Data science libraries such as Spark ML, which is built in, or Graph X for graph algorithms
  • Spark Streaming for real time data processing
  • Interoperability with other systems and file types (orc, parquet, etc.)

Why use Scala instead of PySpark?

Spark provides a familiar API, so using Scala instead of Python won’t feel like a huge learning curve. Here are few reasons why you might want to use Scala:

  • Scala is a statically typed language, which means you’ll find your code will likely have fewer runtime errors than with Python
  • Scala also allows you to create immutable objects, which means when referencing an object you can be confident its state hasn’t been mutated in between creating it and calling it
  • Spark is written in Scala, so new features are available in Scala before Python
  • For Data Scientists and Data Engineers working together, using Scala can help with collaboration, because of the type safety and immutability of Scala code

Spark core concepts

  • DataFrame: a spark DataFrame is a data structure that is very similar to a Pandas DataFrame
  • Dataset: a Dataset is a typed DataFrame, which can be very useful for ensuring your data conforms to your expected schema
  • RDD: this is the core data structure in Spark, upon which DataFrames and Datasets are built

In general, we’ll use Datasets where we can, because they’re type safe, more efficient, and improve readability as it’s clear what data we can expect in the Dataset.

Datasets

To create our Dataset we first need to create a case class, which is similar to a data class in Python, and is really just a way to specify a data structure.

For example, let’s create a case class called FootballTeam, with a few fields:

case class FootballTeam(
    name: String,
    league: String,
    matches_played: Int,
    goals_this_season: Int,
    top_goal_scorer: String,
    wins: Int
)

Now, let’s create an instance of this case class:

val brighton: FootballTeam =
    FootballTeam(
      "Brighton and Hove Albion",
      "Premier League",
      matches_played = 29,
      goals_this_season = 32,
      top_goal_scorer = "Neil Maupay",
      wins = 6
    )

Let’s create another instance called _manCity _and now we’ll create a Dataset with these two FootballTeams:

val teams: Dataset[FootballTeam] = spark.createDataset(Seq(brighton,  
      manCity))

Another way to do this is:

val teams: Dataset[FootballTeam] = 
      spark.createDataFrame(Seq(brighton, manCity)).as[FootballTeam]

The second way can be useful when reading from an external data source and returning a DataFrame, as you can then casting to your Dataset, so that we now have a typed collection.

#pandas #spark #scala #data-science #developer

Stop using Pandas and start using Spark with Scala
2.30 GEEK