Apache Spark is an open-sourced distributed computing framework, but it doesn’t manage the cluster of machines it runs on. You need a cluster manager (also called a scheduler) for that. The most commonly used one is Apache Hadoop YARN. Support for running Spark on Kubernetes was added with version 2.3, and Spark-on-k8s adoption has been accelerating ever since.
If you’re curious about the core notions of Spark-on-Kubernetes, the differences with Yarn as well as the benefits and drawbacks, read our previous article: The Pros And Cons of Running Spark on Kubernetes. For a deeper dive, you can also watch our session at Spark Summit 2020: Running Apache Spark on Kubernetes: Best Practices and Pitfalls.
In this article, we present benchmarks comparing the performance of deploying Spark on Kubernetes versus Yarn. Our results indicate that Kubernetes has caught up with Yarn — there are no significant performance differences between the two anymore. In particular, we will compare the performance of shuffle between YARN and Kubernetes, and give you critical tips to make shuffle performant when running Spark on Kubernetes.
We used the famous TPC-DS benchmark to compare Yarn and Kubernetes, as this is one of the most standard benchmark for Apache Spark and distributed computing in general. The TPC-DS benchmark consists of two things: data and queries.
The performance of a distributed computing framework is multi-dimensional: cost and duration should be taken into account. For example, what is best between a query that lasts 10 hours and costs $10 and a 1-hour $200 query? This depends on the needs of your company.
In this benchmark, we gave a fixed amount of resources to Yarn and Kubernetes. As a result, the cost of a query is directly proportional to its duration. This allows us to compare the two schedulers on a single dimension: duration.
This benchmark compares Spark running Data Mechanics (deployed on Google Kubernetes Engine), and Spark running on Dataproc (GCP’s managed Hadoop offering).
Driver: n2-standard-4 instance
5 executors on n2-highmem-4 instances
We ran each query 5 times and reported the median duration.
We used the recently released 3.0 version of Spark in this benchmark. It brings substantial performance improvements over Spark 2.4, we’ll show these in a future blog post.
#data-science #data-engineering #kubernetes #apache-spark #big-data #data analysis