How to guide: Set up, Manage & Monitor Spark on Kubernetes (with code examples)

How to guide: Set up, Manage & Monitor Spark on Kubernetes (with code examples)

Earlier this year at Spark + AI Summit, we had the pleasure of presenting our session on the best practices and pitfalls of running Apache Spark on Kubernetes (K8s).

Earlier this year at Spark + AI Summit, we had the pleasure of presenting our session on the best practices and pitfalls of running Apache Spark on Kubernetes (K8s).

In this post we’d like to expand on that presentation and talk to you about:

  1. What is Kubernetes?
  2. Why run Spark on Kubernetes?
  3. Getting started with Spark on Kubernetes
  4. Optimizing performance and cost
  5. Monitoring your Spark applications on Kubernetes
  6. The future of Spark on Kubernetes

If you’re already familiar with k8s and why Spark on Kubernetes might be a fit for you, feel free to skip the first couple of sections and get straight to the meat of the post!

What is Kubernetes (k8s)?

Kubernetes (also known as Kube or k8s) is an open-source container orchestration system initially developed at Google, open-sourced in 2014 and maintained by the Cloud Native Computing Foundation. Kubernetes is used to automate deployment, scaling and management of containerized apps — most commonly Docker containers.

It offers many features critical to stability, security, performance, and scalability, like:

  1. Horizontal Scalability
  2. Automated Rollouts & Rollbacks
  3. Load Balancing
  4. Secrets & Config Management
  5. …and many more

Kubernetes has become the standard for infrastructure management in the traditional software development world. But Kubernetes isn’t as popular in the big data scene which is too often stuck with older technologies like Hadoop YARN. Until Spark-on-Kubernetes joined the game!

kubernetes data-engineering docker spark-on-kubernetes apache-spark

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

50+ Useful Kubernetes Tools for 2020 - Part 2

Our original Kubernetes tool list was so popular that we've curated another great list of tools to help you improve your functionality with the platform.

Performance of Apache Spark on Kubernetes has caught up with YARN

Performance of Apache Spark on Kubernetes has caught up with YARN. Learn our benchmark setup, results, as well as critical tips to make shuffles up to 10x faster when running on Kubernetes!

Managing Data as a Data Engineer:  Understanding Data Changes

Understand how data changes in a fast growing company makes working with data challenging. In the last article, we looked at how users view data and the challenges they face while using data.

Apache Spark For Beginners In 3 Hours | Apache Spark Training

In this Apache Spark For Beginners, we will have an overview of Spark in Big Data. An introduction to Apache Spark Programming. The Spark History. We'll learn why Spark is needed and covers everything that an individual needed to master its skill in this field. In this Apache Spark tutorial, you will not only learn Spark from the basics but also through this Apache Spark tutorial, you will get to know the Spark architecture and its components such as Spark Core, Spark Programming, Spark SQL, Spark Streaming, and much more.

Managing Data as a Data Engineer — Understanding Users

Understanding how users view data and their pain points when using data. In this article, I would like to share some of the things that I have learnt while managing terabytes of data in a fintech company.