The Story of a Migration from EMR to Spark on Kubernetes

The Story of a Migration from EMR to Spark on Kubernetes

In this article, the co-founder of Lingk tells the story of their migration from EMR to the Spark-on-Kubernetes platform managed by Data Mechanics: their goals, the architecture of the solution & challenges they had to address, and the results they obtained.

In this article, the co-founder of Lingk tells the story of their migration from EMR to the Spark-on-Kubernetes platform managed by Data Mechanics: their goals, the architecture of the solution & challenges they had to address, and the results they obtained.

Goals of this migration

Lingk.io is a data loading, data pipelines, and integration platform built on top of Apache Spark, serving commercial customers, with expertise in the education sector. In a few clicks from their visual interface, their customers can load, deduplicate, and enrich data from dozens of sources.

Under the hood, Lingk used AWS EMR (ElasticMapReduce) to power their product. But they were facing a few issues:

  • EMR required too much infrastructure management for their Devops team with limited Spark experience. Picking the right cluster instance types, memory settings, spark configs, etc.
  • Their total AWS costs were high — they had the intuition that the autoscaling policies of EMR were not very efficient, and that a lot of compute ressources were wasted.
  • Spark apps took 40 seconds to start on average. It’s a long time during which Lingk’s end users had to wait, particularly if they’re building a new data pipeline or integration.
  • The core Spark application was stuck at an earlier version because upgrading Spark to 3.0+ caused unexplained performance regressions.

spark apache-spark data-engineering kubernetes emr

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

50+ Useful Kubernetes Tools for 2020 - Part 2

Our original Kubernetes tool list was so popular that we've curated another great list of tools to help you improve your functionality with the platform.

Performance of Apache Spark on Kubernetes has caught up with YARN

Performance of Apache Spark on Kubernetes has caught up with YARN. Learn our benchmark setup, results, as well as critical tips to make shuffles up to 10x faster when running on Kubernetes!

Apache Spark 3.1 Release: Spark on Kubernetes is now Generally Available

With the Apache Spark 3.1 release in March 2021, the Spark on Kubernetes project is now officially declared as production-ready and Generally Available. In this article, we will go over the main features of Spark 3.1, with a special focus on the improvements to Spark-on-Kubernetes.

What is Apache Spark? | Apache Spark Python | Spark Training

This Edureka "What is Apache Spark?" video will help you to understand the Architecture of Spark in depth. It includes an example where we Understand what is Python and Apache Spark.

Managing Data as a Data Engineer:  Understanding Data Changes

Understand how data changes in a fast growing company makes working with data challenging. In the last article, we looked at how users view data and the challenges they face while using data.