Overcoming Apache Spark’s biggest pain points

Overcoming Apache Spark’s biggest pain points

This two-part guide is meant for those who do not only want to be able to use Spark, but to truly understand the internals of the Spark in order to solve complex problems and generate high performance, stable code.

It was about 6 years ago that I first used Apache Spark, which at that time, it was proof of being initiated in the world of “Big Data” analytics. There was no question that mastering Spark was a duty of any wannabe-data scientist or data engineer; after all, how else to leverage the massive amounts of data and distributed CPU computing to create the best possible Machine Learning models?

If at that time I could see the future in 2020, I would be perhaps a bit surprised that a large percentage of ML/AI practitioners still do not use Spark or only use it for data engineering, not machine learning. Part of it naturally is because of the partial shift of interest to GPU-oriented, rather than CPU-oriented Machine Learning techniques, especially deep learning. But for most applications outside image and natural language processing, where the usefulness of CPU-oriented techniques is unquestionable, it is surprising that many data scientists still heavily rely on single-machine ML tools such as Scikit-learn and the non-distributed versions of XGBoost and LightGBM.

Personally, I feel this is a pity because when used properly, Spark is an incredibly powerful tool for anyone who works with data, helping us to avoid wasting time figuring out how to fit large datasets into memory and processors, and allowing us to have full control of the data analytics workflow including extracting data, generating models, and deploying models into production and testing.

Having conducted workshops and coached dozens of data scientists and engineers on Apache Spark, I was able to make sense of the biggest struggles users typically faces with the tool, why they happen and how to overcome them. This two-part guide is meant for those who do not only want to be able to use Spark, but to truly understand the internals of the Spark in order to solve complex problems and generate high performance, stable code.

Note that I assume that the reader already has a basic understanding of Spark, for example, what are Spark drivers and executors, that datasets are divided into partitions, what is lazy evaluation and Spark’s basic data structures.

spark machine-learning data-engineering apache-spark

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Wondering how to upgrade your skills in the pandemic? Here's a simple way you can do it.

Corona Virus Pandemic has brought the world to a standstill. Countries are on a major lockdown. Schools, colleges, theatres, gym, clubs, and all other public

Machine Learning Powered Data Pipeline

In the course of the last years the interest in Data Science and Machine Learning has continuously increased. Thanks to libraries like Scikit-Learn.

What is Supervised Machine Learning

What is neuron analysis of a machine? Learn machine learning by designing Robotics algorithm. Click here for best machine learning course models with AI

Spark MLlib tutorial | Machine Learning On Spark | Apache Spark Tutorial

This video on Spark MLlib Tutorial will help you learn about Spark's machine learning library. You will understand the different types of machine learning algorithms - supervised, unsupervised, and reinforcement learning.

Learn Programming, Software Engineering, Machine Learning, And More

Best Free Resources to Learn Programming, Software Engineering, Machine Learning, And More All you need to learn. Do you know that you can take the courses from MIT, Stanford.