Fetch Failed Exception in Apache Spark: Decrypting the most common causes

Fetch Failed Exception in Apache Spark: Decrypting the most common causes

Most Spark developers spend considerable time in troubleshooting the Fetch Failed Exceptions observed during shuffle operations. This story would serve you the most common causes of a Fetch Failed Exception and would reveal the results of a recent poll conducted on the Exception.

Shuffle operations are the backbone of almost all Spark Jobs that are aimed at data aggregation, joins, or data restructuring. During a shuffle operation, the data is shuffled across various nodes of the cluster via a two-step process:

a) Shuffle Write: Shuffle map tasks write the data to be shuffled in a disk file, the data is arranged in the file according to shuffle reduce tasks. Bunch of shuffle data corresponding to a shuffle reduce task written by a shuffle map task is called a shuffle block. Further, each of the shuffle map tasks informs the driver about the written shuffle data.

b) Shuffle Read: Shuffle reduce tasks queries the driver about the locations of their shuffle blocks. Then these tasks establish connections with the executors hosting their shuffle blocks and start fetching the required shuffle blocks. Once a block is fetched, it is available for further computation in the reduce task.

To know more about the shuffle process, you can refer to my earlier story titled: [Revealing Apache Spark Shuffling Magic_](https://medium.com/swlh/revealing-apache-spark-shuffling-magic-b2c304306142)._

The two-step process of a shuffle although sounds simple, but is operationally intensive as it involves data sorting, disk writes/reads, and network transfers. Therefore, there is always a question mark on the reliability of a shuffle operation, and the evidence of this unreliability is the commonly encountered ‘FetchFailed Exception’ during the shuffle operation. Most Spark developers spend considerable time in troubleshooting this widely encountered exception. First, they try to find out the root cause of the exception, and then accordingly put the right fix for the same.

A Fetch Failed Exception, reported in a shuffle reduce task, indicates the failure in reading of one or more shuffle blocks from the hosting executors. Debugging a FetchFailed Exception is quite challenging since it can occur due to multiple reasons. Finding and knowing the right reason is very important because this would help you in putting the right fix to overcome the Exception.

Troubleshooting hundreds of Spark Jobs in recent time has realized me that Fetch Failed Exception mainly comes due to the following reasons:

  • Out of Heap memory on Executors
  • Low Memory Overhead on Executors
  • Shuffle block greater than 2 GB
  • Network TimeOut.

To understand the frequency of these reasons, I also conducted a poll recently on the most followed LinkedIn group on Spark, Apache Spark. To my surprise, quite a lot of people participated in the poll and submitted their opinion which kind of further establishes the fact that people frequently encounter this exception in their Spark Jobs. Here are the results of the poll:

Image for post

Results of the Poll conducted on Fetch Failed Exception in LinkedIn Apache Spark Group

According to the poll results, ‘Out of Heap memory on a Executor’ and the ‘Shuffle block greater than 2 GB’ are the most voted reasons. These are then followed by ‘Network Timeout’ and ‘Low memory overhead on a Executor’.

techno programming artificial-intelligence software-development data-science

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Offshore Software Development - Best Practices

To make the most out of the benefits of offshore software development, you should understand the crucial factors that affect offshore development.

Data Science Course in Dallas

Become a data analysis expert using the R programming language in this [data science](https://360digitmg.com/usa/data-science-using-python-and-r-programming-in-dallas "data science") certification training in Dallas, TX. You will master data...

50 Data Science Jobs That Opened Just Last Week

Data Science and Analytics market evolves to adapt to the constantly changing economic and business environments. Our latest survey report suggests that as the overall Data Science and Analytics market evolves to adapt to the constantly changing economic and business environments, data scientists and AI practitioners should be aware of the skills and tools that the broader community is working on. A good grip in these skills will further help data science enthusiasts to get the best jobs that various industries in their data science functions are offering.

Data Science Explained | Data Science For Beginners | Data Science

This Data Science tutorial will help you in understanding what is Data Science, why we need Data Science, prerequisites for learning Data Science, what does ...

AI or Data Science? | Artificial Intelligence And Data Science Career

There are many intersections and overlaps between AI and data science. AI has numerous subsets, like Machine Learning (ML), Deep Learning (DL), and Natural Language Processing (NLP). With many career opportunities in both fields, there are lots of conflicting perspectives on educational paths for starting a career in one of these fields.