This article lists out the most common four reasons for a FetchFailed exception in Apache Spark.

Shuffle operations are the backbone of almost all Spark Jobs that are aimed at data aggregation, joins, or data restructuring. During a shuffle operation (Without the support of External Shuffle service), the data is shuffled across various nodes of the cluster via a two-step process:

a) Shuffle Write: Shuffle map tasks write the shuffle data to be shuffled in a disk file, the data is arranged in the file according to shuffle reduce tasks. A bunch of shuffle data corresponding to a shuffle reduce task written by a shuffle map task is called a shuffle block. Further, each of the shuffle map tasks informs the driver about the written shuffle data.

b) Shuffle Read: Shuffle reduce tasks queries the driver about the locations of their shuffle blocks. Then these tasks establish connections with the executors hosting their shuffle blocks and start fetching the required shuffle blocks. Once a block is fetched, it is available for further computation in the reduce task.

The two-step process of a shuffle although sounds simple, but is operationally intensive as it involves data sorting, disk writes/reads, and network transfers. Therefore, there is always a question mark on the reliability of a shuffle operation, and the evidence of this unreliability is the commonly encountered ‘FetchFailed Exception’ during the shuffle operation. Most Spark developers spend considerable time in troubleshooting this widely encountered exception. First, they try to find out the root cause of the exception, and then accordingly put the right fix for the same.

Troubleshooting hundreds of Spark Jobs in recent times, I have realized that FetchFailed Exception mainly comes due to the following reasons:

  1. Out of Heap memory on Executors
  2. Low Memory Overhead on Executors
  3. Shuffle block greater than 2 GB
  4. Network TimeOut.

Let’s understand each of these reasons in detail:

#bigdata #hadoop #data #data analytics #spark #big data analytics #artifical intelligence #etl

4 Common Reasons for FetchFailed Exception in Apache Spark
1.30 GEEK