Unraveling the Staged Execution in Apache Spark

A Spark stage can be understood as a compute block to compute data partitions of a distributed collection, the compute block being able to execute in parallel in a cluster of computing nodes. Spark builds parallel execution flow for a Spark application using single or multiple stages. Stages provides modularity, reliability and resiliency to spark application execution. Below are the various important aspects related to Spark Stages:
Stages are created, executed and monitored by DAG scheduler: Every running Spark application has a DAG scheduler instance associated with it. This scheduler create stages in response to submission of a Job, where a Job essentially represents a RDD execution plan (also called as RDD DAG) corresponding to a action taken in a Spark application. Multiple Jobs could be submitted to DAG scheduler if multiple actions are taken in a Spark application. For each of Job submitted to it, DAG scheduler creates one or more stages, builds a stage DAG to list out the stage dependency graph, and then plan execution schedule for the created stages in accordance with stage DAG. Also, the scheduler monitors the status of the stage execution completion which could turn out to be success, partial-success, or failure. Accordingly, the scheduler attempt for stage re-execution, conclude Job failure/success, or schedule dependent stages as per stage DAG.

#programming #big-data #software-development #technology #data-science

towardsdatascience.com

Unraveling the Staged Execution in Apache Spark