While running a Spark application on a cluster, the driver container, running the application master, is the first one to be launched by the cluster resource manager. Application master, after initializing its components, launches the primary driver thread, in the same container. The driver thread runs the main’s method of the Spark application. The first thing the main method does is the initialization of the Spark context which in turn hosts the key components of the driver responsible for driving & supervising the cluster execution of the underlying Spark application. After initializing the Spark context, the driver thread starts executing the required Spark actions on the cluster using the services of the Spark context.

Here is a big picture, showing the key components, of the driver container of a Spark application running in a Yarn cluster.

Image for post

Key Components in a Driver container of a Spark Application running on a Yarn Cluster

**Application Master: **Every Spark application is provided with an Application Master by the cluster resource manager. The application master is started in the driver container of the Spark Application by the cluster resource manager. After getting started, the Application master invokes the Spark application Main’s method in a separate driver thread inside the driver container only. Further, the Application master sets up a communication endpoint to enable communication between the driver thread and the Application Master. Also, the Application Master initiates a resource allocator which is like an agent to fulfill driver thread’s requests for computing resources (Executors) in the cluster.

#software-engineering #artificial-intelligence #technology #data-science #programming

Deep Dive Into the Apache Spark Driver on a Yarn Cluster
1.30 GEEK