Broadcasting PySpark Accumulators

In this post, I am going to discuss an interesting pattern with a broadcast that comes in handy. Before going into more details, let us refresh what Spark Accumulators are.

A shared variable that can be accumulated, i.e., has a commutative and associative “add” operation. Worker tasks on a Spark cluster can add values to an Accumulator with the += operator, but only the driver program is allowed to access its value, using [**_value_**](https://spark.apache.org/docs/2.3.1/api/python/pyspark.html#pyspark.Accumulator.value). Updates from the workers get propagated automatically to the driver program.

_Source: _https://spark.apache.org/docs/2.3.1/api/python/pyspark.html#pyspark.Accumulator

Three Commandments of Accumulator

Accumulators can only be used for commutative and associative “add” operation. For any other operation, we have to use a custom implementation. More on that later.
The accumulator can be “updated” on the worker task but that task can’t access its value.
The accumulator can be updated and accessed on the driver program.

Few Lines of Code is Worth Thousand Words

Let us walk through a simple example of an accumulator

In the above example code, cnt is defined on the global level. add_items method adds the input x to cnt. _add_items _method is later applied to each item of the rdd in global_accumulator method. This is a typical use of accumulator and in the end call to global_accumulator will print 6 which is a summation of 1, 2, and 3. Note that we need to define cnt as global otherwise various methods won’t be able to access it and it would come as undefined.

#pyspark #object-oriented #data-science #spark #programming

Three Commandments of Accumulator

Few Lines of Code is Worth Thousand Words

towardsdatascience.com

Broadcasting PySpark Accumulators