In this post, I am going to discuss an interesting pattern with a broadcast that comes in handy. Before going into more details, let us refresh what Spark Accumulators are.
A shared variable that can be accumulated, i.e., has a commutative and associative “add” operation. Worker tasks on a Spark cluster can add values to an Accumulator with the += operator, but only the driver program is allowed to access its value, using
[**_value_**](https://spark.apache.org/docs/2.3.1/api/python/pyspark.html#pyspark.Accumulator.value)
. Updates from the workers get propagated automatically to the driver program.
_Source: _https://spark.apache.org/docs/2.3.1/api/python/pyspark.html#pyspark.Accumulator
Let us walk through a simple example of an accumulator
In the above example code, cnt is defined on the global level. add_items method adds the input x to cnt. _add_items _method is later applied to each item of the rdd in global_accumulator method. This is a typical use of accumulator and in the end call to global_accumulator will print 6 which is a summation of 1, 2, and 3. Note that we need to define cnt as global otherwise various methods won’t be able to access it and it would come as undefined.
#pyspark #object-oriented #data-science #spark #programming