Broadcasting PySpark Accumulators

In this post, I am going to discuss an interesting pattern with a broadcast that comes in handy. Before going into more details, let us refresh what Spark Accumulators are.

A shared variable that can be accumulated, i.e., has a commutative and associative “add” operation. Worker tasks on a Spark cluster can add values to an Accumulator with the += operator, but only the driver program is allowed to access its value, using [**_value_**](https://spark.apache.org/docs/2.3.1/api/python/pyspark.html#pyspark.Accumulator.value). Updates from the workers get propagated automatically to the driver program.

_Source: _https://spark.apache.org/docs/2.3.1/api/python/pyspark.html#pyspark.Accumulator

Three Commandments of Accumulator

  1. Accumulators can only be used for commutative and associative “add” operation. For any other operation, we have to use a custom implementation. More on that later.
  2. The accumulator can be “updated” on the worker task but that task can’t access its value.
  3. The accumulator can be updated and accessed on the driver program.

Few Lines of Code is Worth Thousand Words

Let us walk through a simple example of an accumulator

In the above example code, cnt is defined on the global level. add_items method adds the input x to cnt_add_items _method is later applied to each item of the rdd in global_accumulator method. This is a typical use of accumulator and in the end call to global_accumulator will print 6 which is a summation of 1, 2, and 3. Note that we need to define cnt as global otherwise various methods won’t be able to access it and it would come as undefined.

#pyspark #object-oriented #data-science #spark #programming

What is GEEK

Buddha Community

Broadcasting PySpark Accumulators
Seamus  Quitzon

Seamus Quitzon

1595223780

Laravel real time event broadcasting with socket.io example

In the today’s technogy age real time data has become an essential thing. So here, in this article i will share laravel real time event broadcasting with socket.io. Here in this example we will see how can we broadcast real times events.

Basically, to acheive this thing in our laravel application, we will use predis which is a laravel package, queue, socket.io, laravel-echo-server and event broadcasting.

So, we will need to install the above things which are not provided in the laravel appication by default. I will provide all the step to implement real time broadcasting event in the various steps from a fresh installation of laravel application. If you have already installed laravel application then you can directly jump on the next step. So without wasting our time let’s start the implementation process and follow all the steps as given below.

Step 1: Install Laravel

For installing fresh laravel application, you will just needto run the following command in your terminal or command prompt.

composer create-project --prefer-dist laravel/laravel realtimeapp

I have given a name realtimeapp to this application, you are frre to give the name of your choice.

Now, run the following command to give permission to the storage and cache directories.

sudo chmod -R 777 /var/www/html/realtimeapp/storage
sudo chmod -R 777 /var/www/html/realtimeapp/bootstrap/cache<br><br>

Step 2: Install predis

In this second step, we will need to install predis. So open your terminal and run the following command.

composer require predis/predis

composer require predis/predis

Step 3: Create event for broadcasting

Now we will need to create an event for broadcasting and in the event file we will set channal and write message. So run the following command to create event.

php artisan make:event SendMessage

Above command will create an event file SendMessage.php file under app/Events directory. So open this file and update it like below.

app/Events/SendMessage.php

<?php

namespace App\Events;

use Illuminate\Broadcasting\Channel;
use Illuminate\Queue\SerializesModels;
use Illuminate\Broadcasting\PrivateChannel;
use Illuminate\Broadcasting\PresenceChannel;
use Illuminate\Foundation\Events\Dispatchable;
use Illuminate\Broadcasting\InteractsWithSockets;
use Illuminate\Contracts\Broadcasting\ShouldBroadcast;
use Illuminate\Contracts\Broadcasting\ShouldBroadcastNow;

class SendMessage implements ShouldBroadcastNow
{
    use InteractsWithSockets, SerializesModels;

    public $data = ['asas'];

    /**
     * Create a new event instance.
     *
     * @return void
     */
    public function __construct()
    {

    }

    /**
     * Get the channels the event should broadcast on.
     *
     * @return \Illuminate\Broadcasting\Channel|array
     */
    public function broadcastOn()
    {
        return new Channel('user-channel');
    }

    /**
     * The event's broadcast name.
     *
     * @return string
     */
    public function broadcastAs()
    {
        return 'UserEvent';
    }
    /**
     * The event's broadcast name.
     *
     * @return string
     */
    public function broadcastWith()
    {
        return ['title'=>'Notification message will go here'];
    }
}

#laravel #how to broadcast events in laravel #laraevl socket.io #laravel broadcasting event #laravel event broadcasting #laravel realtime event broadcasting with socket.io

Broadcasting PySpark Accumulators

In this post, I am going to discuss an interesting pattern with a broadcast that comes in handy. Before going into more details, let us refresh what Spark Accumulators are.

A shared variable that can be accumulated, i.e., has a commutative and associative “add” operation. Worker tasks on a Spark cluster can add values to an Accumulator with the += operator, but only the driver program is allowed to access its value, using [**_value_**](https://spark.apache.org/docs/2.3.1/api/python/pyspark.html#pyspark.Accumulator.value). Updates from the workers get propagated automatically to the driver program.

_Source: _https://spark.apache.org/docs/2.3.1/api/python/pyspark.html#pyspark.Accumulator

Three Commandments of Accumulator

  1. Accumulators can only be used for commutative and associative “add” operation. For any other operation, we have to use a custom implementation. More on that later.
  2. The accumulator can be “updated” on the worker task but that task can’t access its value.
  3. The accumulator can be updated and accessed on the driver program.

Few Lines of Code is Worth Thousand Words

Let us walk through a simple example of an accumulator

In the above example code, cnt is defined on the global level. add_items method adds the input x to cnt_add_items _method is later applied to each item of the rdd in global_accumulator method. This is a typical use of accumulator and in the end call to global_accumulator will print 6 which is a summation of 1, 2, and 3. Note that we need to define cnt as global otherwise various methods won’t be able to access it and it would come as undefined.

#pyspark #object-oriented #data-science #spark #programming

Data Transformation in PySpark

Data is now growing faster than processing speeds. One of the many solutions to this problem is to parallelise our computing on large clusters. A language that allows us to do just that is PySpark.

However, PySpark requires you to think about data differently.

Instead of looking at a dataset row-wise. PySpark encourages you to look at it column-wise. This was a difficult transition for me at first. I’ll tell you the main tricks I learned so you don’t have to waste your time searching for the answers.

#data-transformation #introduction-to-pyspark #pyspark #data-science

Introduction to PySpark — Part 2

In the previous blog, we started with the introduction to Apache Spark, why it is preferred, its features and advantages, architecture and working along with its industrial use cases. In this article we’ll get started with PySpark — Apache Spark using Python! By the end of this article, you’ll have a better understanding of what PySpark is, why we choose python for Spark, its features and advantages followed by a quick installation guide to set up PySpark in your own computer. Finally, this article will throw some light on some of the important concepts in Spark in order to proceed further.

What is PySpark?

Image for post

Source: Databricks

As we have already discussed, Apache Spark also supports Python along with other languages, to make it easier for developers who are more comfortable working with Python for Apache Spark. Python being a relatively easier programming language to learn and use as compared to Spark’s native language Scala, it is preferred by many to develop Spark applications. As we all know, Python is the de facto language for many data analytics workload. While Apache Spark is the most extensively used big data framework today, Python is one of the most widely used programming languages especially for data science. So why not integrate them? This is where PySpark — python for Spark comes in. In order to support Python with Apache Spark, PySpark was released. As many data scientists and analysts use python for its rich libraries, integrating it with Spark is having the best of both worlds. With a strong support by the open source community, PySpark was developed using the Py4j library, to interface with the RDDs in Apache Spark using Python. High speed data processing, powerful caching, real-time and in-memory computation and low latency are some of the features of PySpark that makes it better than other data processing frameworks.

Why choose Python for Spark?

Image for post

Source: becominghuman.ai

Python is easier to learn and use compared to other programming languages, thanks to its syntax and standard libraries. Python being a dynamically typed language, facilitates Spark’s RDDs to hold objects of multiple types. Moreover, Python has an extensive and rich set of libraries for a wide range of utilities like machine learning, natural language processing, visualization, local data transformations and many more.

While python has many libraries like Pandas, NumPy, SciPy for data analysis and manipulation, these libraries are memory dependent and depend on a single node system. Hence, these are not ideal for working with very large datasets in the order of terabytes and petabytes. With Pandas, scalability is an issue. In cases of real-time or near real-time flow of data, where large amount of data needs to be brought into an integrated space for transformation, processing and analysis, Pandas wouldn’t be an optimal choice. Instead, we need a framework to do the work faster and more efficiently by means of distributed and pipelined processing. This is where PySpark would come into action.

#pyspark #introduction-to-pyspark #data-science #python #apache-spark #apache

Paula  Hall

Paula Hall

1619517660

Syntax Gotchas Writing PySpark When Knowing Pandas

If you have some basic knowledge in data analysis with Python Pandas and are curious about PySpark and don’t know where to start, tag along.

Python Pandas encouraged us to leave excel tables behind and to look at data from a coder perspective instead. Data sets became bigger and bigger, turned from data bases to data files and into data lakes. Some smart minds from Apache blessed us with the Scala based framework Spark to process the bigger amounts in a reasonable time. Since Python is the go to language for data science nowadays, there was a Python API available soon that’s called PySpark.

For a while now I am trying to conquer this Spark interface with its non-pythonic syntax that everybody in the big data world praises. It took me a few attempts and it’s still work in progress. However in this post I want to show you, who is also starting learning PySpark, how to replicate the same analysis you would otherwise do with Pandas.

The data analysis example we are going to look at you can find in the book “Python for Data Analysis” by Wes McKinney. In that analysis, the aim is to find out the top ranked movies from the MovieLens 1M data set, which is acquired and maintained by the GroupLens Research project from the University of Minnesota.

As a coding framework I used Kaggle, since it comes with the convenience of notebooks that have the basic data science modules installed and are ready to go with two clicks.

The complete analysis and the Pyspark code you can also find in this Kaggle notebookand the Pandas code inthis one. We won’t replicate the same analysis here, but instead focus on the syntax differences when handling Pandas and Pyspark dataframes. I will always show the Pandas code first following with the PySpark equivalent.

The basic functions that we need for this analysis are:

  • Loading data from csv format
  • Combining datasets from different tables
  • Extracting information

#python #data-science #pyspark #introduction-to-pyspark #pandas-dataframe