Spark is deemed to be a highly fast engine to process high volumes of data and is found to be 100 times faster than MapReduce. It is so as it uses distributed data processing through which it breaks the data into smaller pieces so that the chunks of data can be computed in parallel across the machines which saves time. Also, it uses in-memory processing rather than disk based processing which allows the computation to be faster.

Spark Streaming is one of the most important parts of Big Data ecosystem. It is a software framework from Apache Spark Foundation used to manage Big Data. Basically it ingests the data from sources like Twitter in real time, processes it using functions and algorithms and pushes it out to store it in databases and other places.

How to initiate Spark Streaming?

Configuring Spark

First we configure spark and tell it from where it has to ingest the data, whether from local directory, spark cluster, mesos cluster or kubernetes cluster. If you are unfamiliar with these terms, don’t worry. Basically these are cluster management systems which spark needs to handle tasks such as checking node health and scheduling jobs. If you choose your local directory as the master, you need to specify the number of cores from your local machine that you want spark to run on. The more cores you use to run, the faster would be the performance. If you specify *, it means use all the cores from your system. Then we specify app name which is the name we give to our spark application.

SparkConf conf = new SparkConf().setAppName(“SparkApp”).setMaster(“local[*]”);

Creation of Streaming Context Object

Then we create an object of Java Streaming Context which kinda like opens the door for the streaming to start. It provides methods to create JavaDStream and JavaPairDStream from input sources which we’ll discuss further. While creating Java Streaming Context object, we need to specify batch interval; basically spark streaming divides the incoming data into batches such that the final result is also generated in batches. A batch interval tells spark that for what duration you have to fetch the data, like if its 1 minute, it would fetch the data for the last 1 minute.

#apache-spark #spark #java #streaming #kafka

Spark Streaming for Beginners
1.25 GEEK