In today’s world we have a lot of data. And this data will only grow more and more in future. According to a study, in 2020, the data produced is abound 44 zettabytes each day. And by 2025, approximately 463 exabytes would be created every 24 hours worldwide. Do you ever imagine how one can store or process this much data ?A solution to this is Apache Spark and in this blog I’m going to discuss about Spark Streaming here.
Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. It was added to Apache Spark in 2013. We can get data from many sources such as Kafka, Flume etc. and process it using functions such as map, reduce etc. After processing we can push data to filesystem, databases and even to live dashboards.
In Spark Streaming we work on near real time data. It divides the received input stream into batches. The Spark Engine processes the batches and generate final output in batches.
DStream (also known as discretized stream) is an abstraction of Spark Streaming. It represents a continuous stream of data. You can create DStreams in two ways :
Internally, a DStream is a sequence of RDD and so we can also say that it is a continuous stream of RDD . RDD (Resilient Distributed Dataset) is the fundamental data structure of Apache Spark which are an immutable collection of objects which computes on the different node of the cluster. Every RDD in DStream contains data from the certain interval. Also if you will apply any operation on a DStream, it applies to all the underlying RDDs.
Spark streaming provides 2 categories of Spark Streaming Sources. You can create an input DStream using these sources.
The categories are following :
Also, every input stream(except file stream) has a receiver object. The work of receiver object is to receive the data from input stream and store it in Spark’s memory.
There are two kinds of receiver(based on some sources allow acknowledgement and some not):
You can create multiple input DStreams to receive multiple streams of data in parallel. This will create multiple receivers according to the number of streams.
#scala #spark #spark streaming #spark dstream