Teresa  Jerde

Teresa Jerde

1597452410

Spark Structured Streaming – Stateful Streaming

Welcome back folks to this blog series of Spark Structured Streaming. This blog is the continuation of the earlier blog “Internals of Structured Streaming“. And this blog pertains to Stateful Streaming in Spark Structured Streaming. So let’s get started.

Let’s start from the very basic understanding of what is Stateful Stream Processing. But to understand that, let’s first understand what Stateless Stream Processing is.

In my previous blogs of this series, I’ve discussed Stateless Stream Processing.

You can check them before moving ahead – Introduction to Structured Streaming and Internals of Structured Streaming

#analytics #apache spark #big data and fast data #ml #ai and data engineering #scala #spark #streaming #streaming solutions #tech blogs #stateful streaming #structured streaming

What is GEEK

Buddha Community

Spark Structured Streaming – Stateful Streaming
Teresa  Jerde

Teresa Jerde

1597452410

Spark Structured Streaming – Stateful Streaming

Welcome back folks to this blog series of Spark Structured Streaming. This blog is the continuation of the earlier blog “Internals of Structured Streaming“. And this blog pertains to Stateful Streaming in Spark Structured Streaming. So let’s get started.

Let’s start from the very basic understanding of what is Stateful Stream Processing. But to understand that, let’s first understand what Stateless Stream Processing is.

In my previous blogs of this series, I’ve discussed Stateless Stream Processing.

You can check them before moving ahead – Introduction to Structured Streaming and Internals of Structured Streaming

#analytics #apache spark #big data and fast data #ml #ai and data engineering #scala #spark #streaming #streaming solutions #tech blogs #stateful streaming #structured streaming

Gerhard  Brink

Gerhard Brink

1622108520

Stateful stream processing with Apache Flink(part 1): An introduction

Apache Flink, a 4th generation Big Data processing framework provides robust **stateful stream processing capabilitie**s. So, in a few parts of the blogs, we will learn what is Stateful stream processing. And how we can use Flink to write a stateful streaming application.

What is stateful stream processing?

In general, stateful stream processing is an application design pattern for processing an unbounded stream of events. Stateful stream processing means a** “State”** is shared between events(stream entities). And therefore past events can influence the way the current events are processed.

Let’s try to understand it with a real-world scenario. Suppose we have a system that is responsible for generating a report. It comprising the total number of vehicles passed from a toll Plaza per hour/day. To achieve it, we will save the count of the vehicles passed from the toll plaza within one hour. That count will be used to accumulate it with the further next hour’s count to find the total number of vehicles passed from toll Plaza within 24 hours. Here we are saving or storing a count and it is nothing but the “State” of the application.

Might be it seems very simple, but in a distributed system it is very hard to achieve stateful stream processing. Stateful stream processing is much more difficult to scale up because we need different workers to share the state. Flink does provide ease of use, high efficiency, and high reliability for the**_ state management_** in a distributed environment.

#apache flink #big data and fast data #flink #streaming #streaming solutions ##apache flink #big data analytics #fast data analytics #flink streaming #stateful streaming #streaming analytics

Spark Streaming: Adding Spark to Streaming

In today’s world we have a lot of data. And this data will only grow more and more in future. According to a study, in 2020, the data produced is abound 44 zettabytes each day. And by 2025, approximately 463 exabytes would be created every 24 hours worldwide. Do you ever imagine how one can store or process this much data ?A solution to this is Apache Spark and in this blog I’m going to discuss about Spark Streaming here.

What is Spark Streaming?

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. It was added to Apache Spark in 2013. We can get data from many sources such as Kafka, Flume etc. and process it using functions such as map, reduce etc. After processing we can push data to filesystem, databases and even to live dashboards.

In Spark Streaming we work on near real time data. It divides the received input stream into batches. The Spark Engine processes the batches and generate final output in batches.

Spark Streaming

Spark DStream

DStream (also known as discretized stream) is an abstraction of Spark Streaming. It represents a continuous stream of data. You can create DStreams in two ways :

  • By taking an input stream from sources such as Kafka, Flume etc.
  • By applying functions on input DStream that will produce another DStream.

Internally, a DStream is a sequence of RDD and so we can also say that it is a continuous stream of RDD . RDD (Resilient Distributed Dataset) is the fundamental data structure of Apache Spark which are an immutable collection of objects which computes on the different node of the cluster. Every RDD in DStream contains data from the certain interval. Also if you will apply any operation on a DStream, it applies to all the underlying RDDs.

Spark Streaming Sources and Receivers

Spark streaming provides 2 categories of Spark Streaming Sources. You can create an input DStream using these sources.

The categories are following :

  • Basic sources: Sources directly available in the StreamingContext API. For example: file systems, and socket connections.
  • Advanced sources: Sources like Kafka, Flume, Kinesis, etc. are available through extra utility classes. These require linking by adding extra dependencies.

Also, every input stream(except file stream) has a receiver object. The work of receiver object is to receive the data from input stream and store it in Spark’s memory.

There are two kinds of receiver(based on some sources allow acknowledgement and some not):

  • **Reliable Receiver **: A reliable receiver correctly sends acknowledgment to a reliable source when the data has been received and stored in Spark with replication.
  • Unreliable Receiver: An unreliable receiver does not send acknowledgment to a source.

You can create multiple input DStreams to receive multiple streams of data in parallel. This will create multiple receivers according to the number of streams.

#scala #spark #spark streaming #spark dstream

Art  Lind

Art Lind

1601496000

Stateful Streaming in Spark

Apache Spark is a fast and general-purpose cluster computing system. In Spark, we can do the batch processing and stream processing as well. It does near real-time processing. It means that it processes the data in micro-batches. I have discussed more Spark Streaming in my previous blog. Now in this blog, I’ll discuss Stateful Streaming in Spark. So let’s start !!

#big data #scala #spark #spark streaming #streaming api

Chelsie  Towne

Chelsie Towne

1597801783

Spark Structured Streaming – Handling Late Data

Welcome back folks to this blog series of Spark Structured Streaming. This blog is the continuation of the earlier blog “Understanding Stateful Streaming“. And this blog pertains to Handling Late Arriving Data in Spark Structured Streaming. So let’s get started.

Handling Late Data

With window aggregates (discussed in the previous blog) Spark automatically takes cares of late data. Every aggregate window is like a bucket i.e. as soon as we receive data for a particular new time window, we automatically open up a bucket and start counting the number of records falling in that bucket. These buckets stay open for data which may even come 5 hours late and it will still update that old bucket and thus incrementing the count.

#analytics #apache spark #ml #ai and data engineering #scala #spark #tech blogs #structured streaming #watermark