Spark Structured Streaming

In this Spark Structured Streaming series of blogs, we will have a deep look into what structured streaming is in a very layman language. So let’s get started.

Introduction

Structured streaming is a stream processing engine built on top of the Spark SQL engine and uses the Spark SQL APIs. It is fast, scalable and fault-tolerant. It provides rich, unified and high-level APIs in the form of DataFrame and DataSets that allows us to deal with complex data and complex variation of workloads. Similar to the batch processing of Spark, it also has a rich ecosystem of data sources from which it can read from and write to.

Philosophy

The philosophy behind the development of Structured Streaming is that,

“We as end user should not have to reason about streaming”.

What that means is that we as end-user should only write batch like queries and its Spark’s job to figure out how to run it on a continuous stream of data and continuously update the result as new data flows in.

Background

The thought/realization which the developers of Structured Streaming had and which lead to its development is,

“We can always treat a Stream of Data as an Unbounded Table”.

Which means that every record in the data stream is like a row that is appended to the table.

Thus, we can represent the batch (static bounded data) as well as streams (unbounded data) as tables, which allows us to express our computations with respect to tables and Spark figures out how to actually run it on either static data or streaming data.

Structure of Streaming Query

To have an understanding of the structure of a streaming query, let’s look into a simple streaming query.

(Example is taken from Databricks)

Let’s say we have set up an ETL pipeline, where we are getting JSON data from Kafka and we want to parse this data and convert it into a structured form, finally write it into a Parquet file. Also, we want to get end-to-end failure guarantees as we don’t want any failure to drop the data or create duplicate data.

Reading the data (Defining Source)

The first step is to create a DataFrame from Kafka i.e. we need to specify where to read the data from. In this case, we need to specify the following things:

Format of the source as “kafka”
The IP address of the Kafka brokers (bootstrap servers)
The topic name from where the data is to be consumed.
Offsets from where we want to consume data. It can be earliest, latest or any custom offset.

There are multiple built-in supported sources like File, Kafka, Kinesis. We can also have multiple input streams and can join or union the streams together. (Will discuss this in an upcoming blog)

 spark.readStream

      // Defining Source
      .format("kafka")
      .option("kafka.bootstrap.servers", "...")
      .option("subscribe", "topicName")
      .option("startingOffsets", "latest")
      .load

      // Transformation
      .select($"value".cast(StringType))
      .select(from_json($"value", schema).as("data"))

      // Defining Sink
      .writeStream
      .format("parquet")
      .option("path", "...")

      // Processing Trigger
      .trigger(Trigger.ProcessingTime(1.minutes))

      // Output Mode
      .outputMode(OutputMode.Append)

      // Checkpoint Location
      .option("checkpointLocation", "...")

      .start()

These above lines of code return a DataFrame, which is a single unified API for manipulating both batch and streaming data in Spark using the same APIs.

#data analysis

blog.knoldus.com

Spark Structured Streaming