Spark Structured Streaming – Handling Late Data

Welcome back folks to this blog series of Spark Structured Streaming. This blog is the continuation of the earlier blog “Understanding Stateful Streaming“. And this blog pertains to Handling Late Arriving Data in Spark Structured Streaming. So let’s get started.

Handling Late Data

With window aggregates (discussed in the previous blog) Spark automatically takes cares of late data. Every aggregate window is like a bucket i.e. as soon as we receive data for a particular new time window, we automatically open up a bucket and start counting the number of records falling in that bucket. These buckets stay open for data which may even come 5 hours late and it will still update that old bucket and thus incrementing the count.

#analytics #apache spark #ml #ai and data engineering #scala #spark #tech blogs #structured streaming #watermark

blog.knoldus.com

Spark Structured Streaming – Handling Late Data