Dataflow Under the Hood: the origin story

Google Cloud’s Dataflow, part of our smart analytics platform, is a streaming analytics service that unifies stream and batch data processing. To get a better understanding of Dataflow, it helps to also understand its history, which starts with MillWheel.

A history of Dataflow

Like many projects at Google, MillWheel started in 2008 with a tiny team and a bold idea. When this project started, our team (led by Paul Nordstrom), wanted to create a system that did for streaming data processing what MapReduce had done for batch data processing—provide robust abstractions and scale to massive size. In those early days, we had a handful of key internal Google customers (from Search and Ads ), who were driving requirements for the system and pressure-testing the latest versions. What MillWheel did was build pipelines operating on click logs to attempt to compute real-time session information in order to better understand how to improve systems like Search for our customers. Up until this point, session information was computed on a daily basis, spinning up a colossal number of machines in the wee hours of the morning to produce results in time for when engineers logged on that morning. MillWheel aimed to change that by spreading that load over the entire day, resulting in more predictable resource usage, as well as vastly improved data freshness. Since a session can be an arbitrary length of time, this Search use case helped provide early motivation for key MillWheel concepts like watermarks and timers.

Alongside this session’s use case, we started working with the Google Zeitgeist team—now Google Trends—to look at an early version of trending queries from search traffic. In order to do this, we needed to compare current traffic for a given keyword to historical traffic so that we could determine fluctuations compared to the baseline. This drove a lot of the early work that we did around state aggregation and management, as well as efficiency improvements to the system, to handle cases like first-time queries or one-and-done queries that we’d never see again.

#google cloud platform #data analytics #cloud

A history of Dataflow

cloud.google.com

Dataflow Under the Hood: the origin story