Google Cloud Dataflow is a fully-managed service for executing Apache Beam pipelines within the Google Cloud Platform(GCP). In a recent blog post, Google announced a new, more services-based architecture called Runner v2 to Dataflow – which will include multi-language support for all of its language SDKs.

The company redesigned the Dataflow runner for Apache Beam in a second version, offering:

  • Multi-language support
  • Increased parity across SDKs, including state and timer support in Python
  • More I/O’s for Python developers using the cross-language framework, including Kafka I/O
  • Custom container support
  • Increased throughput using SplittableDoFns
  • Improved performance

With the multi-language support, development teams can share components within their organization written in their preferred language and weave them into a single, high-performance, distributed processing pipeline, Google stated in the blog post. Before the second version of Runner, this was not possible.

Runner V2 has a more efficient and portable worker architecture rewritten in C++, which is based on Apache Beam’s new portability framework. Moreover, Google packaged this framework together with Dataflow Shuffle for batch jobs and Streaming Engine for streaming jobs, allowing them to provide a standard feature set from now on across all language-specific SDKs, as well as share bug fixes and performance improvements. The critical component in the architecture is the worker Virtual Machines (VM), which run the entire pipeline and have access to the various SDKs.

#google #etl #database #cloud #data #google cloud platform #big data #devops #development #architecture & design #ai # ml & data engineering #news

Google Announces a New, More Services-Based Architecture Called Runner V2
2.50 GEEK