Apache Beam Pipeline for Cleaning Batch Data

There are various technologies related to big data in the market such as Hadoop, Apache Spark, Apache Flink, etc, and maintaining those is a big challenge for both developers and businesses. Which tool is the best for batch and streaming data? Are the performance and speed of one particular tool enough in our use case? How should you integrate different data sources? If these questions often appear in your business, you may want to consider Apache Beam.

[Apache Beam] is an open-source, unified model for constructing both batch and streaming data processing pipelines. Beam supports multiple language-specific SDKs for writing pipelines against the Beam Model such as Java, Python, and Go and Runners for executing them on distributed processing backends, including Apache Flink, Apache Spark, [Google Cloud Dataflow] and Hazelcast Jet.

We will be running this pipeline using Google Cloud Platform products so you need to avail your free offer of using these products up to their specified free usage limit, New users will also get $300 to spend on Google Cloud Platform products during your [free trial]

Here we are going to use [Python SDK]py/) and Cloud Dataflow to run the pipeline.

The Anatomy of a Data Pipeline

Image for post

Key Concepts of Pipeline

**Pipeline: **manages a directed acyclic graph (DAG) of PTransforms and PCollections that is ready for execution.
**PCollection: **represents a collection of bounded or unbounded data.
**PTransform: **transforms input PCollections into output PCollections.
**PipelineRunner: **represents where and how the pipeline should execute.
**I/O transform: **Beam comes with a number of “IOs” — library PTransforms that read or write data to various external storage systems.

I have clipped some commonly used higher-level transforms (Ptransforms) below, we are going to use some of them in our pipeline.

Image for post

Common Transforms in Pipeline

ParDo is a primary beam transform for generic parallel processing which is not in the above image. The ParDo processing paradigm is similar to the “Map” phase of a Map/Shuffle/Reduce-style algorithm: a ParDo transform considers each element in the input PCollection, performs some processing on that element, and emits zero, or multiple elements to an output PCollection.

**Pipe _‘|’ _**is the operator to apply transforms, and each transform can be optionally supplied with a unique label. Transforms can be chained, and we can compose arbitrary shapes of transforms, and at runtime, they’ll be represented as DAG.

The above concepts are core to create the apache beam pipeline, so let’s move further to create our first batch pipeline which will clean the dataset and write it to BigQuery.

Basic flow of the pipeline

Image for post

Pipeline Flow

Read the data from google cloud storage bucket (Batch).
Apply some transformations such as splitting data by comma separator, dropping unwanted columns, convert data types, etc.
Write the data into data Sink (BigQuery) and analyze it.

Here we are going to use Craft Beers Dataset from Kaggle.

Description of the beer dataset

abv: The alcoholic content by volume with 0 being no alcohol and 1 being pure alcohol

ibu: International bittering units, which specify how bitter a drink is

name: The name of the beer

style: Beer style (lager, ale, IPA, etc.)

brewery_id: Unique identifier for a brewery that produces this beer

ounces: Size of beer in ounces

We will upload this dataset to google cloud bucket.

#dataflow #batch-processing #data-pipeline #apache-beam #bigquery #data analysis

The Anatomy of a Data Pipeline

Basic flow of the pipeline

towardsdatascience.com

Apache Beam Pipeline for Cleaning Batch Data