There are various technologies related to big data in the market such as Hadoop, Apache Spark, Apache Flink, etc, and maintaining those is a big challenge for both developers and businesses. Which tool is the best for batch and streaming data? Are the performance and speed of one particular tool enough in our use case? How should you integrate different data sources? If these questions often appear in your business, you may want to consider Apache Beam.
[Apache Beam] is an open-source, unified model for constructing both batch and streaming data processing pipelines. Beam supports multiple language-specific SDKs for writing pipelines against the Beam Model such as Java, Python, and Go and Runners for executing them on distributed processing backends, including Apache Flink, Apache Spark, [Google Cloud Dataflow] and Hazelcast Jet.
We will be running this pipeline using Google Cloud Platform products so you need to avail your free offer of using these products up to their specified free usage limit, New users will also get $300 to spend on Google Cloud Platform products during your [free trial]
Here we are going to use [Python SDK]py/) and Cloud Dataflow to run the pipeline.
I have clipped some commonly used higher-level transforms (Ptransforms) below, we are going to use some of them in our pipeline.
ParDo is a primary beam transform for generic parallel processing which is not in the above image. The ParDo processing paradigm is similar to the “Map” phase of a Map/Shuffle/Reduce-style algorithm: a ParDo transform considers each element in the input PCollection, performs some processing on that element, and emits zero, or multiple elements to an output PCollection.
**Pipe _‘|’ _**is the operator to apply transforms, and each transform can be optionally supplied with a unique label. Transforms can be chained, and we can compose arbitrary shapes of transforms, and at runtime, they’ll be represented as DAG.
The above concepts are core to create the apache beam pipeline, so let’s move further to create our first batch pipeline which will clean the dataset and write it to BigQuery.
Here we are going to use Craft Beers Dataset from Kaggle.
Description of the beer dataset
abv: The alcoholic content by volume with 0 being no alcohol and 1 being pure alcohol
ibu: International bittering units, which specify how bitter a drink is
name: The name of the beer
style: Beer style (lager, ale, IPA, etc.)
brewery_id: Unique identifier for a brewery that produces this beer
ounces: Size of beer in ounces
We will upload this dataset to google cloud bucket.
#dataflow #batch-processing #data-pipeline #apache-beam #bigquery #data analysis