I will start a new series of articles about what I build and what I learn during the weekend.

As I mentioned before, to be a great Engineering Leader, I always believe that we should know how to “fight.”

This weekend, I will build a batch-based product data pipeline by using GCP stacks.

Here is the Data Flow.

  1. We are going to scrape the Amazon Audible website. And mock as the Product data source. (Disclaimer this is only for self-learning propers)
  2. Using Apache Beam + DataFlow processes the data transformation.
  3. We will upload the data to the GCS.
  4. Load the data to the BigQuery.

Let’s start with the high-level of architecture.

Image for post

Batch Data Pipeline with GCP stacks

Web Scraping

We are going to scrape the Audible entire 515,845 items by using the most excellent Go concurrency feature. Here are the details steps:

  1. Scraping the Category page and get each category link and total page of each category have.

Image for post

#apache-beam #go #self-improvement #data-pipeline #data-science

Build a batch-based product data pipeline by using GCP stacks.
2.60 GEEK