In Part 1, we looked at extracting a csv file from an FTP server and loading it into Google BigQuery using Cloud Functions. In this article, we will be doing the same thing but this time, we will be extracting data from a MySQL database instead.

There are a lot of ETL tools out there and sometimes they can be overwhelming, especially when you simply want to copy a file from point A to B. So today, I am going to show you how to extract data from a MySQL database (Extract), modify it (Transform) and load it into a Google BigQuery table (Load) using python 3.6 and Google Cloud Functions.

In this article, we will look to do the following:

  • Set up a Cloud Function
  • Extract data
  • Transform data
  • Load data
  • Automate our pipeline

Firstly, what is ETL?

Extract, Transform, Load (ETL) is the general procedure of copying data from one or more sources into a destination system which represents the data differently from the source or in a different context than the source. — Wikipedia

This is what our ETL pipeline will look like in the end:

Image for post

ETL Pipeline (Created with Lucidchart)

Google Cloud FunctionsCloud Functions (CF) is Google Cloud’s Serverless platform set to execute scripts responding to specified events, such as a HTTP request or a database update. An alternative to CF is AWS Lambda or Azure Functions.

Setting up your Cloud Function

  • Go to the Cloud Functions Overview page.
  • Make sure that the project for which you enabled Cloud Functions is selected.
  • Click Create function.
  • Name your function.
  • In the Trigger field, select HTTP Trigger.
  • In the Source code field, select Inline editor. In this exercise, we will be writing some custom code, so you can delete the default code in the editor.
  • Use the Runtime dropdown to select a runtime.
  • Authentication: In this example, we will “Allow unauthenticated invocations” but it is recommended to use a service account to access the Cloud Function and grant “permission invoker” for cloud function on the Cloud Scheduler service account that we will use. More details at the end of the article.

Make sure your runtime is set to “Python 3.7” and under “Advanced options” change the region to one closest to you. As at the writing of this post, CF isn’t available in every Google data-centre region, so check here to see where Cloud Functions is enabled.

#data-science #sql #python #data-engineering #google-cloud-platform #function

 Building a Simple ETL Pipeline with Python
23.65 GEEK