What is Glue? A Full ETL Pipeline Explained

Ever wondered how major big tech companies design their production ETL pipelines? Interested in knowing how TB, ZB of data is seamlessly grabbed and efficiently parsed to the database or another storage for easy use of data scientist & data analyst?

In this post, I will explain in detail (with graphical representations!) the design and implementation of the ETL process using AWS services (Glue, S3, Redshift). Anyone who does not have previous experience and exposure to the AWS Glue or AWS stacks (or even deep development experience) should easily be able to follow through. The walk-through of this post should serve as a good starting guide for those interested in using AWS Glue.

Before we dive into the walkthrough, let’s briefly answer three (3) commonly asked questions:

What actually is AWS Glue?

What are the features and advantages of using Glue?

What is the real-world scenario?

What is AWS Glue?

So what is Glue? AWS Glue is simply a serverless ETL tool. ETL refers to three (3) processes that are commonly needed in most Data Analytics / Machine Learning processes: Extraction, Transformation, Loading. Extracting data from a source, transforming it in the right way for applications, and then loading it back to the data warehouse. And AWS helps us to make the magic happen. AWS console UI offers straightforward ways for us to perform the whole task to the end. No extra code scripts are needed.

#aws-glue #glue #aws

AWS Glue 101: All You Need to Know with A Real-world Example
1.20 GEEK