Using AWS Data Wrangler with AWS Glue Job 2.0

I will admit, AWS Data Wrangler has become my go to package for developing extract, transform, and load (ETL) data pipelines and other day-to-day scripts. AWS Data Wrangler integration with multiple big data AWS services like S3, Glue Catalog, Athena, Databases, EMR, and others makes life simple for engineers. It also provides the ability to import packages like Pandas and PyArrow to help writing transformations.

In this blog post I will walk you through a hypothetical use-case to read data from glue catalog table and obtain filter value to retrieve data from redshift. I would create glue connection with redshift, use AWS Data Wrangler with AWS Glue 2.0 to read data from Glue catalog table, retrieve filtered data from redshift database and write result data set to S3. Along the way I will also mention troubleshooting Glue network connection issues.

AWS Glue is a fully managed extract, transform, and load (ETL) service to process large amount of datasets from various sources for analytics and data processing.

#aws #etl #aws-glue

medium.com

Using AWS Data Wrangler with AWS Glue Job 2.0