AWS Glue is a pay as you go, server-less ETL tool with very little infrastructure set up required. It automates much of the effort involved in writing, executing and monitoring ETL jobs. If your data is structured you can take advantage of Crawlers which can infer the schema, identify file formats and populate metadata in Glue’s Data Catalogue. Based on your specified ETL criteria, Glue can automatically generate Python or Scala code for you and provides a nice UI for job monitoring and scheduling.

In comparison, EMR is a big data platform designed to reduce the cost of processing and analysing huge amounts of data. It is a managed service where you configure your own cluster of EC2 instances. You have complete control over the configuration and can install Hadoop ecosystem components, which makes EMR an incredibly flexible and complex service. Its use cases are vast. Data scientists can use EMR to run machine learning jobs utilising the TensorFlow library, analysts can run SQL queries on Presto, engineers can utilise EMR’s integration with streaming applications such as Kinesis or Spark… the list goes on!

You could replace Glue with EMR but not vice versa, EMR has far more capabilities than its server-less counterpart.

Image for post

#glue #etl #emr #aws

AWS Glue Vs. EMR: Differentiating two of the best ETL platforms
27.95 GEEK