In this project we will demonstrate the use of:
✅Airflow to orchestrate and manage the data pipeline
✅AWS EMR for the heavy data processing
✅Use Airflow to create the EMR cluster, and then terminate once the processing is complete to save on cost.
Code:
---------------
https://github.com/SatadruMukherjee/Data-Preprocessing-Models/blob/main/ingest.sh
https://github.com/SatadruMukherjee/Data-Preprocessing-Models/blob/main/transform.py
https://github.com/SatadruMukherjee/Data-Preprocessing-Models/blob/main/airflow_emr_spark_s3_snowflake.py
https://github.com/SatadruMukherjee/Data-Preprocessing-Models/blob/main/airflow_emr_s3_snowflake_setup.txt
Subscribe: https://www.youtube.com/@KnowledgeAmplifier1/featured
#airflow #spark