AWS Step Functions is a fully managed service designed to coordinate and chain a series of steps together to create something called a state machine for automation tasks. It supports visual workflows and state machines are defined as JSON structure via Amazon State Language ( ASL). In addition, state machines can be scheduled via Amazon CloudWatch as an event rule cron expression.

In this blog, I will walk you through 1.) how to orchestrate data processing jobs via Amazon EMR and 2.) how to apply batch transform on a trained machine learning model to write predictions via Amazon SageMakerStep Functions can be integrated with a wide variety of AWS services including: AWS LambdaAWS FargateAWS BatchAWS GlueAmazon ECSAmazon SQSAmazon SNSAmazon DynamoDB, and more.

Example 1: Orchestrate Data Processing Jobs via Amazon EMR

1a.) Let’s view our input sample dataset (dummy data from my favorite video game) in Amazon S3.

Image for post

Image by Author

1b.) Next, I will create a state machine that spins up an EMR cluster (group of EC2 instances) via ASL.

	"Create_Infra": {
	      "Type": "Task",
	      "Resource": "arn:<partition>:states:<region>:<account-id>:elasticmapreduce:createCluster.sync",
	      "Parameters": {
	        "Name": "Demo",
	        "VisibleToAllUsers": true,
	        "ReleaseLabel": "emr-5.29.0",
	        "Applications": [
	          {
	            "Name": "Hadoop"
	          },
	          {
	            "Name": "Spark"
	          },
	          {
	            "Name": "Hive"
	          },
	          {
	            "Name": "Sqoop"
	          }
	        ],
	        "ServiceRole": "EMR_DefaultRole",
	        "JobFlowRole": "EMR_EC2_DefaultRole",
	        "LogUri": "s3://aws-logs-<account-id>-<region>/elasticmapreduce/",
	        "Instances": {
	          "KeepJobFlowAliveWhenNoSteps": true,
	          "InstanceGroups": [
	            {
	              "Name": "Master Instance Group",
	              "InstanceRole": "MASTER",
	              "InstanceCount": 1,
	              "InstanceType": "m5.xlarge",
	              "Market": "ON_DEMAND"
	            },
	            {
	              "Name": "Core Instance Group",
	              "InstanceRole": "CORE",
	              "InstanceCount": 1,
	              "InstanceType": "m5.xlarge",
	              "Market": "ON_DEMAND"
	            },
	            {
	              "Name": "Task Instance Group",
	              "InstanceRole": "TASK",
	              "InstanceCount": 2,
	              "InstanceType": "m5.xlarge",
	              "Market": "ON_DEMAND"
	            }
	          ],
	          "Ec2KeyName": "<ec2-key>",
	          "Ec2SubnetId": "<subnet>",
	          "EmrManagedMasterSecurityGroup": "<security-group>",
	          "EmrManagedSlaveSecurityGroup": "<security-group>",
	          "ServiceAccessSecurityGroup": "<security-group>"
	        }
	      },
	      "ResultPath": "$.cluster",
	      "Next": "Example_Job_Step_1"
	    }

#emr #amazon-sagemaker #step-functions #machine-learning #aws

Orchestrating Transient Data Analytics Workflows via AWS Step Functions
1.20 GEEK