AWS Step Functions is a fully managed service designed to coordinate and chain a series of steps together to create something called a state machine for automation tasks. It supports visual workflows and state machines are defined as JSON structure via Amazon State Language ( ASL). In addition, state machines can be scheduled via Amazon CloudWatch as an event rule cron expression.

In this blog, I will walk you through 1.) how to orchestrate data processing jobs via Amazon EMR and 2.) how to apply batch transform on a trained machine learning model to write predictions via Amazon SageMakerStep Functions can be integrated with a wide variety of AWS services including: AWS LambdaAWS FargateAWS BatchAWS GlueAmazon ECSAmazon SQSAmazon SNSAmazon DynamoDB, and more.

Example 1: Orchestrate Data Processing Jobs via Amazon EMR

1a.) Let’s view our input sample dataset (dummy data from my favorite video game) in Amazon S3.

1b.) Next, I will create a state machine that spins up an EMR cluster (group of EC2 instances) via ASL.

	"Create_Infra": {
	      "Type": "Task",
	      "Resource": "arn:<partition>:states:<region>:<account-id>:elasticmapreduce:createCluster.sync",
	      "Parameters": {
	        "Name": "Demo",
	        "VisibleToAllUsers": true,
	        "ReleaseLabel": "emr-5.29.0",
	        "Applications": [
	            "Name": "Hadoop"
	            "Name": "Spark"
	            "Name": "Hive"
	            "Name": "Sqoop"
	        "ServiceRole": "EMR_DefaultRole",
	        "JobFlowRole": "EMR_EC2_DefaultRole",
	        "LogUri": "s3://aws-logs-<account-id>-<region>/elasticmapreduce/",
	        "Instances": {
	          "KeepJobFlowAliveWhenNoSteps": true,
	          "InstanceGroups": [
	              "Name": "Master Instance Group",
	              "InstanceRole": "MASTER",
	              "InstanceCount": 1,
	              "InstanceType": "m5.xlarge",
	              "Market": "ON_DEMAND"
	              "Name": "Core Instance Group",
	              "InstanceRole": "CORE",
	              "InstanceCount": 1,
	              "InstanceType": "m5.xlarge",
	              "Market": "ON_DEMAND"
	              "Name": "Task Instance Group",
	              "InstanceRole": "TASK",
	              "InstanceCount": 2,
	              "InstanceType": "m5.xlarge",
	              "Market": "ON_DEMAND"
	          "Ec2KeyName": "<ec2-key>",
	          "Ec2SubnetId": "<subnet>",
	          "EmrManagedMasterSecurityGroup": "<security-group>",
	          "EmrManagedSlaveSecurityGroup": "<security-group>",
	          "ServiceAccessSecurityGroup": "<security-group>"
	      "ResultPath": "$.cluster",
	      "Next": "Example_Job_Step_1"

