AWS Step Functions is a great orchestration service, but can be quite expensive at large scale. In this article, we discuss two alternative and serverless architectures for cost-effective workflow management in enterprise-scale projects.

AWS Step Functions is a great service for orchestrating multi-step workflows with complex logic. It’s fast to implement, relatively easy to use and just works. The problem is its price.

For relatively low-scale projects, it’s a feasible solution. But for large-scale, enterprise-grade orchestration with hundreds of millions of processes, each with dozens of steps, it can be cost-prohibitive.

Why Step Functions Is Expensive

Behind the scenes, AWS Step Functions runs synchronously with our resources. This architecture triggers a double-billing issue, which is one side of the Serverless trilemma.

The recently announced Express Workflows slashed per-transition cost from $25/million to $1/million and created a new dimension: duration of tasks. And guess what? Task duration is charged exactly the same pricing as AWS Lambda: per memory-second rounded to the nearest multiple of 100 milliseconds.

This is like having a Lambda function deployed with a Finite-state Machine implementation, which triggers other resources and keeps running in an idle state waiting for their responses.

AWS recommends using Express workflows if tasks have short execution times. Standard workflows probably contain an overpriced markup to account for the risk of long-running ones.

This is suboptimal, but it’s understandable why AWS went that route. Without having access to the underlying code of tasks, it’s virtually impossible to provide all the feature-set available on Step Functions without synchronous execution and double billing.

Affordable Orchestration Solutions

For large-scale and enterprise-level workflows that cannot afford the wasted resources of the Step Functions model, there are at least a couple of alternatives. One will certainly be able to figure out a dozen more, but the two we cover do the job of illustrating our point while staying 100% serverless, which is our goal.

I must anticipate that any of the two will probably require more effort to implement in comparison to Step Functions. This additional effort may be small or large, depending on your workflow requirements.

Real-World Code Examples

We are planning on open sourcing code examples illustrating the architectures below, along with CloudFormation and CDK templates for easy deployment in your own AWS accounts.

In case this is something you would like to have, please subscribe here to receive a heads-up once it’s ready.

Orchestration With EventBridge

EventBridge is a serverless event bus that routes events from sources to targets based on certain rules. Sounds a bit like Tasks and Choices on Step Functions, right?

With the Schema Registry feature, it became even easier to configure EventBridge to work similarly to a workflow orchestration mechanism. We can organize event routing schemas in logical groups, resembling how Workflows are organized in Step Functions.

Any part of your application can send an event to an Event Bus, which will be matched against a set of schemas to determine which consumers should receive it. Schemas are defined in JSON following OpenAPI standards.

Event Patterns allow us to determine how events are processed depending on the fields and values present on them. Content-based filtering provides even more granularity.

An Event Bus, however, limits itself to receiving events and routing to the appropriate target(s). It won’t track down what targets are working on and react to their responses automatically, as Step Functions does.

Another potential downside is that EventBridge is a relatively new service. Knowledge among developers and tooling to work with it is still not as mature as it is for Step Functions. Dashbird, for example, just announced support for Step Functions in its architectural insights engine. While more advanced tools are not yet available for EventBridge, as always, CloudWatch already supports it for basic metric monitoring.

The architecture we are discussing could involve, for example, one Event Bus and multiple Lambda functions. Each function is responsible for one step of the process. At the end of each step, the respective Lambda function is responsible for sending another event to the same Event Bus providing extra information about the latest process, so that EventBridge can parse and route to the next step in the process.

#aws #serverless #serverless architecture #serverless computing #aws step functions #cost cutting

Cutting Step-Functions Costs on Enterprise-Scale Workflows
1.15 GEEK