Best practices for using AWS StepFunctions

lukvonstrom

Lukas Fruntke

Posted on January 22, 2022

Best practices for using AWS StepFunctions

In this post you will learn some of the best patterns/tricks I have learned during my time creating Step Functions workflows, while working with clients at Accenture.

Table of contents (clickable)

Make use of the Step Functions Workflow Studio

Since it has been introduced in June Step Functions Workflow Studio proved its value to me several times. As its a low-code editor with the most common configurations already baked-in, the effort of writing/designing a workflow with Amazon States Language plummeted. Those minutes and hours handwriting workflows with 100 lines++ are finally over for good, which is something you must not ignore when dealing with Step Functions.

As visualized below, the option to create a workflow from scratch in Workflow Studio directly is already present:
Step Functions creation

As is the option to edit pre-existing workflows directly in Step Functions:
Step Functions edit


Utilize the service integrations

While AWS offers some 17 "optimized" service integrations (for the definitive list see here), that include different custom options of integrating with the specific services, AWS has released an option to call the APIs of nearly all AWS services directly, as described in this article. This allows you to scrap some of the utility lambdas one uses to add much-needed functionality-augmentation to a service and go with Step Function instead.


Use .waitForTaskToken

By using .waitForTaskToken, you are able to transparently pause the workflow, until a task like a lambda function has finished executing.

Be aware, that you need to specify the Task Token in the payload for the lambda, as Step Function does not inject it automatically for you.

Screenshot of waitForTaskToken

Example

code example
This example shows how to send the task token for success/failure back to Step Functions via AWS JS SDK v3.
import {
  SFNClient,
  SendTaskFailureCommand,
  SendTaskSuccessCommand
} from "@aws-sdk/client-sfn";
const client = new SFNClient();

async function success(taskToken, input) {
  const stepFunctionsCommand = new SendTaskSuccessCommand({
    taskToken,
    output: input
  });
  await client.send(stepFunctionsCommand);
}

async function failure(taskToken, cause, error) {
  const stepFunctionsCommand = new SendTaskFailureCommand({
    taskToken,
    cause,
    error
  });
  await client.send(stepFunctionsCommand);
}

async function main(event) {
  try {
    await success(event.MyTaskToken)
  } catch (error) {
    console.error(error)
    const {
      requestId,
      cfId,
      extendedRequestId
    } = error.$metadata;
    await failure(event.MyTaskToken, {
      requestId,
      cfId,
      extendedRequestId
    })
  }

}
Enter fullscreen mode Exit fullscreen mode


Utilize Heartbeats to fail fast

As StepFunctions can run for up to a year (at least Standard workflows) it is imperative to avoid stuck executions. One way of doing this when integrating with Lambda is the Heartbeat API and specification. This allows developers to specify a max-interval, in which the Heartbeat has to be send back to Step Functions. Failure to meet this deadline leads to termination of the task.

Example

code example
In the following example a task is specified with a max-heartbeat duration of 10 minutes.
{
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "HeartbeatSeconds": 600
}
Enter fullscreen mode Exit fullscreen mode

This code-example shows how to send back a heartbeat with AWS JS SDK v3.

import {
  SFNClient,
  SendTaskHeartbeatCommand
} from "@aws-sdk/client-sfn";
const client = new SFNClient();

async function heartbeat(taskToken) {
  const stepFunctionsCommand = new SendTaskHeartbeatCommand({
    taskToken
  });
  await client.send(stepFunctionsCommand);
}

async function main(event) {
    await heartbeat(event.MyTaskToken)

    // some expensive calculation

    await heartbeat(event.MyTaskToken)
}
Enter fullscreen mode Exit fullscreen mode


Define a Catch Handler

Rationale

To quote Werner Vogels, Amazon CTO:

everything fails, all the time

Be prepared for the wildly different and sometimes unexpected errors the AWS APIs can throw, by catching them like you would in a lambda function (if you don't do that we are having a whoole different conversation).

Example

In this pretty simple example, the catch block is used on a lambda task. It works on the other tasks as well. This example uses a catch-all error code, for other error codes see here.

As it is usually a good idea to get notified when an error occurs, this example publishes to a SNS topic, which may have an email subscription. I'd reccomend to use this only for really critical errors, as you may otherwise miss important errors in your then-cluttered inbox.

code example

Catch flow
{
  "Invoke Lambda Task": {
    // [..]
    "Catch": [
      {
        "ErrorEquals": [
          "States.ALL"
        ],
        "Next": "SNS Publish"
      }
    ]
    // [..]
  },
  "SNS Publish": {
    "Type": "Task",
    "Resource": "arn:aws:states:::sns:publish",
    "Parameters": {
      "Message.$": "$"
    },
    "Next": "Fail"
  },
  "Fail": {
    "Type": "Fail"
  }
}
Enter fullscreen mode Exit fullscreen mode


Make use of the inbuilt retries

Instead of catching errors and terminating the flow then, one might as well use retries to try recovery from an error or to await a desired state. One may use this for example to await results from APIs similar to those of a Glue Crawler, which need repeated polling and potentially exponential backoff.

Example

code example

In this example a retry from specific Lambda exceptions is shown. The IntervalSeconds parameter defines an initial offset, which has to pass before the first retry is attempted. The BackoffRate parameter specifies the duration-multiplier which is applied after each unsuccessful attempt. Step Functions will retry after 2,4,8,16,32,64 seconds, limited by the MaxAttempts parameter of 6.

{
  "Invoke Lambda Task": {
    // [..]
    "Retry": [
      {
        "ErrorEquals": [
          "Lambda.ServiceException",
          "Lambda.AWSLambdaException",
          "Lambda.SdkClientException"
        ],
        "IntervalSeconds": 2,
        "MaxAttempts": 6,
        "BackoffRate": 2
      }
    ]
    // [..]
  }
}
Enter fullscreen mode Exit fullscreen mode


Further reading

A nice paper by López et al. from 2018 compares the various orchestration platforms of the different hyperscalers:
DOI - 10.1109/UCC-Companion.2018.00049
ArXiv - arXiv:1807.11248
Be aware though, that the services developed over the course of the three years since than, so it's an excercise to the reader to recognize, how far Step Functions has come since then.


Header image by Gabriel Santos Fotografia via Pexels

💖 💪 🙅 🚩
lukvonstrom
Lukas Fruntke

Posted on January 22, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related