Using Step Functions to handle feature flags

We use feature flags to control how we release parts of a product. Instead of adding conditional statements in the code, we can use Step Functions to decide if the feature flag is enabled.

1. The scenario

Bob's company has a popular application and wants to release a new feature. They want to thoroughly test it in the development environment first. But it's a requirement to continually push changes to production, so the feature might not block the deployment pipeline.

One way to manage this problem is to use feature flags. AWS AppConfig, part of the Systems Manager ecosystem, is a service that allows us to apply feature flags and configuration objects into our application.

One way to incorporate them into the code is to use conditional statements that check whether we have enabled the feature flag for the given environment. But because Bob wants to minimize code changes and reduce code complexity (and because it's fun), he decided to use Step Functions instead of if statements.

He created a separate Lambda function with the new feature (NewFeature), which exists parallel to the existing code (ExistingFeature).

Let's see how this experiment worked out.

2. AppConfig concepts

When getting a feature flag from AppConfig, we must provide three parameters.

An application is a namespace or a folder that contains configurations, feature flags and environments for the given application.

The environment is the target for the feature flag. We can name it as we like. In this example, we'll have two environments, dev and prod. We enable the feature flag in dev, which runs the new code. We keep the existing code in prod.

The last element is the configuration profile, which can be feature flag or freeform configuration. This example will use a feature flag.

I won't describe how to create applications, environments and configuration profiles in AppConfig. I'll provide a link that explains the process at the end of the post.

3. Getting the feature flag

First, we fetch the feature flag state (enabled or disabled) for the given environment from AppConfig.

3.1. Lambda extension

Luckily, we (and Bob) build serverless applications and use Lambda functions. AWS provides an extension that we can integrate with our function as a layer.

If we use SAM templates to create the resources, we can add the extension like this:

AppConfigFunction:
  Type: AWS::Serverless::Function
  Properties:
    Runtime: nodejs20.x
    Layers:
      - 'arn:aws:lambda:eu-central-1:066940009817:layer:AWS-AppConfig-Extension-Arm64:49'
    # more Lambda properties

The URL is different for each region and Lambda function architecture, so you need to find the right one for your scenario.

3.2. The code

We can now call the extension from the GetFeatureFlag function. The code can look like this:

import axios from 'axios';

const {
  AWS_APPCONFIG_EXTENSION_HTTP_PORT,
  APPCONFIG_APPLICATION_NAME,
  APPCONFIG_ENVIRONMENT_NAME,
  APPCONFIG_CONFIGURATION_NAME,
} = process.env;

const client = axios.create({
  baseURL: `http://localhost:${AWS_APPCONFIG_EXTENSION_HTTP_PORT}`,
  timeout: 5000,
});

export const handler = async () => {
  try {
    // 1. Fetch the feature flag from AppConfig
    const config = await client.get(
      `/applications/${APPCONFIG_APPLICATION_NAME}/environments/
      ${APPCONFIG_ENVIRONMENT_NAME}/configurations/
      ${APPCONFIG_CONFIGURATION_NAME}`,
    );
    // 2. Return the feature flag as the value of the config property
    return {
      config: config.data,
    };
  } catch (error) {
    throw error;
  }
};

AWS_APPCONFIG_EXTENSION_HTTP_PORT defaults to 2772, which we can leave as is.

We can have an environment variable for each mandatory AppConfig parameter, application, environment (dev or prod in this case) and configuration profile (1). This way, when we deploy the resources to multiple environments, the function will know the feature flag state for the given environment.

The function's return value will be similar to the following:

{
  "isAllowed": {
    "enabled": true
  }
}

As we can see, AppConfig returns an object of feature flag objects. isAllowed is the feature flag's very creative name. The presented value refers to the dev environment because the flag is enabled there. The value would be enabled: false in prod. We encapsulate the feature flag value in the config property of the returned object (2).

3.3. Permissions

The function's execution role must allow the appconfig:StartConfigurationSession and appconfig:GetLatestConfiguration permissions.

4. Using Step Functions

GetFeatureFlag is part of the state machine, so its return value (the feature flag name and its state) will be the input of the next state.

In this case, it's a Choice state, where we decide if we call the existing function or the one with the new feature.

The state's definition can look like this:

"IsFeatureFlagEnabled": {
  "Type": "Choice",
  "Choices": [
    {
      "Variable": "$.config.isAllowed.enabled",
      "BooleanEquals": true,
      "Next": "NewFeature"
    }
  ],
  "Default": "ExistingFeature"
}

When the feature flag's value is enabled: true, Step Functions will call the NewFeature function. Otherwise, it will invoke ExistingFeature. From this point, the flow can continue as usual.

We have successfully eliminated the if block from the code!

5. AppConfig interactions with Step Functions

What if we wanted to remove the GetFeatureFlag Lambda function and make Step Functions directly interact with AppConfig? We can do that, but there are some considerations to take.

5.1. What's going on in the background?

With a few lines of code in the function handler (1), the Lambda AppConfig extension does a complex job in the background.

First, it calls the StartConfigurationSession API endpoint, which sends back an InitialConfigurationToken. Then, it invokes the GetLatestConfiguration endpoint, which returns the feature flag object seen above.

It then calls GetLatestConfiguration at a configured interval (defaults to 60 seconds) and caches the result.

5.2. Doing the same with Step Functions

We can remove this Lambda function from the architecture and delegate the AppConfig API calls to Step Functions. But in this case, we have to manage everything that the AppConfig extension does for us.

The above workflow snippet shows the change only. The Choice state and everything after will remain the same.

Step Functions integrates with 10,000+ AWS APIs, including StartConfigurationSession and GetLatestConfiguration.

The StartConfigurationSession state requires the mandatory AppConfig parameters we used in the HTTP call inside the Lambda handler. The state's API parameters section can look like this:

{
  "ApplicationIdentifier.$": "$.ApplicationIdentifier",
  "ConfigurationProfileIdentifier.$": "$.ConfigurationProfileIdentifier",
  "EnvironmentIdentifier.$": "$.EnvironmentIdentifier"
}

We assume the state's input contains the ApplicationIdentifier, ConfigurationProfileIdentifier and EnvironmentIdentifier properties.

The state's output (InitialConfigurationToken) will be the input of the following state, GetLatestConfiguration. This state needs one mandatory parameter called ConfigurationToken. The relevant part of the definition can look like this:

{
  "ConfigurationToken.$": "$.InitialConfigurationToken"
}

The output will be similar to this:

{
  "Configuration": "{\"isAllowed\":{\"enabled\":true}}",
  "ContentType": "application/json",
  "NextPollConfigurationToken": "TOKEN",
  "NextPollIntervalInSeconds": 60
}

As we can see, the Configuration property contains the feature flag as expected.

5.3. It might not be a good idea

But there's something else here.

The GetLatestConfiguration call returns a token in the NextPollConfigurationToken property. AWS recommends that clients use it for subsequent calls to the endpoint.

The documentation also recommends caching the feature flag instead of continually fetching it from AppConfig. We should take this advice because AWS charges after the GetLatestConfiguration calls. So we want to reduce the number of invocations!

It means that the client that calls the state machine should also provide the current token in the input. The first state could check if the request contains the token. In this case, the state machine could jump to the GetLatestConfiguration state. If the client can't provide the token (for example, because it's the first call), the state machine could call StartConfigurationSession.

Alternatively, the state machine could store the token somewhere externally, for example, in a DynamoDB table. But this solution would add at least two extra API calls (read and update token) to the flow.

All of these would increase complexity. For this reason, I would keep the Lambda function with the AppConfig extension.

6. Considerations

It's not only feature flags that we can configure in AppConfig. It's possible to store more complex configuration objects, too.

As said above, we can have multiple feature flags for the same application and environment. If this is the case, we'll need a more complex Choice state configuration, which can lead to harder-to-manage states. Alternatively, Bob can write multiple if statements in the code, one for each feature flag.

7. Summary

AppConfig can store feature flags and other configurations we can use in our applications. With the help of the AppConfig Agent or the Lambda extension, we can fetch the feature flag from AppConfig. The extension follows the AWS-recommended flow of API calls and caches the feature flag.

We can use Step Functions and incorporate different code versions based on the feature flag value into our application.

8. Further reading

Creating feature flags and free form configuration data in AWS AppConfig - Guide to create applications, environments and configuration profiles

AWS AppConfig workshop - Get your hands dirty

Getting started with Lambda - How to create a Lambda function

Input and Output Processing in Step Functions - Data flow manipulation

Blog