9 Surprises using AWS EventBridge Scheduler
FrΓ©dΓ©ric Barthelet
Posted on December 1, 2022
AWS released its news AWS EventBridge Scheduler service, dedicated to planing tasks in your application. The service is available on all regions using the SDK, the CDK, the CLI and the web management console.
Since the service was released, I've been thoroughly migrating existing workloads that leveraged homemade scheduling features using DynamoDB TTL or any other contraption. This article serves as a discovery report, describing the good, the bad and my general recommendations when it comes to using this new serverless scheduling managed service, in an effort to save someone else troubles understanding the use cases and limits of the service.
What is the Scheduler
The Scheduler is an AWS managed service, dedicated to scheduling one-time or recurring triggers targeting AWS services actions.
There are 3 types of schedules:
- One-time schedules - December 21st at 7AM UTC
- Rate-based schedules, allowing recurring tasks using frequency rate - every 2 hours
- Cron-based schedules, allowing recurring tasks using a cron expression - every Friday at 4PM
Schedules can be grouped in Schedule Groups. Both Schedules and Schedule Groups can be provisioned using a CRUD API on the service.
Rate-based and cron-based schedules can be triggered in a specific timeframe using a start date and an end date. One-time and cron-based schedules are sensible to an optional timezone parameter in which the schedule expression should be evaluated.
Each schedule can trigger one target. There are 3 types of targets:
- Templated Isomorphic targets
- Templated Service-specific targets
- Universal targets
Unlike Universal targets that require the Scheduler to bootstrap an execution environment to use the AWS SDK, both Templated Isomorphic and Service-specific targets are most likely leveraging EventBridge API Destination features to interact with the destination AWS services using their HTTP interface.
Templated Isomorphic targets
You can trigger the following services APIs using the same generic target definition at the creation of your schedule:
The target definition only requires an arn
(like the one of the Lambda function to be invoked or the Step Functions state machine to be started). It allows the use of an optional input
field.
Templated Service-specific targets
You can trigger the following services APIs using additional service-specific options at the creation of your schedule:
Like for templated isomorphic targets, the schedule definition only requires an arn
. In addition to the optional input
field, one service specific attribute can be added, named ${service}Parameters
to the schedule target definition in order to further configure the action. For exemple, you can provide a SqsParameters
parameter in order to specify MessageGroupId
value for the SQS SendMessage
target.
Universal targets
Universal targets allows schedules to target any AWS service and any corresponding action using an abstracted compute environment with the SDK capabilities. Universal target are defined using the magic string arn:aws:scheduler:::aws-sdk:${service}:${apiAction}
as an arn
. This feature is similar to the AWS Step Functions SDK services integration released in September 2021.
The good surprises π
The Scheduler is right on time !
It's not clearly written in the documentation, but Marcia Villalba release post mentions a granularity of one-minute. We can safely assume schedules precision is within a 60 seconds margin when flexible time window is disabled. In practice, running tests over 10.000 data points, using both Universal and Templated Targets aiming at invoking the same Lambda function, the results show both modes successfully trigger within 50 seconds.
In addition, delta between scheduled time and invocation time has a roughly unified repartition from 0 to 50 seconds
Being right on time with a precision of a minute is a huge step forward compared to the 48 hours guarantee on DynamoDB TTL expiration. Of course this DynamoDB garbage collection feature was never intended for precise scheduling, but measured results were encouraging and a lot of application still relied on this mechanism. Lately, delta has considerably increased and the Scheduler release felt like a blessing!
If you require a scheduling mechanism with a precision to the exact second, have a look at the CDK Scheduler.
Authorization has a per-Schedule granularity
The EventBridge Scheduler allows use of a different role for each Schedule, similar to EventBridge Rules current behavior. Granularity with minimal policy documents is therefore easily enforceable and misconfiguration can be avoided thanks to thorough permission management.
This Schedule specific role should include permissions for the targeted service and its corresponding action. This role should also be assumable by the Scheduler service.
Since one-time Schedule will mostly be provisioned at runtime (like for instance to send a reminder email to a user 10 days after its initial connection), please note that the role assumed by the compute unit in charge of creating the Schedule should:
- allow
schedule:CreateSchedule
action - allow
iam:PassRole
action for the role to be used by the Schedule
Schedules access patterns are relevant
Schedules can be grouped in Schedule Groups (they are by default created in a group conveniently named default
). Schedule ARNs are predictable, no technical IDs are involved in the management of Schedules and Schedule Groups. A Schedule ARN follows this syntax: arn:aws:scheduler:${region}:${accountId}:schedule/${scheduleGroupName}/${scheduleName}
.
Listing existing Schedules with the ListSchedule
action allows multiple access pattern:
- list all Schedules
- list all Schedules in a specific Schedule Group
- list all Schedules whose name has a specific prefix
- list all Schedules in a specific Schedule Group and whose name has a specific prefix
This allow for clear business logic separation like in multi-tenant applications, ensuring no collision occurs within code when it comes to handling Schedules dedicated to a single tenant. It is strangely resembling a composite primary key access pattern on a DynamoDB Table, where Schedule Group names officiate as separate partitions and Schedules as distinct items who's name is the sort key.
Get, Update and Delete actions require however an exact identifier - Name and GroupName (which is equivalent to providing the ARN).
Scheduler is protected against recursive calls
Universal targets leverage a dedicated compute unit to execute SDK actions for a specific Schedule. Not all SDK services and actions are included in this environment. For instance, all actions of the Scheduler are excluded, preventing unintended recursive calls that may break the bank! I did however had a lot of fun trying to create a recursive Schedule targeting arn:aws:scheduler:::aws-sdk:schedule:createSchedule
with the same payload.
The not so good surprises π€―
Schedules remain visible after their job is done
The Scheduler does not distinguish still-relevant and irrelevant Schedules. What I call irrelevant Schedules are:
- one-time Schedules who's scheduled date has passed and target was successfully invoked
- recurring Schedules who's end date has passed and target was successfully invoked on all occurrences
- deactivated Schedules. Those are the only irrelevant Schedules that can be identified and filtered out of listing operations
Those Schedules are indeed irrelevant since there is no remaining tasks associated with them. Except for debugging purpose, they have no remaining impact on the overall application behavior.
Keeping those irrelevant Schedules around induces various problems.
Irrelevant Schedules count towards the per region quota of 1 million Schedules. While this quota can be increased, any limitation impacting an application history (formerly, all Schedules that were ever created in the context of a specific application) is doomed to be a critical problem at some point. Remember disk space storage issues induced by endlessly writing application logs? We're finding ourself in the exact same situation here.
In addition, no validation occurs at Schedule creation to ensure newly created ones are not already irrelevant at the time of creation.
Finally, there is no efficient way to list remaining relevant Schedules at any time on a given workload.
Templated Targets input
field mapping is highly inconsistent
The optional input
field that you can use on templated targets is highly inconsistent. I had to experiment quite a lot with schedules to be able to produce the following few mappings with API reference documentation:
-
EventBridge β PutEvents
input
will be mapped toEntries[0].Detail
field -
Kinesis Data Firehose β PutRecord ->
input
will be mappedRecord.Data
field -
Lambda β Invoke ->
input
will be mapped toPayload
field -
Amazon SNS β Publish
input
will be mapped toMessage
field -
Amazon SQS β SendMessage
input
will be mapped toMessageBody
field -
Step Functions β StartExecution
input
will be mapped toinput
field
The Scheduler supported action list is inconsistent
Actions relative to Schedules are referenced with schedule:${action}
, while actions relative to Schedule Groups are referenced with scheduler:${action}
in policy documents. Small detail here, but can be really troublesome the first time you write a policy document to use for the Scheduler. You can have a full list of all actions in the Scheduler documentation.
Update action has a replace all strategy
Missing optional fields in an update statement are replaced with their default value. Updates on Schedules has unintended behavior if any value that should remain unchanged are not provided in the payload.
Prefix attribute for ListSchedule action regex does not match name attribute regex
Schedule name has the following regexp: ^[0-9a-zA-Z-_.]+$
Listing Schedules using a name prefix filter only accepts an argument following this regexp: ^[a-zA-Z][0-9a-zA-Z-_]*$
Long story short: you can only use name prefix filter access pattern for Schedules who's name starts with an alphabetical character. I initially designed my one-time Schedules name to start with ISO8601 representation of the scheduled date to circumvent irrelevant Schedules issue. This proved to be a wrong design intent since all ISO8601 representations start with a number, and cannot therefore be used as prefix attribute in a ListSchedule operation.
Should you use the AWS EventBridge Scheduler?
You're currently using CloudWatch Rules or EventBridge Rules
Rules using cron-based and rate-based schedules should be migrated to EventBridge Scheduler. You'll be able to reach more target types than with existing EventBridge target catalog. You'll also be able to remove a few Lambda functions who's sole purpose was to use the SDK to reach a service not integrated with EventBridge. Finally, the Scheduler has 14 million Schedules included in its free tier each month, your application may not use as much and you'll remain free of charge after migrating to this new service.
You're currently using DynamoDB TTL
Using DynamoDB TTL to schedule one-time tasks can now be definitely deprecated. The pricing impact of removing DynamoDB and Lambda altogether from the required infrastructure to implement such scheduling mechanism is worth it. Even if you're currently fine with the 48 hours window of DynamoDB TTL, you should rely on the Scheduler with the corresponding flexible time window parameter.
Key take-aways
Always use Universal Targets
Indeed, universal targets have quite a few advantages:
- π Targets catalog All templated targets can be achieved with universal targets. Universal targets cover almost all AWS services and actions.
-
π¨βπ» Developer Experience You can safely rely on targeted service actions documentation instead of hoping you're aiming for the correct field using
input
shortcut provided by templated targets. Your schedule definition will be consistent and self-explanatory since they won't be relying on EventBridge Scheduler specific shorthand syntax. - βοΈ Configurability You can use all allowed configuration options for the action you want to trigger.
- πΆ Cost There is no additional charges to use universal targets, it costs the same while doing much more!
- π Schedule precision Unlike what I initially assumed, universal targets are slightly closer to the requested scheduled time (considering p90 delta).
Provision at the right time
Schedules and Schedule Groups can be created at deploy time, using any IaC framework, or at runtime, using the SDK. A few recommendations regarding when to provision which:
- Almost always, Schedule Groups should be provision at deploy time.
- Schedule Groups used for tenancy segregation in multi-tenant applications are the only groups that should be provisioned at runtime, at the time of tenant creation.
- Recurring Schedules without start and end dates are relevant throughout the entire lifespan of an application, they should be provisioned using IaC at deploy time.
- One-time Schedules and recurring Schedules with a given timeframe should be created at runtime, resulting from a user action, within their respective previously provisioned Schedule Groups
If you need to use UpdateSchedule actions, always use GetSchedule beforehand as starting point for your command payload
Indeed, update actions in EventBridge Scheduler use a replace all attributes strategy. If you omit a value that was previously given (at creation or previous update), the Schedule will use the default value for the corresponding missing attributes. This can lead to unexpected behavior.
Prefer the use of UTC
Timezone management is a pain, always. The Scheduler tries to compensate with an optional timezone parameter and implements daylight savings time shift.
In most cases, if you want to avoid timezone strange behavior, prefer relying on a date management library to convert one-time Schedules scheduled time in UTC before creating it.
Cron-based Schedules are the only relevant Schedules that might benefit from timezone sensitive settings.
Rate-based Schedules are unaffected by this setting.
Always use a DLQ on your Schedules
If it can fail, it will fail at some point. CloudWatch metrics available for the Scheduler cannot distinguished failed target invocation on a per-Schedule basis. Provisioning a dead-letter queue and referencing it as destination for all your schedules is a must have!
Implement a regular cleanup process for irrelevant Schedules
Regular cleaning of now irrelevant Schedules should be implemented to keep the total number of Schedules under control and avoid reaching the 1 million quota for the service. You can rely on a rate-based Schedule to regularly invoke a Lambda function dedicated to listing and deleting irrelevant Schedules. You can adjust the retention period, for debugging purpose, for which you still want to keep a Schedule around by changing your programmatic filtering parameters.
Conclusion
All things considered, the new AWS EventBridge Scheduler service feels like a blessing, especially for one-time Schedules where there were no robust alternatives on AWS. Google Cloud Platform had Cloud Task since 2018 for this specific purpose. It's nice to see AWS matching the offer and providing, almost for free, a dedicated managed service with precise scheduling mechanism.
At the time of publishing this discovery report, some questions remain unanswered. Among the various subjects I'll dig into, but save the findings for a separate article, you'll find:
- why and when to use the flexible time window parameter
- why and when to use the client token. How can you ensure idempotency when you interact with the Scheduler
- what kind of L2/L3 CDK Construct can and should be implemented to ease up integration of this service
Posted on December 1, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.