How Epilot Builds a Powerful Webhook Feature with AWS
Sebastian
Posted on January 17, 2024
Introduction
At epilot, we're committed to simplify the sale and technical implementation of renewable energy solutions through a digital foundation, supporting energy suppliers, grid operators, and solution providers in the energy transition.
One of our features is the integration of webhooks for data synchronization with third-party systems. This allows for timely updates and efficient data exchange, a crucial factor in enhancing our customer service in the energy sector.
In this blog post, we'll dive into how we have harnessed the power of AWS to build a robust webhook feature, enhancing our service capabilities and offering our clients an even more powerful and reliable platform.
The Evolution of Our Webhook System
The initial version of our webhook feature was developed around the time AWS launched a new product known as API Destinations. The concept is simple but powerful: by creating an API Destination as a target for an EventBridge rule, it seamlessly forwards the request to any configured third-party service. One significant advantage of this approach is the use of EventBridge connections to secure webhook requests. This is a common challenge in many platforms, where securing requests is either unsupported or only possible through a signing secret. With EventBridge connections, securing a request becomes versatile and robust, offering options like basic authentication (username/password), API keys (e.g., Authorization: ), or OAuth – a feature frequently demanded by larger enterprise customers. This method eliminates the need for us to manually store client credentials, as the API Destination efficiently handles the signing and forwarding of the request.
The following showcases a sketch of necessary components for our initial webhook architecture
The user is able to create a webhook configuration through our UI. A lambda function creates an API Destination and an EventBridge connection. It then attaches the connection to the API Destination. Then an EventBridge rule is created with API Destination as its target. Whenever this rule is matched, the target is invoked. API Destination forwards failed requests to a Dead Letter Queue (DLQ). A lambda function picks up messages from the queue and stores these events in a table to display failed events to the user.
Caveats
As our platform scaled with increased traffic and users, it unveiled unforeseen issues. The architecture we initially implemented, revealed deficiencies in areas we hadn't anticipated:
Frequent Timeouts: Our customers often synchronise data generated by epilot with their systems, some of which may be slower and unable to handle requests asynchronously. A notable limitation of API Destination is its strict 5-second timeout on requests. This constraint is frequently encountered when syncing data with third-party systems, as their response times can easily exceed this duration.
Payload Size: EventBridge has a hard event size limit of 256kb. While this is a substantial data allowance, we occasionally reach this limit due to extensive data usage. In serverless environments, a typical solution to circumvent such limitations is the Claim-Check-Pattern. However, this approach is not supported by API Destination.
No Analytics: Monitoring within EventBridge remains a complex issue, particularly in determining the success of requests and reflecting this in the user interface. While Dead Letter Queue (DLQ) setups enable to capture failed events, the challenge lies in effectively tracking and displaying successful events.
Was the request successful?: In our platform, webhooks can be triggered by automations. An automation is a set off by predefined actions, such as triggering a webhook. We often received feedback from customers who found it confusing when webhook actions appeared to be successful but ultimately failed. Given the 'fire-and-forget' nature of webhooks, a challenge arises: How can we promptly display a failure when a request doesn't go through successfully?
No static IP support: Larger enterprise customers often require the support of static IPs for using webhook features, which poses a challenge as API Destinations currently do not offer this capability.
How AWS Step Functions Fulfilled Our Requirements
The lack of above mentioned features showcased the requirement of a new webhook architecture.
The AWS Step Function team recently published a new HTTP task, which is very similar to API Destination. One can reuse the EventBridge connection to authorize the request and the HTTP task forwards the request to a 3rd party system. It has no CDK support yet and has to be stable for some months in order for us to adopt it. This announcement, however, brought us to the idea of using a Step Function to implement our webhook architecture. With Step Functions we can:
- remove any timeout issues
- call them synchronously (30s API GW timeout) and asynchronously (no timeout)
- create a lambda task that forwards the request manually:
- allows to use the Claim-Check-Pattern and send larger payloads
- can run within a VPC i.e. having a static IP is easy to add
- complete control how we fetch and store the http response
- store all http responses in a new event table
- easily extend the Step Function with new features when necessary
Playing around with the awesome Step Function builder gets us the following output:
The goal is so use as few lambda functions as possible, to mitigate cascading cold starts. The Step Function architecture itself is straight forward and consists the following tasks:
- GetItemTask Fetch the webhook configuration to know where to send the event to and how
- PutItemTask Persist an event to DynamoDB with some initial data and an 'in_progress' state
- LambdaInvokeTask Call the 3rd party with the input of the state machine. When the input contains a s3_key, hydrate the payload first.
- LambdaInvokeTask Set the event to 'failed' or 'succeeded' based on the HTTP response.
- LambdaInvokeTask (exceptions): Catch unknown exceptions, which raises alerts and sets the event to 'failed' as well.
This results in the following (high level) architecture sketch:
We're updating our system to use AWS Step Functions instead of API Destinations with EventBridge. This change is pretty straightforward, so we don't need any complex migration scripts. We can still use the EventBridge connections we already have, but we'll need to attach them manually to our Lambda tasks for now. We're hoping to automate this attachment by using the new above mentioned HTTP task soon.
For event publishing, we're using a new API endpoint
/webhook/{config_id}/trigger?sync=true|false
. The endpoint checks if the data is bigger than 256kb and, if so, stores it on S3. After that, it triggers the Step Function either in the background or synchronously. This setup is great because it means consumers don't have to worry about permissions; they just need to set up our webhook client. Of course, the consumer can still use the old method of just sending an EventBridge event to trigger the webhook like before.
Conclusion
API Destination proved to be an excellent service for creating a basic webhook feature, but its limitations led us to transition to AWS Step Functions. This shift has enabled us to offer our customers enhanced capabilities, including static IP support, improved analytics, handling of larger payloads, and the elimination of timeout issues. With Step Functions, we now have the flexibility to scale and evolve our architecture to meet our growing needs and those of our customers.
Do you want to work on features like this? Check out our career page or reach out to my Twitter
Posted on January 17, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.