AWS API Gateway supports websockets.

Unfortunately, their service does not provide the ability to persist connection data and reconnect on flaky internet sessions. Nor could I find any example projects with those features. In this article, I will explore a potential solution using Lambda and DynamoDB.

Firstly, how do API Gateway websockets work?

AWS uses routes to execute different actions. Consider a chat application workflow, assuming Lambda is used for compute:

Client connects to websocket, firing the $connect route and associated Lambda.
Client sends JSON payload {action: 'sendmessage'}, firing the sendmessage route.
Server can send data to client by specifying a socketUrl with ConnectionId.
If client sends JSON payload without action, the $default route is fired.
Client disconnects, firing the $disconnect route.

Limitations with API Gateway websocket:

Every time the $connect route is called, a new ConnectionId is created. To persist connection on flaky internet, the client must store an ID.
Lambda is ephemeral, so a database like DynamoDB is required to persist connection data — connection data shouldn’t be stored in memory in case of failure.
The 10 minute connection timeout can be avoided with a ping/pong request.
Max websocket duration of 2 hours, which would require a new connection session.

A solution to reconnecting websockets on AWS

This solution uses custom actions instead of $connect and $disconnect to manage connections, which shifts connection management to the client instead of AWS. DynamoDB will persist connection data and provide an event-driven architecture for returning messages as well as interfacing with external services.

Case 1: Ideal connection

After opening a websocket connection, the client will send an empty socketId to the open route, which generates a new socketId for the client. The socketUrl, ID and currentId are stored in DynamoDB.
Client calls external service “ping”, providing socketUrl and socketId. These details are used to update the DynamoDB table.
Updating DynamoDB will trigger the message Lambda to post any stored messages in the updated row to the client.
Client is responsible for triggering the close route, which will delete the associated row in DynamoDB.
If client fails to close the connection, a TTL (time to live) data deletion timer can be specified for Dynamo
DB.

Case 2: Flaky connection

If the internet connection is poor, the websocket connection may close on the client side.

When the client is offline, new messages will be stored in DynamoDB.
A “back online” event listener on the client will create a new websocket connection.
Client will send an existing socketId to the open route, updating DynamoDB with the new ConnectionID. This triggers an event to send the last message back to the client.

Case 3: Long connection

An under 10 minute disconnect and connect interval will solve this issue.

Limitations to consider

DynamoDB stream adds latency for message updates to client.
Connection data will not persist if client is refreshed. Also, storing socketId in local storage is not ideal when multiple tabs are used.

Blog

Reconnecting websockets on AWS

Meng Lin