Data integrity in Ably Pub/Sub

ccarriazo

Carolina Carriazo

Posted on November 21, 2024

Data integrity in Ably Pub/Sub

When you publish a message to Ably Pub/Sub, you can be confident that the message will be delivered to subscribing clients, wherever they are in the world.

Ably is fast: we have a 99th percentile transmit latency of <50ms from any of our 635 global PoPs, that receive at least 1% of our global traffic. But being fast isn’t enough; Ably is also dependable and scalable. Ably doesn’t sacrifice data integrity for speed or scale; it’s fast and safe.

This post describes the Ably Pub/Sub architecture and features that guarantee your Pub/Sub message is delivered, in order, exactly once, to clients globally; while protecting against regional data center failures, individual instance failures, and cross-region network partitions.

How Ably regional Pub/Sub architecture and persistence works

Each region in Ably is capable of operating entirely independently, but regions also coordinate with each other to share and replicate messages globally.

In each region, a single Pub/Sub channel has exactly one primary location across a fleet of Ably servers. When a message is published by a client attached to that region, the message is processed and stored by that single Pub/Sub channel location before the message is ACKed. Once the publishing client receives the ACK, they can be confident that the message will not be lost, and will be delivered to all subscribing clients.

As well as exactly one primary location, an Ably Pub/Sub channel also has exactly one secondary location. Both the primary and secondary location durably store a copy of the message before the ACK is sent to the client.

How Ably's global pub/sub architecture persists messages across primary and secondary locations in each region for durability.

If the primary location of a Pub/Sub channel fails, the secondary location is ready to take over. The secondary location already has a copy of all the required message data and becomes the primary. A new secondary location is created and the message data is replicated to that new location. This means that clients are isolated from individual instance failure.

The message is immediately replicated to the primary and secondary locations in other regions globally who have subscribing clients. Ably will store up to 14 copies of each message, globally, to mitigate against the failure of entire regions of the Ably service.

All messages are persisted durably for two minutes, but Pub/Sub channels can be configured to persist messages for longer periods of time using the persisted messages feature. Persisted messages are additionally written to Cassandra. Multiple copies of the message are stored in a quorum of globally-distributed Cassandra nodes.

 How Ably SDK clients interact with the Pub/Sub architecture

Each Ably region is capable of operating independently, so Ably SDK clients can connect to any region. By default clients connect to the region providing the lowest latency, but if an issue with that region is detected (perhaps the region is erroring or is slow to respond), the SDKs will connect to another fallback region, and continue operating as normal.

Clients are isolated from region failures. There is no single point of failure in the regional Ably architecture. Regional failures will not affect the global availability of Ably, because clients will fallback to another region and continue operating.

Exactly-once delivery

Ably Pub/Sub messages support exactly-once delivery, which is achieved through two mechanisms: idempotent publishing, and message delivery on the SDK.

Idempotent publishing: Pub/Sub messages have a unique ID field which is used to deduplicate messages. When a message arrives in a region – either because it was published by an Ably SDK connected to that region, or replicated from another region – the primary Pub/Sub channel location verifies the message’s uniqueness by checking its ID. This idempotency check is performed against 2 minutes of message history, and ensures against a client accidentally publishing a message twice. The message is persisted at the primary location and checked for uniqueness in a single atomic operation, which guards against a race between checking uniqueness and durably storing the message.

The primary location in each region performs the idempotency check both for messages that are published to it directly, and for messages replicated from another region, so that the SDKs can retry the publish against another region, and that message will still be checked for uniqueness and not delivered to subscribing clients twice.

Message delivery on the SDK: On the subscribing client, the messages are delivered with a series ID. If the client disconnects, they can provide the last-seen series ID when reconnecting with the resume operation. This resume operation allows the client to pick back up from the exact point it was processed in the stream of Pub/Sub messages, ensuring no duplicates or gaps in the message stream. By default, the SDKs will retry failed message publishes.

Message ordering

The order of messages published by a Pub/Sub WebSocket client on a single WebSocket will remain fully ordered. The regional location of the Pub/Sub channel will share those messages in order with all the other regional locations. Those Pub/Sub messages will be delivered in the same relative order to all subscribing clients, regardless of the region that client is connected to.

To make sure that regions in Ably Pub/Sub can operate independently, there’s no guaranteed order between clients connected and publishing to different regions. This is to allow the two clients, connected to different regions, to be able to publish concurrently at high throughputs and low latency without needing to coordinate globally with each other. The messages published by each client in their respective WebSocket connections will always retain the same order relative to other messages on that same WebSocket connection.

In classic distributed system parlance, each client retains their own causal consistency so that Hello -> World never becomes World -> Hello.

How Ably ensures ordered message delivery across and within regions. Messages retain their order relative to other messages published to the same region.

So, based on what we know about regional Ably architecture, the message ordering that each region sees is defined by when those messages arrive in that region; either by being replicated from another region, or by a client connected to that region performing a message publish. Once the order of messages is established in a single region, all the clients connected to that region will see those messages in the exact same order, but clients in a different region might have a slightly different regional ordering.

Ably's data integrity: a summary

When using a Pub/Sub product, you want it to “just work”. You don’t want to have to worry about if your subscribing clients are going to receive the message you published, or if the messages are going to keep their published order, or if there’s failure in some data center that Ably is running in. You don’t want to have to care. We’ve spent a long time thinking about failure cases, and designing a system that’s both super fast, but also retains the integrity of the data published on Pub/Sub channels.

This post describes the internals for some of the features we use to ensure that Ably Pub/Sub “just works”.

💖 💪 🙅 🚩
ccarriazo
Carolina Carriazo

Posted on November 21, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

Data integrity in Ably Pub/Sub
webdev Data integrity in Ably Pub/Sub

November 21, 2024