What Byzantine Generals Teach Us About AWS SQS
Ali Ogun
Posted on September 9, 2023
Ever heard of the Byzantine Generals? Click here and check it out. But in a nutshell: Generals trying to launch a coordinated attack on a city, but some of them are secretly plotting to confuse the troops. It's like trying to plan a surprise party when a few friends are determined to spill the beans.
Now, let's step into the tech world. We're talking about messages traveling through a web of computers and bouncing around. They need to arrive in the right order. This is where the message queues comes in. In AWS the service is called Amazon Simple Queue Service (SQS).
So, how come they are related you are wondering. Right? Stay still. Today, we'll uncover the connection between Byzantine Generals and SQS. It's a story of how ancient problems can teach us a lot about modern tech. So grab your coffee, and let's dive in!
AWS SQS and Message Ordering:
In AWS, SQS is the very first service that was published for public use in 2004. In cloud computing, where information flows like a river through Amazon Web Services (AWS). Among its many offerings, Amazon Simple Queue Service (SQS) stands as a message queue service for Asynchronous messaging. It's essential to create decoupled resilient applications.
Within AWS SQS, two distinct queue types emerge: Standard Queues and FIFO (First-In-First-Out) Queues.
Standard Queues : It offers a best-effort approach to message ordering. The reason is here standard queues prioritize scalability and high throughput, which may lead to messages being processed out of order. But why? We'll come to that in a moment.
FIFO Queues : These queues guarantee that messages are delivered in the exact same order. When message ordering is a crucial assurance for applications, such as processing financial transactions or completing e-commerce sales, this service comes in handy.
Okay, here that question comes to mind. What is best-effort and why my messages come out of order in the standard queue?
To make things easy to understand, we use often illustrations to grab that kind of abstract concept like SQS. You can see that it looks like a solid pipe. But is it? What is going on under the hood really? Let's take a look at the image below.
This image is from AWS reinvent event. You can see that it's represented as a solid pipe. But let's take a look at the next image.
In this image, messages seem randomly distributed but of course, sorted with best-effort. What is best-effort? Why order is not guaranteed? Let's zoom into the pipe itself.
If we look closer, we understand the reason why it's not a solid pipe. *Because it is distributed on Amazon SQS servers!
What do I mean? When you push a message into the queue, Amazon SQS redundantly stores a message in more than one availability zone (AZ). Message copies are stored in multiple AZs, no single computer, network, or AZ failure can make messages inaccessible. It's perfect right! A standard queue makes a best effort to preserve the order of messages, but more than one copy of a message might be delivered out of order.
Amazon SQS stores copies of your messages on multiple servers for redundancy and high availability.
On rare occasions, one of the servers that stores a copy of a message might be unavailable when you receive or delete a message. If this occurs, the copy of the message isn't deleted on that unavailable server, and you might get that message copy again when you receive messages. *
If you don't mind the order of the messages in your use case, it brings you some superiority in terms of scalability. As long as you keep pushing new messages, AWS can spin up more and more SQS servers and keep handling your messages. Well, here the price that you pay is an ordering issue.
Okay, but how does SQS FIFO Queue work its magic?
Here the AWS team put in hard work to guarantee and to provide exactly-once processing. The exact details of how AWS SQS maintains the order of messages in a FIFO queue are not publicly disclosed, as this is considered part of AWS's internal architecture and proprietary information.
It's a blend of intelligent mechanisms. Each message entering an SQS FIFO queue carries a unique Message Deduplication ID to prevent duplicates. Additionally, messages are organized into Message Groups defined by a Message Group ID. Messages within the same group are processed sequentially, respecting the order of entry. The AWS team here making an extra effort to make sure the order is maintained which comes with a cost. It's more expensive than Standard Queue and has a limited number of transactions per second (TPS).
The second thing is the maximum timeout for storing a given message deduplication ID is 5 minutes. A duplicate message would be accepted and continued processing in the SQS FIFO queue if it arrived more than five minutes after the original. If exactly-once were a crucial requirement, other distributed system components would have to be responsible for ensuring message uniqueness.
So, let's come back to our topic. How all those are related to the Byzantine Generals?
Understanding the Byzantine Generals Problem:
To grasp the connection between Byzantine Generals and the modern-day challenges of message ordering, let's take a closer look at the historical problem.
Imagine an ancient army preparing to lay siege to an enemy's city. The success of their mission depends on synchronized attacks from multiple fronts. But here's the catch: not all the generals can be trusted. Some of them might be traitors and they can sabotage the plan.
In the face of this problem, the loyal generals must somehow agree on a common strategy, even though they can't be sure if the orders they receive from their fellow commanders are genuine or deceitful. The challenge, known as the "Byzantine Generals Problem", is to achieve consensus in a distributed and potentially unreliable communication network.
The complexity of contemporary distributed systems corresponds to this historical issue in a world where integrity is of utmost importance. Our digital systems struggle with faulty parts, network issues, and the necessity of maintaining the right sequence of events, much like the devoted generals trying to plan their attack.
The Byzantine Generals Problem serves as a timeless parable of the difficulties inherent in distributed decision-making. It underscores the vital importance of achieving consensus and maintaining order, whether on the battlefield of ancient Byzantium or in today's cloud computing infrastructure.
Message Ordering Challenges in Distributed Systems such as SQS:
Maintaining the proper sequence of messages is comparable to orchestrating a difficult symphony in the interconnected world of distributed computing, where data passes across networks. Consider a situation where you must make sure a set of activities takes place exactly as planned.
Distributed systems(such as SQS) often involve multiple nodes(most probably different AZs) or components that communicate and collaborate to achieve a common goal. However, these systems operate in a volatile environment with uncertainty. Network delays, hardware failures, and asynchronous communication can introduce chaos into what should be an orderly sequence of events. This quest for precise message ordering is a fundamental challenge in distributed systems.
In the AWS world, AWS Simple Queue Service (SQS) comes to the rescue and AWS team manages this chaos for us. SQS offers a solution to the solve the message ordering problem. However, it's not the end of the story. SQS FIFO queues guarantee in-order and exactly-once processing only within the queue itself, not in the entire distributed system that would use the queue.
The Total Order Broadcast problem appears in the field of distributed computing. According to the Total sequence Broadcast, messages must be sent to every participant in the same sequence, regardless of who they are to. In a distributed system, FIFO delivery of messages is basically a special case of the issue with extra order restrictions. Understanding the constraints of potential FIFO delivery problem solutions is aided by research on the Total Order Broadcast problem. Simply put, the Total Order Broadcast problem is equivalent to the problem of achieving distributed consensus, i.e. having all participants in a distributed system agree on message delivery order.
Although a SQS FIFO queue is undoubtedly a very helpful tool in developing a solution to this issue, it is not a solution in and of itself. The task of guaranteeing exactly-once publication outside the queue typically falls to us developers.
Posted on September 9, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.