Testing Application Resilience: How to Stop Amazon ElastiCache Cluster and Manage Traffic
Devam Parikh
Posted on October 13, 2023
Introduction
As developers, it is crucial to test the resiliency of our applications and understand how they handle failures or disruptions. In this blog post, we will explore a scenario where we need to stop an Amazon ElastiCache cluster to see how our application behaves when Redis is unavailable. Although ElastiCache clusters cannot be stopped, we will discuss alternative approaches to achieve our testing objective.
Understanding Amazon ElastiCache
Amazon ElastiCache for Redis is a powerful in-memory data structure service that provides real-time performance for modern applications. It serves as a cache or a data store, delivering high-speed access to data. ElastiCache uses a synchronous replication mechanism to maintain data consistency across its nodes.
Challenges with Stopping ElastiCache Cluster
Stopping an ElastiCache cluster is not possible due to the synchronous replication mechanism. If we stop a node, the cluster's redundancy is compromised, potentially leading to instability or complete failure. However, we can explore other methods to create scenarios where our application experiences Redis unavailability.
Blocking Incoming Traffic using Security Groups
To simulate Redis unavailability, we can block incoming traffic to the ElastiCache cluster. Security groups act as virtual firewalls, controlling inbound and outbound traffic. By removing all the inbound rules for the ElastiCache cluster, we can prevent any incoming requests from reaching it.
However, it is essential to understand that security groups are stateful[1]. This means that existing connections are not interrupted when security group rules are changed. Thus, our application may still be connected to the ElastiCache cluster.
Addressing the Issue
Two methods can be used to tackle this issue:
1. Restarting the Application: By restarting the application, existing connections will be terminated, forcing the application to establish new connections. This can validate the application's ability to handle Redis unavailability.
2. Using Network ACLs: Network Access Control Lists (ACLs)[2] operate at the subnet level and allow or deny specific inbound or outbound traffic. Unlike security groups, network ACLs are stateless, meaning they don't automatically allow response traffic. Introducing a network ACL that blocks traffic in either direction can break existing connections.
Network ACL in Depth
You can either use the default VPC network ACL or create a custom one with rules similar to security groups for extra VPC security at no extra cost.
The following diagram depicts a VPC with two subnets, each having its network ACL. When traffic enters the VPC (such as from a peered VPC, VPN connection, or the internet) the router directs it to its destination.
Network ACL A controls which traffic can enter subnet 1 and leaves it to destination outside subnet 1. Similarly, network ACL B regulates traffic entering and leaving subnet 2.
Creating a Custom Network ACL
As illustrated in the figure below, this is how I've configured the denial of incoming traffic from my application to the ElastiCache cluster.
A network ACL comprises both inbound and outbound rules, each capable of allowing or denying traffic. These rules are numbered from 1 to 32766.
When determining whether to allow or deny traffic, we evaluate the rules sequentially, starting with the lowest numbered rule. If a rule matches the traffic, it is applied, and no further rules are assessed.
Conclusion
Testing application resilience is essential to ensure smooth operation in challenging scenarios. While stopping an ElastiCache cluster is not feasible due to its replication mechanism, alternative approaches such as blocking incoming traffic using security groups or employing network ACLs can help simulate Redis unavailability. By understanding the statefulness of security groups and the statelessness of network ACLs, we can effectively test our application's behaviour when critical resources are not available.
In summary, remember these key points:
ElastiCache clusters cannot be stopped and rely on synchronous replication for real-time performance.
Security groups are stateful, meaning existing connections persist when rules are modified.
Network ACLs are stateless and can be used to block traffic, potentially breaking existing connections.
Reference:
[1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/security-group-connection-tracking.html
[2] https://docs.aws.amazon.com/vpc/latest/userguide/vpc-network-acls.html
Posted on October 13, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.