Designing for cloud redundancy is more than just application-level

I saw this posted on LinkedIn and wrote up an answer to contribute, but there is more to this topic than just 750 characters.

"One of the key principles of cloud resilience is to design your cloud architecture with redundancy in mind. Redundancy means having multiple copies or backups of your resources, such as servers, databases, storage, and network components, across different locations, zones, or regions. This way, if one of them fails or becomes unavailable, you can switch to another one without affecting your users or customers. Redundancy also helps you balance the load and distribute the traffic among your resources, improving performance and scalability."

My experience with redundancy, resiliency, and disaster recovery planning

I have a good bit of experience in disaster recovery (DR) planning for applications and databases. For three years, when working for a Fortune 100 bank, I led the DR efforts for all of the Human Resources systems. This was servers, applications, databases, storage, security, written plans, testing those plans, gathering the teams in HR and IT to run those plans, and working with other divisions on larger company plans. In addition, I had to know what software and library versions we were on. Essentially, I had to have a full view of what risks we had, how to mitigate them (or not), and what we needed to do to meet legal, regulatory, and business requirements. After that, I helped subsequent companies with their planning and then many customers of companies I worked on as they adopted NoSQL databases and moved to the cloud.

What do most people overlook with cloud redundancy and resiliency?

The most overlooked aspect I have seen my entire career, especially for smaller companies, is a defined and signed-off Recovery Time Objective (RTO) and Recovery Point Objective (RPO). In other words, how long can your application/business be down, and how much data can you stand to lose? Most people say they need zero for both...right up until they see the cost and effort. That said, with these two metrics defined and signed off by management, you choose which technologies get you to those metrics and how much you're willing to pay for such a solution. Or do you need to reassess and back those numbers off?

To put this another way, for every second you are down, how much money is the business losing? Can you get an actual number? It's not easy, but IMO you should try. It will be an eye-opening exercise. One customer in a past job estimated they lost ~$25 per second of downtime! Two years later, their business had grown, and the number was ~$40 per second. With these numbers, they over-engineered and tested every piece of infrastructure or code they had. Your number may be a little lower than theirs, but you should try to figure out that number nonetheless. At least have something to shoot for. What dawned on me was that in the time it took to type in the SSH command and my password for a single server/instance, it cost the company $100. Therefore, 100% automation was critical. Restoring from backups costs money in data transfer, employee time, and being down. We had to do things to limit the need for backups. I use this only as one example but you can see how such exercise would lead to finding holes in your planning. What in your infrastructure is like this example?

A related topic that people overlook is how much customer trust you stand to lose, but that is far more difficult to calculate.

Another often overlooked aspect of designing for redundancy is testing that redundancy and your written plans. Are you backing up the correct files, repositories, databases, etc.? Can you restore those backups at all? How long does that take? How quickly can you deploy in a new region? If you rely on vendors, what are their RTO/RPO or SLA numbers? If they don't have those numbers or their numbers are higher than yours and you 100% depend on them, you must adjust your numbers or DR solution to take their numbers into account.

On top of that, do you happen to test once per quarter to be sure of your plan? If you lose part of your infrastructure, do your teams (both business and technical teams) know where the written plans are? Even more so, are they practiced with those steps of the plan on what to do and how to act to get back to 100%? The more often you test, the more sure of your solution and meeting the RTO/RPO.

If your management team doesn't deem these activities critical to the business, that's a problem. This is akin to owning a large cruise ship, not having enough lifeboats or Personal Floatation Devices (PFDs), and your crew and customers don't know where any of this gear or procedures are in an emergency. Do you want to be a customer on that cruise ship?

If your management doesn't prioritize these activities, then you cannot be sure you can meet any RPO or RTO set...and guess who will be blamed when things inevitably go wrong? You. This is why you need a realistic RTO and RPO in writing and a written plan to back it up. Those metrics will guide you through writing and testing your plan, and what level of redundancy you need.

Conclusion

This topic is far more than meets the eye and very nuanced. Your path will depend on your company, capabilities, the industry you're in, and more. Even with that, you still need a plan in writing, and that overall plan is probably more than what you have today. Yesterday was the best time to have a well-architected, resilient, well-practiced disaster recovery plan. The second best time is to start now, set goals, practice, refine, rinse, and repeat.

Blog

Designing for cloud redundancy is more than just application-level

Kirk Kirkconnell

My experience with redundancy, resiliency, and disaster recovery planning

What do most people overlook with cloud redundancy and resiliency?

Conclusion

Join Our Newsletter. No Spam, Only the good stuff.

Related