Business continuity and disaster recovery blueprints for enterprises
Stefano d'Antonio
Posted on November 25, 2021
Planning for disasters and having recovery processes in place is critical to any business; whether your domain is an e-commerce platform or a financial institution, a disruption of IT services means loss in revenue and reputation for your enterprise.
What changes is the tolerance to those outages.
If your system is down for 10 minutes it can either mean that the users will happily come back and play your online game later or that you will lose millions in critical transaction fees and your clients will go somewhere else where the system is more reliable.
Why do I need to account for downtimes?
Hardware failures is physiological; a rack of servers can suffer for faulty power suppliers, network switches, overheating and so on, despite all good prevention measures in place. Software can also fail and bugs/attacks can cause disruptions to the service.
If a component develops a fault, your workload on that hardware/software can be disrupted and data can also be lost.
Terminology
Before we dive deeper into the different options, let's clarify few terms that may not be familiar if you do not have to deal regularly with systems' reliability:
Recovery Time Objective (RTO)
This is what the business defines as the maximum acceptable time it takes to get the system back up in case of an outage. It can be informally or contractually agreed with consumers.
E.G.
Your system is deployed in the West Europe Azure region on Virtual Machines within a single Availability Zone and a single fault/update domain.
The rack of servers that hosts your VM fails, users cannot reach your website anymore.
How long do you deem acceptable to wait until the system is back up in a different rack/datacenter? That's RTO in a nutshell.
Recovery Point Objective (RPO)
How much data can you afford to lose? RPO is measured in time from the outage.
This is all about data, has nothing to do with your system being back to life.
E.G.
Your rack of server from the previous example develops a fault on hard drives and data is lost.
Assuming you back up your data regularly, how old is your last back up? That's the RPO.
If you back up the data at 9:00PM every day, worst case scenario is that the outage happens at 8:59PM, then your backup will be 1 day old and you would have lost ~24 hours of data. 24h is your RPO. You could be lucky and the outage may happen at 9:01PM right after a completed backup, then you have lost 1 minute of data, but you need to account for the worst case scenario.
Service-level agreement (SLA)
This is probably the most familiar term; likely you have heard of SLA as it is all over the place for cloud resources.
This is the maximum contractual downtime for a service over the year.
In Azure, if a service is down for more than that time, you can ask for credits.
This is to indicate the confidence level of the provider in the availability of the service; things could still go wrong and an outage could last longer, but it is extremely unlikely as this figure comes out of careful Microsoft BCDR planning internally for each service.
You often hear "three nines, four nines, ...", this refers to the digits in the percentage. E.G. 99.9 -> Three nines, 99.9999 -> Six nines...
How do we translate that into time? Let's consider 99.99% SLA, that mean that there is a chance that 0.01% of the year, the service will be unavailable; OK, what's that in actual time the system can be down?
Daily: 8s
Weekly: 1m 0s
Monthly: 4m 22s
Quarterly: 13m 8s
Yearly: 52m 35s
You could very well have a server unreachable for 8 seconds every day over the year of one off for 52 minutes. Having your system down for an hour could have a massive impact in certain domains, but yet 99.99% is quite a good number.
Composite SLA
If you consider a single service, 99.99% is a common figure, but your system will unlikely be that simple, it will be usually composed of multiple chained components.
E.G.
One web app talking to a back-end API talking to a database server.
Even if each component has 99.99% SLA, in the worst case scenario, they could all go down sequentially: Web app is down for 52m then is back up, when the Web App is back, the API is down for 52m and then is back up, when the API is back up, the DB is down for 52m... 2 hours and 36 minutes. OK, that syzygy is quite unlikely to happen, but nonetheless is possible and there is no responsibility on the service provider if that happens.
The provider would have respected their contractual SLAs for each component, so no credits for you, but your system would have still been down for hours.
You can use this calculator to convert percentage into time over the year: Calculate time from SLA
Those are the Azure services SLAs in a nice map: Azure charts SLA
As discussed, a given system is usually composed of different parts, it is possible to calculate the composite SLA with the formulas here: Composite SLA
I have built a tool to do this for you by adding components and defining dependencies, it's on GitHub: https://github.com/UnoSD/SlaCalculator
2022-02-03 update: I have finally published a graphical web app version of the SlaCalculator, please find it here: http://wiki.unosd.com/slacalculator/
This article is more for business guidance so I will not dive deeper into the technical aspects, but in Azure you can leverage within datacenter distribution (Availability sets), across datacenters distribution (Availability zones) and across regions distribution (Region pairs) to maximise the SLA for your solution, see the picture below:
Picture from: Azure resiliency infographic
High availability (HA)
This is quite a generic term and, frankly, you will find it mostly in marketing papers. It means a system resilient to failures, but it does not bear a unit of measure and what I consider to be highly available could be something that has a 70% SLA with a manual failover, someone would say HA only for 99.999% SLA. If you would like a less "abrupt" explanation, have a look at the comprehensive Wikipedia page
Disaster recovery (DR)
By now, if you have gone through the whole article, you should know this is our main focus here; A dictionary definition could be a set of policies, tools and process to recover data/compute from an unforeseen disaster, all the options available to implement this are in the next section.
Different levels of DR
Bear in mind that different levels can apply to data and compute. If your web portal is inaccessible, that's still a disaster, but it is not usually as bad as where there is data loss; as long as you have a recent backup of your data, you are in a much better position and can tolerate the system being out of service, but bringing it back up to the same state where you left it.
No DR
This is the most self-explanatory option; no plan, no resources and no costs.
This can still be a valid option in certain scenarios where your service is not critical and users can wait days/months to access your system again.
If my blog was down for a month, I would be disappointed, but I could just start all over again on a new platform from scratch. There are also real business which could tolerate this, but it is quite rare. It could apply to some internal employee system in certain companies.
You could still have a backup of your articles (like I do) on your laptop and you could restore that on a different platform if dev.to is unavailable; if I have a process to back that up on every change, to me that would count as having a manual DR plan for data, but no DR for compute.
Manual
This option has some overlap with the previous one. When I refer to "manual", I do not include someone noticing the system is down and tries to fix it and puts it back up by clicking around cloud portals and uploading a site somewhere else, I will classify that as "no DR" for the purpose of this article.
Manual is a well documented process for a step-by-step procedure to react to outages.
A cloud operations team of system administrators will get a notification of service disruption and will be able to respond accordingly and start the failover process.
The process could be going into the Azure portal, creating a new virtual machine, uploading a web application, switching the DNS servers configuration to point to the new server. All documented and the environment could be back up in a matter of hours.
This is how many organisations approach DR, but this approach is significantly slow (can take from hours to days). Problems can happen outside office hours and you need to have a team on-call during the night, make sure they are skilled, there is no single point of failure (people on holidays, sick leave, leaving the company and so on...) and despite all of this, human error is still a significant factor.
This approach may sound appealing as there is no additional cost for redundant infrastructure, but the TCO (total cost of ownership) of the solution must take into account people, training and errors.
Infrastructure as code
This is the first automated approach. We are removing human error from the equation, RTO will be much more predictable, you can perform regular tests of this and measure reaction times.
You need to make sure your development teams understand and build scripted environments (or your DevOps engineers, or sysadmins). This is good practice in any case, but not always the reality in the IT world. Most of the legacy applications are still deployed manually on bespoke infrastructure.
We think of "the cloud" as virtually unlimited scalability, but, in reality, the cloud is just yet another datacentre with its own physical and virtual limitations. There is the chance that, when you have an outage and try to redeploy your entire infrastructure in a different zone/region, you can hit capacity limits. This could be a disaster from which you cannot recover if you are not prepared.
A solution in the Azure world would be to purchase capacity reservations; you pay for the guarantee that you will have that capacity when you need to failover. This option increases significantly the cost of just a repository with your scripts, but can save the day during an emergency.
What you will save is the cost of **management of the inactive resources: no OS patching, no upgrading applications, no security alerts and so on.
The whole deployment process should be ideally automated with CD pipelines to create environments and deploy application workloads with minimal to no configuration effort in case of a disaster.
Cold environment
Now definitions start to become more woolly and more technology specific.
A cold environment is an environment that is already deployed, but stopped.
Cost savings may vary depending on the resources.
In Azure this could mean: VM in a deallocated state, Azure Firewall and Application Gateway stopped et cetera.
When a VM is deallocated, you don't pay for it, you just pay for the storage disks which preserve the state (tiny fraction of the cost). This means that you can start it up in half the time and you don't have to install software on it after as you have everything ready in your disk. Similarly for other resources and for more auto-scaling cloud-native services (Azure SQL serverless, Cosmos DB et cetera).
You can still have capacity problems and you can solve them in the same way you would with the previous approach (capacity reservations). The main difference is that this approach will be faster as you have just to start your environment, not provision it and prepare it as it can take 50% or more of the recovery time.
This approach seems better than the IaC/CD one, but has a significant downside:
You still have to keep your environment up to date!
Your applications and the OSs are already deployed into the storage disks or hosting platforms; you need to have a frequent (possibly automated) process for updating also your inactive environment in case it is needed.
Imagine you install Windows Server 2016 with 2017/11 security patches and in 2017/12/14 a new dangerous security bug surfaces, you do not upgrade the system as it is in hibernation, then you need to fail-over on 2018/01 and you end up with a vulnerable environment that could be hacked during the fail-over period.
Warm standby environment
This strategy bear similar/equal costs to a fully replicated production environment.
You have a second production environment, you keep it up to date with applications, configuration, OS patches and you treat it as if it was productions; the difference?
This environment does not process any data, does not actively run any task or interact with any user. It is just there, burning money.
Why would you do that? To have an almost-instantaneous fail-over. If the primary system has an outage, you can automate a DNS change or load balancer to immediately direct the traffic to the secondary environment. In Azure, all the load balancers (DNS/Layer 7/Layer 4) include health probes to automatically fail-over to a secondary if a primary environment does not provide the expected response.
To save on costs, you could have a "smaller scale" warm standby environment (less CPUs, less RAM, cheaper SKUs for services et cetera); you will need your users to tollerate a slower experience in the "rare" event of outages.
Active/active
Eventually, we analyse the "state of the art" of BCDR strategies.
This is the same as the warm standby, but with a small difference with a massive impact: Both environments are active production environments.
This usually involves a load balancer at the ingress of the system that distributes the load to both environments, usually in a round robin fashion, but can also be more sophisticated and distribute load based on geographic latency and, in case of an outage in one region, direct all the traffic to the only working environment.
The massive difference between this and the previous option is that you can now make the most of both environments. Theoretically you could reduce performance (and costs) of both environments to 50%, but also send half of the users to one and the other half to the other environment keeping response times consistent. Most of the cloud-native services also offer no-downtime scaling, so you could scale up one environment in case of failure of the other one.
Why do people still use warm standby (active/passive) when they could do active/active? In reality, applications (in particular legacy ones) are often stateful, you cannot just handle one request on a server and another one on a different server, this can break existing applications or no-so-well designed new application; for this reason warm standby is quite popular and often still requires careful planning as switching to fail-over could be complicated and could corrupt the state; often it requires connection draining or remediation if the application does not support handling traffic on multiple hosts.
This is truly what should be the goal for migrating legacy systems and the design for all new systems.
Active/active strategy in practice
Active/active is "easy" in theory; in practice we have discussed that it can be hard for stateful applications, single-tenanted solutions (where each user/group needs to have dedicated infrastructure).
Start small, those ideas can apply to a whole solution or can be applied individually to part of the solutions.
You could apply this strategy to part of the system or to new /rewritten components.
E.G.
You need to add a new background service to your application.
Instead of running this on the same infrastructure, consider building a separate stateless microservice,
add a load balancer to distribute the tasks,
think about concurrency when storing the results to a data store,
use asynchronous message queues to send requests,
create two queues one per region and add retries with fail-over to your application and distributed reads on the microservice,
avoid strict ordering requirements et cetera.
This is just a simple example and we could go on forever with best practices to build a resilient highly available solution.
Education
Education is key to innovation; it is a culture that needs to encouraged by leadership and built from the ground in each single line of code.
With strong guidance and a good enterprise skilling program, you can educate developers to build resilient systems in each piece of code and that would enable a good modern architecture for the whole system.
Stateful applications cannot be made stateless with infrastructure and architecture, that needs to start at code level.
How can you drive this? Hire talents with growth mindset, nurture them by providing opportunities for learning in-person, virtually, on-demand and promoting cloud adoption.
Microsoft Learn, LinkedIn learning, Pluralsight and so on; there are plenty of platforms with excellent material on stateless, cloud-native, modern architectures.
RTO/RPO image: "Graphic representation of RPO and RTO in case of an incident" by Own work is licensed under CC BY-SA 4.0.
Posted on November 25, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.