System design: Designing for High Availability

Designing for high availability is critical when building systems that need to function continuously with minimal downtime, even in the face of failures. High availability (HA) aims to reduce the risk of outages and ensure that services are always available to users. Let’s explore each of the key concepts in redundancy, failover, active-active vs. active-passive architectures, fault tolerance, and disaster recovery, providing practical insights and examples.

Redundancy and Failover Strategies

Redundancy

At its core, redundancy involves duplicating components in a system to ensure continuous operation if one component fails. In a highly available system, redundancy can be applied at multiple levels: servers, databases, networks, storage, etc.

Types of Redundancy

Hardware Redundancy: If a server’s hardware fails, redundant servers (physical or virtual) can take over.
- Example: You might have two servers in a data center where one is a mirror of the other. If one goes down, the second one immediately takes over without service disruption.
Data Redundancy: Data is replicated across multiple databases or storage solutions to ensure availability in case of data corruption or failure in one instance.
- Example: Replicating a database across multiple regions ensures that data can still be accessed from another region if one goes down.
Network Redundancy: Duplicate network paths and devices (routers, switches) ensure that even if part of the network fails, traffic can still be routed through alternative paths.

Failover Strategies

Failover is the process where a backup component automatically takes over when the primary component fails. Depending on the system design, failover strategies can be implemented with minimal downtime or even without any noticeable downtime.

Cold Failover: Backup components are activated only when the primary component fails. Cold failovers are slower because the backup has to be booted up and initialized.
- Example: A backup database server that is offline until the primary server fails. Once the primary server is detected as offline, the backup server starts up.
Hot Failover: The backup component is fully operational and takes over immediately when the primary component fails. This results in minimal downtime.
- Example: Having two web servers running simultaneously behind a load balancer, where one takes over seamlessly if the other fails.
Warm Failover: The backup system is running in a reduced capacity and is brought to full operation when the primary fails. It strikes a balance between cold and hot failover in terms of cost and recovery time.

Practical Example of Failover in Go

In Go, you could implement a failover strategy by using multiple database connections. If the primary database fails, the backup database takes over.

package main

import (
    "database/sql"
    "fmt"
    _ "github.com/lib/pq" // PostgreSQL driver
)

var primaryDB, backupDB *sql.DB

func connectPrimary() {
    var err error
    primaryDB, err = sql.Open("postgres", "user=primary dbname=mydb sslmode=disable")
    if err != nil {
        fmt.Println("Primary DB connection failed, attempting to connect to backup.")
        connectBackup()
    }
}

func connectBackup() {
    var err error
    backupDB, err = sql.Open("postgres", "user=backup dbname=mydb sslmode=disable")
    if err != nil {
        fmt.Println("Backup DB connection failed.")
    } else {
        fmt.Println("Connected to backup database!")
    }
}

func main() {
    connectPrimary()
}

In this example:

The connectPrimary function connects to the primary database. If it fails, it tries to connect to the backup database using the connectBackup function.

Active-Active vs Active-Passive Architectures

Active-Active Architecture

In an active-active setup, all components (e.g., servers, databases) are active and share the workload. This architecture provides not only redundancy but also load balancing, as traffic is distributed across multiple instances.

Advantages:

Higher performance as multiple systems handle traffic simultaneously.
More fault-tolerant because if one system fails, the others continue handling the load.

Example:
A website running in an active-active configuration might have two web servers behind a load balancer. Both servers are processing incoming requests, and if one fails, the load balancer redirects traffic to the other.

Active-Passive Architecture

In an active-passive setup, only the primary (active) system is actively handling requests, while the backup (passive) system is on standby. The passive system is activated only in the case of a failure in the active system.

Advantages:

Easier to maintain and less complex than active-active.
The backup system remains idle, which can save resources.

Example:
An e-commerce site might have a primary database running queries. If this database goes down, a backup replica of the database (which was passive) will take over. The switchover usually involves downtime, though minimal.

Designing for Fault Tolerance

Fault tolerance ensures that a system can continue to operate correctly even when parts of it fail. This involves identifying single points of failure and designing them out of the system.

Techniques for Fault Tolerance

Redundancy and Failover (as described earlier)
Graceful Degradation: When parts of the system fail, the rest of the system continues to work with reduced functionality.
- Example: If a microservices-based application experiences failure in the recommendation service, the system might still be able to serve the homepage without recommendations.
Retry Mechanism: When a failure occurs, instead of failing immediately, retrying the operation a few times before giving up.
- Example: A payment system retries a failed transaction several times before notifying the user of failure.
Circuit Breaker Pattern: Protects your system from failing components by “opening” the circuit when repeated failures occur. During the open state, requests are immediately rejected instead of being forwarded to the failing component.
- Example: In an API gateway, if a downstream service consistently fails, the circuit breaker will stop sending requests to it until it's healthy again.

Disaster Recovery Planning

Disaster recovery (DR) ensures that your system can recover from catastrophic failures like data center outages, cyber-attacks, or natural disasters. DR typically involves:

Backups: Regular backups of data to ensure that no critical information is lost in case of failure.
Geographical Redundancy: Deploying systems in multiple geographic locations to ensure availability in case of a regional outage.

Disaster Recovery Techniques

Backup and Restore
- You can back up data to a remote storage solution like Amazon S3, and if your primary database fails, you can restore data from the backup.
- Example: You’re running a MySQL database and have daily backups to an external storage system. If the primary database fails due to corruption, you can restore the latest backup.
Pilot Light
- The core part of the system is always running in a minimal fashion. In case of a disaster, you can scale up the necessary services to restore full functionality.
- Example: A minimal read-only version of your website might always be online in a secondary region. If the primary region fails, you scale it up to handle both reads and writes.
Warm Standby
- A scaled-down version of the entire system runs in standby mode. It can be scaled up when the primary site goes down.
- Example: A backup data center runs with enough infrastructure to handle a small amount of traffic. When the primary data center fails, the backup infrastructure is scaled up.
Multi-Site (Hot Standby)
- A full-fledged duplicate of your system runs in a different geographic location. Traffic is routed to both systems, but one location is designated as the primary.
- Example: An e-commerce platform that operates in both the US and Europe. If the US region goes down, traffic can be fully routed to the European region.

Practical Example of Disaster Recovery in AWS

In a cloud setup like AWS, you can design disaster recovery using services such as:

Amazon RDS with Multi-AZ: RDS provides automatic failover between primary and secondary databases in different availability zones.
Amazon S3 for backups: Regular backups of critical data stored in Amazon S3.
CloudFront for global distribution: For static assets, you can distribute content globally using CloudFront. If one region becomes unavailable, CloudFront can deliver content from another region.

Final Thoughts on High Availability Design

Designing for high availability requires a holistic approach, ensuring that every component in your system can handle failure gracefully. Through redundancy, failover strategies, active-active or active-passive architectures, and proper disaster recovery planning, your system can achieve the resilience required for minimal downtime and seamless user experience.