Reliability Best Practices - AWS Well-Architected Framework Study Guide
Alec Dutcher
Posted on March 1, 2022
Return to Well-Architected Framework Guide
- Foundational requirements are those whose scope extends beyond a single workload or project
- It’s the responsibility of AWS to satisfy the requirement for sufficient networking and compute capacity
- Service quotas (aka service limits) exist to prevent accidentally provisioning more resources than needed and to limit request rates on API operations to protect services from abuse
- Monitor and manage these quotas for all workload environments
- Ask:
- How do you manage service quotas and constraints?
- How do you plan your network topology?
- SDKs take the complexity out of coding by providing language-specific APIs for AWS services
- Distributed systems rely on communications networks to interconnect components, such as servers or services
- Workload must operate reliably despite data loss or latency in these networks
- Components must operate in a way that does not negatively impact other components
- Ask:
- How do you design your workload service architecture?
- How do you design interactions in a distributed system to prevent failures?
- How do you design interactions in a distributed system to mitigate or withstand failures?
- Anticipate and accommodate changes to achieve reliable operation
- Changes include those imposed on your workload (i.e. spikes in demand) and those from within (i.e. feature deployments and security patches)
- Monitor the behavior of a workload and automate the response to KPIs
- Ask:
- How do you monitor workload resources?
- How do you design your workload to adapt to changes in demand?
- How do you implement change?
- Be aware of failures as they occur and take action to avoid impact on availability
- Take advantage of automation to react to monitoring data
- Regularly back up your data and test your backup files
- Test failure response on a regular schedule and ensure that such testing is also triggered after significant workload changes
- Actively track KPIs, as well as the recovery time objective (RTO) and recovery point objective (RPO)
- Ask:
- How do you back up data?
- How do you use fault isolation to protect your workload?
- How do you design your workload to withstand component failures?
- How do you test reliability?
- How do you plan for disaster recovery (DR)?
💖 💪 🙅 🚩
Alec Dutcher
Posted on March 1, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.