Appendix: Reliability (Failure Management) - AWS Well-Architected Framework Study Guide
Alec Dutcher
Posted on March 7, 2022
Return to Well-Architected Framework Guide
How do you back up data?
- Identify and back up all data that needs to be backed up, or reproduce the data from sources
- Secure and encrypt backups
- Perform data backup automatically
- Perform periodic recovery of the data to verify backup integrity and processes
How do you use fault isolation to protect your workload?
- Deploy the workload to multiple locations
- Automate recovery for components constrained to a single location
- Use bulkhead architectures to limit scope of impact
How do you design your workload to withstand component failures?
- Monitor all components of the workload to detect failures
- Fail over to healthy resources
- Automate healing on all layers:
- Use static stability to prevent bimodal behavior
- Send notifications when events impact availability
How do you test reliability?
- Use playbooks to investigate failures
- Perform post-incident analysis
- Test functional requirements
- Test scaling and performance requirements
- Test resiliency using chaos engineering
- Conduct game days regularly
How do you plan for disaster recovery (DR)?
- Define recovery objectives for downtime and data loss
- Use defined recovery strategies to meet the recovery objectives
- Test disaster recovery implementation to validate the implementation
- Manage configuration drift at the DR site or region
- Automate recovery
π πͺ π
π©
Alec Dutcher
Posted on March 7, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.