Phase 3 of 5  ·  Quality & Trust
Week 13 / 20   ·   Ch 14

Resilience
Engineering

"When AWS goes down, your app shouldn't"

📚 Ch 14 — Resilience🔄 Recognize-Resist-Recover-Adapt☁️ Cloud Resilience Patterns⏱ ~20 min read

🔍Concept Deep Dives

Click each concept to expand — real examples, diagrams, pros & cons.

🌊

The Resilience Cycle

Recognize (detect threat) → Resist (defend) → Recover (restore) → Adapt (prevent recurrence).

When to Use

Design phase and incident response. Every resilient system goes through this cycle.

Real-World Example

AWS us-east-1 outage 2021: teams that had multi-region failover (Recognize→Resist→Recover) kept serving users.

✓ Advantages

  • Systematic framework
  • Covers before + during + after failure
  • Drives architecture decisions

⚠ Watch Out

  • Expensive to implement fully
  • Requires runbooks and practice
Recognize: monitoring, anomaly detection ↓ Resist: redundancy, rate limiting, DDoS protection ↓ Recover: failover, restore from backup, rollback ↓ Adapt: postmortem, fix root cause, update runbooks
🧩

Resilient System Design

Design patterns: redundancy, graceful degradation, bulkheads, circuit breakers, timeouts.

When to Use

Any system with uptime requirements — especially microservices and distributed systems.

Real-World Example

Netflix circuit breaker: if Recommendations service is slow, show cached/empty instead of timing out the whole page.

✓ Advantages

  • Prevents cascade failures
  • Graceful degradation beats total failure
  • Self-healing systems

⚠ Watch Out

  • Complex to implement and test
  • Harder to debug
Circuit Breaker Pattern: Closed (normal) → too many failures → Open (reject fast) ↑ half-open (test) ←────────────────────┘ Bulkhead: isolate failures to one pool Timeout: never wait forever
🌍

Sociotechnical Resilience

Resilience is not just technical — people, processes, and organizations must also adapt to adverse conditions.

When to Use

Real incident response — technical resilience fails without trained teams and clear processes.

Real-World Example

Boeing 737 MAX: the technical system had indicators, but organizational factors (training, communication, culture) caused the tragedy.

✓ Advantages

  • Addresses real root causes
  • Improves incident response culture
  • Builds institutional knowledge

⚠ Watch Out

  • Harder to 'fix' than technical issues
  • Requires cultural change
Technical Resilience: → Redundancy, failover, circuit breakers Organizational Resilience: → Incident training, runbooks, postmortems Cultural Resilience: → Psychological safety, blame-free culture

📋Quick Reference

θ Ch 14 Cheat Sheet — Resilience Engineering
Resilience Cycle
Recognize → Resist → Recover → Adapt. Design for all 4 phases.
Circuit Breaker
Stop calling a failing service. Fail fast. Give it time to recover. Resume when healthy.
Bulkhead
Isolate components so failure in one doesn't cascade. Named after ship compartments.
Graceful Degradation
Partial functionality > total failure. Show cached data, disable non-essential features.
Timeout
Always set timeouts on external calls. Never wait forever.
Retry with Backoff
Retry failed calls with exponential backoff + jitter to avoid thundering herd.
Postmortem
Blameless analysis of what went wrong. Focus on systems, not people.
θ
Sommerville's Key Points — Ch 14
Author's own summary from the end of the chapter.
  • 1Resilience: ability to maintain essential services despite adverse conditions.
  • 24-stage resilience cycle: recognize, resist, recover, adapt.
  • 3Resilient design patterns: redundancy, circuit breakers, bulkheads, graceful degradation.
  • 4Sociotechnical resilience: both technical AND organizational/human factors matter.
  • 5Chaos engineering validates resilience — passive hope is not enough.
  • 6Blameless postmortems drive adaptation — learn from failures to prevent recurrence.

🧠Quiz — Test Yourself

Think through your answer first, then reveal.

Q1
Recall
What is a circuit breaker pattern and when should it open?
A circuit breaker wraps a remote call and monitors failures. Closed state: normal operation. If failures exceed threshold (e.g., 50% in 60 seconds), circuit Opens — subsequent calls fail immediately without waiting. After a timeout, it goes Half-Open to test if service recovered. Open the circuit when: downstream service is slow/failing AND you don't want to cascade the failure.
Q2
Apply
What is the difference between resilience and reliability?
Reliability: probability of correct operation under normal conditions. Resilience: ability to maintain service under ADVERSE conditions (attacks, failures, unexpected events). A reliable system might fail completely under DDoS. A resilient system degrades gracefully.
Q3
Analyze
Why should postmortems be blameless?
Because blame-focused cultures hide incidents and discourage transparency. When people fear punishment, they don't report near-misses, don't share root causes honestly, and don't document failure modes. Blameless postmortems focus on SYSTEM failures, not people. This builds psychological safety and leads to actual improvement.
Up Next → Week 14
Software Reuse
How npm changed software engineering — and the risks nobody told you about
Continue → Week 14