Phase 3 of 5 · Quality & Trust

Week 13 / 20 · Ch 14

Resilience
Engineering

"When AWS goes down, your app shouldn't"

📚 Ch 14 — Resilience🔄 Recognize-Resist-Recover-Adapt☁️ Cloud Resilience Patterns⏱ ~20 min read

💡Key Concepts

Resilience is the ability to maintain essential services under adverse conditions — attacks, failures, unexpected events.

🌊

The Resilience Cycle

Recognize (detect threat) → Resist (defend) → Recover (restore) → Adapt (prevent recurrence).

Framework

🧩

Resilient System Design

Design patterns: redundancy, graceful degradation, bulkheads, circuit breakers, timeouts.

Patterns

🌍

Sociotechnical Resilience

Resilience is not just technical — people, processes, and organizations must also adapt to adverse conditions.

People+Tech

🔍Concept Deep Dives

Click each concept to expand — real examples, diagrams, pros & cons.

🌊

The Resilience Cycle

Recognize (detect threat) → Resist (defend) → Recover (restore) → Adapt (prevent recurrence).

⌄

When to Use

Design phase and incident response. Every resilient system goes through this cycle.

Real-World Example

AWS us-east-1 outage 2021: teams that had multi-region failover (Recognize→Resist→Recover) kept serving users.

✓ Advantages

Systematic framework
Covers before + during + after failure
Drives architecture decisions

⚠ Watch Out

Expensive to implement fully
Requires runbooks and practice

Recognize: monitoring, anomaly detection ↓ Resist: redundancy, rate limiting, DDoS protection ↓ Recover: failover, restore from backup, rollback ↓ Adapt: postmortem, fix root cause, update runbooks

🧩

Resilient System Design

Design patterns: redundancy, graceful degradation, bulkheads, circuit breakers, timeouts.

⌄

When to Use

Any system with uptime requirements — especially microservices and distributed systems.

Real-World Example

Netflix circuit breaker: if Recommendations service is slow, show cached/empty instead of timing out the whole page.

✓ Advantages

Prevents cascade failures
Graceful degradation beats total failure
Self-healing systems

⚠ Watch Out

Complex to implement and test
Harder to debug

Circuit Breaker Pattern: Closed (normal) → too many failures → Open (reject fast) ↑ half-open (test) ←────────────────────┘ Bulkhead: isolate failures to one pool Timeout: never wait forever

🌍

Sociotechnical Resilience

Resilience is not just technical — people, processes, and organizations must also adapt to adverse conditions.

⌄

When to Use

Real incident response — technical resilience fails without trained teams and clear processes.

Real-World Example

Boeing 737 MAX: the technical system had indicators, but organizational factors (training, communication, culture) caused the tragedy.

✓ Advantages

Addresses real root causes
Improves incident response culture
Builds institutional knowledge

⚠ Watch Out

Harder to 'fix' than technical issues
Requires cultural change

Technical Resilience: → Redundancy, failover, circuit breakers Organizational Resilience: → Incident training, runbooks, postmortems Cultural Resilience: → Psychological safety, blame-free culture

📋Quick Reference

θ Ch 14 Cheat Sheet — Resilience Engineering

Resilience Cycle

Recognize → Resist → Recover → Adapt. Design for all 4 phases.

Circuit Breaker

Stop calling a failing service. Fail fast. Give it time to recover. Resume when healthy.

Bulkhead

Isolate components so failure in one doesn't cascade. Named after ship compartments.

Graceful Degradation

Partial functionality > total failure. Show cached data, disable non-essential features.

Timeout

Always set timeouts on external calls. Never wait forever.

Retry with Backoff

Retry failed calls with exponential backoff + jitter to avoid thundering herd.

Postmortem

Blameless analysis of what went wrong. Focus on systems, not people.

Sommerville's Key Points — Ch 14

Author's own summary from the end of the chapter.

1Resilience: ability to maintain essential services despite adverse conditions.
24-stage resilience cycle: recognize, resist, recover, adapt.
3Resilient design patterns: redundancy, circuit breakers, bulkheads, graceful degradation.
4Sociotechnical resilience: both technical AND organizational/human factors matter.
5Chaos engineering validates resilience — passive hope is not enough.
6Blameless postmortems drive adaptation — learn from failures to prevent recurrence.

🧠Quiz — Test Yourself

Think through your answer first, then reveal.

Recall

What is a circuit breaker pattern and when should it open?

A circuit breaker wraps a remote call and monitors failures. Closed state: normal operation. If failures exceed threshold (e.g., 50% in 60 seconds), circuit Opens — subsequent calls fail immediately without waiting. After a timeout, it goes Half-Open to test if service recovered. Open the circuit when: downstream service is slow/failing AND you don't want to cascade the failure.

Apply

What is the difference between resilience and reliability?

Reliability: probability of correct operation under normal conditions. Resilience: ability to maintain service under ADVERSE conditions (attacks, failures, unexpected events). A reliable system might fail completely under DDoS. A resilient system degrades gracefully.

Analyze

Why should postmortems be blameless?

Because blame-focused cultures hide incidents and discourage transparency. When people fear punishment, they don't report near-misses, don't share root causes honestly, and don't document failure modes. Blameless postmortems focus on SYSTEM failures, not people. This builds psychological safety and leads to actual improvement.

ResilienceEngineering

💡Key Concepts

The Resilience Cycle

Resilient System Design

Sociotechnical Resilience

🔍Concept Deep Dives

The Resilience Cycle

When to Use

Real-World Example

✓ Advantages

⚠ Watch Out

Resilient System Design

When to Use

Real-World Example

✓ Advantages

⚠ Watch Out

Sociotechnical Resilience

When to Use

Real-World Example

✓ Advantages

⚠ Watch Out

📋Quick Reference

🧠Quiz — Test Yourself

Resilience
Engineering