Phase 3 of 5 · Quality & Trust

Week 11 / 20 · Ch 11

Reliability
Engineering

"How Netflix achieves 99.99% uptime — and what you can learn from it"

📚 Ch 11 — Reliability Engineering📊 Reliability Metrics⚡ Fault Tolerance⏱ ~20 min read

💡Key Concepts

Reliability is engineered, not hoped for. Redundancy, fault tolerance, and chaos testing are the tools.

📊

Reliability Metrics

POFOD, ROCOF, MTTF, MTTR, MTBF, Availability — the language of reliability engineering.

Measurement

🔄

Redundancy and Fault Tolerance

Duplicate critical components so one failure doesn't cause system failure.

High Availability

🐒

Chaos Engineering

Intentionally inject failures in production to test resilience. Netflix's Chaos Monkey kills random servers.

Netflix Method

🔧

Recovery-Oriented Computing

Design for fast recovery, not just failure prevention. MTTR matters as much as MTTF.

Recovery

🔍Concept Deep Dives

Click each concept to expand — real examples, diagrams, pros & cons.

📊

Reliability Metrics

POFOD, ROCOF, MTTF, MTTR, MTBF, Availability — the language of reliability engineering.

⌄

When to Use

Defining SLAs, architecture decisions, post-incident analysis.

Real-World Example

AWS SLA: 99.99% monthly uptime. Translated: max 52 minutes downtime per year across all AZs.

✓ Advantages

Objective measurement
Foundation for SLA contracts
Guides architecture

⚠ Watch Out

Hard to measure in practice
Requires extensive telemetry

POFOD: P(failure on demand) — e.g. 1/1000 ROCOF: failures per time unit — e.g. 2/hour MTTF: mean time between failures — e.g. 500hrs MTTR: mean time to repair — e.g. 2hrs Availability: MTTF/(MTTF+MTTR) = 500/502 = 99.6%

🔄

Redundancy and Fault Tolerance

Duplicate critical components so one failure doesn't cause system failure.

⌄

When to Use

Any system where downtime is unacceptable — finance, healthcare, infrastructure.

Real-World Example

Netflix runs in 3 AWS regions. One region failure → traffic shifts automatically to the other two.

✓ Advantages

No single point of failure
Automatic failover
Graceful degradation

⚠ Watch Out

2-3x infrastructure cost
Complex to keep in sync
Distributed systems bugs

Active-Passive: [Primary] ──(fails)──→ [Standby takes over] Active-Active: [Server 1] ──┐ [Server 2] ──┼──→ [Load Balancer] → Users [Server 3] ──┘

🐒

Chaos Engineering

Intentionally inject failures in production to test resilience. Netflix's Chaos Monkey kills random servers.

⌄

When to Use

When your system needs to prove — not just claim — resilience.

Real-World Example

Netflix Chaos Monkey (2011): randomly terminates EC2 instances in production. Forced engineers to build truly resilient services.

✓ Advantages

Reveals hidden dependencies
Forces resilient design
Finds issues before real failures do

⚠ Watch Out

Needs production traffic to be effective
Risky if not prepared
Cultural change required

Chaos Monkey: kill random instances Chaos Gorilla: take down entire AZ Chaos Kong: take down entire AWS region → If system survives, it's truly resilient

🔧

Recovery-Oriented Computing

Design for fast recovery, not just failure prevention. MTTR matters as much as MTTF.

⌄

When to Use

Systems where perfect reliability is impossible — optimize recovery speed instead.

Real-World Example

Google SRE philosophy: 'Hope is not a strategy. Design for failure, optimize for recovery.'

✓ Advantages

Faster incident response
Forces runbook creation
Reduces MTTR significantly

⚠ Watch Out

Requires investment in observability
Needs practiced incident response

Failure Prevention (MTTF) + Fast Recovery (MTTR) = High Availability MTTF↑ = harder, more expensive MTTR↓ = often faster ROI

📋Quick Reference

θ Ch 11 Cheat Sheet — Reliability Engineering

POFOD

Probability of failure on demand. Ex: 1/1000 = 0.001.

ROCOF

Rate of occurrence of failures per time unit. Ex: 0.02 failures/hour.

MTTF

Mean time to failure. How long until next failure on average.

MTTR

Mean time to repair. How long to recover from failure.

MTBF

Mean time between failures = MTTF + MTTR.

Redundancy

Duplicate components so one failure doesn't cause system failure.

Chaos Engineering

Intentional failure injection to test resilience. Netflix Chaos Monkey.

Recovery-oriented

Design for fast recovery, not just failure prevention. MTTR matters.

Sommerville's Key Points — Ch 11

Author's own summary from the end of the chapter.

1Reliability metrics: POFOD (probability), ROCOF (rate), MTTF/MTTR (time-based).
2Reliability-centered design: identify critical components and add redundancy.
3Fault tolerance: system continues operating despite component failures.
4Active-passive and active-active redundancy patterns for high availability.
5Chaos engineering: inject failures intentionally to test resilience (Netflix Chaos Monkey).
6Recovery-oriented computing: optimize MTTR, not just MTTF.

🧠Quiz — Test Yourself

Think through your answer first, then reveal.

Recall

Calculate availability given MTTF=500hrs, MTTR=2hrs. Is this acceptable for a banking system?

Availability = 500/(500+2) = 99.6%. For banking? Probably not — banks typically require 99.95%+ availability. This means roughly 3.5 hours of downtime per month, which is too much for ATMs and online banking.

Apply

What is the difference between fault tolerance and resilience?

Fault tolerance: system continues CORRECT operation despite component failures (via redundancy, voting, error correction). Resilience: broader — system maintains essential services even under attack, unexpected load, or partial failure. Fault tolerance is one mechanism for achieving resilience.

Analyze

What is chaos engineering and why is it run in production (not just staging)?

Chaos engineering = intentional failure injection. Must run in production because staging doesn't have real traffic patterns, real dependencies, and real data volumes. Failures behave differently under production load. Netflix's Chaos Monkey taught this — staging tests couldn't predict production resilience.

ReliabilityEngineering

💡Key Concepts

Reliability Metrics

Redundancy and Fault Tolerance

Chaos Engineering

Recovery-Oriented Computing

🔍Concept Deep Dives

Reliability Metrics

When to Use

Real-World Example

✓ Advantages

⚠ Watch Out

Redundancy and Fault Tolerance

When to Use

Real-World Example

✓ Advantages

⚠ Watch Out

Chaos Engineering

When to Use

Real-World Example

✓ Advantages

⚠ Watch Out

Recovery-Oriented Computing

When to Use

Real-World Example

✓ Advantages

⚠ Watch Out

📋Quick Reference

🧠Quiz — Test Yourself

Reliability
Engineering