Phase 3 of 5  ·  Quality & Trust
Week 11 / 20   ·   Ch 11

Reliability
Engineering

"How Netflix achieves 99.99% uptime — and what you can learn from it"

📚 Ch 11 — Reliability Engineering📊 Reliability Metrics⚡ Fault Tolerance⏱ ~20 min read

🔍Concept Deep Dives

Click each concept to expand — real examples, diagrams, pros & cons.

📊

Reliability Metrics

POFOD, ROCOF, MTTF, MTTR, MTBF, Availability — the language of reliability engineering.

When to Use

Defining SLAs, architecture decisions, post-incident analysis.

Real-World Example

AWS SLA: 99.99% monthly uptime. Translated: max 52 minutes downtime per year across all AZs.

✓ Advantages

  • Objective measurement
  • Foundation for SLA contracts
  • Guides architecture

⚠ Watch Out

  • Hard to measure in practice
  • Requires extensive telemetry
POFOD: P(failure on demand) — e.g. 1/1000 ROCOF: failures per time unit — e.g. 2/hour MTTF: mean time between failures — e.g. 500hrs MTTR: mean time to repair — e.g. 2hrs Availability: MTTF/(MTTF+MTTR) = 500/502 = 99.6%
🔄

Redundancy and Fault Tolerance

Duplicate critical components so one failure doesn't cause system failure.

When to Use

Any system where downtime is unacceptable — finance, healthcare, infrastructure.

Real-World Example

Netflix runs in 3 AWS regions. One region failure → traffic shifts automatically to the other two.

✓ Advantages

  • No single point of failure
  • Automatic failover
  • Graceful degradation

⚠ Watch Out

  • 2-3x infrastructure cost
  • Complex to keep in sync
  • Distributed systems bugs
Active-Passive: [Primary] ──(fails)──→ [Standby takes over] Active-Active: [Server 1] ──┐ [Server 2] ──┼──→ [Load Balancer] → Users [Server 3] ──┘
🐒

Chaos Engineering

Intentionally inject failures in production to test resilience. Netflix's Chaos Monkey kills random servers.

When to Use

When your system needs to prove — not just claim — resilience.

Real-World Example

Netflix Chaos Monkey (2011): randomly terminates EC2 instances in production. Forced engineers to build truly resilient services.

✓ Advantages

  • Reveals hidden dependencies
  • Forces resilient design
  • Finds issues before real failures do

⚠ Watch Out

  • Needs production traffic to be effective
  • Risky if not prepared
  • Cultural change required
Chaos Monkey: kill random instances Chaos Gorilla: take down entire AZ Chaos Kong: take down entire AWS region → If system survives, it's truly resilient
🔧

Recovery-Oriented Computing

Design for fast recovery, not just failure prevention. MTTR matters as much as MTTF.

When to Use

Systems where perfect reliability is impossible — optimize recovery speed instead.

Real-World Example

Google SRE philosophy: 'Hope is not a strategy. Design for failure, optimize for recovery.'

✓ Advantages

  • Faster incident response
  • Forces runbook creation
  • Reduces MTTR significantly

⚠ Watch Out

  • Requires investment in observability
  • Needs practiced incident response
Failure Prevention (MTTF) + Fast Recovery (MTTR) = High Availability MTTF↑ = harder, more expensive MTTR↓ = often faster ROI

📋Quick Reference

θ Ch 11 Cheat Sheet — Reliability Engineering
POFOD
Probability of failure on demand. Ex: 1/1000 = 0.001.
ROCOF
Rate of occurrence of failures per time unit. Ex: 0.02 failures/hour.
MTTF
Mean time to failure. How long until next failure on average.
MTTR
Mean time to repair. How long to recover from failure.
MTBF
Mean time between failures = MTTF + MTTR.
Redundancy
Duplicate components so one failure doesn't cause system failure.
Chaos Engineering
Intentional failure injection to test resilience. Netflix Chaos Monkey.
Recovery-oriented
Design for fast recovery, not just failure prevention. MTTR matters.
θ
Sommerville's Key Points — Ch 11
Author's own summary from the end of the chapter.
  • 1Reliability metrics: POFOD (probability), ROCOF (rate), MTTF/MTTR (time-based).
  • 2Reliability-centered design: identify critical components and add redundancy.
  • 3Fault tolerance: system continues operating despite component failures.
  • 4Active-passive and active-active redundancy patterns for high availability.
  • 5Chaos engineering: inject failures intentionally to test resilience (Netflix Chaos Monkey).
  • 6Recovery-oriented computing: optimize MTTR, not just MTTF.

🧠Quiz — Test Yourself

Think through your answer first, then reveal.

Q1
Recall
Calculate availability given MTTF=500hrs, MTTR=2hrs. Is this acceptable for a banking system?
Availability = 500/(500+2) = 99.6%. For banking? Probably not — banks typically require 99.95%+ availability. This means roughly 3.5 hours of downtime per month, which is too much for ATMs and online banking.
Q2
Apply
What is the difference between fault tolerance and resilience?
Fault tolerance: system continues CORRECT operation despite component failures (via redundancy, voting, error correction). Resilience: broader — system maintains essential services even under attack, unexpected load, or partial failure. Fault tolerance is one mechanism for achieving resilience.
Q3
Analyze
What is chaos engineering and why is it run in production (not just staging)?
Chaos engineering = intentional failure injection. Must run in production because staging doesn't have real traffic patterns, real dependencies, and real data volumes. Failures behave differently under production load. Netflix's Chaos Monkey taught this — staging tests couldn't predict production resilience.
Up Next → Week 12
Security Engineering
OWASP Top 10 in plain English — every junior dev needs this
Continue → Week 12