Key Concepts
Reliability is engineered, not hoped for. Redundancy, fault tolerance, and chaos testing are the tools.
Reliability Metrics
POFOD, ROCOF, MTTF, MTTR, MTBF, Availability — the language of reliability engineering.
Redundancy and Fault Tolerance
Duplicate critical components so one failure doesn't cause system failure.
Chaos Engineering
Intentionally inject failures in production to test resilience. Netflix's Chaos Monkey kills random servers.
Recovery-Oriented Computing
Design for fast recovery, not just failure prevention. MTTR matters as much as MTTF.
Concept Deep Dives
Click each concept to expand — real examples, diagrams, pros & cons.
Reliability Metrics
When to Use
Defining SLAs, architecture decisions, post-incident analysis.
Real-World Example
AWS SLA: 99.99% monthly uptime. Translated: max 52 minutes downtime per year across all AZs.
✓ Advantages
- Objective measurement
- Foundation for SLA contracts
- Guides architecture
⚠ Watch Out
- Hard to measure in practice
- Requires extensive telemetry
Redundancy and Fault Tolerance
When to Use
Any system where downtime is unacceptable — finance, healthcare, infrastructure.
Real-World Example
Netflix runs in 3 AWS regions. One region failure → traffic shifts automatically to the other two.
✓ Advantages
- No single point of failure
- Automatic failover
- Graceful degradation
⚠ Watch Out
- 2-3x infrastructure cost
- Complex to keep in sync
- Distributed systems bugs
Chaos Engineering
When to Use
When your system needs to prove — not just claim — resilience.
Real-World Example
Netflix Chaos Monkey (2011): randomly terminates EC2 instances in production. Forced engineers to build truly resilient services.
✓ Advantages
- Reveals hidden dependencies
- Forces resilient design
- Finds issues before real failures do
⚠ Watch Out
- Needs production traffic to be effective
- Risky if not prepared
- Cultural change required
Recovery-Oriented Computing
When to Use
Systems where perfect reliability is impossible — optimize recovery speed instead.
Real-World Example
Google SRE philosophy: 'Hope is not a strategy. Design for failure, optimize for recovery.'
✓ Advantages
- Faster incident response
- Forces runbook creation
- Reduces MTTR significantly
⚠ Watch Out
- Requires investment in observability
- Needs practiced incident response
Quick Reference
- 1Reliability metrics: POFOD (probability), ROCOF (rate), MTTF/MTTR (time-based).
- 2Reliability-centered design: identify critical components and add redundancy.
- 3Fault tolerance: system continues operating despite component failures.
- 4Active-passive and active-active redundancy patterns for high availability.
- 5Chaos engineering: inject failures intentionally to test resilience (Netflix Chaos Monkey).
- 6Recovery-oriented computing: optimize MTTR, not just MTTF.
Quiz — Test Yourself
Think through your answer first, then reveal.