Key Concepts
Dependability is not one thing — it's a family of properties. You can't retrofit trust into a system that wasn't designed for it.
Availability
The probability that a system is operational at any given point in time.
Reliability
The probability that a system performs correctly over a given time period.
Safety
The probability that a system will not cause damage to people or the environment.
Security
The ability to protect the system from malicious attacks and unauthorized access.
Resilience
The ability to continue delivering services in the presence of partial system failure.
Concept Deep Dives
Click each concept to expand — real examples, diagrams, pros & cons.
Availability
When to Use
Any system where downtime has cost — e-commerce, healthcare, infrastructure.
Real-World Example
AWS targets 99.99% (52 min downtime/year). 99.9% = 8.7 hours/year. 99% = 87 hours/year.
✓ Advantages
- Measurable metric
- Foundation for SLAs
- Drives redundancy design
⚠ Watch Out
- High availability = high cost
- Availability ≠ correctness
Reliability
When to Use
Systems where incorrect operation causes harm or financial loss.
Real-World Example
A bank transfer must be reliable — a system that's available but transfers wrong amounts is not reliable.
✓ Advantages
- Focuses on correct behavior, not just uptime
- Measurable (POFOD, ROCOF, MTTF)
⚠ Watch Out
- Hard to achieve 100% reliability
- Trade-off with performance
Safety
When to Use
Safety-critical systems: medical devices, avionics, industrial control, autonomous vehicles.
Real-World Example
Toyota unintended acceleration (2009): software bug caused deaths. Safety engineering would have prevented this.
✓ Advantages
- Prevents catastrophic failure
- Required by regulation in critical domains
⚠ Watch Out
- Expensive (redundancy, certification)
- Can conflict with performance
Security
When to Use
Always — there is no system that doesn't need security.
Real-World Example
Equifax breach (2017): unpatched Apache Struts vulnerability exposed 147 million records.
✓ Advantages
- Protects assets and users
- Required by regulation (GDPR, HIPAA)
⚠ Watch Out
- Adds complexity
- Security vs usability trade-off
- Never 100% secure
Resilience
When to Use
Systems that must survive failures, attacks, or unexpected events.
Real-World Example
Netflix Chaos Monkey: intentionally kills production servers to test resilience. If it can't survive chaos, it's not resilient.
✓ Advantages
- Systems survive partial failures
- Business continuity
- Graceful degradation
⚠ Watch Out
- Complex to design
- Expensive to test
- May mask bugs
Quick Reference
- 1Dependability: availability, reliability, safety, security, resilience — all required for trustworthy systems.
- 2Availability: system operational when needed. Measured as MTTF/(MTTF+MTTR).
- 3Reliability: system delivers correct service. Measured as POFOD or ROCOF.
- 4Safety: no harm to people/environment. Critical in safety-critical systems.
- 5Security: protection from malicious attacks. CIA triad: confidentiality, integrity, availability.
- 6Resilience: maintain service despite failures. Recognize, resist, recover, adapt.
- 7Fault → Error → Failure: the chain from cause to visible wrong behavior.
Quiz — Test Yourself
Think through your answer first, then reveal.