Phase 3 of 5  ·  Quality & Trust
Week 10 / 20   ·   Ch 10

Dependability
and Security

"What makes software trustworthy? 5 properties every engineer must know."

📚 Ch 10 — Dependability🛡️ 5 Properties🔐 Security by Design⏱ ~20 min read

🔍Concept Deep Dives

Click each concept to expand — real examples, diagrams, pros & cons.

🔄

Availability

The probability that a system is operational at any given point in time.

When to Use

Any system where downtime has cost — e-commerce, healthcare, infrastructure.

Real-World Example

AWS targets 99.99% (52 min downtime/year). 99.9% = 8.7 hours/year. 99% = 87 hours/year.

✓ Advantages

  • Measurable metric
  • Foundation for SLAs
  • Drives redundancy design

⚠ Watch Out

  • High availability = high cost
  • Availability ≠ correctness
Availability = MTTF / (MTTF + MTTR) MTTF = Mean Time To Failure MTTR = Mean Time To Repair 99.99% = 52 min downtime/year
📊

Reliability

The probability that a system performs correctly over a given time period.

When to Use

Systems where incorrect operation causes harm or financial loss.

Real-World Example

A bank transfer must be reliable — a system that's available but transfers wrong amounts is not reliable.

✓ Advantages

  • Focuses on correct behavior, not just uptime
  • Measurable (POFOD, ROCOF, MTTF)

⚠ Watch Out

  • Hard to achieve 100% reliability
  • Trade-off with performance
POFOD: Probability of failure on demand ROCOF: Rate of occurrence of failures MTTF: Mean time to failure High availability ≠ high reliability
🛡️

Safety

The probability that a system will not cause damage to people or the environment.

When to Use

Safety-critical systems: medical devices, avionics, industrial control, autonomous vehicles.

Real-World Example

Toyota unintended acceleration (2009): software bug caused deaths. Safety engineering would have prevented this.

✓ Advantages

  • Prevents catastrophic failure
  • Required by regulation in critical domains

⚠ Watch Out

  • Expensive (redundancy, certification)
  • Can conflict with performance
Safety-Critical Systems: - Avionics (Boeing 737 MAX) - Medical devices (insulin pumps) - Industrial control (nuclear plants) - Autonomous vehicles
🔐

Security

The ability to protect the system from malicious attacks and unauthorized access.

When to Use

Always — there is no system that doesn't need security.

Real-World Example

Equifax breach (2017): unpatched Apache Struts vulnerability exposed 147 million records.

✓ Advantages

  • Protects assets and users
  • Required by regulation (GDPR, HIPAA)

⚠ Watch Out

  • Adds complexity
  • Security vs usability trade-off
  • Never 100% secure
Security Properties: - Confidentiality (only authorized see data) - Integrity (data not tampered with) - Availability (accessible when needed) - Authentication + Authorization
🌊

Resilience

The ability to continue delivering services in the presence of partial system failure.

When to Use

Systems that must survive failures, attacks, or unexpected events.

Real-World Example

Netflix Chaos Monkey: intentionally kills production servers to test resilience. If it can't survive chaos, it's not resilient.

✓ Advantages

  • Systems survive partial failures
  • Business continuity
  • Graceful degradation

⚠ Watch Out

  • Complex to design
  • Expensive to test
  • May mask bugs
Recognize → Resist → Recover → Adapt (Resilience cycle — system stays operational even when parts fail)

📋Quick Reference

θ Ch 10 Cheat Sheet — Dependability and Security
Availability
P(operational at given time). MTTF/(MTTF+MTTR). 99.9% = 8.7hr/yr downtime.
Reliability
P(correct operation over time period). POFOD, ROCOF, MTTF metrics.
Safety
P(no harm to people/environment). Critical in avionics, medical, industrial systems.
Security
Protection from malicious attack. CIA triad: Confidentiality, Integrity, Availability.
Resilience
Continue operating despite partial failure. Recognize → Resist → Recover → Adapt.
Fault vs Failure
Fault = cause (bug, hardware). Error = incorrect state. Failure = visible wrong behavior.
Dependability
Umbrella term: availability + reliability + safety + security + resilience.
θ
Sommerville's Key Points — Ch 10
Author's own summary from the end of the chapter.
  • 1Dependability: availability, reliability, safety, security, resilience — all required for trustworthy systems.
  • 2Availability: system operational when needed. Measured as MTTF/(MTTF+MTTR).
  • 3Reliability: system delivers correct service. Measured as POFOD or ROCOF.
  • 4Safety: no harm to people/environment. Critical in safety-critical systems.
  • 5Security: protection from malicious attacks. CIA triad: confidentiality, integrity, availability.
  • 6Resilience: maintain service despite failures. Recognize, resist, recover, adapt.
  • 7Fault → Error → Failure: the chain from cause to visible wrong behavior.

🧠Quiz — Test Yourself

Think through your answer first, then reveal.

Q1
Recall
What is the difference between availability and reliability? Give an example where a system has high availability but low reliability.
Availability = system is operational. Reliability = system operates correctly. Example: a database server that is always online (high availability) but occasionally returns wrong query results (low reliability). Available but not reliable.
Q2
Apply
Why can't you just 'add security later'?
Security is an architectural concern — it affects every layer of the system. Authentication, encryption, input validation, access control — these are design decisions. Retrofitting security onto a system not designed for it requires rebuilding large parts of it. Security must be designed in from day 1.
Q3
Analyze
Explain the fault-error-failure chain with an example.
Fault = root cause (e.g., integer overflow bug in code). Error = incorrect internal state (counter wraps to negative). Failure = visible wrong behavior (system rejects valid orders). Testing and fault tolerance aim to break this chain before it reaches 'Failure.'
Up Next → Week 11
Reliability Engineering
How Netflix achieves 99.99% uptime — and what you can learn from it
Continue → Week 11