Phase 3 of 5 · Quality & Trust

Week 10 / 20 · Ch 10

Dependability
and Security

"What makes software trustworthy? 5 properties every engineer must know."

📚 Ch 10 — Dependability🛡️ 5 Properties🔐 Security by Design⏱ ~20 min read

💡Key Concepts

Dependability is not one thing — it's a family of properties. You can't retrofit trust into a system that wasn't designed for it.

🔄

Availability

The probability that a system is operational at any given point in time.

Uptime

📊

Reliability

The probability that a system performs correctly over a given time period.

Correctness

🛡️

Safety

The probability that a system will not cause damage to people or the environment.

Critical

🔐

Security

The ability to protect the system from malicious attacks and unauthorized access.

Protection

🌊

Resilience

The ability to continue delivering services in the presence of partial system failure.

Recovery

🔍Concept Deep Dives

Click each concept to expand — real examples, diagrams, pros & cons.

🔄

Availability

The probability that a system is operational at any given point in time.

⌄

When to Use

Any system where downtime has cost — e-commerce, healthcare, infrastructure.

Real-World Example

AWS targets 99.99% (52 min downtime/year). 99.9% = 8.7 hours/year. 99% = 87 hours/year.

✓ Advantages

Measurable metric
Foundation for SLAs
Drives redundancy design

⚠ Watch Out

High availability = high cost
Availability ≠ correctness

Availability = MTTF / (MTTF + MTTR) MTTF = Mean Time To Failure MTTR = Mean Time To Repair 99.99% = 52 min downtime/year

📊

Reliability

The probability that a system performs correctly over a given time period.

⌄

When to Use

Systems where incorrect operation causes harm or financial loss.

Real-World Example

A bank transfer must be reliable — a system that's available but transfers wrong amounts is not reliable.

✓ Advantages

Focuses on correct behavior, not just uptime
Measurable (POFOD, ROCOF, MTTF)

⚠ Watch Out

Hard to achieve 100% reliability
Trade-off with performance

POFOD: Probability of failure on demand ROCOF: Rate of occurrence of failures MTTF: Mean time to failure High availability ≠ high reliability

🛡️

Safety

The probability that a system will not cause damage to people or the environment.

⌄

When to Use

Safety-critical systems: medical devices, avionics, industrial control, autonomous vehicles.

Real-World Example

Toyota unintended acceleration (2009): software bug caused deaths. Safety engineering would have prevented this.

✓ Advantages

Prevents catastrophic failure
Required by regulation in critical domains

⚠ Watch Out

Expensive (redundancy, certification)
Can conflict with performance

Safety-Critical Systems: - Avionics (Boeing 737 MAX) - Medical devices (insulin pumps) - Industrial control (nuclear plants) - Autonomous vehicles

🔐

Security

The ability to protect the system from malicious attacks and unauthorized access.

⌄

When to Use

Always — there is no system that doesn't need security.

Real-World Example

Equifax breach (2017): unpatched Apache Struts vulnerability exposed 147 million records.

✓ Advantages

Protects assets and users
Required by regulation (GDPR, HIPAA)

⚠ Watch Out

Adds complexity
Security vs usability trade-off
Never 100% secure

Security Properties: - Confidentiality (only authorized see data) - Integrity (data not tampered with) - Availability (accessible when needed) - Authentication + Authorization

🌊

Resilience

The ability to continue delivering services in the presence of partial system failure.

⌄

When to Use

Systems that must survive failures, attacks, or unexpected events.

Real-World Example

Netflix Chaos Monkey: intentionally kills production servers to test resilience. If it can't survive chaos, it's not resilient.

✓ Advantages

Systems survive partial failures
Business continuity
Graceful degradation

⚠ Watch Out

Complex to design
Expensive to test
May mask bugs

Recognize → Resist → Recover → Adapt (Resilience cycle — system stays operational even when parts fail)

📋Quick Reference

θ Ch 10 Cheat Sheet — Dependability and Security

Availability

P(operational at given time). MTTF/(MTTF+MTTR). 99.9% = 8.7hr/yr downtime.

Reliability

P(correct operation over time period). POFOD, ROCOF, MTTF metrics.

Safety

P(no harm to people/environment). Critical in avionics, medical, industrial systems.

Security

Protection from malicious attack. CIA triad: Confidentiality, Integrity, Availability.

Resilience

Continue operating despite partial failure. Recognize → Resist → Recover → Adapt.

Fault vs Failure

Fault = cause (bug, hardware). Error = incorrect state. Failure = visible wrong behavior.

Dependability

Umbrella term: availability + reliability + safety + security + resilience.

Sommerville's Key Points — Ch 10

Author's own summary from the end of the chapter.

1Dependability: availability, reliability, safety, security, resilience — all required for trustworthy systems.
2Availability: system operational when needed. Measured as MTTF/(MTTF+MTTR).
3Reliability: system delivers correct service. Measured as POFOD or ROCOF.
4Safety: no harm to people/environment. Critical in safety-critical systems.
5Security: protection from malicious attacks. CIA triad: confidentiality, integrity, availability.
6Resilience: maintain service despite failures. Recognize, resist, recover, adapt.
7Fault → Error → Failure: the chain from cause to visible wrong behavior.

🧠Quiz — Test Yourself

Think through your answer first, then reveal.

Recall

What is the difference between availability and reliability? Give an example where a system has high availability but low reliability.

Availability = system is operational. Reliability = system operates correctly. Example: a database server that is always online (high availability) but occasionally returns wrong query results (low reliability). Available but not reliable.

Apply

Why can't you just 'add security later'?

Security is an architectural concern — it affects every layer of the system. Authentication, encryption, input validation, access control — these are design decisions. Retrofitting security onto a system not designed for it requires rebuilding large parts of it. Security must be designed in from day 1.

Analyze

Explain the fault-error-failure chain with an example.

Fault = root cause (e.g., integer overflow bug in code). Error = incorrect internal state (counter wraps to negative). Failure = visible wrong behavior (system rejects valid orders). Testing and fault tolerance aim to break this chain before it reaches 'Failure.'

Dependabilityand Security

💡Key Concepts

Availability

Reliability

Safety

Security

Resilience

🔍Concept Deep Dives

Availability

When to Use

Real-World Example

✓ Advantages

⚠ Watch Out

Reliability

When to Use

Real-World Example

✓ Advantages

⚠ Watch Out

Safety

When to Use

Real-World Example

✓ Advantages

⚠ Watch Out

Security

When to Use

Real-World Example

✓ Advantages

⚠ Watch Out

Resilience

When to Use

Real-World Example

✓ Advantages

⚠ Watch Out

📋Quick Reference

🧠Quiz — Test Yourself

Dependability
and Security