- Failures are expensive
- Disk failures in practice
- HDDs seem to fail often
- 1-13% replaced annually
- SDDs seem more reliable
- 1-3% replaced annually
- Studies look at replacement events only
- Definitions of failed hard drive are important
- HDDs seem to fail often
- Media errors
- Hard errors (Permanent failures)
- Extrinsic failures (Manufacturing defects, etc.)
- Intrinsic failures (Wear-out)
- Soft errors (Transient errors)
- Hard errors (Permanent failures)
- Faulty component behaviors
- Byzantine behavior: may send arbitrary messages
- Fail-stop behavior: stops and does not send messages
- Examples of fail-stop failures
- Media failure: I/O device code bugs, disk HW failures -> Loss of durable data
- System failure: DB bug, OS fault, HW failure -> Loss of volatile data but durable memory (disk) survives
- Transaction failure
- Code aborts, based on input/database inconsistency
- Mechanical aborts caused by concurrency control solutions to isolation
- Frequent events, "instant" recovery needed
- Storage redundancy + Repeatable computation
- Short, non-shared, deterministic programs
- OS, framework, or user destroys partial changes then reruns program
- Builds on external storage independently protected (RAID/Replicas)
- Long running, non-shared, deterministic programs
- Examples: Extract/Transform/Load jobs
- Periodic state checkpoints to durable, independent, protected storage
- On failure: isolate failed component/system, restart from checkpoint
- Dependent components can trigger failure detection
- Micro-reboot restarts individual long-running system software components
- Fine-grained, transactional components: restart and reinitialize fast
- State segregation: prevent corruption by storing important state externally
- Loosely coupled component: well-defined, well-enforced boundaries
- Retry-able requests: inter-component interactions use timeouts
- Expiring locks (leases): clean-up simplification
- Concurrent, shared data (database) multi-app systems
- Shared state interaction through ACID transactions and write-ahead logging
- External state independently protected
- Concurrent, shared-nothing replicated systems (may be no external state)
- Replicated state machines, driven by coordinating replica changes
- Goal: multiple users manipulate shared data safely
- ACID properties of a transaction
- Atomicity: an operation is done all-or-nothing
- Consistency: user-specified constraints applied before commit
- Isolation: partial changes not visible to other users' code (less complex)
- Durability: changes survive subsequent failures (storage and process redundancy)
- "AID" provided by database system, "C" (mostly) by programmer
- Database is consistent if and only if contents result only from successful transactions
- Integrity constraints (partial consistency) may be enforced by DB
- Goal: isolation mechanism guaranteeing serializability
- Assuming well-formed/consistent transactions seeking isolation
- Simple locking fails to provide isolation if transactions interleave mutation/locking
- 2PL: acquire no lock after releasing any
- Strict 2PL: release no lock before committing, avoids cascading aborts
- Locks held a long time increase blocking; decrease concurrency
- Optimistic methods don't lock but may abort and retry
- Faster if conflict is rare, but risks livelock if not
- Log changes durably before database changes durable
- Write-ahead logging
- REDO: repeat completed transaction on old DB data
- Partial system or total media failure
- UNDO: rollback aborted transaction
- Transaction or system failure
- Only if uncommited transaction allowed to change durable media
- State machine: code + data + input command = deterministic output
- Assuming non-faulty replicas: same initial state + same input = same final state + same output
- Fail-stop failures: 1 surviving replica is sufficient
- Byzantine failures: non-faulty survivors must win a vote
- Need 2t+1 replicas to survive t malicious failures
- Common tools: PAXOS, Apache ZooKeeper
- Agreement: deilver every request to all non-faulty machines
- A coordinator/client specifies a request and the rest agree
- Ordering: ensure the same order of execution at all non-faulty machines
- Assign identifier to requests and execute in identifier order
- Use a clock: Logical clock, Real-time clock, Replica-generated clock
- Two events are not concurrent if one "happens before" the other
- Replicated state machine wants same order of changes at all replicas
- Every machine maintains a counter for its orderable events
- A message arrives with sender's counter: receiver advances counter past the sender's
$C' = max(C, C_{msg})+1$ - Resolve ties by adding machine/thread ID as lower order bits
- Define a total order that is consistent with "happen before"
- Problem: To decide what request to execute next, we need to know no request with a lower logical clock may arrive in future
- Require messages between two machines arrive in order (e.g., TCP)
- Delay execution at a replica until it has seen a larger logical clock from all non-faulty machines
- New problem: waiting for later messages is undesirable
- Forces heartbeat messages, and significant latency
- Real-time clocks can fix this iff clock skew < message delivery
- Replicas can negotiate an order by communicating among themselves
- At the cost of extra messaging