Module 04 — Write-Ahead Logging & Recovery
Track: Database Internals — Orbital Object Registry
Position: Module 4 of 6
Source material: Database Internals — Alex Petrov, Chapters 9–10; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 7
Quiz pass threshold: 70% on all three lessons to unlock the project
Mission Context
SDA INCIDENT REPORT — OOR-2026-0045
Classification: DATA LOSS INCIDENT
Subject: 2,400 TLE records lost after unplanned power failureAt 03:17 UTC, a PDU failure at Ground Station Bravo caused an unclean shutdown of the OOR storage engine. The active memtable contained approximately 2,400 TLE updates from the preceding 12-minute pass window. Because the memtable is a volatile in-memory structure, all 2,400 records were lost. The LSM engine restarted with only the previously flushed SSTables, leaving the catalog 12 minutes stale. Two conjunction alerts were delayed because the missing TLEs contained the most recent orbital elements for objects in a close-approach trajectory.
Directive: Implement a write-ahead log. Every mutation must be logged to durable storage before it is applied to the memtable. On crash recovery, replay the WAL to reconstruct the memtable to its pre-crash state.
Learning Outcomes
After completing this module, you will be able to:
- Explain the write-ahead rule and why it is the foundation of crash recovery
- Implement a WAL that logs key-value operations to an append-only file with checksummed records
- Describe the ARIES recovery protocol — analysis, redo, and undo phases
- Implement crash recovery by replaying the WAL to reconstruct the memtable
- Design a checkpointing strategy that limits WAL replay time after a crash
- Reason about the tradeoff between
fsyncfrequency and durability guarantees
Lesson Summary
Lesson 1 — WAL Fundamentals
The write-ahead rule, log record format, LSN ordering, and the WAL's role in the LSM write path. Why fsync is the only guarantee of durability, and the latency cost of calling it.
Key question: What is the maximum data loss window under group commit with 50ms batch intervals?
Lesson 2 — Crash Recovery
The ARIES recovery protocol adapted for the LSM engine. Analysis phase (determine which WAL records need replay), redo phase (replay committed operations into the memtable), and how the manifest tracks SSTable state for consistent restart.
Key question: If the engine crashes during a compaction, does it use the old or new SSTables on recovery?
Lesson 3 — Checkpointing
Fuzzy checkpoints that snapshot the LSM state without blocking writes. How checkpoints bound WAL replay time and enable WAL truncation. The tradeoff between checkpoint frequency and recovery time.
Key question: What is the maximum WAL replay time for the OOR with 60-second checkpoint intervals?
Capstone Project — Durable TLE Update Pipeline
Add WAL-based durability to the Module 3 LSM engine. Every write is logged before entering the memtable. On simulated crash, the engine recovers to a consistent state by replaying the WAL. Full brief in project-durable-pipeline.md.
File Index
module-04-wal-recovery/
├── README.md
├── lesson-01-wal-fundamentals.md
├── lesson-01-quiz.toml
├── lesson-02-crash-recovery.md
├── lesson-02-quiz.toml
├── lesson-03-checkpointing.md
├── lesson-03-quiz.toml
└── project-durable-pipeline.md
Prerequisites
- Module 3 (LSM Trees & Compaction) completed
What Comes Next
Module 5 (Transactions & Isolation) adds concurrent read/write support with MVCC snapshot isolation, building on the durable foundation established here.