Meridian Space Academy
Meridian Space Academy is the internal engineering training program for Meridian Space Systems. It exists to bring senior engineers — most arriving from web, fintech, or systems backgrounds where Rust's role is established but the operational domain is not — up to working competence on the production data systems that power Meridian's flight operations.
The curriculum is organized as five tracks of six modules each. Every module is grounded in a real piece of Meridian's stack: a service we operate, a pipeline we maintain, or a system we are actively rebuilding. The lessons read like internal engineering documentation because that is what they are.
The mission
Meridian operates a 48-satellite Earth observation constellation, twelve ground station sites distributed across four continents, and a forward operations node at the Artemis lunar base. The original control plane was a Python codebase written for a six-satellite constellation in 2018. It does not scale. Single-threaded I/O loops cannot keep up with the uplink and downlink schedule of forty-eight active vehicles, the GIL prevents the parallelism we need on a fundamentally I/O-bound workload, and the type system has let an unmanageable amount of operational fragility accumulate over six years.
The replacement is being written in Rust. This curriculum is how new engineers — and existing engineers moving into the platform team — develop the mental model required to contribute to that replacement without making the system worse.
Prerequisites
Three or more years of Rust experience. The curriculum assumes fluency with ownership, borrowing, lifetimes, trait bounds, generics, error handling with Result, and Cargo workspaces. There are no introductory lessons on these topics; references are made as needed but the basics are not re-taught.
Engineers who have not yet shipped Rust to production should complete the Rustlings exercises and read Programming Rust (Blandy, Orendorff, Tindall) before starting. This is not a beginner curriculum, and starting it without the prerequisites in place produces frustration without learning.
The five tracks
Foundation — Mission Control Platform (shipped)
Async Rust, concurrency, message passing, networking, data layout, and performance profiling, anchored in the rebuild of the legacy Python control plane. Modules cover the tokio runtime and task lifecycle, shared-state vs. message-passing concurrency, channel patterns for telemetry fan-in and fan-out, TCP and UDP servers for ground station traffic, cache-friendly data layouts for hot paths, and CPU and allocation profiling. Foundation is the prerequisite for every other track.
Database Internals — Orbital Object Registry (shipped)
A storage engine for the Orbital Object Registry — the system of record for every tracked orbital object, its TLE history, and the conjunction analyses computed against it. Modules cover page-level storage and the buffer pool, B+ tree indexes, LSM trees and compaction, write-ahead logging and crash recovery, transactions and isolation levels, and query processing with the Volcano model. The capstone projects compose into a working single-node OLTP storage engine over the course of the track.
Data Pipelines — Space Domain Awareness Fusion (shipped)
Real-time sensor fusion across radar, optical, and inter-satellite link feeds, producing the conjunction alerts that flight ops acts on. Modules cover stream processing semantics, pipeline orchestration internals, watermarks and event-time, exactly-once delivery, and backpressure. The mission framing is the SDA Fusion service that ingests tens of thousands of observations per second from heterogeneous sources and emits a unified, deduplicated catalog.
Data Lakes — Artemis Base Cold Archive (planned)
A versioned mission-data lakehouse for the Artemis base archive. Modules cover columnar formats, table format internals (Parquet and Iceberg), partition layout and clustering, compaction and table maintenance, and time-travel and lineage. The mission framing is the cold-archive storage tier for high-resolution sensor data that flows down from the lunar base on a multi-day cadence and must remain queryable for the operational lifetime of the program.
Distributed Systems — Constellation Network (planned)
Consensus, replication, and partitioning across the 48-satellite grid and its ground footprint. Modules cover failure detection, leader election and Raft, replication and quorum, sharding and rebalancing, and split-brain recovery. The mission framing is the inter-satellite coordination layer that maintains constellation-wide state — orbital schedules, contact windows, downlink priorities — without a stable connection to any single ground site.
Recommended order
Foundation first. Every other track depends on async Rust, channels, and the data-oriented performance habits introduced there.
After Foundation, the Database Internals, Data Pipelines, Data Lakes, and Distributed Systems tracks may be taken in any order. They reference one another where useful — the Database track's WAL chapter is referenced from the Distributed Systems track's replication chapter, for example — but each is self-contained and does not require the others as a prerequisite.
How a module works
Every module follows the same structure:
- Three lessons. Written readings with code examples and source citations where applicable. Lessons that synthesize from training knowledge rather than a primary source are flagged with a
Source notecallout at the top. - Three quizzes. One quiz at the end of each lesson, mixing multiple choice, free response, and Rust tracing problems. A score of seventy percent or higher is required to mark a lesson complete.
- One capstone project brief. A production-realistic engineering task that exercises the module's material. Projects are scoped to one to two weeks of focused work and include explicit acceptance criteria, suggested architecture, and the relevant Meridian operational context.
A module is complete when all three lessons are passed and the capstone project brief has been worked through end to end. The capstone is not optional reading; it is where the module's material becomes a real system.
Source material
Each lesson cites the primary text it draws from. The core references across the curriculum are Async Rust (Flitton, Morton), Database Internals (Petrov), Designing Data-Intensive Applications (Kleppmann), The Linux Programming Interface (Kerrisk), and the Raft and Spanner papers. Lessons that synthesize beyond a primary source are explicitly marked.
Module 01 — Async Rust Fundamentals
Track: Foundation — Mission Control Platform
Position: Module 1 of 6
Source material: Async Rust — Maxwell Flitton & Caroline Morton, Chapters 1, 2, 7
Quiz pass threshold: 70% on all three lessons to unlock the project
- Mission Context
- What You Will Learn
- Lessons
- Capstone Project — Async Telemetry Ingestion Broker
- Prerequisites
- What Comes Next
Mission Context
Meridian's legacy Python control plane was built for a 6-satellite constellation. It handles ground station connections sequentially: one connection at a time, blocking on each telemetry frame before moving to the next. At 48 satellites across 12 ground station sites, this architecture is the primary bottleneck in the control plane. During peak pass windows, the broker accumulates up to 40 seconds of delivery lag — unacceptable for conjunction avoidance workflows that require sub-10-second frame delivery.
This module establishes the async Rust foundation for the replacement system. Before writing any production control plane code, you need an accurate model of how async Rust executes — not at the surface level of #[tokio::main], but at the level of futures, the polling contract, executor scheduling, and task lifecycle. Every architectural decision in the modules that follow depends on this model.
What You Will Learn
By the end of this module you will be able to:
- Implement the
Futuretrait directly and trace the polling lifecycle from first call to completion - Explain the waker contract and identify futures that will silently stall due to missing waker registration
- Distinguish between
tokio::spawn,tokio::join!, andtokio::select!and apply each to the correct concurrency pattern - Configure a Tokio runtime explicitly via
Builder, size worker and blocking thread pools for a given workload profile, and isolate high-frequency I/O workloads from blocking work - Cancel tasks safely using
.abort()andtokio::time::timeout, understanding exactly where and when the future is dropped - Implement a graceful shutdown sequence with a bounded drain deadline, using RAII and
CancellationTokenfor async cleanup
Lessons
Lesson 1 — The async/await Model: Futures, Polling, and the Executor Loop
Covers the Future trait, the poll function, Poll::Ready vs Poll::Pending, the waker contract, Pin, and the executor's task queue. Includes a manually implemented future to make the state machine mechanics explicit.
Key question this lesson answers: What actually happens at every await point, and what causes a task to silently stall?
→ lesson-01-async-await-model.md / lesson-01-quiz.toml
Lesson 2 — The Tokio Runtime: Spawning Tasks, the Scheduler, and Thread Pools
Covers Tokio's multi-thread work-stealing scheduler, the distinction between worker threads and blocking threads, tokio::task::spawn_blocking, and explicit runtime configuration via Builder. Includes a dual-runtime pattern for isolating ingress and housekeeping workloads.
Key question this lesson answers: How do you configure the runtime for your actual workload rather than the defaults, and when does blocking work need to leave the async executor?
→ lesson-02-tokio-runtime.md / lesson-02-quiz.toml
Lesson 3 — Task Lifecycle: Cancellation, Timeouts, and JoinHandle Management
Covers JoinHandle<T> semantics, cooperative cancellation with .abort(), RAII cleanup on cancellation, tokio::time::timeout, tokio::select! for racing futures, and a complete graceful shutdown pattern using broadcast and a bounded drain deadline.
Key question this lesson answers: How do you cleanly terminate a task — whether it completes normally, times out, or receives a shutdown signal — without leaking resources or corrupting state?
→ lesson-03-task-lifecycle.md / lesson-03-quiz.toml
Capstone Project — Async Telemetry Ingestion Broker
Build the TCP ingress layer for Meridian's replacement control plane. The broker accepts concurrent connections from ground stations, reads length-prefixed telemetry frames, fans each frame out to multiple downstream handlers via a broadcast channel, and shuts down gracefully on Ctrl-C with a 10-second drain deadline.
Acceptance is against 7 verifiable criteria including concurrent connection handling, broadcast fan-out correctness, slow-handler isolation, and bounded graceful shutdown.
→ project-async-telemetry-broker.md
Prerequisites
This module assumes you are comfortable with Rust ownership, borrowing, traits, and closures. It does not re-explain language fundamentals. If you are new to async Rust generally, the module starts from first principles at the trait level — but it expects you to read Rust error messages without assistance.
What Comes Next
Module 2 — Concurrency Primitives builds directly on this foundation: now that you understand how the executor runs tasks, Module 2 covers how those tasks share state safely — Mutex, RwLock, atomics, and memory ordering. The ground station command queue project in Module 2 connects directly to the telemetry broker you build here.
Lesson 1 — The async/await Model: Futures, Polling, and the Executor Loop
Module: Foundation — M01: Async Rust Fundamentals
Position: Lesson 1 of 3
Source: Async Rust — Maxwell Flitton & Caroline Morton, Chapters 1–2
Context
Meridian's legacy Python control plane was designed for a 6-satellite constellation. It handles ground station connections sequentially: accept a connection, process its telemetry frame, move to the next connection. At 6 satellites, this was acceptable. At 48 satellites across 12 ground station sites, it is a bottleneck. A single slow uplink from a station in the Atacama Desert holds up frames from every other active connection. The Python GIL makes true parallelism on this I/O-bound workload impossible without forking processes, which multiplies memory overhead and complicates shared state.
The replacement control plane is being written in Rust with tokio. Before writing any of that system, you need an accurate mental model of how async Rust actually executes code — not at the level of the tokio macro, but at the level of futures, the polling protocol, and the executor's task queue. Misunderstanding this model is the root cause of most async Rust bugs in production: dropped wakers, blocking the executor thread, and state machine explosions that are impossible to reason about.
This lesson covers the mechanics that every await desugars into. By the end, you should be able to read a future's poll implementation and trace exactly when it will make progress, when it will yield, and what will wake it back up. That skill is indispensable when debugging a hung ground station connection at 0300.
Source: Async Rust, Chapters 1–2 (Flitton & Morton)
Core Concepts
What Async Actually Is
Async programming does not add CPU cores. It reorganizes work so that dead time — waiting for a network response, waiting for a disk write — is used to make progress on other tasks. The classic analogy: you do not stand still while the kettle boils. You put the bread in the toaster. The key insight is that both tasks share one pair of hands but interleave their execution during wait periods.
In Rust, this interleaving is explicit and zero-cost. There is no runtime scheduler running on a background OS thread intercepting your code. Instead, you write state machines, and the Rust compiler compiles async fn into those state machines for you. await is a yield point — a place where the current task volunteers to give up the thread so another task can run.
This is the critical difference from threads: with threads, preemption is involuntary. With async tasks, yield is voluntary, at every await. A task that never hits an await — one that runs a tight CPU loop — will starve every other task on that executor thread. This is not hypothetical. In Meridian's uplink pipeline, a single malformed frame that triggers O(n²) validation holds the entire thread if there's no await in the hot path.
The Future Trait
Every value produced by an async fn or an async block implements Future. The trait is:
#![allow(unused)] fn main() { pub trait Future { type Output; fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output>; } }
Poll has two variants: Poll::Ready(value) when the computation is complete, and Poll::Pending when the future cannot yet make progress and should be woken up later.
The poll function is not async. This matters: futures are polled synchronously. The executor calls poll, it runs synchronously until it either completes or hits a point where it cannot proceed, and then it returns. If it returns Pending, it is the future's responsibility to arrange for a wake-up. If it returns Pending without registering a waker, the task will never run again — a silent deadlock.
The Waker Contract
Context carries a Waker. The Waker is a handle that, when called, schedules the associated task back onto the executor's run queue. The contract is: if poll returns Pending, it must have called cx.waker().wake_by_ref() or stored the waker to be called later when the awaited resource becomes available.
Violating this contract — returning Pending without registering the waker — produces a future that stalls forever with no error. The executor sees a pending task, never reschedules it, and the task silently vanishes from the run queue. At the Meridian scale, this manifests as a ground station connection that goes quiet mid-session: no error, no disconnect, just silence until the session timeout fires.
The executor side of this contract: when a waker is called, the executor re-queues the task and eventually calls poll again. The future may be polled many times before it completes. The state it needs to resume must be owned by the future struct itself — this is why async Rust desugars async fn into a struct that holds all local variables as fields.
Pinning
The poll signature takes Pin<&mut Self> rather than &mut Self. Pin prevents the future from being moved in memory after it has been pinned. This matters because async state machines frequently contain self-referential structures: a future that awaits another future may hold a reference into its own fields. If the outer future were moved, that reference would dangle.
Pin enforces at compile time that once you call poll, the future cannot be moved. For futures composed entirely of Unpin types (most standard types), this is a no-op. For futures holding references into themselves — which the compiler generates automatically from async fn — it is essential.
Practical implication: you cannot call poll directly on a future obtained from an async fn without first pinning it via Box::pin(future) or tokio::pin!(future). tokio::spawn handles this for you; you only encounter it directly when building custom executors or when polling a future by reference inside select!.
tokio::pin! — Polling a Future by Reference
tokio::pin! pins a value to the current stack frame in place, making it safe to poll by mutable reference. The common situation where this matters: you need to start an async operation once and track its progress across multiple iterations of a select! loop, rather than restarting it fresh on every iteration.
Consider fetching a TLE catalog update while simultaneously processing incoming session commands. The fetch should run to completion in the background; the command loop should not restart it on each iteration:
use tokio::sync::mpsc; use tokio::time::{sleep, Duration}; async fn fetch_tle_update() -> Vec<u8> { // Simulate a slow catalog fetch — ~200ms in production. sleep(Duration::from_millis(200)).await; vec![0u8; 64] // placeholder TLE payload } #[tokio::main] async fn main() { let (tx, mut rx) = mpsc::channel::<String>(8); // Spawn a sender to simulate incoming commands. tokio::spawn(async move { for cmd in ["REPOINT", "STATUS", "RESET"] { sleep(Duration::from_millis(60)).await; let _ = tx.send(cmd.to_string()).await; } }); // Create the future ONCE, outside the loop. let tle_fetch = fetch_tle_update(); // Pin it to the stack so we can poll it by reference (&mut tle_fetch). tokio::pin!(tle_fetch); let mut tle_done = false; loop { tokio::select! { // Poll the same future instance each iteration. // Without tokio::pin!, each iteration would call fetch_tle_update() // again, creating a brand-new future and discarding all progress. tle = &mut tle_fetch, if !tle_done => { println!("TLE update received: {} bytes", tle.len()); tle_done = true; } Some(cmd) = rx.recv() => { println!("command received: {cmd}"); if cmd == "RESET" { break; } } else => break, } } }
Two things to notice. First, tle_fetch is created before the loop and pinned with tokio::pin!. Inside select!, &mut tle_fetch polls the same future on every iteration — it accumulates progress across polls. If you wrote fetch_tle_update() directly inside select!, you would get a new future each time and the fetch would restart from zero on every loop iteration.
Second, the , if !tle_done precondition disables the branch once the fetch has completed. This is essential: if the branch stays enabled after the future resolves, select! would attempt to poll an already-completed future on the next iteration, causing a "async fn resumed after completion" panic. The precondition guards against this. The next section covers preconditions in full.
The Executor Loop
The executor maintains a run queue of tasks ready to be polled. Its loop is approximately:
- Pop a task from the ready queue.
- Call
pollon it. - If
Poll::Ready, the task is done — drop it. - If
Poll::Pending, the task is parked. It will be re-queued only when its waker is called.
Tasks are not re-polled speculatively. They are polled exactly when woken. This means a task can sit in Pending state indefinitely if nothing triggers its waker — which is the correct behavior for a task waiting on a network connection that has gone silent.
tokio::spawn places a task on the executor's ready queue. tokio::join! polls multiple futures concurrently on the same task — no new OS threads, no new tasks, just interleaved polling within the same scheduler slot. tokio::spawn creates a new independent task that can be scheduled on any worker thread.
Code Examples
Implementing Future Directly — A Telemetry Frame Validator
This example implements Future manually to illustrate what async/await desugars into. In production this would be an async fn, but seeing the state machine explicitly clarifies exactly when control yields and what triggers resumption.
The scenario: validating an incoming telemetry frame header requires checking a CRC that is computed in a background thread pool. The future polls a oneshot channel for the result.
use std::future::Future; use std::pin::Pin; use std::task::{Context, Poll}; use tokio::sync::oneshot; /// Represents a frame whose header CRC is being validated asynchronously. /// The validation runs on a blocking thread; this future waits for its result. pub struct FrameValidationFuture { // oneshot::Receiver implements Future directly, but we wrap it here // to show the polling mechanics explicitly. receiver: oneshot::Receiver<bool>, } impl FrameValidationFuture { pub fn new(receiver: oneshot::Receiver<bool>) -> Self { Self { receiver } } } impl Future for FrameValidationFuture { type Output = Result<(), String>; fn poll(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> { // Pin::new is safe here because oneshot::Receiver is Unpin. // For a self-referential type we'd need unsafe or box-pinning. match Pin::new(&mut self.receiver).poll(cx) { Poll::Ready(Ok(true)) => Poll::Ready(Ok(())), Poll::Ready(Ok(false)) => { Poll::Ready(Err("CRC validation failed".to_string())) } Poll::Ready(Err(_)) => { // Sender dropped without sending — the validator thread panicked // or was cancelled. Treat as a validation failure, not a panic. Poll::Ready(Err("Validator thread terminated unexpectedly".to_string())) } // The result is not ready yet. The oneshot::Receiver has already // registered cx's waker — it will call it when a value is sent. // We return Pending; the executor parks this task. Poll::Pending => Poll::Pending, } } } #[tokio::main] async fn main() { let (tx, rx) = oneshot::channel::<bool>(); // Simulate the CRC validator running on a blocking thread pool. tokio::spawn(async move { // In production: tokio::task::spawn_blocking(|| compute_crc(...)).await // Here we just send a valid result immediately. let _ = tx.send(true); }); let validation = FrameValidationFuture::new(rx); match validation.await { Ok(()) => println!("Frame header valid — forwarding to telemetry pipeline"), Err(e) => eprintln!("Frame rejected: {e}"), } }
The poll implementation delegates to the inner oneshot::Receiver's own poll. When Receiver::poll returns Pending, it has already stored the waker from cx internally. When tx.send(true) fires, Receiver calls that waker, which re-queues this task. No manual waker management is needed here because we compose with a type that already handles it correctly.
This is the pattern to follow when building custom futures: compose with existing futures and channel primitives wherever possible. Write unsafe waker code only when you are bridging to a non-async notification source (an epoll fd, a hardware interrupt, a C library callback).
Concurrent Polling with tokio::join! vs. Sequential await
Sequential await for two telemetry frame fetches from different ground stations means the second fetch does not start until the first completes:
#![allow(unused)] fn main() { // SEQUENTIAL — total latency = latency(station_a) + latency(station_b) let frame_a = fetch_frame("gs-atacama").await?; let frame_b = fetch_frame("gs-svalbard").await?; }
tokio::join! polls both concurrently on the same task. While one is pending, the executor can drive the other forward:
use anyhow::Result; use tokio::net::TcpStream; async fn fetch_frame(station_id: &str) -> Result<Vec<u8>> { // Simplified: in production this reads from a persistent connection pool. let mut _stream = TcpStream::connect(format!("{station_id}:7777")).await?; // ... read frame bytes ... Ok(vec![]) // placeholder } #[tokio::main] async fn main() -> Result<()> { // CONCURRENT — total latency ≈ max(latency_a, latency_b) // Both futures are polled in the same task; no new OS threads are created. let (frame_a, frame_b) = tokio::join!( fetch_frame("gs-atacama"), fetch_frame("gs-svalbard") ); // Both results are available here; handle errors independently. match (frame_a, frame_b) { (Ok(a), Ok(b)) => { println!("Received {} + {} bytes from ground stations", a.len(), b.len()); } (Err(e), _) | (_, Err(e)) => { eprintln!("Ground station fetch failed: {e}"); } } Ok(()) }
tokio::join! is appropriate when the futures are independent and you need both results. If you only need the first result and want to cancel the loser, use tokio::select!. If the futures have no data dependency and need to run across multiple threads simultaneously, tokio::spawn each and join the handles.
Key Takeaways
-
The
Futuretrait'spollmethod is synchronous. An async runtime is a loop that callspollon ready tasks; it does not preempt running tasks. A future that does significant CPU work without anawaitwill monopolize its executor thread. -
If
pollreturnsPoll::Pendingwithout registering the context's waker, the task is silently parked forever. Always verify that the resource you're awaiting will call the waker when it becomes available. -
Pin<&mut Self>exists to prevent futures from being moved after polling begins. For futures containing self-referential state (which the compiler generates automatically), this is load-bearing. Most composed futures areUnpin; the constraint only bites when bridging to raw async primitives. -
tokio::join!achieves concurrency within a single task by interleaved polling. It does not create threads or new tasks. Use it for independent I/O operations that should proceed simultaneously but whose results you need together. -
tokio::pin!pins a future to the current stack frame so it can be polled by mutable reference across multipleselect!iterations. Use it when you need to start an operation once and track its progress, not restart it on each loop. Always pair it with a precondition (, if !done) to prevent polling the future after it has already resolved. -
Every
async fnis compiled into a state machine struct. Variables live acrossawaitpoints become fields of that struct. Understanding this explains why async Rust futures can be large, why they must be pinned, and why capturing large values across await points inflates memory use.
Lesson 2 — The Tokio Runtime: Spawning Tasks, the Scheduler, and Thread Pools
Module: Foundation — M01: Async Rust Fundamentals
Position: Lesson 2 of 3
Source: Async Rust — Maxwell Flitton & Caroline Morton, Chapter 7
Context
The Meridian control plane receives telemetry from 48 satellite uplinks simultaneously. Each uplink connection is long-lived: a ground station holds a persistent TCP session with the control plane and streams frames at irregular intervals driven by orbital geometry and antenna alignment. Alongside these connections, the control plane runs housekeeping tasks — session health checks, TLE refresh from the catalog, and periodic flush of buffered frames to the downstream aggregator.
The #[tokio::main] macro stands up a default multi-threaded runtime and runs your async main function in it. For prototyping and simple services, this is sufficient. For a system with the throughput profile and operational requirements of Meridian's control plane, you need to understand what that runtime is actually doing — how many threads it allocates, how it distributes work across them, what happens when a blocking operation enters the mix, and how to configure it for your actual workload rather than the defaults.
This lesson covers Tokio's scheduler architecture, the distinction between async tasks and blocking tasks, how to size thread pools for I/O-bound vs. compute-bound work, and how to configure the runtime explicitly via Builder. The goal is not to tune prematurely — it is to understand the model well enough to make deliberate choices rather than accepting defaults that may be wrong for your system.
Source: Async Rust, Chapter 7 (Flitton & Morton)
Core Concepts
The Multi-Thread Scheduler
Tokio's default multi_thread scheduler maintains a pool of worker threads — by default, one per logical CPU core. Each worker thread has a local run queue. Tasks spawned with tokio::spawn go onto a global run queue and are stolen by whichever worker thread is idle. This work-stealing design keeps all workers busy when there is backlog, at the cost of some cross-thread synchronization.
Each worker runs the same loop from Lesson 1: pop a ready task, call poll, re-queue it if it returns Pending, drop it if Ready. When a worker's local queue is empty, it attempts to steal tasks from other workers' queues before checking the global queue. The global_queue_interval configuration controls how many local-queue tasks a worker processes before checking the global queue — the default is 61. Lowering this value gives newly spawned tasks lower latency at the cost of more global-queue contention.
The current_thread runtime (used by #[tokio::main] in tests and the single-threaded mode) runs all tasks on the calling thread. It is appropriate for services that are purely I/O-bound with no CPU-intensive tasks and where single-threaded throughput is sufficient. The Meridian control plane uses the multi-thread runtime.
Worker Threads and Blocking Threads
Tokio distinguishes between two kinds of threads:
Worker threads run the async executor loop. They poll futures and run async task code. There should be enough of them to saturate your I/O capacity without exceeding your core count. A typical production setting is num_cpus::get(), which Builder::new_multi_thread() uses by default.
Blocking threads are spawned on demand by tokio::task::spawn_blocking. They run synchronous, blocking code — file I/O, CPU-intensive computation, synchronous library calls — in a separate thread pool that does not interfere with the async worker threads. The key rule: never perform blocking I/O or long CPU work directly on an async worker thread. Doing so parks that thread for the duration, reducing effective parallelism and starving other tasks.
max_blocking_threads caps the number of blocking threads that can exist simultaneously. The default is 512. For the Meridian control plane, which may process TLE bulk imports concurrently with live uplinks, sizing this correctly prevents runaway thread creation under load spikes.
tokio::spawn and Task Identity
tokio::spawn places a new task onto the runtime's global queue. It returns a JoinHandle<T> — a handle to the spawned task's eventual output. The task is independent of the spawner: if the spawner drops the handle, the task continues running (though its output is lost). If you need the task's output, keep the handle and .await it. If you need to cancel the task, call .abort() on the handle.
Spawned tasks must be 'static — they cannot borrow from the spawning scope. If the task needs data from the spawner, move it in with async move { ... }, clone cheaply clonable data (like Arc-wrapped state), or use channels to communicate.
A common mistake is spawning a task per connection without any admission control. At 48 uplinks with 100 frames per second each, that is 4,800 task-spawns per second for frame processing alone. Task creation has overhead. For Meridian's frame processing workload, a bounded task pool or a pipeline of fixed workers is more appropriate than an unbounded spawn-per-frame pattern.
Configuring the Runtime Explicitly
The #[tokio::main] macro is shorthand for building a default runtime and blocking on the async main function. Replacing it with an explicit Builder gives fine-grained control:
use tokio::runtime::Builder; fn main() { let runtime = Builder::new_multi_thread() .worker_threads(8) .max_blocking_threads(16) .thread_name("meridian-worker") .thread_stack_size(2 * 1024 * 1024) .enable_all() .build() .expect("failed to build Tokio runtime"); runtime.block_on(async_main()); }
The runtime is a value. You can have multiple runtimes in the same process — useful when you need strict resource isolation between subsystems (e.g., keeping the telemetry ingress runtime separate from the housekeeping runtime so a housekeeping spike does not starve active uplinks).
Code Examples
Explicit Runtime Configuration for the Meridian Control Plane
Meridian's control plane has two distinct workload profiles that benefit from isolated runtimes: the high-frequency telemetry ingress path (many short-lived I/O tasks) and the housekeeping path (fewer, slower tasks including blocking TLE catalog refreshes). Sharing a single runtime risks head-of-line blocking when a TLE import saturates the blocking thread pool.
use std::sync::LazyLock; use tokio::runtime::{Builder, Runtime}; // Ingress runtime: tuned for concurrent I/O — one worker per core, // minimal blocking threads since real blocking work routes to the // housekeeping runtime. static INGRESS_RUNTIME: LazyLock<Runtime> = LazyLock::new(|| { Builder::new_multi_thread() .worker_threads(num_cpus::get()) .max_blocking_threads(4) .thread_name("meridian-ingress") .thread_stack_size(2 * 1024 * 1024) .on_thread_start(|| tracing::debug!("ingress worker started")) .enable_all() .build() .expect("failed to build ingress runtime") }); // Housekeeping runtime: fewer workers, more blocking threads for // catalog refreshes and frame archival. static HOUSEKEEPING_RUNTIME: LazyLock<Runtime> = LazyLock::new(|| { Builder::new_multi_thread() .worker_threads(2) .max_blocking_threads(32) .thread_name("meridian-housekeeping") .enable_all() .build() .expect("failed to build housekeeping runtime") }); async fn handle_uplink_session() { // This runs on an ingress worker thread. // Long-running I/O awaits are fine here. tokio::time::sleep(tokio::time::Duration::from_millis(10)).await; tracing::info!("uplink session processed"); } async fn refresh_tle_catalog() { // CPU + blocking I/O — route to spawn_blocking so we do not // park an ingress worker for the duration of the refresh. tokio::task::spawn_blocking(|| { // Synchronous HTTP fetch + file write; blocks for ~200ms. tracing::info!("TLE catalog refreshed"); }) .await .expect("TLE refresh panicked"); } fn main() { // Ingress and housekeeping run in separate thread pools. // A TLE refresh spike cannot starve active uplink sessions. std::thread::spawn(|| { HOUSEKEEPING_RUNTIME.block_on(async { loop { refresh_tle_catalog().await; tokio::time::sleep(tokio::time::Duration::from_secs(300)).await; } }); }); INGRESS_RUNTIME.block_on(async { // In production: bind TCP listener, accept connections, // spawn handle_uplink_session per connection. for _ in 0..48 { INGRESS_RUNTIME.spawn(handle_uplink_session()); } tokio::time::sleep(tokio::time::Duration::from_secs(1)).await; }); }
The on_thread_start hook enables per-thread tracing setup. In a production deployment, this is where you would initialize thread-local metrics recorders. The thread_name setting surfaces in top, htop, and perf output — essential when profiling which runtime is responsible for CPU usage.
Dispatching Blocking Work Correctly
The most common async-correctness mistake in production Rust services is calling blocking code on an async worker thread. The rule is simple but frequently violated: if a function does not have async in its signature and it does any I/O or takes more than a few hundred microseconds, it belongs in spawn_blocking.
use anyhow::Result; use tokio::task; /// Parse and validate a TLE record from a raw string. /// TLE parsing is synchronous and O(n) with input length. /// On a 100KB batch, this can take several milliseconds. fn parse_tle_batch_blocking(raw: String) -> Result<Vec<String>> { // Synchronous parsing — no I/O, but CPU-bound for large inputs. raw.lines() .filter(|l| l.starts_with("1 ") || l.starts_with("2 ")) .map(|l| Ok(l.to_string())) .collect() } async fn ingest_tle_update(raw_batch: String) -> Result<Vec<String>> { // Moving raw_batch into spawn_blocking satisfies the 'static bound. // The closure executes on a blocking thread; we await the JoinHandle. let records = task::spawn_blocking(move || parse_tle_batch_blocking(raw_batch)) .await // The outer error is a JoinError (task panicked or was aborted). // Propagate it as an application error. .map_err(|e| anyhow::anyhow!("TLE parser panicked: {e}"))??; Ok(records) } #[tokio::main] async fn main() -> Result<()> { let raw = "1 25544U 98067A 21275.52500000 .00001234 00000-0 12345-4 0 9999\n\ 2 25544 51.6400 337.6640 0007417 62.6000 297.5200 15.48889583300000\n" .to_string(); let records = ingest_tle_update(raw).await?; println!("Parsed {} TLE records", records.len()); Ok(()) }
The double ? on .await.map_err(...)?? deserves explanation: spawn_blocking returns Result<T, JoinError>, and parse_tle_batch_blocking itself returns Result<Vec<String>, anyhow::Error>. The first ? propagates JoinError (after mapping it), the second propagates the inner application error. Collapsing these correctly is a common stumbling point — do not use .unwrap() on JoinHandle in production code; a parser panic should not take down the ingress runtime.
Key Takeaways
-
Tokio's multi-thread scheduler uses work-stealing across a pool of worker threads (defaulting to one per logical CPU). Tasks spawned via
tokio::spawnenter the global queue and are picked up by idle workers. -
Worker threads and blocking threads serve different purposes. Never run synchronous blocking I/O or long CPU computation on a worker thread. Use
tokio::task::spawn_blockingto route blocking work to the dedicated blocking thread pool. -
Explicit
Builderconfiguration lets you control thread counts, stack sizes, thread names, and lifecycle hooks. Use it in production to separate high-frequency I/O workloads from lower-frequency blocking workloads, preventing one from starving the other. -
tokio::spawncreates a task with'staticlifetime. If you need to share data from the spawning scope, move it into the closure withasync move, wrap it inArc, or communicate via channels. -
Multiple runtimes in the same process are a valid pattern for resource isolation. Ingress and housekeeping workloads with fundamentally different resource profiles benefit from separate thread pools rather than competing on a shared executor.
Lesson 3 — Task Lifecycle: Cancellation, Timeouts, and JoinHandle Management
Module: Foundation — M01: Async Rust Fundamentals
Position: Lesson 3 of 3
Source: Async Rust — Maxwell Flitton & Caroline Morton, Chapters 2, 7
Context
The Meridian control plane manages connections that span orbital passes. A ground station connection is live while the target satellite is above the horizon — typically 8 to 12 minutes. When the pass ends, the connection should be torn down cleanly: in-flight frames flushed, session state persisted, downstream consumers notified. If the control plane is restarted mid-pass — rolling deploy, crash recovery, OOM kill — active tasks must be cancelled in a way that does not corrupt shared state or leave downstream systems with partial data.
Understanding task lifecycle is not optional for this system. Tasks that outlive their useful scope waste resources. Tasks cancelled without cleanup leave corrupted state. Tasks that silently swallow their errors make incident response a guessing game. The Tokio JoinHandle, the .abort() call, and tokio::time::timeout are the instruments for managing these concerns; this lesson covers each one in depth.
Source: Async Rust, Chapters 2 & 7 (Flitton & Morton)
Core Concepts
The JoinHandle and Task Output
tokio::spawn returns JoinHandle<T>. The handle has two primary uses: waiting for the task's output with .await, and cancelling the task with .abort().
.await on a JoinHandle<T> produces Result<T, JoinError>. JoinError indicates one of two things: the task panicked, or it was aborted. Distinguishing them matters:
#![allow(unused)] fn main() { match handle.await { Ok(value) => { /* normal completion */ } Err(e) if e.is_panic() => { /* task panicked — log and recover */ } Err(e) if e.is_cancelled() => { /* task was aborted */ } Err(_) => unreachable!(), } }
If you drop a JoinHandle without awaiting it, the task continues running — it is not cancelled. This is the correct behavior for fire-and-forget tasks. If you need the task to stop when the handle is dropped, use tokio_util::task::AbortOnDropHandle (a wrapper that calls .abort() on drop) or implement the same pattern manually.
Task Cancellation with .abort()
.abort() sends a cancellation signal to the task. The task does not stop immediately — it is cancelled at the next .await point. This is cooperative cancellation: the task's state machine is dropped when it next yields to the executor, which runs the Drop implementation of any held values.
The implication: resources guarded by RAII are dropped correctly on cancellation. A tokio::net::TcpStream held by the task will be closed. A MutexGuard will be released. A tokio::fs::File will be flushed and closed. What is not guaranteed: code after the .await where cancellation occurred will not run. If you have cleanup logic that must run regardless of cancellation, it must be in a Drop impl, not in code that follows an .await.
#![allow(unused)] fn main() { // This cleanup logic may NOT run if the task is cancelled at the await: async fn session_handler(id: u64) { process_frames().await; // <-- task may be cancelled here // The following line may never execute if aborted above. persist_session_state(id).await; // NOT guaranteed on cancellation } // This cleanup logic WILL run on cancellation because it is in Drop: struct Session { id: u64, state: SessionState, } impl Drop for Session { fn drop(&mut self) { // Synchronous cleanup only — no async here. // Flush to a synchronous in-memory buffer; a separate housekeeping // task drains the buffer to persistent storage. tracing::info!(session_id = self.id, "session dropped, state buffered"); } } }
CancellationToken and TaskTracker
broadcast and watch channels work for shutdown signalling, but tokio-util provides two purpose-built primitives that are cleaner for the specific problem of cooperative shutdown.
CancellationToken is a cloneable, shareable cancellation handle. Any clone of a token represents the same cancellation event: when .cancel() is called on any one of them, all clones see it. Tasks wait on .cancelled(), which returns a future that resolves when the token is cancelled:
use tokio::time::{sleep, Duration}; use tokio_util::sync::CancellationToken; async fn uplink_session(station_id: u32, token: CancellationToken) { loop { tokio::select! { // cancelled() is just a future — it composes naturally with select!. _ = token.cancelled() => { // Run async cleanup here before returning. // This is the key advantage over .abort(): we choose when to stop // and can flush state, send final messages, close connections. tracing::info!(station_id, "session received cancellation — draining"); flush_pending_frames().await; break; } _ = process_next_frame(station_id) => { // Normal frame processing continues until token is cancelled. } } } } async fn flush_pending_frames() { sleep(Duration::from_millis(10)).await; // placeholder } async fn process_next_frame(_id: u32) { sleep(Duration::from_millis(50)).await; // placeholder } #[tokio::main] async fn main() { let token = CancellationToken::new(); // Clone the token for each task — all clones share the same cancellation. let handles: Vec<_> = (0..4) .map(|id| { let t = token.clone(); tokio::spawn(uplink_session(id, t)) }) .collect(); // Simulate running for a short time then shutting down. sleep(Duration::from_millis(120)).await; // Cancel all sessions simultaneously with one call. token.cancel(); for handle in handles { let _ = handle.await; } tracing::info!("all sessions shut down"); }
The critical difference from .abort(): when the token fires, the task's select! arm runs, giving the task the opportunity to execute async cleanup before it exits. .abort() drops the future at the next .await with no opportunity for the task to run any further code.
CancellationToken::child_token() creates a child that is cancelled when the parent is cancelled, but can also be cancelled independently. Use this for hierarchical shutdown: cancel the top-level token to shut down everything, or cancel a child token to shut down one subsystem while leaving others running.
TaskTracker solves the drain-waiting problem more cleanly than collecting JoinHandles into a Vec. Spawn tasks through the tracker; call .close() when no more tasks will be added; then .wait() to block until all tracked tasks finish:
use tokio::time::{sleep, Duration}; use tokio_util::task::TaskTracker; #[tokio::main] async fn main() { let tracker = TaskTracker::new(); let token = tokio_util::sync::CancellationToken::new(); for station_id in 0..12u32 { let t = token.clone(); tracker.spawn(async move { tokio::select! { _ = t.cancelled() => { tracing::info!(station_id, "session shutting down"); } _ = sleep(Duration::from_secs(300)) => { tracing::info!(station_id, "session pass complete"); } } }); } // Signal that no more tasks will be spawned. // wait() will not resolve until close() has been called. tracker.close(); // Trigger shutdown. sleep(Duration::from_millis(50)).await; token.cancel(); // Block until all 12 sessions finish their cleanup. tracker.wait().await; tracing::info!("all sessions drained"); }
tracker.wait() only resolves after both conditions are true: all spawned tasks have completed, and tracker.close() has been called. The close() requirement prevents a race where wait() resolves between the last task finishing and the next one being spawned. Always call close() before wait().
tokio::time::timeout
tokio::time::timeout(duration, future) wraps any future and adds a deadline. If the future does not complete within the duration, it is cancelled and the wrapper returns Err(tokio::time::error::Elapsed).
#![allow(unused)] fn main() { use tokio::time::{timeout, Duration}; async fn fetch_frame_with_deadline(station: &str) -> anyhow::Result<Vec<u8>> { timeout(Duration::from_secs(5), fetch_frame(station)) .await // Elapsed is returned as Err — map it to an application error. .map_err(|_| anyhow::anyhow!("ground station {station} timed out after 5s"))? } }
A critical detail: timeout cancels the inner future when the deadline fires — with the same cooperative semantics as .abort(). The future is dropped at its next .await point after the deadline. If the future holds a database transaction or has submitted writes that should be rolled back on timeout, the transaction handle's Drop must handle the rollback.
For scenarios where you want to retry on timeout, wrap the timeout in a loop. For scenarios where you want to give a task one deadline with no retry, timeout is the right primitive. For scenarios where you want to cancel based on an external signal (graceful shutdown, satellite pass end), use CancellationToken or tokio::select! with a shutdown receiver.
tokio::select! for Racing Futures
tokio::select! polls multiple futures concurrently and completes with the first one that becomes ready, cancelling the others. It is the right tool for:
- Racing a task against a timeout
- Racing a task against a shutdown signal
- Implementing priority receive patterns on multiple channels
#![allow(unused)] fn main() { use tokio::sync::oneshot; async fn session_with_shutdown( session: impl std::future::Future<Output = ()>, mut shutdown: oneshot::Receiver<()>, ) { tokio::select! { _ = session => { tracing::info!("session completed normally"); } _ = &mut shutdown => { // Shutdown signal received — session future is cancelled here. // RAII cleanup in the session's Drop runs. tracing::info!("session cancelled: shutdown signal received"); } } } }
The branch that wins is executed; the branches that lose are cancelled (futures dropped at their next await point). If you need to do async cleanup when the losing branch is cancelled, you cannot do it inside select! — you need CancellationToken combined with a cleanup task.
Important: all branches of a select! run concurrently on the same task. They are never truly simultaneous — only one executes at a time — but they are polled in interleaved fashion within a single scheduler slot. This is distinct from tokio::spawn, which creates a new task that can run on a different worker thread. select! is lightweight concurrent multiplexing; spawn is independent parallel scheduling.
select! Loop Patterns and Branch Preconditions
select! is most often used inside a loop. Two patterns come up constantly in production systems.
Multi-channel drain with else: when a session task needs to drain from multiple upstream channels until all are closed:
#![allow(unused)] fn main() { use tokio::sync::mpsc; async fn drain_uplinks( mut primary: mpsc::Receiver<Vec<u8>>, mut redundant: mpsc::Receiver<Vec<u8>>, ) { loop { tokio::select! { // select! randomly picks which ready branch to check first — // this prevents the redundant channel from always being starved // if the primary is consistently busy. Some(frame) = primary.recv() => { process_frame(frame, "primary"); } Some(frame) = redundant.recv() => { process_frame(frame, "redundant"); } // else fires when ALL patterns fail — both channels returned None, // meaning both are closed. This is the clean exit condition. else => { tracing::info!("all uplink channels closed — drain complete"); break; } } } } fn process_frame(frame: Vec<u8>, source: &str) { tracing::debug!(bytes = frame.len(), source, "frame processed"); } }
The else branch is not optional when you pattern-match on Some(...). If both channels close and there is no else, select! will panic because no branch can make progress. Always include else when all branches use fallible patterns.
Branch preconditions: the , if condition syntax disables a branch before select! evaluates it. This is essential when polling a pinned future by reference inside a loop — once the future completes, the branch must be disabled or the next iteration will attempt to poll an already-resolved future, causing a panic:
use tokio::sync::mpsc; use tokio::time::{sleep, Duration}; async fn catalog_refresh() -> Vec<u8> { sleep(Duration::from_millis(100)).await; vec![0u8; 128] } #[tokio::main] async fn main() { let (_tx, mut cmd_rx) = mpsc::channel::<String>(8); let refresh = catalog_refresh(); tokio::pin!(refresh); let mut refresh_done = false; for _ in 0..5 { tokio::select! { // Branch is disabled once refresh_done = true. // Without this precondition: panic on second iteration. result = &mut refresh, if !refresh_done => { println!("catalog refreshed: {} bytes", result.len()); refresh_done = true; } Some(cmd) = cmd_rx.recv() => { println!("command: {cmd}"); } else => break, } } }
When the precondition is false, select! simply skips that branch. If all branches are disabled by preconditions, select! panics — so structure your logic to ensure at least one branch is always eligible or an else handles the case.
Graceful Shutdown Pattern
A production service needs a defined shutdown sequence. For the Meridian control plane:
- Stop accepting new connections.
- Signal active session tasks to finish or cancel.
- Wait for tasks to drain (with a deadline — do not wait forever).
- Flush pending telemetry to downstream consumers.
- Exit cleanly.
#![allow(unused)] fn main() { use std::time::Duration; use tokio::sync::broadcast; struct ShutdownCoordinator { sender: broadcast::Sender<()>, } impl ShutdownCoordinator { fn new() -> Self { let (sender, _) = broadcast::channel(1); Self { sender } } fn subscribe(&self) -> broadcast::Receiver<()> { self.sender.subscribe() } async fn shutdown(&self, tasks: Vec<tokio::task::JoinHandle<()>>) { // Signal all subscribers. let _ = self.sender.send(()); // Give tasks 10 seconds to drain. After that, abort stragglers. let deadline = Duration::from_secs(10); let _ = tokio::time::timeout(deadline, async { for handle in tasks { // Ignore individual task errors during shutdown. let _ = handle.await; } }) .await; } } }
The coordinator sends a shutdown signal over a broadcast channel. Each session task holds a Receiver and uses tokio::select! to race its work against the shutdown signal. After broadcasting, shutdown awaits all handles with a 10-second deadline. Any task that has not completed by then is left to the OS — in a containerized environment, the container will be killed by the orchestrator anyway.
Code Examples
Managing a Satellite Pass Session with Full Lifecycle Control
A pass session has a well-defined lifetime: it starts when the satellite rises above the ground station horizon and ends when it sets. The session task must complete cleanly if the pass ends normally, abort gracefully on shutdown, and timeout if the satellite goes silent mid-pass (antenna tracking failure, power anomaly).
use anyhow::Result; use tokio::{ sync::oneshot, time::{timeout, Duration}, }; use tracing::{info, warn}; #[derive(Debug)] struct PassSession { satellite_id: u32, ground_station: String, } impl Drop for PassSession { fn drop(&mut self) { // Synchronous state flush — no async. // In production, push final state to a lock-free ring buffer // that a background writer drains to persistent storage. info!( satellite_id = self.satellite_id, ground_station = %self.ground_station, "PassSession dropped — flushing state synchronously" ); } } impl PassSession { async fn run(&mut self) -> Result<()> { info!( satellite_id = self.satellite_id, "pass session started" ); // Simulate frame processing loop. // In production: read frames from TcpStream, validate, forward. for frame_num in 0u32..100 { tokio::time::sleep(Duration::from_millis(50)).await; info!(frame = frame_num, "frame processed"); } Ok(()) } } async fn manage_pass( satellite_id: u32, ground_station: String, pass_duration: Duration, mut shutdown_rx: oneshot::Receiver<()>, ) -> Result<()> { let mut session = PassSession { satellite_id, ground_station, }; // Race: session completion, pass duration timeout, or shutdown signal. tokio::select! { result = timeout(pass_duration, session.run()) => { match result { Ok(Ok(())) => info!(satellite_id, "pass completed normally"), Ok(Err(e)) => warn!(satellite_id, "session error: {e}"), Err(_) => warn!(satellite_id, "pass duration exceeded — session timed out"), } } _ = &mut shutdown_rx => { // PassSession::drop runs here, flushing state before the task exits. warn!(satellite_id, "pass cancelled: shutdown received"); } } Ok(()) } #[tokio::main] async fn main() -> Result<()> { tracing_subscriber::fmt::init(); let (shutdown_tx, shutdown_rx) = oneshot::channel::<()>(); let handle = tokio::spawn(manage_pass( 25544, "gs-svalbard".to_string(), Duration::from_secs(30), shutdown_rx, )); // Simulate shutdown signal after 1 second. tokio::time::sleep(Duration::from_secs(1)).await; let _ = shutdown_tx.send(()); match handle.await { Ok(Ok(())) => info!("task completed"), Ok(Err(e)) => warn!("task error: {e}"), Err(e) if e.is_cancelled() => warn!("task was aborted externally"), Err(e) => warn!("task panicked: {e}"), } Ok(()) }
Key decisions in this code: the Drop impl handles synchronous cleanup, which is guaranteed to run whether the session completes normally, times out, or is cancelled. The select! gives the session three possible exit paths with distinct log entries — observable, diagnosable behavior rather than silent state corruption. The outer .await on the handle distinguishes between clean task exit, application errors, external abort, and panics.
Key Takeaways
-
JoinHandle<T>awaits asResult<T, JoinError>. Distinguish between panics and cancellation usinge.is_panic()/e.is_cancelled(). Never.unwrap()aJoinHandlein production code without a comment explaining the invariant. -
Dropping a
JoinHandledoes not cancel the task. Call.abort()explicitly if you need cancellation on drop..abort()is cooperative — the task stops at its next.awaitpoint, not immediately. -
Async cleanup after an
.awaitis not guaranteed on cancellation. Put mandatory cleanup inDrop(synchronous) or useCancellationTokento intercept the shutdown signal and run async teardown before the task exits. -
tokio::time::timeoutwraps any future with a deadline. On expiry, it cancels the inner future at its next.await. Resources held by the cancelled future are dropped via RAII — no manual cleanup needed if your types implementDropcorrectly. -
tokio::select!runs all branches on the same task — they multiplex, they do not parallelize. Branches randomly compete for selection when multiple are ready, which prevents starvation. Usetokio::spawnwhen you need true independent scheduling; useselect!when you need lightweight concurrency within a single task. -
select!branch preconditions (, if condition) disable a branch before evaluation. Always use them with pinned futures in loops to prevent the "async fn resumed after completion" panic. -
In
select!loops, always include anelsebranch when all active branches use fallible patterns likeSome(...). Theelsebranch fires when all patterns fail to match — typically when all channels are closed — and provides the clean exit condition. -
CancellationToken(fromtokio-util) is the preferred cancellation primitive for cooperative shutdown. Cloning shares the same cancellation event..cancelled().awaitcomposes naturally withselect!and, unlike.abort(), allows the task to run async cleanup before exiting. -
TaskTracker(fromtokio-util) is the preferred drain primitive for shutdown. Spawn tasks through the tracker, call.close()when done spawning, then.wait().awaitto block until all tasks finish. This avoids theJoinHandleVec pattern and correctly handles the close/wait ordering requirement.
Project — Async Telemetry Ingestion Broker
Module: Foundation — M01: Async Rust Fundamentals
Prerequisite: All three module quizzes passed (≥70%)
- Mission Brief
- System Specification
- Expected Output
- Acceptance Criteria
- Frame Format Reference
- Hints
- Reference Implementation
- Reflection
Mission Brief
TO: Platform Engineering
FROM: Mission Control Systems Lead
CLASSIFICATION: UNCLASSIFIED // INTERNAL
SUBJECT: RFC-0041 — Telemetry Ingestion Broker Replacement
The legacy Python telemetry broker is being decommissioned. It accepted connections sequentially on a single thread and could not keep up beyond 12 concurrent ground station feeds. With constellation expansion to 48 LEO satellites and 12 active ground station sites, the broker routinely falls behind during peak pass windows, buffering up to 40 seconds of lag before flushing — unacceptable for conjunction avoidance workflows that require sub-10-second delivery.
Your task is to implement the replacement broker in Rust using Tokio. The broker must accept concurrent TCP connections from ground stations, parse incoming telemetry frames, and fan each frame out to multiple registered downstream handlers — without blocking on any single slow handler.
The broker does not perform conjunction computation. It is a pure ingress and distribution layer. Correctness, throughput, and clean lifecycle management are the acceptance criteria.
System Specification
Connection Model
- Ground stations connect over TCP to a configurable bind address.
- Each connection streams telemetry frames encoded as length-prefixed byte sequences: a 4-byte big-endian
u32length header followed bylengthbytes of payload. - Connections are persistent for the duration of a satellite pass (8–12 minutes). They may drop and reconnect within a pass without notice.
- The broker must handle up to 48 concurrent connections without degradation.
Frame Routing
- Registered downstream handlers receive every frame via a bounded
tokio::sync::broadcastchannel. - If a slow handler's receiver falls behind and the broadcast channel fills, it is the handler's problem — the broker must not block or slow its ingress path to accommodate a slow consumer.
- The broker logs a warning when a receiver falls behind (broadcast returns
RecvError::Lagged).
Lifecycle
- The broker accepts a shutdown signal (a
tokio::sync::watchoroneshotchannel) and performs graceful shutdown:- Stop accepting new connections.
- Signal all active session tasks to drain and exit.
- Wait up to 10 seconds for tasks to finish.
- Force-abort any remaining tasks and exit.
- Session tasks must flush their in-progress frame before shutting down (complete the current frame read, then exit — do not abort mid-frame).
Expected Output
A binary crate (meridian-broker) that:
- Binds a TCP listener on a configurable address (default
0.0.0.0:7777). - Spawns a new async task per incoming connection.
- Each task reads frames using the length-prefix protocol.
- Each parsed frame is sent over a
broadcast::Sender<Frame>. - A configurable number of simulated downstream handler tasks subscribe to the broadcast channel and print/log received frames.
- Ctrl-C triggers graceful shutdown with the sequence described above.
The binary should run, accept at least one connection from telnet or netcat with hand-crafted bytes, and log frame receipt and shutdown cleanly.
Acceptance Criteria
| # | Criterion | Verifiable |
|---|---|---|
| 1 | Broker accepts ≥2 simultaneous TCP connections without either blocking the other | Yes — connect two nc sessions concurrently |
| 2 | Frames are delivered to all registered downstream handlers | Yes — log output shows frame receipt on each handler |
| 3 | A slow downstream handler does not stall frame ingestion on other connections | Yes — add a tokio::time::sleep in one handler; other connections continue at full rate |
| 4 | Ctrl-C triggers graceful shutdown; in-progress frame reads complete before the task exits | Yes — observable in log output |
| 5 | If shutdown drain exceeds 10 seconds, remaining tasks are aborted | Yes — simulate a stuck task and verify the process exits within 11 seconds |
| 6 | No .unwrap() on JoinHandle::await or channel send/receive in production paths | Yes — code review |
| 7 | spawn_blocking is used for any synchronous I/O or CPU-intensive frame processing | Yes — code review |
Frame Format Reference
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Payload Length (u32 BE) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Payload (variable length) |
| ... |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
A Frame struct for the purpose of this project:
#![allow(unused)] fn main() { #[derive(Clone, Debug)] pub struct Frame { pub station_id: String, pub payload: Vec<u8>, } }
Hints
Hint 1 — Reading length-prefixed frames
tokio::io::AsyncReadExt provides .read_exact(&mut buf) which reads exactly buf.len() bytes or returns an error. Use it to read the 4-byte header, parse the length, allocate the payload buffer, and read the payload:
#![allow(unused)] fn main() { use tokio::io::AsyncReadExt; use tokio::net::TcpStream; async fn read_frame(stream: &mut TcpStream) -> anyhow::Result<Vec<u8>> { let mut len_buf = [0u8; 4]; stream.read_exact(&mut len_buf).await?; let len = u32::from_be_bytes(len_buf) as usize; let mut payload = vec![0u8; len]; stream.read_exact(&mut payload).await?; Ok(payload) } }
Hint 2 — Broadcast channel for fan-out
tokio::sync::broadcast::channel(capacity) returns (Sender<T>, Receiver<T>). Additional receivers are created with sender.subscribe(). Receivers that fall behind by more than capacity messages receive Err(RecvError::Lagged(n)) — not an error in the fatal sense, just a signal that they missed n messages. Log it and continue receiving.
#![allow(unused)] fn main() { use tokio::sync::broadcast; let (tx, _rx) = broadcast::channel::<Frame>(256); // Downstream handler let mut rx = tx.subscribe(); tokio::spawn(async move { loop { match rx.recv().await { Ok(frame) => { /* process */ } Err(broadcast::error::RecvError::Lagged(n)) => { tracing::warn!(missed = n, "handler fell behind"); } Err(broadcast::error::RecvError::Closed) => break, } } }); }
Hint 3 — Graceful shutdown with watch channel
tokio::sync::watch is well-suited for broadcasting a shutdown signal to an arbitrary number of tasks: one sender, many receivers, each receiver can check the current value or wait for a change.
#![allow(unused)] fn main() { use tokio::sync::watch; let (shutdown_tx, shutdown_rx) = watch::channel(false); // In each session task: let mut shutdown = shutdown_rx.clone(); tokio::select! { result = read_frames(&mut stream) => { /* ... */ } _ = shutdown.changed() => { tracing::info!("shutdown received, finishing current frame"); // complete current frame read if mid-frame, then return } } // In shutdown handler: let _ = shutdown_tx.send(true); }
Hint 4 — Collecting JoinHandles for drain
Keep a Vec<JoinHandle<()>> of spawned session tasks. During shutdown, wrap the drain loop in tokio::time::timeout:
#![allow(unused)] fn main() { let drain_deadline = Duration::from_secs(10); let drain_result = tokio::time::timeout(drain_deadline, async { for handle in session_handles { let _ = handle.await; // ignore individual task errors } }).await; if drain_result.is_err() { tracing::warn!("drain deadline exceeded — some tasks may not have flushed"); } }
After the timeout, the remaining JoinHandles are dropped (tasks continue) or you can collect and abort them explicitly.
Reference Implementation
Reveal reference implementation (attempt the project first)
// src/main.rs use anyhow::Result; use std::sync::Arc; use tokio::{ net::{TcpListener, TcpStream}, sync::{broadcast, watch}, time::{timeout, Duration}, }; use tokio::io::AsyncReadExt; use tracing::{info, warn}; #[derive(Clone, Debug)] pub struct Frame { pub station_id: String, pub payload: Vec<u8>, } async fn read_frame(stream: &mut TcpStream) -> Result<Vec<u8>> { let mut len_buf = [0u8; 4]; stream.read_exact(&mut len_buf).await?; let len = u32::from_be_bytes(len_buf) as usize; if len > 65_536 { // Reject oversized frames — likely a protocol error or malicious client. anyhow::bail!("frame length {len} exceeds maximum 65536 bytes"); } let mut payload = vec![0u8; len]; stream.read_exact(&mut payload).await?; Ok(payload) } async fn handle_connection( mut stream: TcpStream, station_id: String, frame_tx: broadcast::Sender<Frame>, mut shutdown_rx: watch::Receiver<bool>, ) { info!(station = %station_id, "connection established"); loop { tokio::select! { // Bias toward reading frames to minimize partial-frame cancellation. biased; result = read_frame(&mut stream) => { match result { Ok(payload) => { let frame = Frame { station_id: station_id.clone(), payload, }; // Broadcast errors mean all receivers dropped — broker is shutting down. if frame_tx.send(frame).is_err() { break; } } Err(e) => { // EOF or read error — connection dropped. info!(station = %station_id, "connection closed: {e}"); break; } } } _ = shutdown_rx.changed() => { if *shutdown_rx.borrow() { info!(station = %station_id, "shutdown — completing current frame then exiting"); // The biased select ensures we finish the in-progress read if one was started. // On the next iteration, the shutdown branch will win again and we break. break; } } } } info!(station = %station_id, "connection handler exiting"); } fn spawn_handler( id: usize, mut rx: broadcast::Receiver<Frame>, ) -> tokio::task::JoinHandle<()> { tokio::spawn(async move { loop { match rx.recv().await { Ok(frame) => { info!( handler = id, station = %frame.station_id, bytes = frame.payload.len(), "frame received" ); } Err(broadcast::error::RecvError::Lagged(n)) => { warn!(handler = id, missed = n, "handler fell behind — lagged"); } Err(broadcast::error::RecvError::Closed) => { info!(handler = id, "broadcast channel closed, handler exiting"); break; } } } }) } #[tokio::main] async fn main() -> Result<()> { tracing_subscriber::fmt::init(); let bind_addr = "0.0.0.0:7777"; let listener = TcpListener::bind(bind_addr).await?; info!("meridian broker listening on {bind_addr}"); let (frame_tx, _) = broadcast::channel::<Frame>(256); let (shutdown_tx, shutdown_rx) = watch::channel(false); // Spawn 3 downstream handlers. let mut handler_handles: Vec<tokio::task::JoinHandle<()>> = (0..3) .map(|i| spawn_handler(i, frame_tx.subscribe())) .collect(); // Ctrl-C handler. let shutdown_tx = Arc::new(shutdown_tx); let shutdown_tx_ctrlc = shutdown_tx.clone(); tokio::spawn(async move { tokio::signal::ctrl_c().await.expect("failed to listen for ctrl-c"); info!("ctrl-c received — initiating graceful shutdown"); let _ = shutdown_tx_ctrlc.send(true); }); let mut session_handles: Vec<tokio::task::JoinHandle<()>> = Vec::new(); let mut conn_id = 0usize; loop { // Stop accepting new connections once shutdown is signalled. if *shutdown_rx.borrow() { break; } tokio::select! { accept = listener.accept() => { match accept { Ok((stream, addr)) => { conn_id += 1; let station_id = format!("gs-{conn_id}@{addr}"); let handle = tokio::spawn(handle_connection( stream, station_id, frame_tx.clone(), shutdown_rx.clone(), )); session_handles.push(handle); } Err(e) => warn!("accept error: {e}"), } } _ = shutdown_rx.changed() => { if *shutdown_rx.borrow() { break; } } } } info!("draining {} active sessions (10s deadline)", session_handles.len()); // Drop the broadcast sender so downstream handlers see Closed after drain. drop(frame_tx); let drain_result = timeout(Duration::from_secs(10), async { for handle in session_handles { let _ = handle.await; } for handle in handler_handles.drain(..) { let _ = handle.await; } }) .await; if drain_result.is_err() { warn!("drain deadline exceeded — forcing exit"); } else { info!("all tasks drained cleanly"); } Ok(()) }
Cargo.toml dependencies:
[dependencies]
tokio = { version = "1", features = ["full"] }
anyhow = "1"
tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["env-filter"] }
Testing the broker manually:
# Terminal 1: run the broker
RUST_LOG=info cargo run
# Terminal 2: send a frame (4-byte length prefix = 5, then "hello")
printf '\x00\x00\x00\x05hello' | nc localhost 7777
# Terminal 3: send concurrently
printf '\x00\x00\x00\x07meridian' | nc localhost 7777
# Ctrl-C in Terminal 1 to trigger graceful shutdown
Reflection
After completing this project, you have built the entry point for Meridian's control plane ingress. The patterns used here — broadcast fan-out, select!-driven shutdown, bounded drain with timeout, JoinHandle collection — recur throughout the rest of the Foundation modules and into the Data Pipelines track.
Consider for further exploration: what happens if the broker receives 10,000 connections? At what point does the spawn-per-connection model become a problem, and what is the alternative? How would you add backpressure from downstream handlers back to the ingress path without stalling the broker? These questions are the starting point for Module 3 (Message Passing Patterns).
Module 02 — Concurrency Primitives
Track: Foundation — Mission Control Platform
Position: Module 2 of 6
Source material: Rust Atomics and Locks — Mara Bos, Chapters 1–3
Quiz pass threshold: 70% on all three lessons to unlock the project
- Mission Context
- What You Will Learn
- Lessons
- Capstone Project — Ground Station Command Queue
- Prerequisites
- What Comes Next
Mission Context
The Meridian control plane is not a purely async system. The async runtime handles the high-frequency I/O path. But the control plane also runs CPU-bound conjunction checks, synchronous vendor libraries with C FFI, a shared priority command table written by multiple connections and read by the session dispatcher, and lock-free statistics counters sampled by the monitoring dashboard.
The Python system handled shared state with a global threading lock. In six months of operation, that lock has caused four production incidents. This module establishes the Rust concurrency model that replaces it — not by eliminating shared state, but by giving you the type-level guarantees and primitive toolkit to reason about it precisely.
What You Will Learn
By the end of this module you will be able to:
- Distinguish OS threads from async tasks at the scheduling level, and route work to the correct model for its characteristics — blocking work to
spawn_blockingor scoped threads, I/O-bound work to async tasks - Use
SendandSyncto reason about which types can cross thread boundaries, and understand whyRc,Cell, and raw pointers opt out - Implement shared mutable state with
Mutex<T>andRwLock<T>, manageMutexGuardlifetimes correctly, and handle lock poisoning appropriately - Identify the three deadlock patterns that cause most production incidents and apply the structural patterns that prevent them
- Use
tokio::sync::Mutexwhen locks must be held across.awaitpoints in async code - Apply atomic operations (
fetch_add,compare_exchange,load/store) and select the correct memory ordering (Relaxed,Acquire/Release,SeqCst) for the intended guarantee
Lessons
Lesson 1 — Threads vs Async Tasks: When to Use Each and Why
Covers std::thread::spawn vs tokio::spawn, the preemptive vs cooperative scheduling distinction, thread::scope for scoped threads with borrowed data, Send and Sync as the compiler's enforcement mechanism, and spawn_blocking as the bridge between the two models.
Key question this lesson answers: When a piece of work needs to happen concurrently, how do you decide between an OS thread and an async task — and what goes wrong if you choose incorrectly?
→ lesson-01-threads-vs-async.md / lesson-01-quiz.toml
Lesson 2 — Shared State: Mutex, RwLock, and Avoiding Deadlocks
Covers Mutex<T> mechanics (locking, MutexGuard, RAII unlock, lock poisoning), RwLock<T> and the read-heavy access pattern, MutexGuard lifetime pitfalls in async code, tokio::sync::Mutex, Condvar for blocking on data conditions, and the three deadlock patterns with structural prevention strategies.
Key question this lesson answers: How do you share mutable data between threads correctly, and what are the failure modes that Rust's type system does not prevent?
→ lesson-02-shared-state.md / lesson-02-quiz.toml
Lesson 3 — Atomics and Memory Ordering: Acquire/Release/SeqCst in Practice
Covers atomic types and the operations they support (load/store, fetch_add/fetch_sub, compare_exchange), memory ordering (Relaxed, Acquire/Release, AcqRel, SeqCst), the happens-before relationship established by the Acquire/Release pair, and the practical decision of when to use atomics vs a mutex.
Key question this lesson answers: When is a mutex overkill, and how do you safely share single values between threads without locking — including ensuring the processor and compiler do not reorder the operations that matter?
→ lesson-03-atomics.md / lesson-03-quiz.toml
Capstone Project — Ground Station Command Queue
Build a typed, concurrent priority command queue for Meridian's mission operations system. The queue accepts commands from multiple concurrent ground station producer threads, dispatches them to a consumer in priority order with FIFO tie-breaking, blocks producers when at capacity (without dropping commands), exposes lock-free metrics readable without acquiring the queue lock, and shuts down gracefully by draining remaining commands.
Acceptance is against 7 verifiable criteria including correct priority dispatch, non-busy-waiting, lock-free metrics access, and clean shutdown drain.
→ project-command-queue.md
Prerequisites
Module 1 (Async Rust Fundamentals) must be complete. This module assumes you understand how async tasks are scheduled and why blocking an async worker thread is harmful — that understanding is the foundation for the threads-vs-async distinction in Lesson 1 and the async Mutex guidance in Lesson 2.
What Comes Next
Module 3 — Message Passing Patterns builds the next layer: rather than sharing state between concurrent actors, you pass ownership of data through channels. The command queue from this module's project is extended in Module 3 with a tokio::sync::mpsc front-end, moving backpressure into async channel semantics.
Lesson 1 — Threads vs Async Tasks: When to Use Each and Why
Module: Foundation — M02: Concurrency Primitives
Position: Lesson 1 of 3
Source: Rust Atomics and Locks — Mara Bos, Chapter 1
Context
The Meridian control plane is not a purely async system. The async runtime handles the high-frequency path — accepting ground station connections, reading telemetry frames, routing them to downstream consumers. But the control plane also runs work that has no business on an async worker thread: a vendor-supplied TLE validation library with a synchronous C FFI, a CPU-intensive conjunction check that processes several hundred orbital elements per pass, and a legacy configuration parser that performs synchronous file I/O.
The Python system handled this by running everything on threads, leaning on the GIL to serialize concurrent access. The Rust replacement needs a deliberate model. The first decision you make when writing any new piece of the control plane is: does this go on an async task or an OS thread? Getting this wrong produces either a system that starves its async executor with blocking work, or one that spawns OS threads unnecessarily, paying per-thread stack overhead at scale.
This lesson establishes the model. Every rule here has a corresponding failure mode that has been observed in Meridian's staging environment.
Source: Rust Atomics and Locks, Chapter 1 (Bos)
Core Concepts
The Fundamental Difference
An OS thread is scheduled by the kernel. The kernel decides when it runs, when it is preempted, and which CPU core it runs on. The thread has its own stack (typically 2–8 MB by default), and blocking — whether on I/O, a mutex, or std::thread::sleep — is perfectly safe: the kernel parks the thread and runs something else.
An async task is scheduled by the executor. It runs until it voluntarily yields at an await point. It shares executor worker threads with other tasks. Blocking on the worker thread — calling a synchronous library, running a long computation, sleeping with std::thread::sleep — starves every other task scheduled on that thread. There is no kernel to preempt you and run something else.
This is the core rule: any call that can block a thread for non-trivial time belongs on an OS thread, not on an async worker thread. In Tokio, the mechanism is spawn_blocking, which routes the closure to a dedicated blocking thread pool. From the async side, it looks like an awaitable future. On the execution side, it gets a real OS thread.
std::thread::spawn — Ownership and Lifetimes
std::thread::spawn takes a closure that is Send + 'static. The 'static requirement means the thread cannot borrow from the spawning scope — it must own everything it uses, or access data through shared references that are themselves 'static (like Arc).
use std::thread; use std::sync::Arc; fn main() { let catalog = Arc::new(vec!["ISS", "CSS", "STARLINK-1"]); let handle = thread::spawn({ let catalog = Arc::clone(&catalog); move || { // catalog is owned by this thread — no borrow, no lifetime issue. println!("Thread sees {} objects", catalog.len()); } }); handle.join().unwrap(); println!("Main sees {} objects", catalog.len()); }
The Arc::clone before the move is idiomatic: clone the handle, not the data. The thread gets its own Arc pointer (cheap — one atomic increment), and both threads share the underlying Vec. When both Arcs drop, the Vec deallocates.
thread::scope — Scoped Threads with Borrowed Data
The 'static requirement on spawn prevents borrowing stack data. thread::scope lifts this restriction: threads spawned within a scope are guaranteed to finish before the scope exits, which allows them to borrow data from the enclosing frame.
use std::thread; fn validate_tle_batch(records: &[String]) -> usize { let mid = records.len() / 2; let (left, right) = records.split_at(mid); // Scoped threads can borrow `left` and `right` — no Arc, no clone. thread::scope(|s| { let left_handle = s.spawn(|| left.iter().filter(|r| r.starts_with("1 ")).count()); let right_handle = s.spawn(|| right.iter().filter(|r| r.starts_with("1 ")).count()); // scope blocks here until both threads finish. left_handle.join().unwrap() + right_handle.join().unwrap() }) } fn main() { let records: Vec<String> = (0..100) .map(|i| format!("{} {:05}U record", if i % 2 == 0 { "1" } else { "2" }, i)) .collect(); println!("{} valid TLE lines", validate_tle_batch(&records)); }
thread::scope is the right tool for data-parallel CPU work over a borrowed slice — exactly the conjunction check pattern in the Meridian pipeline. No heap allocation, no Arc, no 'static constraint. The compiler enforces that the borrowed data outlives the scope.
Send and Sync — The Type System's Enforcement
Rust enforces thread safety through two marker traits (Rust Atomics and Locks, Ch. 1):
Send: a type is Send if ownership of a value of that type can be transferred to another thread. Arc<T> is Send (if T: Send + Sync), but Rc<T> is not — Rc's reference count is non-atomic and would race if shared across threads.
Sync: a type is Sync if it can be shared between threads by shared reference. i32 is Sync. Cell<i32> is not — mutating through a shared reference is not safe across threads.
The compiler enforces these automatically. You cannot accidentally send a Rc<T> to another thread — thread::spawn requires Send, and Rc does not implement it. You cannot share a RefCell<T> across threads — Mutex<T> requires T: Send, and RefCell does not implement Sync.
Both traits are auto-derived: a struct whose fields are all Send is itself Send. The common exceptions are raw pointers (*const T, *mut T), Rc, Cell, RefCell, and types that wrap OS handles that are not thread-safe. When you implement a type that wraps these, you must opt in to Send/Sync manually with unsafe impl, accepting responsibility for the invariant.
Choosing the Right Model
The decision tree for any piece of work in the control plane:
| Work type | Right model | Mechanism |
|---|---|---|
| Concurrent TCP connections, channel receive/send | Async task | tokio::spawn |
| CPU-bound computation (conjunction check, CRC) | Blocking thread | spawn_blocking |
| Synchronous vendor library (C FFI) | Blocking thread | spawn_blocking |
Synchronous file I/O (std::fs) | Blocking thread | spawn_blocking |
| Data-parallel work over borrowed data | Scoped threads | thread::scope |
| Independent long-running background service | OS thread | thread::spawn |
The cost difference matters at scale. An OS thread on Linux has a default 8 MB stack reservation (even if physical pages are not committed until used), a kernel thread structure, and scheduling overhead. Tokio tasks use a few hundred bytes of heap. The control plane at 48 uplinks can sustain thousands of concurrent tasks trivially; it cannot sustain thousands of OS threads without careful stack-size tuning.
Code Examples
Mixing Async and Blocking: The Vendor TLE Validator
The TLE validation library provided by Meridian's orbit data vendor is a synchronous C library wrapped in a Rust FFI crate. It performs checksum validation and orbital element range checking — purely CPU work, no I/O, but it takes 2–15ms per record depending on complexity. Calling it from an async task would stall the executor for the duration.
use std::time::Duration; use tokio::task; // Simulates a synchronous vendor library call. // In production: calls into the C FFI wrapper. fn validate_tle_sync(line1: &str, line2: &str) -> Result<(), String> { // Vendor library does checksum + orbital element bounds checking. // Blocks for 2–15ms depending on record complexity. std::thread::sleep(Duration::from_millis(5)); // placeholder if line1.starts_with("1 ") && line2.starts_with("2 ") { Ok(()) } else { Err(format!("malformed TLE: {line1}")) } } async fn validate_tle_async(line1: String, line2: String) -> Result<(), String> { // Move strings into the blocking closure. // spawn_blocking runs on the dedicated blocking thread pool — // async worker threads are not touched. task::spawn_blocking(move || validate_tle_sync(&line1, &line2)) .await // JoinError means the blocking thread panicked. .map_err(|e| format!("validator panicked: {e}"))? } #[tokio::main] async fn main() { // All 48 sessions can submit validation concurrently. // Each runs on the blocking pool; none stall the async workers. let tasks: Vec<_> = (0..6).map(|i| { tokio::spawn(validate_tle_async( format!("1 {:05}U 98067A 21275.52 .00001234 00000-0 12345-4 0 999{i}", i), format!("2 {:05} 51.6400 337.6640 0007417 62.6000 297.5200 15.4888958300000{i}", i), )) }).collect(); for (i, t) in tasks.into_iter().enumerate() { match t.await.unwrap() { Ok(()) => println!("record {i}: valid"), Err(e) => println!("record {i}: {e}"), } } }
Scoped Threads for Parallel Conjunction Screening
The conjunction screening pass runs every 10 minutes against the full 50k-object catalog. It splits the catalog across CPU cores using scoped threads. The catalog is a large Vec<OrbitalRecord> — no clone, no Arc, just borrowed slices distributed across workers.
use std::thread; #[derive(Clone)] struct OrbitalRecord { norad_id: u32, altitude_km: f64, } struct ConjunctionAlert { object_a: u32, object_b: u32, closest_approach_km: f64, } fn screen_shard(shard: &[OrbitalRecord], threshold_km: f64) -> Vec<ConjunctionAlert> { // Simplified: real implementation computes relative positions via SGP4. shard.windows(2) .filter(|pair| (pair[0].altitude_km - pair[1].altitude_km).abs() < threshold_km) .map(|pair| ConjunctionAlert { object_a: pair[0].norad_id, object_b: pair[1].norad_id, closest_approach_km: (pair[0].altitude_km - pair[1].altitude_km).abs(), }) .collect() } fn run_conjunction_screen(catalog: &[OrbitalRecord], threshold_km: f64) -> Vec<ConjunctionAlert> { let num_cores = thread::available_parallelism() .map(|n| n.get()) .unwrap_or(4); let shard_size = (catalog.len() + num_cores - 1) / num_cores; thread::scope(|s| { let handles: Vec<_> = catalog .chunks(shard_size) .map(|shard| s.spawn(move || screen_shard(shard, threshold_km))) .collect(); handles.into_iter() .flat_map(|h| h.join().unwrap()) .collect() }) } fn main() { let catalog: Vec<OrbitalRecord> = (0..1000) .map(|i| OrbitalRecord { norad_id: i, altitude_km: 400.0 + (i as f64 * 0.3) }) .collect(); let alerts = run_conjunction_screen(&catalog, 5.0); println!("{} conjunction alerts generated", alerts.len()); }
Each shard runs on its own OS thread via thread::scope, borrowing its slice without any heap allocation for sharing. The scope blocks until all workers finish, then results are collected. This is the correct pattern for data-parallel CPU work where all input data is available upfront and results need to be aggregated.
Key Takeaways
-
OS threads are preemptively scheduled by the kernel. Async tasks are cooperatively scheduled by the executor. Blocking on an async worker thread — any call that does not yield at
await— starves other tasks on that thread. -
Use
spawn_blockingfor any synchronous, blocking, or CPU-intensive work that originates in an async context. It routes work to a dedicated thread pool separate from the async workers. -
thread::scopeallows scoped threads to borrow data from the enclosing frame withoutArcor'staticconstraints. It is the right tool for data-parallel work over borrowed slices. The scope blocks until all spawned threads finish. -
SendandSyncare marker traits enforced at compile time.Sendpermits transferring ownership across threads;Syncpermits sharing by reference. Violating these constraints — sendingRc, sharingCell— is a compile error, not a runtime race. -
The thread vs async decision is about scheduling model, not concurrency. Both models run work concurrently. The difference is what happens when work blocks: OS threads can block safely; async tasks cannot.
Lesson 2 — Shared State: Mutex, RwLock, and Avoiding Deadlocks
Module: Foundation — M02: Concurrency Primitives
Position: Lesson 2 of 3
Source: Rust Atomics and Locks — Mara Bos, Chapter 1
Context
The Meridian command queue maintains a shared priority table: incoming operator commands are written by the command ingress task, read by the session dispatch task, and occasionally queried by the monitoring dashboard. The Python system used a global dictionary with a threading lock. In production, that lock has been involved in three separate deadlock incidents — two in the same deployment week — all caused by the same root pattern: lock acquired, function called, that function also acquires the same lock.
Rust does not prevent deadlocks at compile time. But it gives you the tools to reason about them precisely: Mutex<T> and RwLock<T> make the protected data visible in the type signature, MutexGuard makes it impossible to access data without holding the lock, and RAII makes it impossible to forget to release it. This lesson covers how these primitives work, the failure modes that remain after Rust's type system has done its job, and the patterns that prevent them.
Source: Rust Atomics and Locks, Chapter 1 (Bos)
Core Concepts
Mutex<T> — Exclusive Access with RAII
std::sync::Mutex<T> wraps a value of type T and enforces that only one thread can access it at a time. The data is inaccessible without locking. There is no way to accidentally read T without going through .lock().
.lock() returns LockResult<MutexGuard<'_, T>>. The MutexGuard dereferences to T and automatically releases the lock when it drops. There is no .unlock() method. The lock is released when the guard goes out of scope — or, critically, when it is explicitly dropped.
use std::sync::{Arc, Mutex}; use std::thread; fn main() { let command_count = Arc::new(Mutex::new(0u64)); let handles: Vec<_> = (0..4).map(|_| { let counter = Arc::clone(&command_count); thread::spawn(move || { for _ in 0..1000 { // Lock is acquired here. Guard is dropped at end of block. let mut count = counter.lock().unwrap(); *count += 1; // Guard dropped here — lock released before next iteration. } }) }).collect(); for h in handles { h.join().unwrap(); } println!("commands processed: {}", command_count.lock().unwrap()); }
The Arc provides shared ownership across threads (Rc is not Send and will not compile here). The Mutex provides exclusive access. This is the standard pattern for shared mutable state between threads.
Lock Poisoning
When a thread panics while holding a Mutex lock, the mutex is marked poisoned. Subsequent calls to .lock() return Err(PoisonError). The data is still accessible through the error — err.into_inner() returns the MutexGuard — but the poison signals that the data may be in an inconsistent state.
In practice, most Meridian code uses .unwrap() on mutex locks. This is deliberate: if a thread panics while holding the command queue lock, it is not safe to continue operating on potentially corrupted queue state. Propagating the panic is the correct response. The cases where you would recover from a poisoned mutex are rare and require domain-specific knowledge about what "inconsistent" means for that data.
One place where .unwrap() is wrong: in a test or in a thread that genuinely needs to clean up a partially-written state. In those cases, match on the LockResult explicitly.
MutexGuard Lifetime — A Common Bug
The most common Mutex bug in Rust code is holding a guard longer than intended, or — worse — holding it across an .await point in async code. A guard held across an await parks the lock for the duration of the async operation. If another task tries to acquire the same lock, it will block the async worker thread (since std::sync::Mutex::lock blocks, not yields).
use std::sync::Mutex; fn main() { let data = Mutex::new(vec![1u32, 2, 3]); // BUG: guard lives to end of the if block, holding lock during the push { let guard = data.lock().unwrap(); if guard.contains(&2) { drop(guard); // Must explicitly drop before re-locking. data.lock().unwrap().push(4); } // Without the explicit drop, this deadlocks: the guard is still // alive when we try to lock again at data.lock().unwrap().push(4) } println!("{:?}", data.lock().unwrap()); }
In async code, use tokio::sync::Mutex instead of std::sync::Mutex. It yields to the executor while waiting for the lock rather than blocking the thread. Conversely, never hold a tokio::sync::MutexGuard across a .await that might block for a long time — you are holding the lock for the duration of that await, which blocks all other lock waiters.
RwLock<T> — Read Concurrency, Write Exclusivity
RwLock<T> distinguishes between reads and writes. Multiple readers can hold the lock simultaneously; a writer requires exclusive access. This is the concurrent version of RefCell.
It is appropriate when reads are frequent and writes are rare. For the Meridian session state table: many tasks read current session state, but writes only happen when sessions start or end. An RwLock allows those many concurrent reads without serializing them.
use std::collections::HashMap; use std::sync::{Arc, RwLock}; use std::thread; type SessionTable = Arc<RwLock<HashMap<u32, String>>>; fn register_session(table: &SessionTable, id: u32, station: String) { // Write lock — exclusive. table.write().unwrap().insert(id, station); } fn query_session(table: &SessionTable, id: u32) -> Option<String> { // Read lock — concurrent with other readers. table.read().unwrap().get(&id).cloned() } fn main() { let table: SessionTable = Arc::new(RwLock::new(HashMap::new())); register_session(&table, 25544, "gs-svalbard".into()); let readers: Vec<_> = (0..4).map(|_| { let t = Arc::clone(&table); thread::spawn(move || { // All four reader threads can hold the read lock simultaneously. println!("{:?}", query_session(&t, 25544)); }) }).collect(); for r in readers { r.join().unwrap(); } }
RwLock is not always faster than Mutex. If writes are frequent, readers pay the overhead of checking for pending writers. On some platforms, RwLock can starve writers if readers continuously hold the lock. Profile before committing to RwLock as an optimisation. For the common case of a hot write path with rare reads, Mutex is simpler and often faster.
Deadlock Patterns and How to Prevent Them
A deadlock requires at least two resources and two threads acquiring them in opposite order. Rust's type system does not prevent this. Three patterns cause the vast majority of deadlocks in production:
Lock ordering violation: Thread A acquires lock 1 then lock 2. Thread B acquires lock 2 then lock 1. Each holds what the other needs. Prevention: establish a global lock acquisition order and document it. If the command queue lock must always be acquired before the session table lock, enforce that convention in code review.
Re-entrant locking:
std::sync::Mutex is not reentrant. A thread that calls .lock() on a mutex it already holds will deadlock immediately — there is no second locking that succeeds. This is the source of Meridian's production incidents: a function that acquires the lock, calls a helper, and the helper also acquires the same lock.
Prevention: keep lock-holding code flat. Do not call functions while holding a lock unless you can verify they do not acquire the same lock. If a function is callable both with and without a lock held, split it into two versions or restructure the locking scope.
Holding guards across blocking calls:
In synchronous code: holding a MutexGuard while calling a function that blocks on I/O. In async code: holding a std::sync::MutexGuard across an .await.
Prevention: minimize the scope of guards. Acquire, mutate, release. Do not hold a lock while doing I/O. In async code, use tokio::sync::Mutex or restructure to release the lock before awaiting.
Code Examples
The Meridian Priority Command Queue
The command queue receives operator commands from the ground network interface. Commands have integer priorities. The session dispatcher reads the highest-priority pending command. Multiple ground network connections can write concurrently.
use std::collections::BinaryHeap; use std::cmp::Reverse; use std::sync::{Arc, Mutex, Condvar}; use std::thread; use std::time::Duration; #[derive(Eq, PartialEq)] struct Command { priority: u8, payload: String, } impl Ord for Command { fn cmp(&self, other: &Self) -> std::cmp::Ordering { // Higher priority = higher value in max-heap. self.priority.cmp(&other.priority) } } impl PartialOrd for Command { fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> { Some(self.cmp(other)) } } struct CommandQueue { // Mutex + Condvar is the standard pattern for blocking producers/consumers. inner: Mutex<BinaryHeap<Command>>, available: Condvar, } impl CommandQueue { fn new() -> Arc<Self> { Arc::new(Self { inner: Mutex::new(BinaryHeap::new()), available: Condvar::new(), }) } fn push(&self, cmd: Command) { self.inner.lock().unwrap().push(cmd); // Notify one waiting consumer that data is available. self.available.notify_one(); } fn pop_blocking(&self) -> Command { let mut queue = self.inner.lock().unwrap(); // Condvar::wait releases the mutex and blocks until notified, // then reacquires the mutex before returning. loop { if let Some(cmd) = queue.pop() { return cmd; } queue = self.available.wait(queue).unwrap(); } } } fn main() { let queue = CommandQueue::new(); // Producer threads simulate ground network connections. let producers: Vec<_> = (0..3).map(|i| { let q = Arc::clone(&queue); thread::spawn(move || { thread::sleep(Duration::from_millis(i * 10)); q.push(Command { priority: (i as u8 % 3) + 1, payload: format!("CMD-{i:04}"), }); println!("producer {i}: pushed priority {}", (i as u8 % 3) + 1); }) }).collect(); // Consumer runs on a separate thread — simulates session dispatcher. let q = Arc::clone(&queue); let consumer = thread::spawn(move || { for _ in 0..3 { let cmd = q.pop_blocking(); println!("dispatcher: executing '{}' (priority {})", cmd.payload, cmd.priority); } }); for p in producers { p.join().unwrap(); } consumer.join().unwrap(); }
The Condvar solves the busy-wait problem: without it, the consumer would spin-lock on queue.is_empty(), wasting CPU. Condvar::wait atomically releases the mutex and parks the thread, then reacquires the mutex before returning. The .unwrap() on lock() is intentional: if a producer panics while holding the lock, corrupting the queue, the consumer should not continue silently.
Key Takeaways
-
Mutex<T>makes protected data inaccessible without locking.MutexGuardis the only way to reach the data, and it releases the lock on drop. There is no way to forget to unlock — but there are ways to hold the lock longer than intended. -
Lock poisoning marks a mutex as potentially inconsistent when a thread panics while holding it. Most production code uses
.unwrap()on locks, propagating the panic. Recover from a poisoned mutex only when you can correct the inconsistent state. -
RwLock<T>allows concurrent reads and exclusive writes. It is appropriate when reads are dominant. It is not always faster thanMutexon write-heavy paths — profile before optimizing. -
Three deadlock patterns cover most production incidents: lock ordering violations (acquire in inconsistent order across threads), re-entrant locking (acquiring a lock you already hold), and holding guards across blocking calls. Document lock acquisition order and minimize guard scope.
-
In async code,
std::sync::Mutex::lockblocks the OS thread, which parks the async worker. Usetokio::sync::Mutexwhen the lock may be contended and the wait must yield to the executor. Never hold anyMutexGuardacross a slow.await. -
Condvaris the correct primitive for blocking on a data condition (waiting for a non-empty queue, waiting for a flag). It atomically releases the mutex and parks the thread, avoiding busy-waiting.
Lesson 3 — Atomics and Memory Ordering: Acquire/Release/SeqCst in Practice
Module: Foundation — M02: Concurrency Primitives
Position: Lesson 3 of 3
Source: Rust Atomics and Locks — Mara Bos, Chapters 2–3
Context
The Meridian control plane increments a frame counter every time a telemetry frame is received — 4,800 times per second at full uplink load across 48 satellites. The per-session heartbeat timer fires every 100ms. The frame drop rate is sampled by the monitoring dashboard every second. None of these operations need the overhead of a mutex lock. They need a single integer that multiple threads can read and write without data races.
This is the domain of atomics. std::sync::atomic provides integer and boolean types that support safe concurrent mutation without locking. The operations are indivisible — they either complete entirely or have not happened yet — which prevents the torn reads and non-atomic increments that would corrupt counters under concurrent access.
But atomics are not free. The memory ordering argument on every atomic operation — Relaxed, Acquire, Release, AcqRel, SeqCst — controls what guarantees the processor and compiler make about the ordering of operations across threads. Getting this wrong produces bugs that are invisible in development and intermittent in production.
Source: Rust Atomics and Locks, Chapters 2–3 (Bos)
Core Concepts
What Atomic Operations Guarantee
An atomic operation is indivisible: it either completes entirely before any other operation on the same variable, or it has not happened yet (Rust Atomics and Locks, Ch. 2). Two threads simultaneously performing counter += 1 on a plain integer is undefined behavior — the read-modify-write is three separate operations, and the interleaving is unpredictable. Two threads simultaneously calling counter.fetch_add(1, Relaxed) is defined and correct: each fetch_add is a single atomic step.
The available types live in std::sync::atomic: AtomicBool, AtomicI8/U8 through AtomicI64/U64, AtomicIsize/Usize, and AtomicPtr<T>. All support mutation through a shared reference (&AtomicUsize) — they use interior mutability without UnsafeCell runtime checks.
Every atomic operation takes an Ordering argument. The ordering is not about the value — it is about the visibility of other memory operations to other threads.
Load, Store, and Fetch-and-Modify
The three basic operation families:
Load and store — read or write the atomic value:
#![allow(unused)] fn main() { use std::sync::atomic::{AtomicU64, Ordering::Relaxed}; static FRAME_COUNT: AtomicU64 = AtomicU64::new(0); fn record_frame() { FRAME_COUNT.fetch_add(1, Relaxed); } fn read_frame_count() -> u64 { FRAME_COUNT.load(Relaxed) } }
Fetch-and-modify — atomically modify the value and return the previous value (Rust Atomics and Locks, Ch. 2):
use std::sync::atomic::{AtomicU64, Ordering::Relaxed}; fn main() { let counter = AtomicU64::new(100); let old = counter.fetch_add(23, Relaxed); assert_eq!(old, 100); // returned the value before the add assert_eq!(counter.load(Relaxed), 123); // value after the add }
The full set: fetch_add, fetch_sub, fetch_and, fetch_or, fetch_xor, fetch_max, fetch_min, and swap. Use these in preference to compare-and-exchange when the operation fits — they are simpler and the compiler can map them to a single hardware instruction.
compare_exchange — The General Atomic Primitive
compare_exchange atomically checks whether the current value equals an expected value, and if so, replaces it with a new value. It returns the previous value on success, and the actual current value on failure (Rust Atomics and Locks, Ch. 2):
use std::sync::atomic::{AtomicU32, Ordering::Relaxed}; fn increment_if_below(a: &AtomicU32, limit: u32) -> bool { let mut current = a.load(Relaxed); loop { if current >= limit { return false; } match a.compare_exchange(current, current + 1, Relaxed, Relaxed) { Ok(_) => return true, // successfully incremented Err(v) => current = v, // another thread changed it; retry } } } fn main() { let seq = AtomicU32::new(0); println!("{}", increment_if_below(&seq, 5)); // true }
The loop-and-retry pattern is fundamental: load the current value, compute the desired new value without holding any lock, then swap atomically only if the value has not changed since the load. If it has changed, retry. This is a lock-free algorithm — no thread blocks, and progress is guaranteed as long as at least one thread makes progress.
compare_exchange_weak may spuriously fail (return Err even when the value matches) on some architectures. Use it in loops where spurious failure just triggers another iteration. Use the strong version when you need a guarantee that success or failure is definitive.
The ABA problem: if a value changes from A to B and back to A between the load and the CAS, compare_exchange will succeed even though the value was modified. For simple counters and flags this is harmless; for pointer-based data structures it can be a correctness issue.
Memory Ordering — The Model
Processors and compilers reorder operations when it does not change single-threaded program behavior. In concurrent code, these reorderings can change observed behavior across threads. Memory ordering tells the compiler and processor what reorderings are permissible around a given atomic operation (Rust Atomics and Locks, Ch. 3).
Relaxed — no ordering guarantees beyond consistency on the single atomic variable. All threads see modifications of a given atomic in the same total order, but operations on different variables may be reordered arbitrarily. Use for statistics counters and progress indicators where you only care about the eventual value, not the timing relationship with other operations.
Release (stores) / Acquire (loads) — the most important pair. A release-store establishes a happens-before relationship with any subsequent acquire-load that reads the stored value:
use std::sync::atomic::{AtomicBool, AtomicU64, Ordering::{Acquire, Release, Relaxed}}; use std::thread; static DATA: AtomicU64 = AtomicU64::new(0); static READY: AtomicBool = AtomicBool::new(false); fn main() { thread::spawn(|| { DATA.store(12345, Relaxed); // (1) write data READY.store(true, Release); // (2) publish: everything before this is visible... }); while !READY.load(Acquire) { // (3) ...once this returns true. std::hint::spin_loop(); } println!("{}", DATA.load(Relaxed)); // guaranteed to print 12345 }
Once the acquire-load at (3) sees true, the happens-before relationship guarantees that (1) is visible. Without the Acquire/Release pair — using Relaxed on both — the processor could see READY as true while DATA still holds 0.
The names come from the mutex pattern: a mutex unlock is a release-store; a mutex lock-acquire is an acquire-load. Everything the thread did before releasing the mutex is visible to the thread that acquires it next.
AcqRel — both Acquire and Release in a single operation. Used for read-modify-write operations (like fetch_add or compare_exchange) that must both see all prior releases and publish all prior stores.
SeqCst — sequentially consistent: the strongest ordering. All SeqCst operations across all threads form a single total order that every thread agrees on. This is stronger than Acquire/Release and is rarely needed. Use it when you have two threads each setting a flag and then reading the other's flag, and you need to guarantee that at least one thread sees the other's write (Rust Atomics and Locks, Ch. 3). In nearly all other cases, Acquire/Release is sufficient.
When to Reach for Atomics vs Mutex
Atomics are not a general replacement for mutexes. They are appropriate for:
- Single-value counters and flags (frame counts, connection counts, shutdown flags)
- Lock-free reference counting (the internal mechanism of
Arc) - Progress indicators shared between threads
- Single-producer/single-consumer patterns where acquire/release establishes the necessary ordering
Mutexes are appropriate for:
- Protecting multi-field structs where all fields must be updated atomically
- Any operation that requires a multi-step transaction
- Data structures that cannot be represented as a single atomic value
Reaching for SeqCst everywhere is not safe by default — it has higher cost on some architectures (notably ARM) and the extra strength is rarely needed. Start with Acquire/Release. If your correctness argument requires a global total order across multiple atomics, then SeqCst is warranted.
Code Examples
Multi-Thread Frame Counter with Atomic Statistics
The telemetry pipeline tracks three counters: total frames received, total frames dropped (due to backpressure), and bytes processed. These are written by 48 uplink tasks and read by the monitoring dashboard. A mutex would serialize all 48 writes; atomics let them proceed in parallel.
use std::sync::atomic::{AtomicU64, Ordering::{Relaxed, Release, Acquire}}; use std::sync::Arc; use std::thread; use std::time::Duration; struct PipelineMetrics { frames_received: AtomicU64, frames_dropped: AtomicU64, bytes_processed: AtomicU64, // Shutdown flag: Release on write, Acquire on read. shutdown: AtomicU64, } impl PipelineMetrics { fn new() -> Arc<Self> { Arc::new(Self { frames_received: AtomicU64::new(0), frames_dropped: AtomicU64::new(0), bytes_processed: AtomicU64::new(0), shutdown: AtomicU64::new(0), }) } fn record_frame(&self, bytes: u64) { // Relaxed: these counters are for monitoring only. // The exact ordering relative to other threads' stores doesn't matter; // we only care about the eventual totals. self.frames_received.fetch_add(1, Relaxed); self.bytes_processed.fetch_add(bytes, Relaxed); } fn record_drop(&self) { self.frames_dropped.fetch_add(1, Relaxed); } fn signal_shutdown(&self) { // Release: ensures all frame counts written before this are visible // to any thread that reads shutdown with Acquire. self.shutdown.store(1, Release); } fn should_stop(&self) -> bool { // Acquire: establishes happens-before with the Release store above. // Any Relaxed loads on frames_received etc. after this call // will see all stores from before signal_shutdown(). self.shutdown.load(Acquire) == 1 } fn snapshot(&self) -> (u64, u64, u64) { ( self.frames_received.load(Relaxed), self.frames_dropped.load(Relaxed), self.bytes_processed.load(Relaxed), ) } } fn main() { let metrics = PipelineMetrics::new(); // Simulate 4 uplink tasks. let workers: Vec<_> = (0..4).map(|i| { let m = Arc::clone(&metrics); thread::spawn(move || { for _ in 0..100 { if m.should_stop() { break; } m.record_frame(1024); if i == 0 { m.record_drop(); } // simulate occasional drops on uplink 0 } }) }).collect(); // Monitoring thread samples every 5ms. let m = Arc::clone(&metrics); let monitor = thread::spawn(move || { for _ in 0..3 { thread::sleep(Duration::from_millis(5)); let (recv, drop, bytes) = m.snapshot(); println!("recv={recv} drop={drop} bytes={bytes}"); } m.signal_shutdown(); }); for w in workers { w.join().unwrap(); } monitor.join().unwrap(); let (recv, drop, bytes) = metrics.snapshot(); println!("final: recv={recv} drop={drop} bytes={bytes}"); }
The Acquire/Release pair on the shutdown flag ensures that after any thread reads should_stop() as true, all Relaxed frame counts written before signal_shutdown() are visible. Without this pair, the monitoring thread could read shutdown=1 but still see stale frame counts from before the shutdown writes.
Key Takeaways
-
Atomic operations are indivisible: a
fetch_addon anAtomicU64is a single step with no observable intermediate state. Plain integer+=is not atomic — concurrent modification is undefined behavior. -
fetch_addand friends return the value before the operation. This is intentional: it lets you use the old value to implement compare-and-swap patterns or sequence counters. -
compare_exchangeis the general-purpose lock-free primitive. The loop-and-retry pattern — load, compute, CAS, retry on failure — enables lock-free algorithms where no thread ever blocks. -
Relaxedordering gives only modification order on a single variable. It is correct for statistics counters and progress indicators where cross-variable ordering does not matter. -
Acquire/Release establishes happens-before across threads. A release-store publishes all preceding memory operations; an acquire-load that reads that value sees all of them. This is what makes mutex unlock/lock,
Arcdrop/clone, and cross-thread data handoffs safe. -
SeqCstprovides a global total order across allSeqCstoperations on all threads. Use it only when you need to coordinate two or more flags where the relative order matters globally. In practice, Acquire/Release covers the vast majority of use cases.
Project — Ground Station Command Queue
Module: Foundation — M02: Concurrency Primitives
Prerequisite: All three module quizzes passed (≥70%)
- Mission Brief
- System Specification
- Expected Output
- Acceptance Criteria
- Hints
- Reference Implementation
- Reflection
Mission Brief
TO: Platform Engineering
FROM: Mission Operations Lead
CLASSIFICATION: UNCLASSIFIED // INTERNAL
SUBJECT: RFC-0044 — Priority Command Queue Implementation
The legacy Python command queue used a global dictionary with a threading lock. Over the past six months it has been involved in four production incidents: two deadlocks from re-entrant locking, one priority inversion where a low-priority housekeeping command blocked an emergency SAFE_MODE injection, and one data race when a monitoring process read the queue mid-write.
The replacement must be a typed, concurrent priority queue in Rust. It accepts mission-critical commands from multiple concurrent ground network connections, dispatches them in priority order to the session controller, and exposes lock-free metrics to the monitoring system — without the failure modes of the Python implementation.
System Specification
Command Model
Commands have a u8 priority (0 = lowest, 255 = highest). Predefined priorities:
| Command type | Priority |
|---|---|
SAFE_MODE | 255 |
ABORT_PASS | 200 |
REPOINT | 100 |
STATUS_REQUEST | 50 |
HOUSEKEEPING | 10 |
A Command struct:
#![allow(unused)] fn main() { #[derive(Debug, Eq, PartialEq)] pub struct Command { pub priority: u8, pub kind: CommandKind, pub issued_at: std::time::Instant, } #[derive(Debug, Eq, PartialEq)] pub enum CommandKind { SafeMode, AbortPass, Repoint { azimuth: f32, elevation: f32 }, StatusRequest, Housekeeping, } }
Queue Behaviour
- Multiple producer threads (one per ground station connection) push commands concurrently.
- One consumer thread (the session dispatcher) pops the highest-priority command. If multiple commands share the same priority, the oldest (by
issued_at) is dispatched first. - When the queue is empty, the consumer blocks without busy-waiting.
- The queue has a configurable capacity. If full, a push from a producer blocks until space is available. Blocking producers must not block the consumer.
Metrics
The following counters are maintained lock-free and available to the monitoring system without acquiring any lock:
commands_pushed— total commands ever pushed (all priorities)commands_dispatched— total commands ever dispatchedsafe_mode_count— number of SAFE_MODE commands dispatched (priority 255)
Shutdown
The queue accepts a shutdown signal. On shutdown:
- No new pushes are accepted — producers get an
Err(QueueShutdown). - The consumer drains any remaining commands in priority order.
- Once the queue is empty and shutdown is signalled, the consumer returns.
Expected Output
A library crate (meridian-cmdqueue) with:
- A
CommandQueuetype withpush,pop, andshutdownmethods - An
Arc<Metrics>accessible from theCommandQueuewith the three lock-free counters - A binary that demonstrates: 3 producer threads pushing 5 commands each, 1 consumer thread dispatching them in priority order, a monitoring thread sampling metrics every 20ms, and shutdown after all producers finish
The output should clearly show commands being dispatched in priority order (not FIFO).
Acceptance Criteria
| # | Criterion | Verifiable |
|---|---|---|
| 1 | Commands dispatched in priority order (highest first, then oldest-first within priority) | Yes — log output order |
| 2 | Consumer blocks without busy-waiting when queue is empty | Yes — no >5% CPU when idle (measure with top) |
| 3 | Multiple concurrent producers do not cause data races | Yes — runs under cargo test with --test-threads=1 via loom or ThreadSanitizer |
| 4 | Metrics counters are readable without acquiring any queue lock | Yes — code review: metrics accessed via atomics only |
| 5 | Shutdown drains remaining commands before consumer exits | Yes — log shows all pushed commands dispatched before exit |
| 6 | Producer push blocks (does not drop commands) when queue is at capacity | Yes — test with capacity=2 and 10 concurrent pushes |
| 7 | No .unwrap() on Mutex::lock() without a comment on the invariant | Yes — code review |
Hints
Hint 1 — Implementing priority + FIFO ordering on BinaryHeap
BinaryHeap is a max-heap. To get "highest priority first, then oldest first within the same priority," implement Ord on Command to compare first by priority (descending), then by issued_at (ascending — older is higher priority):
#![allow(unused)] fn main() { use std::cmp::Ordering; use std::time::Instant; struct Command { priority: u8, issued_at: Instant } impl Ord for Command { fn cmp(&self, other: &Self) -> Ordering { self.priority.cmp(&other.priority) .then_with(|| other.issued_at.cmp(&self.issued_at)) // older = higher } } impl PartialOrd for Command { fn partial_cmp(&self, other: &Self) -> Option<Ordering> { Some(self.cmp(other)) } } impl PartialEq for Command { fn eq(&self, other: &Self) -> bool { self.priority == other.priority && self.issued_at == other.issued_at } } impl Eq for Command {} }
Hint 2 — Blocking push with capacity using Mutex + two Condvars
Two Condvars: one signals "not full" (wake a blocked producer), one signals "not empty" (wake the consumer):
#![allow(unused)] fn main() { use std::sync::{Mutex, Condvar}; use std::collections::BinaryHeap; struct QueueInner<T> { heap: BinaryHeap<T>, capacity: usize, shutdown: bool, } struct CommandQueue<T> { inner: Mutex<QueueInner<T>>, not_empty: Condvar, not_full: Condvar, } }
Push blocks on not_full when the heap is at capacity; pop blocks on not_empty when the heap is empty. Each operation notifies the other condvar after completing.
Hint 3 — Lock-free metrics with atomics
Counters increment in push and pop, which both hold the mutex. But the monitoring thread must read without the mutex. Use atomics for all three counters — write from inside the locked section (ordering is Relaxed since the mutex provides the actual happens-before relationship), read from the monitoring thread with Relaxed:
#![allow(unused)] fn main() { use std::sync::atomic::{AtomicU64, Ordering::Relaxed}; use std::sync::Arc; pub struct Metrics { pub commands_pushed: AtomicU64, pub commands_dispatched: AtomicU64, pub safe_mode_count: AtomicU64, } impl Metrics { pub fn new() -> Arc<Self> { Arc::new(Self { commands_pushed: AtomicU64::new(0), commands_dispatched: AtomicU64::new(0), safe_mode_count: AtomicU64::new(0), }) } } }
Hint 4 — Shutdown drain sequence
Set shutdown = true in the inner state while holding the mutex, then notify_all() on both condvars. In push, check shutdown after acquiring the lock and return Err if set. In pop, check shutdown && heap.is_empty() — if both are true, return None to signal the consumer to exit:
#![allow(unused)] fn main() { // In pop: let mut inner = self.inner.lock().unwrap(); loop { if let Some(cmd) = inner.heap.pop() { self.not_full.notify_one(); return Some(cmd); } if inner.shutdown { return None; // Queue is empty and shutdown — consumer exits } inner = self.not_empty.wait(inner).unwrap(); } }
Reference Implementation
Reveal reference implementation
#![allow(unused)] fn main() { // src/lib.rs use std::cmp::Ordering; use std::collections::BinaryHeap; use std::sync::atomic::{AtomicU64, Ordering::Relaxed}; use std::sync::{Arc, Condvar, Mutex}; use std::time::Instant; #[derive(Debug)] pub struct QueueShutdown; impl std::fmt::Display for QueueShutdown { fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { write!(f, "command queue is shut down") } } #[derive(Debug, Eq, PartialEq)] pub enum CommandKind { SafeMode, AbortPass, Repoint { azimuth: u32, elevation: u32 }, StatusRequest, Housekeeping, } #[derive(Debug, Eq, PartialEq)] pub struct Command { pub priority: u8, pub kind: CommandKind, pub issued_at: Instant, } impl Ord for Command { fn cmp(&self, other: &Self) -> Ordering { self.priority .cmp(&other.priority) // Within the same priority, older commands go first. .then_with(|| other.issued_at.cmp(&self.issued_at)) } } impl PartialOrd for Command { fn partial_cmp(&self, other: &Self) -> Option<Ordering> { Some(self.cmp(other)) } } pub struct Metrics { pub commands_pushed: AtomicU64, pub commands_dispatched: AtomicU64, pub safe_mode_count: AtomicU64, } impl Metrics { fn new() -> Arc<Self> { Arc::new(Self { commands_pushed: AtomicU64::new(0), commands_dispatched: AtomicU64::new(0), safe_mode_count: AtomicU64::new(0), }) } } struct Inner { heap: BinaryHeap<Command>, capacity: usize, shutdown: bool, } pub struct CommandQueue { inner: Mutex<Inner>, not_empty: Condvar, not_full: Condvar, pub metrics: Arc<Metrics>, } impl CommandQueue { pub fn new(capacity: usize) -> Arc<Self> { Arc::new(Self { inner: Mutex::new(Inner { heap: BinaryHeap::with_capacity(capacity), capacity, shutdown: false, }), not_empty: Condvar::new(), not_full: Condvar::new(), metrics: Metrics::new(), }) } pub fn push(&self, cmd: Command) -> Result<(), QueueShutdown> { let mut inner = self.inner.lock().unwrap(); loop { if inner.shutdown { return Err(QueueShutdown); } if inner.heap.len() < inner.capacity { let is_safe_mode = matches!(cmd.kind, CommandKind::SafeMode); inner.heap.push(cmd); // Relaxed: the mutex provides the happens-before. These are // statistics only — no cross-variable ordering needed. self.metrics.commands_pushed.fetch_add(1, Relaxed); if is_safe_mode { self.metrics.safe_mode_count.fetch_add(1, Relaxed); } self.not_empty.notify_one(); return Ok(()); } // Queue full — block until space opens or shutdown. inner = self.not_full.wait(inner).unwrap(); } } pub fn pop(&self) -> Option<Command> { let mut inner = self.inner.lock().unwrap(); loop { if let Some(cmd) = inner.heap.pop() { self.metrics.commands_dispatched.fetch_add(1, Relaxed); self.not_full.notify_one(); return Some(cmd); } if inner.shutdown { return None; } inner = self.not_empty.wait(inner).unwrap(); } } pub fn shutdown(&self) { let mut inner = self.inner.lock().unwrap(); inner.shutdown = true; // Wake all blocked producers and the consumer. self.not_empty.notify_all(); self.not_full.notify_all(); } } }
// src/main.rs (demonstration binary) use std::sync::Arc; use std::thread; use std::time::{Duration, Instant}; fn main() { // Inline the relevant types for the playground demo // (in the real crate, use `use meridian_cmdqueue::*`) tracing_subscriber::fmt::init(); let queue = CommandQueue::new(20); let metrics = Arc::clone(&queue.metrics); // Three producer threads — simulate ground network connections. let producers: Vec<_> = (0..3u8).map(|gs| { let q = Arc::clone(&queue); thread::spawn(move || { let priorities = [255u8, 200, 100, 50, 10]; for &priority in &priorities { thread::sleep(Duration::from_millis(gs as u64 * 5)); let kind = match priority { 255 => CommandKind::SafeMode, 200 => CommandKind::AbortPass, 100 => CommandKind::Repoint { azimuth: 180, elevation: 45 }, 50 => CommandKind::StatusRequest, _ => CommandKind::Housekeeping, }; match q.push(Command { priority, kind, issued_at: Instant::now() }) { Ok(()) => println!("gs-{gs}: pushed priority {priority}"), Err(e) => println!("gs-{gs}: push rejected — {e}"), } } }) }).collect(); // Consumer thread — session dispatcher. let q = Arc::clone(&queue); let consumer = thread::spawn(move || { while let Some(cmd) = q.pop() { println!("dispatcher: {:?} (priority {})", cmd.kind, cmd.priority); thread::sleep(Duration::from_millis(10)); } println!("dispatcher: queue drained, exiting"); }); // Monitoring thread. let monitor = thread::spawn(move || { for _ in 0..4 { thread::sleep(Duration::from_millis(20)); println!( "metrics: pushed={} dispatched={} safe_mode={}", metrics.commands_pushed.load(std::sync::atomic::Ordering::Relaxed), metrics.commands_dispatched.load(std::sync::atomic::Ordering::Relaxed), metrics.safe_mode_count.load(std::sync::atomic::Ordering::Relaxed), ); } }); for p in producers { p.join().unwrap(); } queue.shutdown(); consumer.join().unwrap(); monitor.join().unwrap(); }
Reflection
The command queue built here uses all three concurrency layers from this module: OS threads for the producer and consumer, Mutex + Condvar for blocking coordination, and atomics for the metrics that must be readable without acquiring any lock. The relationship between these layers — the mutex providing the happens-before for the atomic writes, the condvar providing the non-busy-waiting block, the atomics avoiding any lock on the read path — is the pattern used throughout the Meridian control plane.
The natural next question: the blocking push is correct but puts an upper bound on producer throughput. In Module 3, this queue is extended with a tokio::sync::mpsc front-end that moves the backpressure into async channel semantics rather than blocking OS threads.
Module 03 — Message Passing Patterns
Track: Foundation — Mission Control Platform
Position: Module 3 of 6
Source material: Async Rust — Maxwell Flitton & Caroline Morton, Chapters 3, 6, 7, 8
Quiz pass threshold: 70% on all three lessons to unlock the project
- Mission Context
- What You Will Learn
- Lessons
- Capstone Project — Multi-Source Telemetry Aggregator
- Prerequisites
- What Comes Next
Mission Context
Module 2 built shared-state concurrency: Mutex, RwLock, atomics. Those primitives protect data that multiple actors need to touch. This module takes the complementary approach: instead of sharing data, pass ownership through channels. Producers and consumers are decoupled — each owns its state exclusively, communicating only through typed messages.
For the Meridian control plane, message passing is the primary architecture for the telemetry pipeline. 48 satellite uplinks funnel frames into a priority-ordered aggregator. TLE catalog updates fan out to every active session simultaneously. The shutdown signal propagates to all tasks through a watched value. None of these require shared mutable state — they compose entirely from channel primitives.
What You Will Learn
By the end of this module you will be able to:
- Create bounded
mpscchannels, size them for backpressure, clone senders for multiple concurrent producers, and design consumer loops that terminate cleanly when all senders drop - Implement the actor pattern: an async task that owns its state exclusively and exposes all operations as messages, using
oneshotchannels for request-response within the message protocol - Distribute events to all subscribers using
broadcast, handleRecvError::Laggedcorrectly, and size the broadcast capacity for the slowest realistic consumer - Distribute current state to many readers using
watch, understand the difference between event distribution and state distribution, and applyArc<T>inside a watch for cheap config reads - Merge multiple independent async sources into one stream using shared-sender MPSC (uniform sources),
select! { biased; }(priority sources), and a router actor (dynamic sources) - Choose between
mpsc,broadcast,watch, andoneshotgiven a fan-in or fan-out requirement
Lessons
Lesson 1 — tokio::mpsc: Bounded Channels, Backpressure, and Sender Cloning
Covers mpsc::channel(capacity), Sender::clone for multiple producers, send().await as the backpressure mechanism, try_send for non-blocking producers, the consumer loop termination on sender drop, oneshot for request-response, and the actor pattern as the structural idiom that emerges from MPSC channels.
Key question this lesson answers: How do you safely move work between concurrent async tasks without shared state, and what ensures slow consumers are not overwhelmed by fast producers?
→ lesson-01-mpsc.md / lesson-01-quiz.toml
Lesson 2 — Broadcast and Watch Channels: Fan-Out Patterns
Covers broadcast::channel(capacity) for event fan-out (every subscriber gets every message), RecvError::Lagged handling, watch::channel(initial) for state fan-out (latest value, change notification), borrow() for lock-free reads, and the decision matrix for choosing between mpsc, broadcast, and watch.
Key question this lesson answers: How do you distribute one event or one value to many concurrent tasks, and when does missing an intermediate update matter?
→ lesson-02-broadcast-watch.md / lesson-02-quiz.toml
Lesson 3 — Fan-In Aggregation: Merging Streams from Multiple Satellite Feeds
Covers shared-sender MPSC for uniform fan-in, select! { biased; } for priority fan-in with two priority levels, message tagging with typed source identifiers, and the router actor for dynamic fan-in (sources registered and removed at runtime).
Key question this lesson answers: How do you merge many independent async sources into one stream with control over priority, fairness, and dynamic source registration?
→ lesson-03-fan-in.md / lesson-03-quiz.toml
Capstone Project — Multi-Source Telemetry Aggregator
Build the full telemetry aggregation pipeline: a router actor with dynamic source registration, a priority fan-in that ensures emergency frames are never delayed behind routine telemetry, a bounded frame processor with backpressure, a broadcast fan-out to downstream consumers, atomic pipeline statistics exposed through a watch channel, and a clean shutdown sequence.
Acceptance is against 7 verifiable criteria including emergency frame priority, dynamic source registration, backpressure enforcement, lossless shutdown drain, and lagged broadcast handling.
→ project-telemetry-aggregator.md
Prerequisites
Modules 1 and 2 must be complete. Module 1 established how async tasks are scheduled and why they cooperatively yield — essential for understanding why bounded channel backpressure works without blocking threads. Module 2 established the shared-state model that message passing replaces — understanding both models is necessary to choose the right one for a given problem.
What Comes Next
Module 4 — Network Programming connects the message-passing pipeline to the network. The telemetry aggregator from this module gains a TCP listener front-end, turning the router actor into a full ground station connection broker that accepts connections from the 12 Meridian ground station sites.
Lesson 1 — tokio::mpsc: Bounded Channels, Backpressure, and Sender Cloning
Module: Foundation — M03: Message Passing Patterns
Position: Lesson 1 of 3
Source: Async Rust — Maxwell Flitton & Caroline Morton, Chapter 8
Context
Module 2's command queue used a Mutex<BinaryHeap> plus a Condvar to share state between threads. That approach works, but it couples the producers and consumer through a shared data structure — every access requires acquiring the same lock, and the consumer must hold the lock while inspecting queue contents. Under contention at 48-uplink load, that lock becomes a bottleneck.
The alternative model is message passing: producers send values into a channel; the consumer receives from it. There is no shared data structure, no explicit locking, and no Arc to pass around. The channel itself manages all synchronization. The backpressure mechanism is built in: when the channel is full, send yields rather than blocking a thread, and the async runtime can schedule other work while the producer waits.
This lesson covers tokio::sync::mpsc — the multi-producer, single-consumer channel that is the workhorse of most async Rust systems. It also covers oneshot for request-response patterns and introduces the actor model as the structural pattern that emerges naturally from combining channels with task ownership.
Source: Async Rust, Chapter 8 (Flitton & Morton)
Core Concepts
MPSC Channels: The Model
tokio::sync::mpsc::channel(capacity) creates a bounded channel and returns a (Sender<T>, Receiver<T>) pair. The capacity is the maximum number of messages that can sit in the channel before senders must wait:
use tokio::sync::mpsc; #[tokio::main] async fn main() { // Capacity of 32: up to 32 messages can be buffered. // If the receiver falls behind, the 33rd send will yield. let (tx, mut rx) = mpsc::channel::<String>(32); tokio::spawn(async move { tx.send("frame-001".to_string()).await.unwrap(); }); while let Some(msg) = rx.recv().await { println!("received: {msg}"); } }
recv() returns None when all Sender handles have been dropped — this is the clean shutdown signal for a consumer loop. No explicit close call is needed; the channel closes naturally when the last sender drops.
Sender is Clone: Multiple Producers
Sender<T> implements Clone. Each clone is an independent handle to the same channel. This is the "multi-producer" part of MPSC — any number of tasks can hold a Sender and push messages concurrently. The receiver sees messages from all senders interleaved, in the order they are delivered to the channel.
use tokio::sync::mpsc; #[tokio::main] async fn main() { let (tx, mut rx) = mpsc::channel::<(u32, Vec<u8>)>(64); // Each uplink session gets its own cloned sender. for satellite_id in 0..4u32 { let tx = tx.clone(); tokio::spawn(async move { for seq in 0u8..3 { let frame = vec![satellite_id as u8, seq]; // Yields if channel is full — backpressure in action. tx.send((satellite_id, frame)).await .expect("aggregator task dropped"); } }); } // Drop the original sender so the channel closes when all // spawned tasks finish. Without this drop, rx.recv() never // returns None — the original sender keeps the channel alive. drop(tx); while let Some((sat, frame)) = rx.recv().await { println!("sat {sat}: {:?}", frame); } }
The drop of the original tx after spawning is important and easy to forget. If any Sender clone outlives its usefulness, the channel stays open and the consumer loop blocks forever. The idiomatic pattern is to clone before spawning and drop the original.
Backpressure and Capacity Sizing
A bounded channel applies backpressure: when the channel reaches capacity, send().await yields and does not return until the consumer has drained a slot. This is the async equivalent of a blocking queue — it prevents fast producers from overwhelming a slow consumer.
try_send is the non-blocking variant. It returns Err(TrySendError::Full(_)) immediately if the channel is full rather than yielding. Use it when the producer should take an alternative action (log, drop, route to overflow) rather than applying backpressure:
use tokio::sync::mpsc; async fn forward_or_drop(tx: &mpsc::Sender<Vec<u8>>, frame: Vec<u8>) { match tx.try_send(frame) { Ok(()) => {} Err(mpsc::error::TrySendError::Full(frame)) => { // Aggregator is falling behind — record the drop and continue. // In production: increment a metrics counter here. tracing::warn!(bytes = frame.len(), "frame dropped: aggregator full"); } Err(mpsc::error::TrySendError::Closed(_)) => { tracing::error!("aggregator task has exited"); } } } #[tokio::main] async fn main() { let (tx, mut rx) = mpsc::channel::<Vec<u8>>(8); // Demonstrate try_send behaviour for i in 0u8..12 { forward_or_drop(&tx, vec![i]).await; } drop(tx); let mut count = 0; while rx.recv().await.is_some() { count += 1; } println!("received {count} frames (8 max due to capacity)"); }
Capacity sizing: too small causes unnecessary producer backpressure; too large hides a slow consumer until the buffer is exhausted. For the Meridian aggregator, a capacity of 2–4× the expected burst size is a reasonable starting point. Profile under realistic load.
unbounded_channel() provides no capacity limit — senders never yield. Use it only when backpressure is handled at an outer layer and unbounded buffering is acceptable (e.g., a metrics sink that can absorb any burst). Unbounded channels can cause OOM if the consumer is slower than the producers.
oneshot: Request-Response
tokio::sync::oneshot is a single-message channel: exactly one send, exactly one receive. It is the correct primitive for request-response patterns, where a task sends a request and needs to await the result:
use tokio::sync::{mpsc, oneshot}; enum ControlMsg { GetQueueDepth { reply: oneshot::Sender<usize> }, Flush, } async fn aggregator(mut rx: mpsc::Receiver<ControlMsg>) { let mut queue: Vec<Vec<u8>> = Vec::new(); while let Some(msg) = rx.recv().await { match msg { ControlMsg::GetQueueDepth { reply } => { // reply.send consumes the sender — can only respond once. let _ = reply.send(queue.len()); } ControlMsg::Flush => { println!("flushing {} frames", queue.len()); queue.clear(); } } } } #[tokio::main] async fn main() { let (tx, rx) = mpsc::channel::<ControlMsg>(8); tokio::spawn(aggregator(rx)); // Ask the aggregator for its current queue depth. let (reply_tx, reply_rx) = oneshot::channel::<usize>(); tx.send(ControlMsg::GetQueueDepth { reply: reply_tx }).await.unwrap(); let depth = reply_rx.await.unwrap(); println!("aggregator queue depth: {depth}"); }
The oneshot::Sender is embedded in the message itself. When the aggregator handles the message, it sends back through the oneshot and the caller's reply_rx.await resolves. This pattern — sometimes called the "mailbox" or "actor" pattern — eliminates the need for any shared state between the caller and the aggregator.
The Actor Pattern
An actor is an async task that owns its state exclusively and exposes its functionality entirely through message passing (Async Rust, Ch. 8). No locks, no shared Arc, no exposed fields. Every operation on the actor's state happens sequentially within the actor's message loop — concurrent safety is structural, not from locking.
The advantages: the actor's state is never accessed concurrently. There are no data races by construction. Testing is straightforward — send messages, check responses. Adding operations means adding enum variants, not adding lock guards.
The tradeoffs: all operations are async (each call involves a channel send and an await). If many callers need responses simultaneously, the actor is a serialization point. If the actor's work is CPU-intensive, it blocks its own message loop. Both are solvable — the first with multiple actors, the second with spawn_blocking inside the loop — but they require deliberate design.
Code Examples
Telemetry Frame Aggregator Actor
The aggregator is an actor: it owns the frame buffer exclusively, receives frames and control messages through a single channel, and responds to queries via embedded oneshot channels. No locks anywhere.
use tokio::sync::{mpsc, oneshot}; use std::collections::VecDeque; const MAX_BUFFER: usize = 1000; #[derive(Debug)] struct TelemetryFrame { satellite_id: u32, sequence: u64, payload: Vec<u8>, } enum AggregatorMsg { /// A new frame from an uplink session. Frame(TelemetryFrame), /// Request: how many frames are buffered? Depth { reply: oneshot::Sender<usize> }, /// Drain the buffer and return all frames. Drain { reply: oneshot::Sender<Vec<TelemetryFrame>> }, } async fn run_aggregator(mut rx: mpsc::Receiver<AggregatorMsg>) { let mut buffer: VecDeque<TelemetryFrame> = VecDeque::with_capacity(MAX_BUFFER); while let Some(msg) = rx.recv().await { match msg { AggregatorMsg::Frame(frame) => { if buffer.len() >= MAX_BUFFER { tracing::warn!( satellite_id = frame.satellite_id, "buffer full — dropping oldest frame" ); buffer.pop_front(); } buffer.push_back(frame); } AggregatorMsg::Depth { reply } => { let _ = reply.send(buffer.len()); } AggregatorMsg::Drain { reply } => { let frames: Vec<_> = buffer.drain(..).collect(); let _ = reply.send(frames); } } } tracing::info!("aggregator: all senders dropped, shutting down"); } /// A typed handle to the aggregator actor. /// Hides the channel internals from callers. #[derive(Clone)] struct AggregatorHandle { tx: mpsc::Sender<AggregatorMsg>, } impl AggregatorHandle { fn spawn(capacity: usize) -> Self { let (tx, rx) = mpsc::channel(capacity); tokio::spawn(run_aggregator(rx)); Self { tx } } async fn send_frame(&self, frame: TelemetryFrame) -> anyhow::Result<()> { self.tx.send(AggregatorMsg::Frame(frame)).await .map_err(|_| anyhow::anyhow!("aggregator has shut down")) } async fn depth(&self) -> anyhow::Result<usize> { let (reply_tx, reply_rx) = oneshot::channel(); self.tx.send(AggregatorMsg::Depth { reply: reply_tx }).await .map_err(|_| anyhow::anyhow!("aggregator has shut down"))?; reply_rx.await.map_err(|_| anyhow::anyhow!("aggregator dropped reply")) } async fn drain(&self) -> anyhow::Result<Vec<TelemetryFrame>> { let (reply_tx, reply_rx) = oneshot::channel(); self.tx.send(AggregatorMsg::Drain { reply: reply_tx }).await .map_err(|_| anyhow::anyhow!("aggregator has shut down"))?; reply_rx.await.map_err(|_| anyhow::anyhow!("aggregator dropped reply")) } } #[tokio::main] async fn main() -> anyhow::Result<()> { tracing_subscriber::fmt::init(); let agg = AggregatorHandle::spawn(128); // Simulate 4 concurrent uplink sessions each sending 3 frames. let tasks: Vec<_> = (0..4u32).map(|sat_id| { let agg = agg.clone(); tokio::spawn(async move { for seq in 0u64..3 { agg.send_frame(TelemetryFrame { satellite_id: sat_id, sequence: seq, payload: vec![sat_id as u8; 64], }).await.unwrap(); } }) }).collect(); for t in tasks { t.await.unwrap(); } println!("buffered: {}", agg.depth().await?); let frames = agg.drain().await?; println!("drained {} frames", frames.len()); Ok(()) }
The AggregatorHandle is the public API. Callers see send_frame, depth, and drain — they never interact with the channel directly. The handle is Clone, so it can be shared freely across tasks by cloning, with no Arc<Mutex<...>> needed.
Key Takeaways
-
tokio::sync::mpsc::channel(capacity)creates a bounded channel. The capacity is the backpressure valve:send().awaityields when the channel is full, preventing fast producers from overwhelming slow consumers. -
Sender<T>isClone. Every clone is an independent producer on the same channel. The channel closes when all senders drop. Always drop the original sender after spawning cloned senders, or the consumer loop will block forever. -
try_sendis the non-blocking variant. Use it when the producer should take an alternative action — drop, log, route to overflow — rather than yielding. Prefersend().awaitwhen backpressure is the correct response. -
oneshotis the single-message channel for request-response patterns. Embed theoneshot::Senderin the message to allow the receiver to reply exactly once. TheSenderis consumed on send — using it more than once is a compile error. -
The actor pattern — an async task that owns its state exclusively and receives all operations as messages — eliminates shared state and all associated locking. It is the structural pattern that emerges naturally from MPSC channels in async systems.
Lesson 2 — Broadcast and Watch Channels: Fan-Out Patterns for Telemetry Distribution
Module: Foundation — M03: Message Passing Patterns
Position: Lesson 2 of 3
Source: Async Rust — Maxwell Flitton & Caroline Morton, Chapters 6, 7
Context
MPSC channels move work from many producers to one consumer. The inverse problem is fan-out: distributing one event to many consumers. The Meridian control plane has two distinct fan-out requirements that call for different solutions.
The first: when a TLE catalog update arrives, every active uplink session needs to process it. Each session must see the update — no session should receive it twice, and no session should miss it. This is an event-distribution problem.
The second: the shutdown flag. Every task in the control plane needs to know when the system is shutting down, but they do not need to receive a separate "shutdown event" — they just need to be able to check the current value at any time. This is a state-distribution problem.
Tokio provides a dedicated primitive for each. broadcast solves event distribution: every subscriber receives every message. watch solves state distribution: subscribers observe the latest value and are notified when it changes.
Source: Async Rust, Chapters 6–7 (Flitton & Morton)
Core Concepts
tokio::sync::broadcast — Every Subscriber Gets Every Message
broadcast::channel(capacity) returns a (Sender<T>, Receiver<T>) pair. Additional receivers are created by calling sender.subscribe() — each receiver gets its own position in the channel and receives every message sent after the subscription point.
use tokio::sync::broadcast; #[tokio::main] async fn main() { let (tx, _rx) = broadcast::channel::<String>(16); // Each session gets its own receiver. let mut session_a = tx.subscribe(); let mut session_b = tx.subscribe(); tx.send("TLE-UPDATE-2024-001".to_string()).unwrap(); // Both sessions receive the same message independently. println!("A: {}", session_a.recv().await.unwrap()); println!("B: {}", session_b.recv().await.unwrap()); }
Sender::send does not require await — it is synchronous. Messages are placed in a ring buffer; receivers read from their own position in that buffer.
The Lagged Error and What to Do With It
The broadcast channel has a fixed capacity ring buffer. If a slow receiver falls behind by more than capacity messages, it loses its position in the buffer. The next recv() call returns Err(RecvError::Lagged(n)), where n is the number of messages missed.
This is not a fatal error. The receiver continues to work — it simply missed n messages and will receive all subsequent ones. Whether missing messages is acceptable depends on the use case. For TLE catalog updates, a session that missed 3 updates can request a fresh fetch. For an audit log, missing messages is a compliance issue.
#![allow(unused)] fn main() { use tokio::sync::broadcast; async fn session_loop(mut rx: broadcast::Receiver<Vec<u8>>) { loop { match rx.recv().await { Ok(frame) => { // Normal path. process_update(frame).await; } Err(broadcast::error::RecvError::Lagged(n)) => { // Receiver fell behind — n messages were lost from this receiver's view. // Log the gap and continue; the next recv will succeed. tracing::warn!(missed = n, "session fell behind broadcast — requesting resync"); request_catalog_resync().await; } Err(broadcast::error::RecvError::Closed) => { // All senders dropped — broadcast channel is done. tracing::info!("broadcast channel closed, session exiting"); break; } } } } async fn process_update(_frame: Vec<u8>) {} async fn request_catalog_resync() {} }
Capacity sizing for broadcast is more sensitive than for MPSC. The slowest subscriber determines whether lagging occurs. If subscribers have variable processing speeds, size the capacity to accommodate the slowest realistic consumer under load, plus a safety margin.
tokio::sync::watch — Latest Value, Change Notification
watch::channel(initial_value) creates a single-value channel: the sender can update the value at any time, and receivers are notified when it changes. Receivers always see the latest value; intermediate values may be missed if the sender updates faster than the receiver reads.
use tokio::sync::watch; #[tokio::main] async fn main() { let (tx, rx) = watch::channel::<bool>(false); // Clone the receiver for multiple tasks. let mut rx2 = rx.clone(); tokio::spawn(async move { // Wait for the value to change. rx2.changed().await.unwrap(); println!("shutdown signal received"); }); tokio::time::sleep(tokio::time::Duration::from_millis(10)).await; tx.send(true).unwrap(); tokio::time::sleep(tokio::time::Duration::from_millis(10)).await; }
watch::Receiver::borrow() returns the current value without waiting. changed().await waits for the next change and then lets you borrow() the new value. This is the pattern for config reloading: tasks watch for a config change, then read the new config with borrow().
watch is the correct primitive for the Meridian shutdown flag — much better than a broadcast channel. The shutdown event needs to be observed once by each task, and latecomers (tasks that check the flag after shutdown is signalled) need to see true immediately. A broadcast receiver created after the shutdown send would miss the message. A watch receiver always sees the current state.
Choosing Between mpsc, broadcast, and watch
| Pattern | Channel | Use when |
|---|---|---|
| Work queue: one item consumed once | mpsc | 48 sessions each send frames to one aggregator |
| Event broadcast: every subscriber gets every event | broadcast | TLE update delivered to all active sessions |
| State sync: subscribers need the latest value | watch | Shutdown flag, config updates, current orbital state |
| One-shot reply | oneshot | Request-response within an actor message |
The key question: does each message need to be consumed exactly once (mpsc), by every subscriber (broadcast), or is only the latest value relevant (watch)?
watch for Configuration Distribution
A common pattern in the Meridian control plane: runtime configuration loaded at startup and potentially reloaded via a management API. All tasks need to read the current config, and they need to be notified when it changes:
use tokio::sync::watch; use std::sync::Arc; #[derive(Clone, Debug)] struct ControlPlaneConfig { max_frame_size: usize, session_timeout_secs: u64, } async fn uplink_session( satellite_id: u32, mut config_rx: watch::Receiver<Arc<ControlPlaneConfig>>, ) { loop { // Read current config — no lock, no await. let config = config_rx.borrow().clone(); tokio::select! { // Process frames using current config. _ = tokio::time::sleep( tokio::time::Duration::from_secs(config.session_timeout_secs) ) => { tracing::warn!(satellite_id, "session timeout"); break; } // React to config changes mid-session. Ok(()) = config_rx.changed() => { let new_config = config_rx.borrow().clone(); tracing::info!( satellite_id, max_frame = new_config.max_frame_size, "config reloaded" ); // Loop continues with new config. } } } } #[tokio::main] async fn main() { let initial = Arc::new(ControlPlaneConfig { max_frame_size: 65536, session_timeout_secs: 600, }); let (config_tx, config_rx) = watch::channel(Arc::clone(&initial)); // Spawn a few sessions. for sat_id in 0..3u32 { let rx = config_rx.clone(); tokio::spawn(uplink_session(sat_id, rx)); } // Simulate a config reload. tokio::time::sleep(tokio::time::Duration::from_millis(50)).await; config_tx.send(Arc::new(ControlPlaneConfig { max_frame_size: 32768, session_timeout_secs: 300, })).unwrap(); tokio::time::sleep(tokio::time::Duration::from_millis(50)).await; }
Arc<Config> avoids cloning the full config struct on every borrow(). The Arc::clone is cheap (one atomic increment); the config data is shared read-only across tasks.
Code Examples
TLE Catalog Update Broadcaster
When the orbit data pipeline ingests a new TLE batch, it publishes the update over a broadcast channel. Every active session task receives the update and can refresh its orbital prediction model.
use tokio::sync::broadcast; use std::sync::Arc; #[derive(Clone, Debug)] struct TleUpdate { batch_id: u32, records: Arc<Vec<String>>, } async fn session_task( satellite_id: u32, mut tle_rx: broadcast::Receiver<TleUpdate>, shutdown_rx: tokio::sync::watch::Receiver<bool>, ) { let mut shutdown = shutdown_rx.clone(); loop { tokio::select! { result = tle_rx.recv() => { match result { Ok(update) => { tracing::info!( satellite_id, batch = update.batch_id, records = update.records.len(), "TLE update applied" ); } Err(broadcast::error::RecvError::Lagged(n)) => { tracing::warn!(satellite_id, missed = n, "TLE lag — resyncing"); } Err(broadcast::error::RecvError::Closed) => break, } } Ok(()) = shutdown.changed() => { if *shutdown.borrow() { break; } } } } tracing::info!(satellite_id, "session task exiting"); } #[tokio::main] async fn main() { tracing_subscriber::fmt::init(); let (tle_tx, _) = broadcast::channel::<TleUpdate>(32); let (shutdown_tx, shutdown_rx) = tokio::sync::watch::channel(false); // Spawn 4 sessions, each with its own broadcast receiver. for sat_id in 0..4u32 { let tle_rx = tle_tx.subscribe(); let sd = shutdown_rx.clone(); tokio::spawn(session_task(sat_id, tle_rx, sd)); } // Publish a TLE update to all sessions. tokio::time::sleep(tokio::time::Duration::from_millis(10)).await; tle_tx.send(TleUpdate { batch_id: 42, records: Arc::new(vec!["1 25544U...".to_string(); 100]), }).unwrap(); // Trigger shutdown. tokio::time::sleep(tokio::time::Duration::from_millis(20)).await; shutdown_tx.send(true).unwrap(); tokio::time::sleep(tokio::time::Duration::from_millis(20)).await; }
The combination of broadcast for events and watch for state is idiomatic Tokio. The broadcast channel delivers the catalog update to every session independently; the watch channel distributes the shutdown signal to all tasks simultaneously. The select! in the session loop races the two — whichever fires first wins.
Key Takeaways
-
broadcast::channel(capacity)distributes every message to every subscriber. Subscribers receive from their own position in a ring buffer. Creating a receiver viasender.subscribe()is the only way to subscribe — receivers created after a message is sent do not receive that message retroactively. -
RecvError::Lagged(n)is recoverable. A lagged receiver missednmessages but can continue receiving future ones. Whether missing messages is acceptable is application-specific; always handle it explicitly rather than treating it as a fatal error. -
watch::channel(initial)is for state distribution: the latest value, not every intermediate value.borrow()reads without waiting.changed().awaitwaits for the next update. Receivers created after a send see the current value immediately. -
Use
broadcastwhen every subscriber must receive every event. Usewatchwhen subscribers need the current state and can tolerate missing intermediate updates. Usempscwhen each message should be consumed by exactly one task. -
Arc<Config>wrapped in awatchchannel is the idiomatic pattern for distributing read-heavy configuration to many tasks. The watch notify is cheap; the config read is a lock-freeborrow().
Lesson 3 — Fan-In Aggregation: Merging Streams from Multiple Satellite Feeds
Module: Foundation — M03: Message Passing Patterns
Position: Lesson 3 of 3
Source: Async Rust — Maxwell Flitton & Caroline Morton, Chapters 3, 8
Context
Lesson 1 covered moving data from many producers to one consumer via MPSC. That is fan-in at its simplest: all producers push to the same channel. But the Meridian aggregator's real requirements are more demanding. The 48 uplink sessions produce at different rates. Archived replay feeds produce at a different priority level than live feeds. A session that goes silent should not block the aggregator from processing the other 47. A priority command frame from a SAFE_MODE event should not wait behind a queue of housekeeping frames.
These requirements call for structured fan-in: merging multiple independent async sources into one stream, with control over priority, fairness, and behaviour when sources are slow or silent. This lesson covers three fan-in patterns — shared-sender MPSC, select!-based merge with priority, and the router actor pattern — and when to use each.
Source: Async Rust, Chapters 3 & 8 (Flitton & Morton)
Core Concepts
Shared-Sender Fan-In: The Simple Case
The simplest fan-in is the one already established in Lesson 1: clone the Sender, give each producer a clone, and let the single Receiver consume them all. Every message enters the same queue; the consumer sees them in arrival order.
use tokio::sync::mpsc; async fn uplink_producer(satellite_id: u32, tx: mpsc::Sender<(u32, Vec<u8>)>) { for seq in 0u8..5 { let frame = vec![satellite_id as u8, seq]; if tx.send((satellite_id, frame)).await.is_err() { break; // Aggregator shut down. } } } #[tokio::main] async fn main() { let (tx, mut rx) = mpsc::channel::<(u32, Vec<u8>)>(256); for sat_id in 0..4u32 { tokio::spawn(uplink_producer(sat_id, tx.clone())); } drop(tx); while let Some((sat, frame)) = rx.recv().await { println!("sat {sat}: {:?}", frame); } }
This is correct and efficient for uniform, same-priority inputs. It has one limitation: arrival order provides no priority control. A SAFE_MODE frame from satellite 7 waits behind whatever housekeeping frames arrived first.
select!-Based Priority Fan-In
When sources have different priorities, select! can implement a priority receive by always checking a high-priority channel before a lower-priority one. Tokio's select! macro randomly selects among ready branches for fairness, but the biased modifier overrides this and evaluates branches in source order:
use tokio::sync::mpsc; async fn priority_aggregator( mut high: mpsc::Receiver<Vec<u8>>, mut low: mpsc::Receiver<Vec<u8>>, ) { loop { // biased: always check high-priority first. // Without biased, both channels are polled in random order — // low-priority frames could be dispatched before high-priority ones // if both are ready simultaneously. tokio::select! { biased; Some(frame) = high.recv() => { println!("HIGH: {} bytes", frame.len()); } Some(frame) = low.recv() => { println!("LOW: {} bytes", frame.len()); } else => break, } } } #[tokio::main] async fn main() { let (high_tx, high_rx) = mpsc::channel::<Vec<u8>>(64); let (low_tx, low_rx) = mpsc::channel::<Vec<u8>>(256); // High-priority: SAFE_MODE and emergency commands. tokio::spawn(async move { high_tx.send(vec![0xFF; 8]).await.unwrap(); // emergency frame }); // Low-priority: housekeeping telemetry. tokio::spawn(async move { for _ in 0..3 { low_tx.send(vec![0x00; 64]).await.unwrap(); } }); priority_aggregator(high_rx, low_rx).await; }
biased is important here. Without it, if both channels have messages ready, select! randomly picks which to process — a high-priority frame could wait behind three low-priority frames. With biased, the high-priority channel is always drained first. The tradeoff: if the high-priority channel receives messages faster than they are processed, the low-priority channel is starved. For mission-critical applications like SAFE_MODE injection, this is the intended behaviour.
This pattern directly implements what Async Rust Chapter 3 builds when constructing a priority async queue with HIGH_CHANNEL and LOW_CHANNEL — the concept is the same, applied to async channels rather than thread queues.
Tagging Messages with Source Identity
When fan-in merges undifferentiated Vec<u8> frames from multiple sources, the consumer cannot determine which satellite the frame came from. Tag messages at the producer side with an enum or a source identifier:
use tokio::sync::mpsc; #[derive(Debug)] enum FeedKind { LiveUplink { satellite_id: u32 }, ArchivedReplay { mission_id: String }, } #[derive(Debug)] struct TaggedFrame { source: FeedKind, sequence: u64, payload: Vec<u8>, } async fn live_uplink(sat_id: u32, tx: mpsc::Sender<TaggedFrame>) { for seq in 0u64..3 { let _ = tx.send(TaggedFrame { source: FeedKind::LiveUplink { satellite_id: sat_id }, sequence: seq, payload: vec![sat_id as u8; 32], }).await; } } async fn replay_feed(mission: String, tx: mpsc::Sender<TaggedFrame>) { for seq in 0u64..2 { let _ = tx.send(TaggedFrame { source: FeedKind::ArchivedReplay { mission_id: mission.clone() }, sequence: seq, payload: vec![0xAA; 128], }).await; } } #[tokio::main] async fn main() { let (tx, mut rx) = mpsc::channel::<TaggedFrame>(128); for sat_id in 0..3u32 { tokio::spawn(live_uplink(sat_id, tx.clone())); } tokio::spawn(replay_feed("ARTEMIS-IV".to_string(), tx.clone())); drop(tx); while let Some(frame) = rx.recv().await { match &frame.source { FeedKind::LiveUplink { satellite_id } => { println!("live sat {satellite_id} seq {}: {} bytes", frame.sequence, frame.payload.len()); } FeedKind::ArchivedReplay { mission_id } => { println!("replay {mission_id} seq {}: {} bytes", frame.sequence, frame.payload.len()); } } } }
Using an enum for source identity is more robust than a raw integer: the compiler enforces that all source types are handled. When a new source type is added, match exhaustiveness forces updates at all handling sites.
The Router Actor Pattern
For more than two or three sources, or when sources are created dynamically (e.g., a new ground station connection comes online mid-session), a router actor is the correct abstraction. The router owns a set of active input channels, polls them all, and forwards to a single output channel. This is the pattern Async Rust Chapter 8 builds as the foundation of its actor system.
use tokio::sync::mpsc; use std::collections::HashMap; #[derive(Debug)] struct TaggedFrame { source_id: u32, payload: Vec<u8>, } enum RouterMsg { /// Register a new uplink feed. AddFeed { source_id: u32, feed: mpsc::Receiver<Vec<u8>> }, /// Remove an uplink feed (session ended). RemoveFeed { source_id: u32 }, } async fn router_actor( mut ctrl: mpsc::Receiver<RouterMsg>, out: mpsc::Sender<TaggedFrame>, ) { // Tokio's mpsc doesn't provide a built-in multi-receiver select, // so we use a secondary MPSC where all feeds forward their frames. let (internal_tx, mut internal_rx) = mpsc::channel::<TaggedFrame>(512); let mut feed_handles: HashMap<u32, tokio::task::JoinHandle<()>> = HashMap::new(); loop { tokio::select! { // Control messages: add or remove feeds. Some(msg) = ctrl.recv() => { match msg { RouterMsg::AddFeed { source_id, mut feed } => { let fwd_tx = internal_tx.clone(); let handle = tokio::spawn(async move { while let Some(payload) = feed.recv().await { if fwd_tx.send(TaggedFrame { source_id, payload }).await.is_err() { break; // Router shut down. } } tracing::debug!(source_id, "feed task exiting"); }); feed_handles.insert(source_id, handle); } RouterMsg::RemoveFeed { source_id } => { if let Some(handle) = feed_handles.remove(&source_id) { handle.abort(); // Feed task no longer needed. } } } } // Frames from all registered feeds, already fan-in'ed via internal channel. Some(frame) = internal_rx.recv() => { if out.send(frame).await.is_err() { break; // Downstream consumer has shut down. } } else => break, } } } #[tokio::main] async fn main() { let (ctrl_tx, ctrl_rx) = mpsc::channel::<RouterMsg>(8); let (out_tx, mut out_rx) = mpsc::channel::<TaggedFrame>(256); tokio::spawn(router_actor(ctrl_rx, out_tx)); // Register two satellite feeds dynamically. for sat_id in [25544u32, 48274] { let (feed_tx, feed_rx) = mpsc::channel::<Vec<u8>>(32); ctrl_tx.send(RouterMsg::AddFeed { source_id: sat_id, feed: feed_rx, }).await.unwrap(); tokio::spawn(async move { for i in 0u8..3 { feed_tx.send(vec![i; 16]).await.unwrap(); } }); } drop(ctrl_tx); tokio::time::sleep(tokio::time::Duration::from_millis(50)).await; let mut count = 0; while let Ok(frame) = tokio::time::timeout( tokio::time::Duration::from_millis(20), out_rx.recv() ).await { if let Some(f) = frame { println!("sat {}: {} bytes", f.source_id, f.payload.len()); count += 1; } else { break; } } println!("total frames: {count}"); }
Each registered feed gets a dedicated forwarding task that moves frames to the router's internal channel. The router selects between control messages (add/remove feeds) and forwarded frames. Adding a new satellite source at runtime is a single ctrl_tx.send(RouterMsg::AddFeed {...}) call — no restructuring of the select loop.
Key Takeaways
-
Shared-sender MPSC is the simplest fan-in: all producers clone the
Sender, and the consumer reads from the singleReceiver. Use it when sources have equal priority and arrival order is acceptable. -
select!withbiasedimplements priority fan-in: the first branch is always evaluated before the second. Use it for two or three sources with different priority levels. Withoutbiased,select!randomizes branch selection — a high-priority source is not guaranteed to be drained first when both are ready. -
Tag messages at the source with a typed identifier (
enumor struct field) rather than relying on arrival order to infer provenance. An enum exhaustiveness check at the consumer forces all source types to be handled explicitly. -
The router actor pattern handles dynamic fan-in: sources can be registered and deregistered at runtime via control messages. Each source gets a dedicated forwarding task that converts its
Receiverinto tagged frames on the internal channel. The router selects between control and data messages. -
Fan-in and fan-out compose: an aggregator can receive from a router (fan-in) and forward to a broadcast channel (fan-out), building a full hub-and-spoke telemetry pipeline from these primitives.
Interactive Examples
Message Brokers
Project — Multi-Source Telemetry Aggregator
Module: Foundation — M03: Message Passing Patterns
Prerequisite: All three module quizzes passed (≥70%)
- Mission Brief
- System Specification
- Expected Output
- Acceptance Criteria
- Hints
- Reference Implementation
- Reflection
Mission Brief
TO: Platform Engineering
FROM: Mission Control Systems Lead
CLASSIFICATION: UNCLASSIFIED // INTERNAL
SUBJECT: RFC-0047 — Telemetry Aggregation Pipeline
The control plane currently receives telemetry from 48 LEO satellite uplinks and two archived replay feeds simultaneously during mission replay operations. Each source produces frames at independent rates. Emergency command frames from any source must be processed before routine telemetry. Downstream analytics consumers need every frame; a monitoring dashboard needs only the latest pipeline statistics.
Your task is to build the telemetry aggregation pipeline that connects these sources to their consumers. The pipeline must: fan-in all sources into a priority-ordered stream, fan-out to a downstream frame processor and to a monitoring dashboard, apply backpressure so fast sources cannot overwhelm the processor, and shut down cleanly when signalled.
System Specification
Frame Types
#![allow(unused)] fn main() { #[derive(Debug, Clone)] pub enum FramePriority { Emergency, // SAFE_MODE, ABORT commands Routine, // Standard telemetry } #[derive(Debug, Clone)] pub struct Frame { pub source_id: u32, pub source_kind: SourceKind, pub priority: FramePriority, pub sequence: u64, pub payload: Vec<u8>, } #[derive(Debug, Clone)] pub enum SourceKind { LiveUplink, ArchivedReplay, } }
Pipeline Architecture
[Uplink 0..48] ──┐
├─► [Router Actor] ─► [Priority Fan-In] ─► [Frame Processor]
[Replay 0..2] ──┘ │ │
└──────────────────────────────► [Broadcast: all frames]
│
[Dashboard] [Archive]
[watch: shutdown] ──────────────────────────────────────► All tasks
[watch: stats] ◄──────────────────── Frame Processor (updates atomically)
Behavioural Requirements
Fan-in: Frames from live uplinks and archived replays are merged via a router actor that supports dynamic source registration. Emergency frames must be prioritised over routine frames when both are available simultaneously.
Backpressure: The frame processor has a bounded input channel (capacity 64). When the processor is saturated, backpressure propagates up to the priority fan-in, which in turn applies pressure to the router's internal channel. Routine sources are slowed; emergency frames still make progress due to priority ordering.
Fan-out: Every processed frame is sent over a broadcast channel to all downstream consumers. The monitoring dashboard subscribes; an archive writer task subscribes. The dashboard is allowed to lag and handles RecvError::Lagged gracefully.
Stats: The pipeline maintains three AtomicU64 counters: frames_routed, frames_processed, emergency_count. These are exposed via a watch channel as a PipelineStats snapshot, updated by the frame processor after each frame.
Shutdown: A watch<bool> shutdown signal is distributed to all tasks. On signal: (1) stop accepting new frames from sources, (2) drain the priority fan-in channel, (3) close the broadcast channel, (4) all tasks exit within 5 seconds.
Expected Output
A binary that:
- Starts a router actor accepting dynamic source registration
- Registers 4 live uplink sources (each sending 10 frames) and 1 replay source (sending 5 frames)
- 2 of the live uplink frames per source are marked
Emergency - Runs a frame processor that logs each frame with its priority and source
- Runs a monitoring task that reads
watch<PipelineStats>every 50ms and prints stats - Runs a downstream archive task subscribed to the broadcast channel
- Sends shutdown signal after all sources finish; all tasks exit cleanly
The output should clearly show emergency frames being processed before routine frames from the same batch.
Acceptance Criteria
| # | Criterion | Verifiable |
|---|---|---|
| 1 | Emergency frames processed before queued routine frames from the same source | Yes — log order |
| 2 | New sources can be registered at runtime via the router control channel | Yes — sources registered mid-run |
| 3 | Frame processor channel capacity is enforced — producers yield when full | Yes — add tokio::time::sleep in processor and verify producers do not drop frames |
| 4 | All downstream consumers receive every processed frame via broadcast | Yes — counts match between processor and archive consumer |
| 5 | Stats watch channel provides latest snapshot without acquiring any lock | Yes — code review: only atomic loads in stats read path |
| 6 | Shutdown drains the fan-in channel before exiting | Yes — no frames lost after shutdown signal |
| 7 | Lagged broadcast receivers log a warning and continue — they do not crash | Yes — introduce a slow archive task and verify Lagged is handled |
Hints
Hint 1 — Priority fan-in with biased select!
Use two channels from the router: one for emergency frames, one for routine. The priority fan-in selects with biased:
#![allow(unused)] fn main() { async fn priority_fan_in( mut emergency_rx: tokio::sync::mpsc::Receiver<Frame>, mut routine_rx: tokio::sync::mpsc::Receiver<Frame>, out_tx: tokio::sync::mpsc::Sender<Frame>, shutdown: tokio::sync::watch::Receiver<bool>, ) { let mut shutdown = shutdown; loop { tokio::select! { biased; Some(f) = emergency_rx.recv() => { if out_tx.send(f).await.is_err() { break; } } Some(f) = routine_rx.recv() => { if out_tx.send(f).await.is_err() { break; } } Ok(()) = shutdown.changed() => { if *shutdown.borrow() { break; } } else => break, } } } }
Hint 2 — Router actor with dynamic registration
The router forwards all sources to two internal channels split by priority. Each registered source gets a forwarding task:
#![allow(unused)] fn main() { enum RouterMsg { AddSource { source_id: u32, source_kind: SourceKind, feed: tokio::sync::mpsc::Receiver<Frame>, }, RemoveSource { source_id: u32 }, } }
The forwarding task reads from the feed and sends to the appropriate internal channel based on frame.priority.
Hint 3 — Stats snapshot with watch + atomics
The frame processor updates atomic counters after each frame, then sends a snapshot to the watch channel:
#![allow(unused)] fn main() { use std::sync::atomic::{AtomicU64, Ordering::Relaxed}; use std::sync::Arc; #[derive(Clone, Debug, Default)] pub struct PipelineStats { pub frames_processed: u64, pub emergency_count: u64, } struct StatsTracker { frames_processed: AtomicU64, emergency_count: AtomicU64, tx: tokio::sync::watch::Sender<PipelineStats>, } impl StatsTracker { fn record(&self, is_emergency: bool) { self.frames_processed.fetch_add(1, Relaxed); if is_emergency { self.emergency_count.fetch_add(1, Relaxed); } // Publish a snapshot — receivers always see the latest. let _ = self.tx.send(PipelineStats { frames_processed: self.frames_processed.load(Relaxed), emergency_count: self.emergency_count.load(Relaxed), }); } } }
Hint 4 — Broadcast fan-out with lagged handling
#![allow(unused)] fn main() { async fn archive_consumer( mut rx: tokio::sync::broadcast::Receiver<Frame>, ) { let mut archived = 0u64; loop { match rx.recv().await { Ok(frame) => { archived += 1; tracing::debug!( source = frame.source_id, seq = frame.sequence, "archived" ); } Err(tokio::sync::broadcast::error::RecvError::Lagged(n)) => { // Archive fell behind — note the gap and continue. tracing::warn!(missed = n, "archive lagged"); } Err(tokio::sync::broadcast::error::RecvError::Closed) => { tracing::info!(total = archived, "archive consumer done"); break; } } } } }
Reference Implementation
Reveal reference implementation
// This reference implementation is intentionally condensed. // A production implementation would split into modules. use tokio::sync::{broadcast, mpsc, watch}; use std::sync::atomic::{AtomicU64, Ordering::Relaxed}; use std::sync::Arc; use std::collections::HashMap; use tokio::time::{sleep, Duration}; #[derive(Debug, Clone)] pub enum FramePriority { Emergency, Routine } #[derive(Debug, Clone)] pub enum SourceKind { LiveUplink, ArchivedReplay } #[derive(Debug, Clone)] pub struct Frame { pub source_id: u32, pub source_kind: SourceKind, pub priority: FramePriority, pub sequence: u64, pub payload: Vec<u8>, } #[derive(Clone, Debug, Default)] pub struct PipelineStats { pub frames_processed: u64, pub emergency_count: u64, } enum RouterMsg { AddSource { source_id: u32, feed: mpsc::Receiver<Frame>, }, } async fn router_actor( mut ctrl: mpsc::Receiver<RouterMsg>, emergency_tx: mpsc::Sender<Frame>, routine_tx: mpsc::Sender<Frame>, ) { let (internal_tx, mut internal_rx) = mpsc::channel::<Frame>(512); let mut handles: HashMap<u32, tokio::task::JoinHandle<()>> = HashMap::new(); loop { tokio::select! { Some(msg) = ctrl.recv() => { match msg { RouterMsg::AddSource { source_id, mut feed } => { let fwd = internal_tx.clone(); let h = tokio::spawn(async move { while let Some(frame) = feed.recv().await { if fwd.send(frame).await.is_err() { break; } } }); handles.insert(source_id, h); } } } Some(frame) = internal_rx.recv() => { let dest = match frame.priority { FramePriority::Emergency => &emergency_tx, FramePriority::Routine => &routine_tx, }; if dest.send(frame).await.is_err() { break; } } else => break, } } } async fn priority_fan_in( mut emerg_rx: mpsc::Receiver<Frame>, mut routine_rx: mpsc::Receiver<Frame>, out_tx: mpsc::Sender<Frame>, mut shutdown: watch::Receiver<bool>, ) { loop { tokio::select! { biased; Some(f) = emerg_rx.recv() => { if out_tx.send(f).await.is_err() { break; } } Some(f) = routine_rx.recv() => { if out_tx.send(f).await.is_err() { break; } } Ok(()) = shutdown.changed() => { if *shutdown.borrow() { break; } } else => break, } } } async fn frame_processor( mut rx: mpsc::Receiver<Frame>, bcast_tx: broadcast::Sender<Frame>, stats_tx: watch::Sender<PipelineStats>, processed: Arc<AtomicU64>, emergency: Arc<AtomicU64>, ) { while let Some(frame) = rx.recv().await { let is_emerg = matches!(frame.priority, FramePriority::Emergency); tracing::info!( source = frame.source_id, seq = frame.sequence, priority = if is_emerg { "EMERGENCY" } else { "routine" }, "processed" ); processed.fetch_add(1, Relaxed); if is_emerg { emergency.fetch_add(1, Relaxed); } let _ = stats_tx.send(PipelineStats { frames_processed: processed.load(Relaxed), emergency_count: emergency.load(Relaxed), }); let _ = bcast_tx.send(frame); } } async fn archive_consumer(mut rx: broadcast::Receiver<Frame>) { let mut count = 0u64; loop { match rx.recv().await { Ok(_) => count += 1, Err(broadcast::error::RecvError::Lagged(n)) => { tracing::warn!(missed = n, "archive lagged"); } Err(broadcast::error::RecvError::Closed) => { tracing::info!(total = count, "archive done"); break; } } } } #[tokio::main] async fn main() { tracing_subscriber::fmt::init(); let (shutdown_tx, shutdown_rx) = watch::channel(false); let (stats_tx, mut stats_rx) = watch::channel(PipelineStats::default()); let (bcast_tx, _) = broadcast::channel::<Frame>(128); let (ctrl_tx, ctrl_rx) = mpsc::channel::<RouterMsg>(8); let (emerg_tx, emerg_rx) = mpsc::channel::<Frame>(64); let (routine_tx, routine_rx) = mpsc::channel::<Frame>(256); let (proc_tx, proc_rx) = mpsc::channel::<Frame>(64); let processed = Arc::new(AtomicU64::new(0)); let emergency = Arc::new(AtomicU64::new(0)); // Start pipeline tasks. tokio::spawn(router_actor(ctrl_rx, emerg_tx, routine_tx)); tokio::spawn(priority_fan_in(emerg_rx, routine_rx, proc_tx, shutdown_rx.clone())); tokio::spawn(frame_processor( proc_rx, bcast_tx.clone(), stats_tx, Arc::clone(&processed), Arc::clone(&emergency), )); tokio::spawn(archive_consumer(bcast_tx.subscribe())); // Register 4 live uplink sources. for sat_id in 0..4u32 { let (feed_tx, feed_rx) = mpsc::channel::<Frame>(32); ctrl_tx.send(RouterMsg::AddSource { source_id: sat_id, feed: feed_rx }).await.unwrap(); tokio::spawn(async move { for seq in 0u64..10 { let priority = if seq < 2 { FramePriority::Emergency } else { FramePriority::Routine }; feed_tx.send(Frame { source_id: sat_id, source_kind: SourceKind::LiveUplink, priority, sequence: seq, payload: vec![sat_id as u8; 32], }).await.unwrap(); sleep(Duration::from_millis(5)).await; } }); } // Stats monitor. tokio::spawn(async move { for _ in 0..4 { sleep(Duration::from_millis(50)).await; stats_rx.changed().await.ok(); let s = stats_rx.borrow().clone(); println!("stats: processed={} emergency={}", s.frames_processed, s.emergency_count); } }); sleep(Duration::from_millis(300)).await; println!("sending shutdown"); shutdown_tx.send(true).unwrap(); sleep(Duration::from_millis(100)).await; println!("final: processed={} emergency={}", processed.load(Relaxed), emergency.load(Relaxed)); }
Reflection
This project assembles the full message-passing toolkit from Module 3. The router actor provides dynamic fan-in with independent source lifecycle management. The priority fan-in ensures emergency frames are never delayed by routine traffic. The broadcast channel distributes every processed frame to all downstream consumers. The watch channel distributes state — shutdown signal and pipeline stats — without requiring consumers to hold any lock.
The pattern here — router → priority queue → processor → broadcast — recurs throughout Meridian's data pipeline architecture. In Module 4 (Network Programming), the router actor gains TCP listener integration, turning it into a full ground station connection broker.
Module 04 — Network Programming
Track: Foundation — Mission Control Platform
Position: Module 4 of 6
Source material: Tokio tutorial I/O and Framing chapters; reqwest documentation; tokio::net API docs
Quiz pass threshold: 70% on all three lessons to unlock the project
Note on source book: Network Programming with Rust (Chanda, 2018) uses pre-async/await Tokio 0.1 APIs that are incompatible with current Tokio 1.x. Lesson content is grounded in the current Tokio tutorial and API documentation rather than that book.
- Mission Context
- What You Will Learn
- Lessons
- Capstone Project — Ground Station Network Client
- Prerequisites
- What Comes Next
Mission Context
The Meridian control plane's telemetry pipeline now has a complete message-passing architecture (Module 3). What it still lacks is the network layer: the actual TCP connections from ground stations that feed the pipeline. This module builds that layer — connecting the abstract pipeline to the physical network.
The control plane operates three distinct network protocols simultaneously: persistent TCP sessions with ground stations (framed, long-lived, must reconnect on failure), UDP datagrams from SDA radar and optical sensors (high-frequency, latency-sensitive, loss-tolerant), and outbound HTTP calls to the TLE catalog API and mission operations endpoints (request-response, with retry logic).
What You Will Learn
By the end of this module you will be able to:
- Build async TCP servers with
tokio::net::TcpListener, spawn per-connection tasks, handle EOF correctly, and shut down the accept loop cleanly via awatchchannel shutdown signal - Use
AsyncReadExt::read_exactfor length-prefix framing, split sockets withTcpStream::split()andinto_split()for concurrent read/write, and wrap writers inBufWriterto reduce syscall overhead - Add per-session timeouts to detect silent connections (antenna tracking failures, network blackouts) without leaving ghost sessions open
- Bind and use
tokio::net::UdpSocketin both connected and unconnected modes, understand why UDP receive buffers must be sized to the maximum datagram, and applytry_sendrather than blocking in high-frequency sensor pipelines - Build a production
reqwest::Clientwith appropriate timeout configuration, share it viaCloneacross async tasks, useerror_for_status()correctly, and implement exponential backoff retry logic that distinguishes retryable server errors from non-retryable client errors
Lessons
Lesson 1 — TCP Servers with tokio::net: Listeners, Connection Handling, and Graceful Shutdown
Covers TcpListener::bind and the accept loop, AsyncReadExt/AsyncWriteExt extension traits, read_exact for framing, EOF handling, TcpStream::split() vs into_split(), BufWriter for write batching, read timeouts, and graceful shutdown of both the accept loop and individual connections.
Key question this lesson answers: How do you build a TCP server that handles many concurrent connections correctly — reading frames, handling EOF, splitting for bidirectional I/O, and shutting down cleanly?
→ lesson-01-tcp-servers.md / lesson-01-quiz.toml
Lesson 2 — UDP and Datagram Protocols: Low-Latency Sensor Data
Covers UdpSocket::bind, recv_from/send_to semantics, connected vs unconnected mode, concurrent send/receive via Arc<UdpSocket>, buffer sizing and IP fragmentation, OS socket buffer tuning with socket2, and the decision between UDP and TCP for high-frequency sensor pipelines.
Key question this lesson answers: When does UDP's lack of ordering and reliability become an advantage, and how do you structure a receiver that does not block on a slow downstream consumer?
→ lesson-02-udp.md / lesson-02-quiz.toml
Lesson 3 — HTTP Clients with reqwest: Async REST Calls
Covers reqwest::Client construction and sharing, ClientBuilder timeout configuration, error_for_status(), .json() for serialization/deserialization, retry logic with exponential backoff and jitter, status-code-based retry decisions, and multiple clients for services with different SLOs.
Key question this lesson answers: How do you build a robust HTTP client that handles transient failures without hammering a rate-limited API, and correctly distinguishes retryable errors from permanent ones?
→ lesson-03-http-clients.md / lesson-03-quiz.toml
Capstone Project — Ground Station Network Client
Build the full ground station client: connects to a TCP endpoint using the length-prefix framing protocol, automatically reconnects on failure with exponential backoff, runs a background TLE refresh via HTTP with retry logic, forwards received frames to the downstream aggregator pipeline via try_send, publishes session state via a watch channel, and shuts down cleanly including a GOODBYE frame to the peer.
Acceptance is against 7 verifiable criteria including automatic reconnection, bounded backoff, 5-minute failure timeout, TLE retry, non-blocking frame forwarding, mid-frame shutdown safety, and state machine correctness.
→ project-gs-client.md
Prerequisites
Modules 1–3 must be complete. Module 1 established the async task model and tokio::select! — both used extensively in connection handlers. Module 3 established the message-passing pipeline that network frames feed into. Understanding mpsc::Sender and try_send from Module 3 is prerequisite to the UDP and TCP lessons' discussion of non-blocking frame forwarding.
What Comes Next
Module 5 — Data-Oriented Design in Rust shifts from I/O to computation: how to lay out structs for CPU cache efficiency, when to use struct-of-arrays vs array-of-structs, and arena allocation for high-throughput frame processing. The telemetry frames arriving via the TCP and UDP clients from this module are processed in bulk in Module 5.
Lesson 1 — TCP Servers with tokio::net: Listeners, Connection Handling, and Graceful Shutdown
Module: Foundation — M04: Network Programming
Position: Lesson 1 of 3
Source: Tokio tutorial — I/O and Framing chapters (tokio.rs/tokio/tutorial)
Source note: Network Programming with Rust (Chanda) uses pre-async/await Tokio 0.1 APIs that are incompatible with current Tokio 1.x. This lesson is grounded in the current Tokio tutorial and Tokio 1.x API documentation.
Context
Every uplink session in the Meridian control plane begins with a TCP connection from a ground station. The Module 1 broker project sketched this accept loop in broad strokes. This lesson provides the complete model: how TcpListener binds and accepts connections, how to split a socket for concurrent read and write, how AsyncReadExt and AsyncWriteExt handle framed protocols, how a connection handler exits cleanly on EOF or error, and how the accept loop itself shuts down gracefully without leaking tasks.
The patterns here are not specific to Meridian. Every TCP server in Rust's async ecosystem — from a Redis clone to a satellite control plane — uses the same building blocks. Understanding them at the structural level means you can build, debug, and extend any such system.
Core Concepts
TcpListener — Binding and Accepting
tokio::net::TcpListener::bind(addr) binds the socket and returns a TcpListener. listener.accept().await waits for the next incoming connection and returns a (TcpStream, SocketAddr) pair. The accept call is async — while waiting, the executor can run other tasks.
use tokio::net::TcpListener; #[tokio::main] async fn main() -> anyhow::Result<()> { let listener = TcpListener::bind("0.0.0.0:7777").await?; loop { let (socket, addr) = listener.accept().await?; println!("connection from {addr}"); // Each connection gets its own task. tokio::spawn(async move { handle_connection(socket).await; }); } } async fn handle_connection(_socket: tokio::net::TcpStream) { // ... read frames, process, respond }
The accept loop spawns a new task per connection and immediately loops back to accept the next one. The connection handler runs concurrently with all other handlers and with the accept loop itself. This is the fundamental async TCP server structure.
One critical detail: if listener.accept() returns an error, it does not always mean the listener is broken. EAGAIN, ECONNABORTED, and similar transient errors should be logged and retried. An unrecoverable error (e.g., the listener fd was closed) should terminate the loop. A simple approach: log the error and continue — the OS will sort out transient errors. For a production-grade implementation, add an exponential backoff on repeated errors.
AsyncRead, AsyncWrite, and Their Extension Traits
tokio::net::TcpStream implements both AsyncRead and AsyncWrite, but you almost never call their methods directly. Instead you use the extension traits AsyncReadExt and AsyncWriteExt (from tokio::io), which provide ergonomic higher-level methods:
| Method | Description |
|---|---|
read(&mut buf) | Read up to buf.len() bytes; returns 0 on EOF |
read_exact(&mut buf) | Read exactly buf.len() bytes; errors on EOF |
read_u32(), read_u64(), etc. | Read a big-endian integer |
write_all(&buf) | Write all bytes in buf |
write_u32(n), etc. | Write a big-endian integer |
read_exact is the right primitive for fixed-size framing (like Meridian's 4-byte length prefix). It guarantees the buffer is fully populated before returning, handling the case where the underlying read returns fewer bytes than requested.
EOF handling: read() returning Ok(0) means the remote has closed the write half of the connection. Any subsequent read() will also return Ok(0). When you see this, exit the read loop — continuing to call read() on a closed stream creates a 100% CPU spin loop.
#![allow(unused)] fn main() { use tokio::io::AsyncReadExt; use tokio::net::TcpStream; async fn read_frame(stream: &mut TcpStream) -> anyhow::Result<Option<Vec<u8>>> { let mut len_buf = [0u8; 4]; // read_exact returns Err(UnexpectedEof) if the connection closes mid-header. match stream.read_exact(&mut len_buf).await { Ok(()) => {} Err(e) if e.kind() == std::io::ErrorKind::UnexpectedEof => { // Clean EOF at frame boundary — connection closed normally. return Ok(None); } Err(e) => return Err(e.into()), } let len = u32::from_be_bytes(len_buf) as usize; if len > 65_536 { anyhow::bail!("frame too large: {len} bytes"); } let mut payload = vec![0u8; len]; stream.read_exact(&mut payload).await?; Ok(Some(payload)) } }
Splitting a Socket: io::split and TcpStream::split
A TcpStream implements both AsyncRead and AsyncWrite, but Rust's borrow rules prevent passing &mut stream to two concurrent operations at the same time. To read and write concurrently — for example, to handle a bidirectional protocol or to send heartbeat responses while reading frames — the socket must be split.
TcpStream::split() splits by reference. Both halves must remain on the same task, but the read and write can be used independently within a single select! or sequential pair. Zero cost — no Arc, no Mutex.
io::split(stream) splits by value. Each half can be sent to a different task. Internally uses an Arc<Mutex> — slightly more overhead than the reference split, but needed when the read and write tasks must be truly independent.
#![allow(unused)] fn main() { use tokio::io::{AsyncReadExt, AsyncWriteExt}; use tokio::net::TcpStream; async fn bidirectional_handler(stream: TcpStream) -> anyhow::Result<()> { // into_split: value split — each half can move to separate tasks. let (mut reader, mut writer) = stream.into_split(); // Write task: sends periodic heartbeats. let write_task = tokio::spawn(async move { loop { tokio::time::sleep(tokio::time::Duration::from_secs(30)).await; if writer.write_all(b"HEARTBEAT\n").await.is_err() { break; } } }); // Read task: processes incoming frames. let mut buf = vec![0u8; 4096]; loop { let n = reader.read(&mut buf).await?; if n == 0 { break; } // EOF tracing::debug!(bytes = n, "frame received"); } write_task.abort(); Ok(()) } }
Use TcpStream::split() (reference) when both read and write stay in one task. Use TcpStream::into_split() (value) when they need to move to separate tasks.
BufWriter — Reducing Syscalls on the Write Path
Each write_all call is a syscall. For a protocol that sends many small writes (header bytes, then payload bytes), the overhead accumulates. Wrapping the write half in tokio::io::BufWriter buffers writes and flushes them in larger batches:
#![allow(unused)] fn main() { use tokio::io::{AsyncWriteExt, BufWriter}; use tokio::net::TcpStream; async fn write_framed(stream: TcpStream, payload: &[u8]) -> anyhow::Result<()> { // BufWriter with 8KB internal buffer — flushes when full or on explicit flush(). let mut writer = BufWriter::new(stream); // These two writes go to the internal buffer, not to the socket. let len = payload.len() as u32; writer.write_all(&len.to_be_bytes()).await?; writer.write_all(payload).await?; // flush() pushes the buffered bytes to the socket in one syscall. writer.flush().await?; Ok(()) } }
Always call flush() after writing a complete logical unit (a frame, a response). If you return from the handler without flushing, buffered data is silently dropped when the BufWriter drops.
Graceful Shutdown of the Accept Loop
A simple loop { listener.accept().await? } has no shutdown path. The pattern from Lesson 3 of Module 1 applies here: race the accept against a shutdown signal with select!:
#![allow(unused)] fn main() { use tokio::net::TcpListener; use tokio::sync::watch; async fn accept_loop( listener: TcpListener, mut shutdown: watch::Receiver<bool>, ) { loop { tokio::select! { accept = listener.accept() => { match accept { Ok((socket, addr)) => { tracing::info!(%addr, "connection accepted"); let sd = shutdown.clone(); tokio::spawn(async move { connection_handler(socket, sd).await; }); } Err(e) => { tracing::warn!("accept error: {e}"); // Continue — transient errors are normal. } } } Ok(()) = shutdown.changed() => { if *shutdown.borrow() { tracing::info!("accept loop shutting down"); break; } } } } } async fn connection_handler( _socket: tokio::net::TcpStream, _shutdown: watch::Receiver<bool>, ) { // Read frames; check shutdown between reads. } }
Pass the watch::Receiver into each connection handler so that individual connections can also respond to the shutdown signal — stopping mid-read cleanly rather than being forcibly dropped.
Code Examples
Production Ground Station TCP Server
A complete TCP server for a Meridian ground station connection. Reads length-prefixed frames, forwards them to the telemetry aggregator from Module 3, and shuts down cleanly.
use anyhow::Result; use tokio::{ io::{AsyncReadExt, AsyncWriteExt}, net::{TcpListener, TcpStream}, sync::{mpsc, watch}, time::{timeout, Duration}, }; use tracing::{info, warn}; #[derive(Debug)] struct TelemetryFrame { station_id: String, payload: Vec<u8>, } async fn read_frame(stream: &mut TcpStream) -> Result<Option<Vec<u8>>> { let mut len_buf = [0u8; 4]; match stream.read_exact(&mut len_buf).await { Ok(()) => {} Err(e) if e.kind() == std::io::ErrorKind::UnexpectedEof => return Ok(None), Err(e) => return Err(e.into()), } let len = u32::from_be_bytes(len_buf) as usize; if len > 65_536 { anyhow::bail!("frame too large: {len}"); } let mut buf = vec![0u8; len]; stream.read_exact(&mut buf).await?; Ok(Some(buf)) } async fn handle_connection( mut stream: TcpStream, station_id: String, frame_tx: mpsc::Sender<TelemetryFrame>, mut shutdown: watch::Receiver<bool>, ) { info!(station = %station_id, "session started"); loop { tokio::select! { // Bias toward reading to complete in-progress frames. biased; frame = timeout(Duration::from_secs(60), read_frame(&mut stream)) => { match frame { // Session timeout — ground station went silent. Err(_elapsed) => { warn!(station = %station_id, "session timeout"); break; } Ok(Ok(Some(payload))) => { if frame_tx.send(TelemetryFrame { station_id: station_id.clone(), payload, }).await.is_err() { break; // Aggregator shut down. } } Ok(Ok(None)) => { info!(station = %station_id, "connection closed by peer"); break; } Ok(Err(e)) => { warn!(station = %station_id, "read error: {e}"); break; } } } Ok(()) = shutdown.changed() => { if *shutdown.borrow() { info!(station = %station_id, "shutdown signal — closing session"); break; } } } } // Send a clean close to the peer. let _ = stream.shutdown().await; info!(station = %station_id, "session ended"); } pub async fn run_tcp_server( bind_addr: &str, frame_tx: mpsc::Sender<TelemetryFrame>, shutdown: watch::Receiver<bool>, ) -> Result<()> { let listener = TcpListener::bind(bind_addr).await?; info!("ground station server listening on {bind_addr}"); let mut conn_id = 0usize; let mut sd = shutdown.clone(); loop { tokio::select! { accept = listener.accept() => { let (socket, addr) = accept?; conn_id += 1; let station_id = format!("gs-{conn_id}@{addr}"); tokio::spawn(handle_connection( socket, station_id, frame_tx.clone(), shutdown.clone(), )); } Ok(()) = sd.changed() => { if *sd.borrow() { break; } } } } info!("accept loop exited"); Ok(()) } #[tokio::main] async fn main() -> Result<()> { tracing_subscriber::fmt::init(); let (frame_tx, mut frame_rx) = mpsc::channel::<TelemetryFrame>(256); let (shutdown_tx, shutdown_rx) = watch::channel(false); // Frame consumer. tokio::spawn(async move { while let Some(frame) = frame_rx.recv().await { info!(station = %frame.station_id, bytes = frame.payload.len(), "frame received"); } }); // Shutdown after 2 seconds for demo purposes. let sd = shutdown_tx; tokio::spawn(async move { tokio::time::sleep(Duration::from_secs(2)).await; let _ = sd.send(true); }); run_tcp_server("0.0.0.0:7777", frame_tx, shutdown_rx).await }
Several production decisions embedded here: the timeout around read_frame handles silent connections (antenna loss, network blackout) without leaving ghost sessions open. stream.shutdown() sends a TCP FIN to the peer on clean exit. The biased select! ensures an in-progress frame read is completed before the shutdown branch wins.
Key Takeaways
-
TcpListener::bind().awaitbinds the socket;listener.accept().awaityields a(TcpStream, SocketAddr). Spawn a task per connection and loop back immediately — the accept loop should never be blocked by connection handling. -
read()returningOk(0)is EOF — the remote closed its write half. Continuing to callread()after EOF creates a spin loop. Always exit the read loop onOk(0). -
read_exactis the correct primitive for fixed-size framing. It handles short reads internally and returnsUnexpectedEofif the connection closes before the buffer is filled. -
Use
TcpStream::split()for same-task read/write splitting (zero cost). UseTcpStream::into_split()when the read and write halves must move to separate tasks. -
BufWriterbatches small writes. Always callflush()after writing a complete logical unit — unflushed data is silently dropped when the writer drops. -
Add a
timeoutto reads in long-lived connections. Ground stations go silent without warning. A 60-second read timeout detects ghost sessions that would otherwise hold open resources indefinitely.
Lesson 2 — UDP and Datagram Protocols: Low-Latency Sensor Data
Module: Foundation — M04: Network Programming
Position: Lesson 2 of 3
Source: Synthesized from training knowledge and tokio::net::UdpSocket documentation
Source note: This lesson synthesizes from current
tokio::net::UdpSocketAPI documentation and training knowledge. The following concepts would benefit from verification against the source book if the API has changed:split()onUdpSocket,recv_from/send_tosemantics, andconnect()-vs-unconnected modes.
Context
The Meridian Space Domain Awareness network includes optical sensors and radar installations that report raw detection events at high frequency with strict latency requirements. A radar return needs to reach the conjunction analysis pipeline in under 50ms. At that latency budget, TCP's per-packet acknowledgment and retransmission overhead is a liability, not a feature. When the occasional dropped packet is acceptable — or when the application layer manages its own loss detection — UDP is the right transport.
UDP is a datagram protocol: each send and recv corresponds to exactly one discrete packet. There are no streams, no connection establishment, no ordering guarantees, and no retransmission. What you get is low overhead, minimal kernel buffering, and latency that is bounded only by the network, not by protocol machinery.
This lesson covers tokio::net::UdpSocket: binding, sending, receiving, splitting for concurrent send/receive, and the design decisions around UDP in a high-frequency sensor pipeline.
Core Concepts
UDP Socket Basics
UdpSocket::bind(addr) creates a UDP socket bound to a local address. Unlike TCP, there is no accept loop and no connection concept. A single bound socket can send to any address and receive from any address:
use tokio::net::UdpSocket; #[tokio::main] async fn main() -> anyhow::Result<()> { // Bind to receive on all interfaces, port 9090. let socket = UdpSocket::bind("0.0.0.0:9090").await?; let mut buf = [0u8; 1024]; loop { // recv_from returns the number of bytes and the sender's address. let (n, addr) = socket.recv_from(&mut buf).await?; println!("received {n} bytes from {addr}: {:?}", &buf[..n]); // Echo back. socket.send_to(&buf[..n], addr).await?; } }
recv_from waits for the next datagram. If the incoming datagram is larger than buf, the excess bytes are silently discarded — there is no partial read concept in UDP. Size your buffer to the maximum expected datagram, not the average.
Connected Mode vs. Unconnected Mode
An unconnected UDP socket can communicate with any remote address. A connected UDP socket is associated with one specific remote address via socket.connect(addr) — this is not a TCP handshake, just a filter on the local OS socket:
use tokio::net::UdpSocket; #[tokio::main] async fn main() -> anyhow::Result<()> { let socket = UdpSocket::bind("0.0.0.0:0").await?; // OS assigns port // "Connect" to the sensor — enables send/recv instead of send_to/recv_from. // Datagrams from other addresses are filtered out. socket.connect("192.168.1.100:5500").await?; socket.send(b"POLL").await?; let mut buf = [0u8; 256]; let n = socket.recv(&mut buf).await?; println!("sensor response: {:?}", &buf[..n]); Ok(()) }
After connect(), use send/recv instead of send_to/recv_from. The OS filters datagrams to only those from the connected address, which is useful for point-to-point sensor polling. For a server receiving from many sensors, use the unconnected mode with recv_from.
Splitting for Concurrent Send/Receive
A single UdpSocket cannot be both recv_from'd and send_to'd simultaneously from different tasks — you need a split. UdpSocket::into_split() returns (OwnedRecvHalf, OwnedSendHalf), each of which can be moved to a separate task:
use std::sync::Arc; use tokio::net::UdpSocket; #[tokio::main] async fn main() -> anyhow::Result<()> { let socket = Arc::new(UdpSocket::bind("0.0.0.0:9090").await?); // For UdpSocket, Arc-sharing is the idiomatic split pattern // because both send_to and recv_from take &self. let recv_socket = Arc::clone(&socket); let send_socket = Arc::clone(&socket); let recv_task = tokio::spawn(async move { let mut buf = [0u8; 1024]; loop { let (n, addr) = recv_socket.recv_from(&mut buf).await.unwrap(); println!("recv {n} bytes from {addr}"); } }); let send_task = tokio::spawn(async move { // Periodic heartbeat to a known sensor address. loop { tokio::time::sleep(tokio::time::Duration::from_secs(5)).await; send_socket.send_to(b"HEARTBEAT", "192.168.1.100:5500") .await.unwrap(); } }); let _ = tokio::join!(recv_task, send_task); Ok(()) }
UdpSocket's send_to and recv_from take &self (shared reference), so wrapping in Arc lets multiple tasks share the same socket without splitting. This differs from TcpStream where read and write require &mut self.
Buffer Sizing and Packet Loss
UDP datagrams have a maximum size of 65,507 bytes (for IPv4 over Ethernet), but practical limits are lower. A datagram that exceeds the network MTU (typically 1500 bytes on Ethernet) is fragmented at the IP layer. If any fragment is lost, the entire datagram is discarded. For high-frequency sensor data, keep individual datagrams under 1472 bytes (1500 MTU - 20 IP header - 8 UDP header) to avoid fragmentation.
Buffer the receive socket at the OS level with SO_RCVBUF if sensor bursts arrive faster than the application can drain them. This requires socket2 or nix crate access to set socket options before wrapping in tokio::net::UdpSocket:
#![allow(unused)] fn main() { use socket2::{Socket, Domain, Type}; use std::net::SocketAddr; use tokio::net::UdpSocket; async fn bind_with_large_buffer(addr: &str) -> anyhow::Result<UdpSocket> { let addr: SocketAddr = addr.parse()?; let socket = Socket::new(Domain::IPV4, Type::DGRAM, None)?; socket.set_reuse_address(true)?; // 4MB receive buffer to absorb radar bursts. socket.set_recv_buffer_size(4 * 1024 * 1024)?; socket.bind(&addr.into())?; socket.set_nonblocking(true)?; Ok(UdpSocket::from_std(socket.into())?) } }
When to Choose UDP over TCP
| Situation | Preferred |
|---|---|
| Radar/optical detection events, < 50ms latency budget | UDP |
| Telemetry frames requiring ordered delivery and reliability | TCP |
| Configuration commands — must not be lost | TCP |
| Periodic status heartbeats where loss is acceptable | UDP |
| Bulk TLE catalog transfer | TCP |
| High-frequency position updates where only latest matters | UDP |
The core tradeoff: TCP adds ordering, reliability, and flow control at the cost of latency and per-connection overhead. UDP provides a raw datagram channel — if reliability matters, implement it yourself (sequence numbers, ACKs, retransmission) at the application layer.
Code Examples
SDA Radar Sensor Receiver
The Meridian SDA network has radar stations that broadcast detection events as UDP datagrams. The receiver processes them and forwards to the conjunction analysis pipeline. Packet loss is tolerable — a missed radar return is worse than a delayed one, but the next sweep arrives in 250ms anyway.
use std::net::SocketAddr; use std::sync::Arc; use tokio::net::UdpSocket; use tokio::sync::mpsc; use tokio::time::{timeout, Duration}; #[derive(Debug)] struct RadarDetection { sensor_id: u32, azimuth_deg: f32, elevation_deg: f32, range_km: f32, timestamp_ms: u64, } fn parse_detection(buf: &[u8], addr: SocketAddr) -> Option<RadarDetection> { // Wire format: 4-byte sensor_id | 4-byte azimuth (f32 BE) | // 4-byte elevation (f32 BE) | 4-byte range (f32 BE) | // 8-byte timestamp (u64 BE) if buf.len() < 24 { return None; // Malformed datagram — discard silently. } let sensor_id = u32::from_be_bytes(buf[0..4].try_into().ok()?); let azimuth = f32::from_be_bytes(buf[4..8].try_into().ok()?); let elevation = f32::from_be_bytes(buf[8..12].try_into().ok()?); let range = f32::from_be_bytes(buf[12..16].try_into().ok()?); let timestamp = u64::from_be_bytes(buf[16..24].try_into().ok()?); tracing::debug!(%addr, sensor_id, "detection received"); Some(RadarDetection { sensor_id, azimuth_deg: azimuth, elevation_deg: elevation, range_km: range, timestamp_ms: timestamp, }) } async fn radar_receiver( bind_addr: &str, tx: mpsc::Sender<RadarDetection>, mut shutdown: tokio::sync::watch::Receiver<bool>, ) -> anyhow::Result<()> { let socket = Arc::new(UdpSocket::bind(bind_addr).await?); tracing::info!("radar receiver listening on {bind_addr}"); let mut buf = [0u8; 1472]; // Stay under MTU to avoid fragmentation. loop { tokio::select! { biased; recv = socket.recv_from(&mut buf) => { match recv { Ok((n, addr)) => { if let Some(detection) = parse_detection(&buf[..n], addr) { // Non-blocking — drop if pipeline is full rather than // blocking the receive loop. A queued radar sweep is // useless by the time it clears the backlog. if tx.try_send(detection).is_err() { tracing::warn!("detection pipeline full — datagram dropped"); } } } Err(e) => { tracing::warn!("recv error: {e}"); // UDP recv errors are typically transient — continue. } } } Ok(()) = shutdown.changed() => { if *shutdown.borrow() { tracing::info!("radar receiver shutting down"); break; } } } } Ok(()) } #[tokio::main] async fn main() -> anyhow::Result<()> { tracing_subscriber::fmt::init(); let (tx, mut rx) = mpsc::channel::<RadarDetection>(512); let (shutdown_tx, shutdown_rx) = tokio::sync::watch::channel(false); tokio::spawn(radar_receiver("0.0.0.0:9090", tx, shutdown_rx)); // Consumer: conjunction analysis pipeline. tokio::spawn(async move { while let Some(det) = rx.recv().await { tracing::info!( sensor = det.sensor_id, az = det.azimuth_deg, el = det.elevation_deg, range = det.range_km, "detection processed" ); } }); // Demo: shut down after 5 seconds. tokio::time::sleep(Duration::from_secs(5)).await; shutdown_tx.send(true)?; tokio::time::sleep(Duration::from_millis(100)).await; Ok(()) }
try_send instead of send().await is deliberate here. If the conjunction pipeline is saturated, blocking the radar receive loop means subsequent datagrams pile up in the OS socket buffer and eventually overflow it too. Dropping one detection and keeping the receive loop running is the correct behaviour for high-frequency sensor data where recency matters more than completeness.
Key Takeaways
-
UDP is a datagram protocol — each
send/recvis one discrete packet with no ordering, reliability, or congestion control. Use it when latency matters more than reliability, or when the application layer manages loss detection. -
recv_fromreturns the number of bytes received and the sender's address. If the datagram is larger than the buffer, excess bytes are silently discarded. Size receive buffers to the maximum expected datagram, not the average. -
connect()on a UDP socket is not a handshake — it sets a default remote address and filters incoming datagrams from other addresses. Use connected mode for point-to-point polling; use unconnected mode for servers receiving from many sources. -
UdpSocket'ssend_toandrecv_fromtake&self. Wrapping inArclets multiple tasks share one socket without a formal split — unlikeTcpStreamwhich requiresinto_split()orsplit()for concurrent access. -
Keep datagrams under 1472 bytes on Ethernet networks to avoid IP fragmentation. A single lost IP fragment drops the entire datagram.
-
In high-frequency sensor pipelines, use
try_sendrather thansend().awaitwhen forwarding to a downstream channel. Blocking the receive loop on a full channel is worse than dropping one datagram.
Lesson 3 — HTTP Clients with reqwest: Async REST Calls to Meridian's Mission API
Module: Foundation — M04: Network Programming
Position: Lesson 3 of 3
Source: Synthesized from reqwest documentation and training knowledge
Source note: This lesson synthesizes from
reqwest0.12.x API documentation and training knowledge. Verify connection pool configuration options against the currentreqwest::ClientBuilderdocs if behaviour differs.
Context
The Meridian control plane is not an island. It fetches TLE updates from the external Space-Track catalog API, posts conjunction alerts to the mission operations REST endpoint, and retrieves ground station configuration from an internal config service. All of these are HTTP calls — outbound, async, with retry logic and timeouts.
reqwest is the standard async HTTP client for Rust. It wraps hyper (the underlying HTTP implementation) with a high-level, ergonomic API, built-in connection pooling, JSON support through serde, and configurable timeout and retry behaviour. Understanding how to use it correctly — particularly how Client is shared, how connection pools work, and how to handle failures robustly — is essential for any Rust service that communicates with external APIs.
Core Concepts
Client — Shared, Pooled, Long-Lived
reqwest::Client manages a connection pool internally. Building a Client is expensive — it allocates the pool, sets up TLS configuration, and resolves DNS configuration. A Client is designed to be created once and cloned cheaply for sharing across tasks.
#![allow(unused)] fn main() { use reqwest::Client; use std::time::Duration; fn build_client() -> anyhow::Result<Client> { Ok(Client::builder() // Overall request timeout: connection + headers + body. .timeout(Duration::from_secs(30)) // How long to wait for the TCP connection to establish. .connect_timeout(Duration::from_secs(5)) // Keep connections alive for reuse — avoids TCP handshake per request. .pool_idle_timeout(Duration::from_secs(90)) .pool_max_idle_per_host(10) // User-Agent header for all requests. .user_agent("meridian-control-plane/1.0") .build()?) } }
Client is Clone — cloning it is a reference count increment that shares the same underlying connection pool. Pass a Client to tasks by cloning, not by wrapping in Arc<Mutex<Client>>. The Arc is already inside Client.
Never create a new Client per request. Each new Client is a new connection pool — you lose all the benefit of connection reuse and accumulate resource overhead proportional to your request rate.
Making Requests
The basic request pattern: call a method on the Client to get a RequestBuilder, add headers and body, call .send().await, check the status, and deserialize the response:
#![allow(unused)] fn main() { use reqwest::Client; use serde::{Deserialize, Serialize}; #[derive(Debug, Deserialize)] struct TleRecord { norad_id: u32, name: String, line1: String, line2: String, } async fn fetch_tle(client: &Client, norad_id: u32) -> anyhow::Result<TleRecord> { let url = format!("https://api.meridian.internal/tle/{norad_id}"); let response = client .get(&url) .header("X-API-Key", "mission-control-key") .send() .await?; // error_for_status() converts 4xx/5xx responses into Err. // Without this, a 404 or 500 is not an error — you receive the body. let response = response.error_for_status()?; let record: TleRecord = response.json().await?; Ok(record) } }
error_for_status() is important. A 404 or 503 does not cause .send().await to return Err — only network errors do. If you omit error_for_status(), a 500 response body is deserialized as if it were a valid TleRecord, producing a confusing JSON parse error rather than a clear HTTP error.
Sending JSON Bodies
For POST and PUT requests with JSON bodies, use .json(&value) on the RequestBuilder. It serializes the value with serde, sets the Content-Type: application/json header, and sets the body:
#![allow(unused)] fn main() { use reqwest::Client; use serde::Serialize; #[derive(Serialize)] struct ConjunctionAlert { object_a: u32, object_b: u32, tca_seconds: f64, miss_distance_km: f64, } async fn post_alert(client: &Client, alert: &ConjunctionAlert) -> anyhow::Result<()> { client .post("https://api.meridian.internal/alerts") .json(alert) .send() .await? .error_for_status()?; Ok(()) } }
.json() requires the json feature on reqwest (enabled by default). For large payloads that should be streamed rather than buffered in memory, use .body(reqwest::Body::wrap_stream(stream)) instead.
Retry Logic with Exponential Backoff
External APIs fail transiently — rate limits, brief outages, transient DNS failures. A single retry with a fixed delay is rarely sufficient. Exponential backoff with jitter is the standard approach: wait 1s, then 2s, then 4s, with random jitter to avoid thundering herds:
#![allow(unused)] fn main() { use reqwest::{Client, StatusCode}; use tokio::time::{sleep, Duration}; async fn fetch_with_retry( client: &Client, url: &str, max_attempts: u32, ) -> anyhow::Result<String> { let mut attempt = 0; loop { attempt += 1; let result = client.get(url).send().await; match result { Ok(resp) if resp.status().is_success() => { return Ok(resp.text().await?); } Ok(resp) if resp.status() == StatusCode::TOO_MANY_REQUESTS => { // Respect Retry-After header if present, otherwise backoff. let retry_after = resp .headers() .get("Retry-After") .and_then(|v| v.to_str().ok()) .and_then(|s| s.parse::<u64>().ok()) .unwrap_or(0); let delay = if retry_after > 0 { Duration::from_secs(retry_after) } else { backoff_delay(attempt) }; tracing::warn!(attempt, url, ?delay, "rate limited — backing off"); if attempt >= max_attempts { anyhow::bail!("rate limit exhausted"); } sleep(delay).await; } Ok(resp) if resp.status().is_server_error() => { tracing::warn!(attempt, url, status = %resp.status(), "server error"); if attempt >= max_attempts { anyhow::bail!("server error after {max_attempts} attempts"); } sleep(backoff_delay(attempt)).await; } Ok(resp) => { // 4xx client errors (except 429) are not retryable. anyhow::bail!("request failed: HTTP {}", resp.status()); } Err(e) if e.is_connect() || e.is_timeout() => { tracing::warn!(attempt, url, "network error: {e}"); if attempt >= max_attempts { return Err(e.into()); } sleep(backoff_delay(attempt)).await; } Err(e) => return Err(e.into()), } } } fn backoff_delay(attempt: u32) -> Duration { // Exponential backoff: 1s, 2s, 4s, 8s, capped at 30s. // Add jitter to avoid thundering herd. use std::time::SystemTime; let base = Duration::from_secs(1u64 << attempt.min(5)); let jitter_ms = (SystemTime::now() .duration_since(SystemTime::UNIX_EPOCH) .unwrap_or_default() .subsec_millis()) % 1000; base + Duration::from_millis(jitter_ms as u64) } }
Retry strategy by status code:
- 5xx (server error): Retry with backoff — transient server issues.
- 429 (too many requests): Retry with backoff, respect
Retry-Afterheader. - 408 (request timeout) or connection/timeout errors: Retry with backoff.
- 4xx (client errors) except 429: Do not retry — the request itself is malformed.
- Success: Return immediately.
Configuring Timeouts Correctly
A single .timeout(Duration) sets the overall request timeout (connection + sending + receiving). For fine-grained control:
#![allow(unused)] fn main() { use reqwest::Client; use std::time::Duration; fn build_production_client() -> anyhow::Result<Client> { Ok(Client::builder() // TCP connection timeout — fail fast if service is unreachable. .connect_timeout(Duration::from_secs(3)) // Total time budget for the entire request (all phases). .timeout(Duration::from_secs(15)) // How long an idle connection can sit in the pool before being closed. .pool_idle_timeout(Duration::from_secs(60)) .build()?) } }
For the Meridian TLE catalog API — a slow external service that can take up to 10 seconds to respond during load — set the timeout to 12–15 seconds. For the internal mission ops REST endpoint on the same datacenter network, 3–5 seconds is appropriate. Do not use the same Client configuration for both if the timeout requirements differ significantly — build two clients.
Code Examples
TLE Catalog HTTP Client for the Control Plane
The control plane fetches TLE updates from Space-Track on a 10-minute schedule. It also exposes a REST endpoint for on-demand TLE queries. This example shows both directions: fetching and posting, with retry logic and a shared client.
use anyhow::{Context, Result}; use reqwest::{Client, StatusCode}; use serde::{Deserialize, Serialize}; use std::time::Duration; use tokio::time::sleep; #[derive(Debug, Deserialize, Clone)] pub struct TleRecord { pub norad_id: u32, pub name: String, pub line1: String, pub line2: String, pub epoch: String, } #[derive(Debug, Serialize)] pub struct ConjunctionReport { pub object_a_id: u32, pub object_b_id: u32, pub tca_unix: f64, pub miss_distance_km: f64, pub probability: f64, } pub struct MissionApiClient { client: Client, base_url: String, api_key: String, } impl MissionApiClient { pub fn new(base_url: String, api_key: String) -> Result<Self> { let client = Client::builder() .connect_timeout(Duration::from_secs(5)) .timeout(Duration::from_secs(20)) .pool_max_idle_per_host(4) .user_agent("meridian-control-plane/1.0") .build() .context("failed to build HTTP client")?; Ok(Self { client, base_url, api_key }) } /// Fetch a single TLE record with up to 3 retry attempts. pub async fn get_tle(&self, norad_id: u32) -> Result<TleRecord> { let url = format!("{}/tle/{norad_id}", self.base_url); let mut attempt = 0u32; loop { attempt += 1; let response = self.client .get(&url) .header("X-API-Key", &self.api_key) .send() .await; match response { Ok(resp) if resp.status().is_success() => { return resp.json::<TleRecord>().await .context("failed to parse TLE response"); } Ok(resp) if resp.status().is_server_error() && attempt < 3 => { tracing::warn!(norad_id, attempt, status = %resp.status(), "retrying"); sleep(Duration::from_secs(1 << attempt)).await; } Ok(resp) => { anyhow::bail!("TLE fetch failed: HTTP {}", resp.status()); } Err(e) if (e.is_connect() || e.is_timeout()) && attempt < 3 => { tracing::warn!(norad_id, attempt, "network error: {e}, retrying"); sleep(Duration::from_secs(1 << attempt)).await; } Err(e) => return Err(e).context("TLE fetch network error"), } } } /// Post a conjunction report to the mission operations endpoint. pub async fn post_conjunction(&self, report: &ConjunctionReport) -> Result<()> { self.client .post(format!("{}/conjunctions", self.base_url)) .header("X-API-Key", &self.api_key) .json(report) .send() .await .context("failed to send conjunction report")? .error_for_status() .context("conjunction report rejected")?; Ok(()) } /// Fetch all active TLEs in a specified altitude band (batch request). pub async fn get_tle_batch(&self, min_km: u32, max_km: u32) -> Result<Vec<TleRecord>> { self.client .get(format!("{}/tle/batch", self.base_url)) .query(&[("min_alt_km", min_km), ("max_alt_km", max_km)]) .header("X-API-Key", &self.api_key) .send() .await? .error_for_status()? .json::<Vec<TleRecord>>() .await .context("failed to parse TLE batch response") } } #[tokio::main] async fn main() -> Result<()> { tracing_subscriber::fmt::init(); let api = MissionApiClient::new( "https://api.meridian.internal".to_string(), "mission-control-key".to_string(), )?; // Periodic TLE refresh loop. let api_ref = std::sync::Arc::new(api); let refresh_api = std::sync::Arc::clone(&api_ref); tokio::spawn(async move { loop { match refresh_api.get_tle(25544).await { Ok(tle) => tracing::info!(name = %tle.name, "TLE refreshed"), Err(e) => tracing::error!("TLE refresh failed: {e}"), } sleep(Duration::from_secs(600)).await; } }); // Post a conjunction report. api_ref.post_conjunction(&ConjunctionReport { object_a_id: 25544, object_b_id: 48274, tca_unix: 1_735_000_000.0, miss_distance_km: 0.8, probability: 0.003, }).await?; sleep(Duration::from_secs(1)).await; Ok(()) }
The MissionApiClient wraps the reqwest::Client and encodes the API contract — base URL, auth header, response types — in one place. Callers interact with typed methods rather than raw HTTP primitives. The Arc::new(api) pattern is appropriate here because Client is already internally reference-counted; wrapping in Arc just lets the MissionApiClient struct itself be shared. A simpler option is to pass &MissionApiClient to async functions directly, since MissionApiClient is Send + Sync.
Key Takeaways
-
Create one
Clientper configuration profile and share it across tasks viaClone. Each newClientis a new connection pool — creating one per request wastes connection setup overhead and defeats pooling. -
Always call
error_for_status()after.send().awaitunless you explicitly want to handle 4xx/5xx response bodies. HTTP error responses do not returnErrfromsend(). -
Use
.json(&value)for serializing request bodies with serde. Use.json::<T>()on the response for deserialization. Both require thejsonfeature (enabled by default). -
Distinguish retryable errors (5xx, 429, connection/timeout errors) from non-retryable ones (4xx client errors). Apply exponential backoff with jitter for retryable failures. Respect
Retry-Afterheaders on 429 responses. -
Set
connect_timeoutseparately from the overall.timeout. A short connect timeout (3–5s) fails fast on unreachable services without waiting for the full request timeout budget. -
For different external services with different latency profiles and rate limits, use separate
Clientinstances with separate configurations rather than sharing one client across everything.
Project — Ground Station Network Client
Module: Foundation — M04: Network Programming
Prerequisite: All three module quizzes passed (≥70%)
Mission Brief
TO: Platform Engineering
FROM: Mission Control Systems Lead
CLASSIFICATION: UNCLASSIFIED // INTERNAL
SUBJECT: RFC-0051 — Ground Station Network Client Implementation
The Meridian control plane currently uses a Python subprocess to manage ground station TCP connections. It provides no reconnection logic, no session health monitoring, and no integration with the TLE catalog API for per-session orbital data refresh. Under antenna tracking interruptions, sessions drop and are never re-established. Under Space-Track API rate limiting, TLE data becomes stale without any backoff or retry.
Your task is to build the ground station network client — the component that owns the full lifecycle of a ground station TCP session: connect, read frames, reconnect on failure, refresh TLE data via HTTP, and shut down cleanly.
System Specification
Connection Management
The client connects to a ground station TCP endpoint (host:port). The length-prefix framing protocol from Lesson 1 applies: 4-byte big-endian u32 length header followed by length bytes of payload.
On connection loss (EOF, read error, timeout), the client reconnects automatically with exponential backoff: 1s, 2s, 4s, 8s, up to 30s maximum. If reconnection fails for more than 5 minutes total, the client marks the station as Failed and stops retrying.
Session Lifecycle
Connecting → Connected → Receiving frames → [disconnect] → Reconnecting → Connected → ...
→ [shutdown signal] → Disconnecting → Stopped
→ [5 min failure] → Failed
The current session state is tracked as an enum and exposed via a watch channel so monitoring systems can observe it.
TLE Refresh
Each active session periodically fetches the TLE record for the session's assigned satellite from the mission API (GET /tle/{norad_id}). The refresh interval is configurable (default: 10 minutes). The HTTP client uses a connect_timeout of 3s and overall timeout of 15s. On 5xx or network errors, the refresh is retried with exponential backoff (up to 3 attempts). On 429, the backoff respects a Retry-After header if present.
Frame Forwarding
Successfully received frames are forwarded to a tokio::sync::mpsc::Sender<Frame>. The frame includes the station ID, the session's current TLE record (if available), and the raw payload. If the downstream channel is full, the frame is dropped and a warning is logged.
Shutdown
A watch::Receiver<bool> shutdown signal is accepted. On signal: complete the current frame read (do not abort mid-frame), flush any buffered writes (send a final GOODBYE frame to the peer), close the TCP connection cleanly, and exit.
Expected Output
A library crate (meridian-gs-client) with:
- A
GroundStationClientstruct withrun()method - A
SessionStateenum andwatchchannel for state observation - A
Framestruct forwarded to the downstream channel - A test binary that: connects to a local echo server (you implement a minimal echo server in the test), receives 5 frames, triggers reconnect by having the echo server drop the connection, verifies reconnection, then triggers shutdown
The test binary output should clearly show: initial connection, frame receipt, connection drop, reconnection, and clean shutdown.
Acceptance Criteria
| # | Criterion | Verifiable |
|---|---|---|
| 1 | Client reconnects automatically on connection loss with exponential backoff | Yes — drop the server connection and verify reconnection in logs |
| 2 | Reconnection backoff is bounded at 30 seconds | Yes — check timing between reconnect attempts under sustained failure |
| 3 | Client marks station as Failed after 5 minutes of failed reconnections | Yes — simulate sustained connection refusal and verify state transition |
| 4 | TLE refresh runs on the configured interval and retries on 5xx/network errors | Yes — mock server returning 503 then 200 |
| 5 | Frame forwarding uses try_send — channel-full does not block the receive loop | Yes — code review and test with a slow downstream consumer |
| 6 | Shutdown completes the current frame before exiting | Yes — send a large frame and trigger shutdown mid-send; frame arrives complete |
| 7 | Session state transitions are correctly published to the watch channel | Yes — observer task sees all transitions in order |
Hints
Hint 1 — Session state machine
#![allow(unused)] fn main() { #[derive(Debug, Clone, PartialEq)] pub enum SessionState { Connecting { attempt: u32 }, Connected { since: std::time::Instant }, Reconnecting { attempt: u32, next_retry: std::time::Instant }, Disconnecting, Failed { reason: String }, Stopped, } }
Publish state changes via watch::Sender<SessionState>. Observers call borrow() to read the current state or changed().await to wait for the next transition.
Hint 2 — Reconnect loop structure
#![allow(unused)] fn main() { async fn run_with_reconnect( config: &ClientConfig, tx: mpsc::Sender<Frame>, mut shutdown: watch::Receiver<bool>, state_tx: watch::Sender<SessionState>, ) { let mut attempt = 0u32; let start = std::time::Instant::now(); loop { if *shutdown.borrow() { break; } if start.elapsed() > std::time::Duration::from_secs(300) { let _ = state_tx.send(SessionState::Failed { reason: "reconnection window exceeded".into(), }); break; } let _ = state_tx.send(SessionState::Connecting { attempt }); match tokio::net::TcpStream::connect(&config.addr).await { Ok(stream) => { attempt = 0; // Reset on successful connection. let _ = state_tx.send(SessionState::Connected { since: std::time::Instant::now(), }); // Run the session until it disconnects or shutdown. run_session(stream, config, &tx, &mut shutdown, &state_tx).await; if *shutdown.borrow() { break; } } Err(e) => { tracing::warn!("connection failed (attempt {attempt}): {e}"); } } attempt += 1; let delay = std::time::Duration::from_secs((1u64 << attempt.min(5)).min(30)); let _ = state_tx.send(SessionState::Reconnecting { attempt, next_retry: std::time::Instant::now() + delay, }); tokio::time::sleep(delay).await; } } }
Hint 3 — TLE refresh as a background task per session
Spawn a TLE refresh task when the session connects. Abort it when the session disconnects. Use a watch::Sender<Option<TleRecord>> to share the current TLE with the frame handler:
#![allow(unused)] fn main() { async fn run_tle_refresh( http: reqwest::Client, norad_id: u32, interval: std::time::Duration, tle_tx: tokio::sync::watch::Sender<Option<TleRecord>>, mut shutdown: tokio::sync::watch::Receiver<bool>, ) { loop { tokio::select! { _ = tokio::time::sleep(interval) => { match fetch_tle_with_retry(&http, norad_id, 3).await { Ok(tle) => { let _ = tle_tx.send(Some(tle)); } Err(e) => tracing::warn!(norad_id, "TLE refresh failed: {e}"), } } Ok(()) = shutdown.changed() => { if *shutdown.borrow() { break; } } } } } }
Hint 4 — Sending a GOODBYE frame on clean shutdown
#![allow(unused)] fn main() { use tokio::io::AsyncWriteExt; async fn send_goodbye(stream: &mut tokio::net::TcpStream) { const GOODBYE: &[u8] = b"GOODBYE"; let len = (GOODBYE.len() as u32).to_be_bytes(); // Best-effort — ignore errors (we're shutting down anyway). let _ = stream.write_all(&len).await; let _ = stream.write_all(GOODBYE).await; let _ = stream.flush().await; let _ = stream.shutdown().await; } }
Reference Implementation
Reveal reference implementation
#![allow(unused)] fn main() { use anyhow::Result; use reqwest::Client as HttpClient; use serde::Deserialize; use std::time::{Duration, Instant}; use tokio::{ io::{AsyncReadExt, AsyncWriteExt}, net::TcpStream, sync::{mpsc, watch}, time::sleep, }; use tracing::{info, warn, error}; #[derive(Debug, Clone, Deserialize)] pub struct TleRecord { pub norad_id: u32, pub name: String, pub line1: String, pub line2: String, } #[derive(Debug, Clone, PartialEq)] pub enum SessionState { Connecting { attempt: u32 }, Connected, Reconnecting { attempt: u32 }, Failed { reason: String }, Stopped, } #[derive(Debug)] pub struct Frame { pub station_id: String, pub tle: Option<TleRecord>, pub payload: Vec<u8>, } pub struct ClientConfig { pub station_id: String, pub addr: String, pub norad_id: u32, pub api_base_url: String, pub tle_refresh_interval: Duration, } async fn read_frame(stream: &mut TcpStream) -> Result<Option<Vec<u8>>> { let mut len_buf = [0u8; 4]; match stream.read_exact(&mut len_buf).await { Ok(()) => {} Err(e) if e.kind() == std::io::ErrorKind::UnexpectedEof => return Ok(None), Err(e) => return Err(e.into()), } let len = u32::from_be_bytes(len_buf) as usize; if len > 65_536 { anyhow::bail!("frame too large: {len}"); } let mut buf = vec![0u8; len]; stream.read_exact(&mut buf).await?; Ok(Some(buf)) } async fn fetch_tle(http: &HttpClient, base_url: &str, norad_id: u32) -> Result<TleRecord> { let url = format!("{base_url}/tle/{norad_id}"); let mut attempt = 0u32; loop { attempt += 1; match http.get(&url).send().await { Ok(r) if r.status().is_success() => { return Ok(r.json::<TleRecord>().await?); } Ok(r) if r.status().is_server_error() && attempt < 3 => { warn!(norad_id, attempt, "TLE fetch server error, retrying"); sleep(Duration::from_secs(1 << attempt)).await; } Ok(r) => anyhow::bail!("TLE fetch: HTTP {}", r.status()), Err(e) if (e.is_connect() || e.is_timeout()) && attempt < 3 => { warn!(norad_id, attempt, "TLE fetch network error, retrying"); sleep(Duration::from_secs(1 << attempt)).await; } Err(e) => return Err(e.into()), } } } async fn run_session( mut stream: TcpStream, config: &ClientConfig, http: &HttpClient, frame_tx: &mpsc::Sender<Frame>, tle_tx: &watch::Sender<Option<TleRecord>>, mut shutdown: watch::Receiver<bool>, ) { // Kick off TLE refresh task for this session. let (session_shutdown_tx, session_shutdown_rx) = watch::channel(false); let tle_refresh = { let http = http.clone(); let base = config.api_base_url.clone(); let norad_id = config.norad_id; let interval = config.tle_refresh_interval; let tle_tx = tle_tx.clone(); tokio::spawn(async move { let mut sd = session_shutdown_rx; loop { tokio::select! { _ = sleep(interval) => { match fetch_tle(&http, &base, norad_id).await { Ok(tle) => { let _ = tle_tx.send(Some(tle)); } Err(e) => warn!("TLE refresh failed: {e}"), } } Ok(()) = sd.changed() => { if *sd.borrow() { break; } } } } }) }; loop { tokio::select! { biased; frame = tokio::time::timeout(Duration::from_secs(60), read_frame(&mut stream)) => { match frame { Err(_) => { warn!(station = %config.station_id, "session timeout"); break; } Ok(Ok(Some(payload))) => { let tle = tle_tx.subscribe().borrow().clone(); let f = Frame { station_id: config.station_id.clone(), tle, payload, }; if frame_tx.try_send(f).is_err() { warn!(station = %config.station_id, "frame dropped: pipeline full"); } } Ok(Ok(None)) => { info!(station = %config.station_id, "peer closed connection"); break; } Ok(Err(e)) => { warn!(station = %config.station_id, "read error: {e}"); break; } } } Ok(()) = shutdown.changed() => { if *shutdown.borrow() { info!(station = %config.station_id, "shutdown — sending GOODBYE"); let _ = session_shutdown_tx.send(true); let payload = b"GOODBYE"; let len = (payload.len() as u32).to_be_bytes(); let _ = stream.write_all(&len).await; let _ = stream.write_all(payload).await; let _ = stream.flush().await; let _ = stream.shutdown().await; break; } } } } let _ = session_shutdown_tx.send(true); let _ = tle_refresh.await; } pub async fn run_client( config: ClientConfig, frame_tx: mpsc::Sender<Frame>, mut shutdown: watch::Receiver<bool>, state_tx: watch::Sender<SessionState>, ) { let http = HttpClient::builder() .connect_timeout(Duration::from_secs(3)) .timeout(Duration::from_secs(15)) .build() .expect("failed to build HTTP client"); let (tle_tx, _) = watch::channel::<Option<TleRecord>>(None); let mut attempt = 0u32; let start = Instant::now(); loop { if *shutdown.borrow() { break; } if start.elapsed() > Duration::from_secs(300) { let _ = state_tx.send(SessionState::Failed { reason: "5-minute reconnect window exceeded".into(), }); return; } let _ = state_tx.send(SessionState::Connecting { attempt }); match TcpStream::connect(&config.addr).await { Ok(stream) => { attempt = 0; let _ = state_tx.send(SessionState::Connected); info!(station = %config.station_id, "connected to {}", config.addr); run_session(stream, &config, &http, &frame_tx, &tle_tx, shutdown.clone()).await; if *shutdown.borrow() { break; } info!(station = %config.station_id, "session ended, will reconnect"); } Err(e) => { warn!(station = %config.station_id, attempt, "connection failed: {e}"); } } attempt += 1; let delay = Duration::from_secs((1u64 << attempt.min(5)).min(30)); let _ = state_tx.send(SessionState::Reconnecting { attempt }); info!(station = %config.station_id, "reconnecting in {delay:?}"); sleep(delay).await; } let _ = state_tx.send(SessionState::Stopped); info!(station = %config.station_id, "client stopped"); } }
Reflection
The ground station client built here is the connection layer that sits between the raw TCP socket and the telemetry aggregator from Module 3. The three lessons of this module are directly combined: TcpListener/TcpStream from Lesson 1 for the framed session protocol, UDP from Lesson 2 could be added for out-of-band sensor feeds from the same station, and reqwest from Lesson 3 for the TLE refresh background task within the session.
The reconnection loop pattern — state machine published to a watch channel, exponential backoff, failure timeout — is universal. It applies equally to database connections, message broker connections, and any other persistent network resource that needs supervisory recovery behaviour.
Module 05 — Data-Oriented Design in Rust
Track: Foundation — Mission Control Platform
Position: Module 5 of 6
Source material: Rust for Rustaceans — Jon Gjengset, Chapters 2, 9
Quiz pass threshold: 70% on all three lessons to unlock the project
- Mission Context
- What You Will Learn
- Lessons
- Capstone Project — High-Throughput Telemetry Packet Processor
- Prerequisites
- What Comes Next
Mission Context
The Meridian telemetry processor runs at 62,000 frames per second. The conjunction avoidance pipeline requires 100,000. The gap is not a missing algorithm or a suboptimal data structure — it is allocator pressure and cache waste, both caused by data layout decisions made when defining types. Each frame allocates a Vec<u8> on the global heap. Each deduplication pass loads 2.4× more data than the deduplication logic uses.
Data-oriented design is a discipline for making data layout decisions that align with CPU hardware realities: cache lines are 64 bytes, cache misses cost 100–300 cycles, and SIMD instructions operate on contiguous uniform-type data. The three techniques in this module — cache-optimal struct layout, SoA field separation, and arena allocation — directly address the two profiling findings above.
What You Will Learn
By the end of this module you will be able to:
- Explain how field alignment and padding inflate struct sizes, use
reprattributes to control layout, and writeconstassertions to lock in size expectations at compile time - Identify false sharing between concurrent tasks, apply
repr(align(64))with padding to isolate per-thread data to separate cache lines, and separate hot fields from cold fields in structs used in high-volume collections - Explain when SoA layout outperforms AoS (field-subset sequential operations) and when AoS outperforms SoA (per-entity random access), implement an
OrbitalCatalogusing field grouping, and transition from AoS to SoA incrementally without a full rewrite - Implement a bump/arena allocator for same-lifetime batch allocations, contrast its allocation cost with the global allocator, use thread-local arenas for zero-contention concurrent allocation, and identify when arena allocation is inappropriate (mixed lifetimes, individual deallocation)
Lessons
Lesson 1 — Cache-Friendly Data Layouts: Struct Layout, Padding, and Cache Line Alignment
Covers alignment and padding mechanics, repr(C) vs repr(Rust) vs repr(packed) vs repr(align(n)), false sharing between concurrent tasks, repr(align(64)) for per-thread counter isolation, and hot/cold field separation. Grounded in Rust for Rustaceans, Chapter 2.
Key question this lesson answers: How does field order affect struct size, what causes false sharing between concurrent tasks, and how do you isolate hot data from cold data?
→ lesson-01-cache-friendly-layouts.md / lesson-01-quiz.toml
Lesson 2 — Struct-of-Arrays vs Array-of-Structs: When Each Wins
Covers the AoS and SoA layout patterns, the cache utilisation argument for each, the conditions that favour SoA (field-subset sequential scans, SIMD, large N), the conditions that favour AoS (per-entity random access, all-field operations), the hybrid field-grouping pattern, and incremental AoS-to-SoA transition via a companion index.
Key question this lesson answers: When does splitting fields into separate vectors improve performance, and when does it hurt?
→ lesson-02-soa-vs-aos.md / lesson-02-quiz.toml
Lesson 3 — Arena Allocation: Bump Allocators for High-Throughput Telemetry Processing
Covers the global allocator's cost for high-frequency short-lived allocations, the bump allocator pattern (O(1) alloc, O(1) epoch free), the lifetime constraint, thread-local arenas for zero-contention concurrent allocation, the bumpalo crate interface, and the workloads where arena allocation is inappropriate.
Key question this lesson answers: When is the global allocator the bottleneck, and how does a bump allocator eliminate that overhead for same-lifetime batch objects?
→ lesson-03-arena-allocation.md / lesson-03-quiz.toml
Capstone Project — High-Throughput Telemetry Packet Processor
Rebuild the Meridian telemetry processor core to achieve ≥100,000 frames/sec using all three techniques: a 24-byte FrameHeader with const size assertion, SoA separation of headers from arena-allocated payloads, bump arena for batch payload allocation with O(1) epoch reset, and SoA-based deduplication operating only on the hot header array.
Acceptance is against 7 verifiable criteria including compile-time size assertions, no per-frame heap allocations, correct arena reset, and measured throughput.
→ project-telemetry-processor.md
Prerequisites
Modules 1–4 must be complete. Module 2 (Concurrency Primitives) introduced atomic operations and the false sharing problem — Lesson 1 of this module extends that with the repr(align(64)) solution. Module 5's content stands alone otherwise; it does not build on the networking or message-passing material from Modules 3–4.
What Comes Next
Module 6 — Performance and Profiling gives you the measurement tools to validate the optimisations introduced here: criterion for reliable microbenchmarks, flamegraph and perf for identifying hot paths, and heap profiling for measuring allocator pressure. You will profile the processor built in this module's project and verify the improvement against the baseline.
Lesson 1 — Cache-Friendly Data Layouts: Struct Layout, Padding, and Cache Line Alignment
Module: Foundation — M05: Data-Oriented Design in Rust
Position: Lesson 1 of 3
Source: Rust for Rustaceans — Jon Gjengset, Chapter 2
Context
The Meridian telemetry processor receives 100,000 frames per second at peak load across 48 uplinks. Each frame passes through validation, deduplication, and routing — operations that read specific fields from a TelemetryFrame struct on every iteration. At that throughput, the cost of a CPU cache miss — roughly 100–300 clock cycles to fetch from RAM, compared to 4 cycles for an L1 cache hit — is the difference between keeping up and falling behind.
CPU cache performance is not something you can bolt on after profiling shows a bottleneck. It is determined by the decisions you make when you define your data types. How fields are ordered. How large structs are. Whether hot fields and cold fields share a cache line. These decisions are locked in by the struct definition, and changing them later requires touching every callsite that constructs or accesses the type.
This lesson covers the mechanics that determine how Rust lays out your types in memory, the repr attributes that control those mechanics, and how to make decisions that keep hot data in cache.
Source: Rust for Rustaceans, Chapter 2 (Gjengset)
Core Concepts
Alignment and Padding
Every type has an alignment requirement — the CPU needs its address to be a multiple of some power of two. A u8 needs 1-byte alignment. A u32 needs 4-byte alignment. A u64 needs 8-byte alignment.
When you put fields of different alignments in a struct, the compiler inserts padding bytes between fields to satisfy alignment requirements (Rust for Rustaceans, Ch. 2). Consider this struct with #[repr(C)] (which preserves field order):
#![allow(unused)] fn main() { #[repr(C)] struct BadLayout { tiny: bool, // 1 byte // 3 bytes padding — to align `normal` to 4 bytes normal: u32, // 4 bytes small: u8, // 1 byte // 7 bytes padding — to align `long` to 8 bytes long: u64, // 8 bytes short: u16, // 2 bytes // 6 bytes padding — to make total size a multiple of alignment (8) } // Total: 32 bytes. Actual data: 16 bytes. Wasted: 16 bytes — half the struct is padding. }
With #[repr(Rust)] (the default), the compiler is free to reorder fields by size, descending — eliminating most padding:
#![allow(unused)] fn main() { // Default Rust layout — compiler reorders fields for minimal padding. // Effective order: long (8), normal (4), short (2), tiny (1), small (1) struct GoodLayout { tiny: bool, normal: u32, small: u8, long: u64, short: u16, } // Total: 16 bytes. Same fields, no wasted padding. }
The difference at scale: a Vec<BadLayout> of 1 million elements occupies 32 MB. A Vec<GoodLayout> with the same data occupies 16 MB — fitting twice as many elements in the same cache footprint, doubling cache hit rate for sequential access.
You can verify sizes at compile time with std::mem::size_of:
#[repr(C)] struct BadLayout { tiny: bool, normal: u32, small: u8, long: u64, short: u16 } struct GoodLayout { tiny: bool, normal: u32, small: u8, long: u64, short: u16 } fn main() { // Confirm the size difference at compile time. const _: () = assert!(std::mem::size_of::<BadLayout>() == 32); const _: () = assert!(std::mem::size_of::<GoodLayout>() == 16); println!("BadLayout: {} bytes", std::mem::size_of::<BadLayout>()); println!("GoodLayout: {} bytes", std::mem::size_of::<GoodLayout>()); }
Use const assertions as compile-time guards on struct sizes for types that appear in high-volume collections. When a future change accidentally adds padding, the assertion fails at compile time rather than silently degrading cache performance.
repr Attributes
repr(Rust) — the default. The compiler may reorder fields for minimal padding and does not guarantee a specific layout. This is optimal for Rust-only code but incompatible with C interop.
repr(C) — fields laid out in declaration order, C-compatible. Required when passing structs across FFI boundaries. At the cost of potentially more padding if fields are not ordered by descending alignment.
repr(packed) — removes all padding. Fields may be misaligned, which can be much slower on x86 (unaligned loads trigger microcode assists) and cause bus errors on architectures that require alignment. Use only when minimizing memory footprint is more important than access speed — for example, serialized wire formats, or extremely memory-constrained environments.
repr(align(n)) — forces the struct to have at least n byte alignment. The most common use in systems programming is cache line alignment for concurrent data structures:
#![allow(unused)] fn main() { use std::sync::atomic::AtomicU64; // Each counter occupies a full 64-byte cache line. // Without this: two counters from different threads share a cache line, // causing false sharing — each write by one thread invalidates the // other thread's cache entry even though they touch different data. #[repr(align(64))] struct CacheAlignedCounter { value: AtomicU64, _pad: [u8; 56], // Explicit padding to fill the 64-byte cache line. } }
Cache Lines and False Sharing
A CPU cache line is 64 bytes on x86-64. The CPU fetches and evicts cache lines as atomic units — not individual bytes or words. When two logical pieces of data share a cache line, any write to either one invalidates the entire line in every other core's cache.
False sharing occurs when two threads write to different variables that happen to occupy the same cache line (Rust for Rustaceans, Ch. 2). Each write by either thread causes the cache line to bounce between cores — effectively serializing what should be independent writes:
#![allow(unused)] fn main() { use std::sync::atomic::{AtomicU64, Ordering}; use std::thread; // BAD: both counters fit in one 16-byte struct, sharing a cache line. // Thread A's writes to `a` invalidate thread B's cached copy of the line, // which also contains `b`. Both threads contend on the same cache line. struct SharedCounters { a: AtomicU64, b: AtomicU64, } // GOOD: each counter on its own cache line. #[repr(align(64))] struct IsolatedCounter { value: AtomicU64, } fn demonstrate_false_sharing() { // With SharedCounters: threads A and B writing independently // still cause cache coherence traffic because they share a line. // With two IsolatedCounter instances: threads write truly independently. let counter_a = IsolatedCounter { value: AtomicU64::new(0) }; let counter_b = IsolatedCounter { value: AtomicU64::new(0) }; // counter_a and counter_b now occupy separate cache lines. // Writes by one thread do not invalidate the other's cache entry. counter_a.value.fetch_add(1, Ordering::Relaxed); counter_b.value.fetch_add(1, Ordering::Relaxed); } }
Hot Field / Cold Field Separation
Not all fields in a struct are accessed with equal frequency. For a TelemetryFrame, the routing fields (satellite_id, sequence) are read on every frame. The full payload is only read when forwarding downstream. Putting hot and cold data in the same struct means every cache miss for a hot field also loads the cold payload into cache — evicting other useful data.
The pattern: split the struct. Keep a hot "header" struct with frequently accessed fields, and access the cold data via an Arc<Vec<u8>> or a separate index:
#![allow(unused)] fn main() { use std::sync::Arc; // Hot: accessed on every frame for routing decisions. // 24 bytes — fits comfortably in cache alongside many sibling headers. struct FrameHeader { satellite_id: u32, // 4 bytes sequence: u64, // 8 bytes timestamp_ms: u64, // 8 bytes flags: u8, // 1 byte _pad: [u8; 3], // 3 bytes padding (explicit, documented) } // Cold: accessed only when forwarding to downstream consumers. // Heap-allocated; not loaded until needed. struct FrameBody { header: FrameHeader, payload: Arc<Vec<u8>>, // Heap allocation keeps cold data out of hot path. } }
A Vec<FrameHeader> for routing decisions keeps 24-byte hot entries packed. 64 bytes (one cache line) holds 2 full headers plus change — much better than loading 24 + payload.len() bytes per frame just to check a sequence number.
Code Examples
Verifying Layout Decisions at Compile Time
Use constant assertions to lock in size expectations for hot types. This catches accidental regressions — adding a field that introduces padding shows up as a compile error immediately.
use std::mem::{size_of, align_of}; /// A telemetry frame header optimized for sequential scanning. /// Fields ordered by alignment (descending) to minimize padding. #[derive(Debug, Clone, Copy)] pub struct TelemetryHeader { pub timestamp_ms: u64, // 8 bytes — largest alignment first pub sequence: u64, // 8 bytes pub satellite_id: u32, // 4 bytes pub byte_count: u32, // 4 bytes pub flags: u8, // 1 byte pub station_id: u8, // 1 byte pub _reserved: [u8; 2], // 2 bytes explicit pad — documented intent } // Lock in the expected size at compile time. // If a future change causes unexpected padding, this fails to compile. const _SIZE_CHECK: () = assert!(size_of::<TelemetryHeader>() == 32); const _ALIGN_CHECK: () = assert!(align_of::<TelemetryHeader>() == 8); /// A per-uplink session counter, cache-line aligned to prevent false sharing. /// 48 sessions each updating their own counter never contend on a shared line. #[repr(align(64))] pub struct SessionCounter { pub frames_received: u64, pub bytes_received: u64, pub frames_dropped: u64, _pad: [u8; 40], // Pad to fill 64-byte cache line: 3×8 + 40 = 64. } const _COUNTER_ALIGN: () = assert!(align_of::<SessionCounter>() == 64); const _COUNTER_SIZE: () = assert!(size_of::<SessionCounter>() == 64); fn main() { println!("TelemetryHeader: {} bytes", size_of::<TelemetryHeader>()); println!("SessionCounter: {} bytes (cache-line aligned)", size_of::<SessionCounter>()); // Verify that an array of counters places each on its own cache line. let counters: Vec<SessionCounter> = (0..4) .map(|_| SessionCounter { frames_received: 0, bytes_received: 0, frames_dropped: 0, _pad: [0; 40], }) .collect(); // Each counter is 64 bytes and 64-byte aligned — no false sharing. for (i, c) in counters.iter().enumerate() { let addr = c as *const _ as usize; println!("counter[{i}] at 0x{addr:x} (aligned: {})", addr % 64 == 0); } }
Key Takeaways
-
The compiler inserts padding between fields to satisfy alignment requirements. Field order determines how much padding is inserted. Ordering fields by decreasing size (largest alignment first) minimizes padding with default
repr(Rust). -
repr(Rust)(default) allows the compiler to reorder fields — usually optimal.repr(C)preserves field order for FFI compatibility at the potential cost of more padding.repr(packed)removes padding but risks misaligned access penalties. -
repr(align(n))forces a minimum alignment. Use it to ensure hot atomic counters occupy separate cache lines when accessed from multiple threads concurrently, preventing false sharing. -
False sharing occurs when two threads write to different variables that share a 64-byte cache line. The fix is
repr(align(64))with explicit padding to fill the cache line. -
Separate hot fields (read on every iteration) from cold fields (read rarely). A struct that bundles both forces the CPU to load cold data into cache on every hot access, evicting more useful data. Use a header struct for hot fields and heap-allocated or indexed access for cold data.
-
Use
constassertions onsize_ofandalign_offor types in high-volume collections. They turn accidental layout regressions into compile errors rather than silent performance degradation.
Lesson 2 — Struct-of-Arrays vs Array-of-Structs: When Each Wins
Module: Foundation — M05: Data-Oriented Design in Rust Position: Lesson 2 of 3 Source: Synthesized from training knowledge. Concepts would benefit from verification against: Mike Acton's CppCon 2014 "Data-Oriented Design and C++" and Chandler Carruth's related talks.
Context
The Meridian conjunction screening pass processes 50,000 orbital elements every 10 minutes. Each screening step reads the altitude and inclination of every object in the catalog. It does not read the object name, the launch date, the operator contact, or any other administrative metadata. Those fields exist in the catalog, but the screening loop does not touch them.
With a conventional struct design — one OrbitalObject struct with all fields — the screening loop loads each full struct into cache on every iteration. The fields it actually uses are 16 bytes; the fields it ignores are perhaps 120 bytes. The ratio: 13% of every cache line brought in is useful data. The rest is wasted memory bandwidth.
This is the core insight behind struct-of-arrays (SoA) layout: if an operation only accesses a subset of fields, store those fields contiguously rather than interleaved with irrelevant fields. Processing altitudes[0..50000] accesses only altitude data; there is no orbital metadata in the working set, no cache line waste.
This lesson covers when SoA beats AoS, when AoS beats SoA, and how to implement the transition idiomatically in Rust.
Core Concepts
Array-of-Structs (AoS): The Default
The conventional layout: a Vec<T> where T is a struct containing all fields for one entity.
#![allow(unused)] fn main() { // Array-of-Structs: all fields for one object are contiguous. #[derive(Debug, Clone)] struct OrbitalObject { norad_id: u32, // 4 bytes altitude_km: f64, // 8 bytes — used in conjunction screening inclination_deg: f64, // 8 bytes — used in conjunction screening raan_deg: f64, // 8 bytes — used in conjunction screening name: [u8; 24], // 24 bytes — NOT used in conjunction screening launch_year: u16, // 2 bytes — NOT used in conjunction screening _pad: [u8; 6], // 6 bytes padding } // One OrbitalObject: 64 bytes. One cache line. // The screening loop uses 28 bytes (norad_id + 3 doubles). // 36 bytes of every cache line are irrelevant to screening. fn screen_conjunctions_aos(objects: &[OrbitalObject], threshold_km: f64) -> Vec<u32> { let mut alerts = Vec::new(); for (i, a) in objects.iter().enumerate() { for b in &objects[i+1..] { // Each iteration loads a full 64-byte OrbitalObject into cache. // Only altitude_km and inclination_deg are used. let delta = (a.altitude_km - b.altitude_km).abs(); if delta < threshold_km { alerts.push(a.norad_id); } } } alerts } }
For a 50,000-object catalog, AoS loads 50,000 × 64 bytes = 3.2 MB per pass, even though only 28 bytes per object matter. At a 64-byte cache line, 44% of cache bandwidth is wasted on unused fields.
Struct-of-Arrays (SoA): Fields in Separate Vectors
SoA inverts the layout: instead of one Vec<Object>, maintain separate Vec<field_type> for each field. Objects are indexed by position across all vectors.
#![allow(unused)] fn main() { // Struct-of-Arrays: each field is a contiguous array. // Access patterns that touch only a few fields see only those fields in cache. struct OrbitalCatalog { // "Hot" fields — accessed every screening pass. norad_ids: Vec<u32>, altitudes_km: Vec<f64>, inclinations_deg: Vec<f64>, raans_deg: Vec<f64>, // "Cold" fields — accessed only for display / export. names: Vec<[u8; 24]>, launch_years: Vec<u16>, } impl OrbitalCatalog { fn len(&self) -> usize { self.norad_ids.len() } fn push(&mut self, id: u32, alt: f64, inc: f64, raan: f64, name: [u8; 24], launch: u16) { self.norad_ids.push(id); self.altitudes_km.push(alt); self.inclinations_deg.push(inc); self.raans_deg.push(raan); self.names.push(name); self.launch_years.push(launch); } } fn screen_conjunctions_soa(catalog: &OrbitalCatalog, threshold_km: f64) -> Vec<u32> { let alts = &catalog.altitudes_km; let ids = &catalog.norad_ids; let mut alerts = Vec::new(); for i in 0..catalog.len() { for j in i+1..catalog.len() { // Only altitudes_km is touched here — 8 bytes per element. // 8 f64s fit in one cache line. // For 50k objects, working set for altitudes_km = 400KB (fits in L2). let delta = (alts[i] - alts[j]).abs(); if delta < threshold_km { alerts.push(ids[i]); } } } alerts } }
The screening loop now accesses only altitudes_km — 50,000 × 8 bytes = 400 KB, which fits in a typical L2 cache (512KB–2MB). The names, launch years, and RAAN values are never loaded. Cache utilisation is near 100%.
When SoA Wins
SoA is most effective when:
-
Operations access a small subset of fields. The conjunction screening loop uses 3 of 6 fields. SIMD vectorization of
altitudes_km - alts[j]operates on 8 doubles per instruction with AVX2. -
Processing is sequential over all objects. Iterating
altitudes_km[0..50000]is a linear scan — the hardware prefetcher predicts the access pattern and pre-fetches cache lines ahead of the loop. -
Field values have uniform types amenable to SIMD. A
Vec<f64>can be processed withf64x4orf64x8SIMD instructions. An AoS loop cannot be auto-vectorized as efficiently because the fields are interleaved. -
Objects are added and removed infrequently. SoA requires synchronized insertion and removal across all vectors. Random insertion in the middle is O(n) for every field vector simultaneously.
When AoS Wins
AoS is more appropriate when:
-
Operations access all or most fields of one object at a time. Constructing a display record or serializing one object reads all fields — SoA forces jumping across multiple vectors.
-
Access is random by index. Looking up object ID 25544 requires one index lookup across all vectors. AoS keeps all of 25544's data in one cache line — one miss. SoA scatters it across multiple cache lines.
-
Objects are frequently inserted, removed, or moved. AoS insertion is a single
push. SoA insertion ispushacross all field vectors — more work and more cache lines touched. -
The struct has few fields or all fields are typically accessed together. If the struct is small (≤ 32 bytes) and all fields are used in every operation, SoA provides no benefit and complicates the API.
Hybrid: AoS with Field Grouping
The practical approach is not a binary AoS vs SoA choice — it is grouping fields by access pattern:
#![allow(unused)] fn main() { /// Hot group: fields accessed in every pass of the screening loop. #[derive(Debug, Clone, Copy)] struct ObjectHot { altitude_km: f64, inclination_deg: f64, raan_deg: f64, eccentricity: f64, } /// Cold group: fields accessed for display, export, and audit only. #[derive(Debug, Clone)] struct ObjectCold { norad_id: u32, launch_year: u16, name: String, operator: String, } /// The catalog splits hot and cold data into separate vectors. /// The index is the common key between them. struct OrbitalCatalog { hot: Vec<ObjectHot>, // Dense; accessed every screening pass. cold: Vec<ObjectCold>, // Sparse access; not in screening hot path. } }
The screening pass operates only on hot — a Vec<ObjectHot> of 50,000 × 32 bytes = 1.6 MB, fitting in L3 cache. cold is loaded only for display or export, where its access pattern (one object at a time by index) makes AoS natural.
Code Examples
Parallel Altitude Screening with Rayon and SoA
With altitudes stored in a contiguous Vec<f64>, the screening loop is naturally parallelisable — each parallel chunk accesses an independent range of the altitude slice:
// Note: rayon is not available in the Playground; this demonstrates the // pattern. In production, add rayon = "1" to Cargo.toml. /// Simplified O(n) altitude band screening (not the full O(n²) conjunction check). /// Finds objects in a dangerous altitude band. SoA makes this trivially parallel /// and cache-friendly: the working set is just Vec<f64>. fn screen_altitude_band( altitudes_km: &[f64], norad_ids: &[u32], min_km: f64, max_km: f64, ) -> Vec<u32> { assert_eq!(altitudes_km.len(), norad_ids.len()); // Sequential: all altitudes fit in one contiguous slice. // Hardware prefetcher maximises cache utilisation. altitudes_km .iter() .zip(norad_ids.iter()) .filter_map(|(&alt, &id)| { if alt >= min_km && alt <= max_km { Some(id) } else { None } }) .collect() } fn main() { // Simulate a 10,000-object catalog. let altitudes_km: Vec<f64> = (0..10_000u32) .map(|i| 350.0 + (i as f64) * 0.05) .collect(); let norad_ids: Vec<u32> = (0..10_000u32).collect(); // Screen for objects in the 400–450 km band (high debris density). let alerts = screen_altitude_band(&altitudes_km, &norad_ids, 400.0, 450.0); println!("{} objects in 400–450km band", alerts.len()); // Verify the working set is contiguous and predictable: let working_set_bytes = altitudes_km.len() * std::mem::size_of::<f64>(); println!("working set: {} KB", working_set_bytes / 1024); }
Transposing AoS to SoA Incrementally
Transitioning an existing AoS codebase to SoA does not require a full rewrite. Extract the hot fields into a companion SoA structure, index both by the same key:
// Existing AoS type — not changed, other code still uses it. #[derive(Debug, Clone)] struct TelemetryFrame { satellite_id: u32, sequence: u64, timestamp_ms: u64, station_id: u8, payload: Vec<u8>, } // New SoA hot path for bulk sequence-number deduplication. // Built from the AoS data; kept in sync on insert. struct FrameSequenceIndex { satellite_ids: Vec<u32>, sequences: Vec<u64>, } impl FrameSequenceIndex { fn from_frames(frames: &[TelemetryFrame]) -> Self { Self { satellite_ids: frames.iter().map(|f| f.satellite_id).collect(), sequences: frames.iter().map(|f| f.sequence).collect(), } } /// Find all duplicate (satellite_id, sequence) pairs — O(n) scan, /// cache-friendly because both vecs are small and contiguous. fn find_duplicates(&self) -> Vec<usize> { let mut seen = std::collections::HashSet::new(); self.satellite_ids .iter() .zip(self.sequences.iter()) .enumerate() .filter_map(|(i, (&sat, &seq))| { if !seen.insert((sat, seq)) { Some(i) } else { None } }) .collect() } } fn main() { let frames: Vec<TelemetryFrame> = (0..5u32).map(|i| TelemetryFrame { satellite_id: i % 3, sequence: (i / 3) as u64, timestamp_ms: 1_700_000_000 + i as u64, station_id: 1, payload: vec![i as u8; 128], }).collect(); let index = FrameSequenceIndex::from_frames(&frames); let dups = index.find_duplicates(); println!("{} duplicate frames found", dups.len()); }
The FrameSequenceIndex co-exists with the original Vec<TelemetryFrame>. Hot operations use the index; display and forwarding use the original frames. The transition is incremental — no global refactor required.
Key Takeaways
-
AoS is natural when operations access all or most fields of one entity at a time, or when random access by index is common. SoA is natural when operations process all entities but only a few fields — the common case in simulation and batch processing.
-
SoA improves cache utilisation for field-subset operations because the working set is smaller: processing
altitudes_km[0..n]loads only altitude data, not names, metadata, or other cold fields. -
Sequential access of a contiguous
Vec<f64>is maximally cache-friendly and SIMD-friendly. The hardware prefetcher predicts linear access patterns; SIMD intrinsics or auto-vectorisation require uniformly-typed contiguous data. -
The practical pattern is field grouping: split hot fields (accessed in every loop) from cold fields (accessed occasionally), and store them in separate vecs. This is a hybrid AoS/SoA approach.
-
Transitioning from AoS to SoA does not require a full rewrite. Extract hot fields into a companion SoA index, keep both in sync on insert, and route hot-path operations through the index.
Lesson 3 — Arena Allocation: Bump Allocators for High-Throughput Telemetry Processing
Module: Foundation — M05: Data-Oriented Design in Rust
Position: Lesson 3 of 3
Source: Rust for Rustaceans — Jon Gjengset, Chapter 9 (GlobalAlloc). SoA and arena patterns synthesized from training knowledge.
Source note: The
GlobalAlloctrait and its safety requirements are covered in Rust for Rustaceans, Ch. 9. The bump allocator pattern and its application to telemetry pipelines are synthesized from training knowledge. Recommended further reading: thebumpalocrate documentation and Alexis Beingessner's "The Allocator API, Bump Allocation, and You" (Gankra.github.io).
Context
The Meridian telemetry processor allocates a Vec<u8> for every telemetry frame payload. At 100,000 frames per second, that is 100,000 malloc/free calls per second hitting the global allocator. The global allocator (jemalloc or the system allocator) is designed for general-purpose allocation: it handles arbitrary sizes, arbitrary lifetimes, and thread-safe concurrent access. This generality has a cost: each allocation acquires an internal lock or performs atomic operations, searches for a free block of appropriate size, and updates allocator metadata.
For short-lived objects that all die together — a batch of frames processed in one scheduling quantum, all freed at the end — a bump allocator eliminates all of that overhead. A bump allocator maintains a pointer into a pre-allocated slab of memory. Each allocation is one pointer addition. Deallocation is a no-op — the entire slab is reclaimed at once when the allocation epoch ends. For the right workload, this is 10–100× faster than the global allocator.
This lesson covers how bump allocators work, when they are appropriate, and how to implement and use them in Rust for high-throughput frame processing.
Core Concepts
The Global Allocator and Its Overhead
Every Box::new(x), Vec::new(), and String::new() in Rust calls the global allocator via the GlobalAlloc trait (Rust for Rustaceans, Ch. 9):
#![allow(unused)] fn main() { // The GlobalAlloc trait (simplified from std): pub unsafe trait GlobalAlloc { unsafe fn alloc(&self, layout: std::alloc::Layout) -> *mut u8; unsafe fn dealloc(&self, ptr: *mut u8, layout: std::alloc::Layout); } }
The default allocator (jemalloc in most production Rust, or the system allocator) handles:
- Thread safety (internal locks or lock-free data structures)
- Size classes and free lists for different allocation sizes
- Fragmentation management
- Returning memory to the OS when freed
For long-lived, variously-sized allocations with arbitrary lifetimes, this is correct and often fast. For thousands of small, short-lived allocations that all have the same lifetime, it is expensive overkill.
Bump Allocation: The Pattern
A bump allocator owns a contiguous slab of memory. Each allocation is a pointer increment:
[slab start] → [offset] → [slab end]
↑
pointer bumps forward on each allocation
Freeing individual allocations is not supported. The entire slab is reset in one operation when all allocations are no longer needed — the "epoch" ends and the offset pointer resets to zero.
Properties:
- Allocation: O(1), typically one integer addition and a bounds check.
- Deallocation: O(1) total for all allocations from one epoch — reset the offset.
- Thread safety: A single-threaded bump allocator has no synchronisation overhead. A thread-local bump allocator gives each thread its own slab with no contention.
- Fragmentation: None within the epoch. Memory is never reused for a different allocation during the epoch — no fragmentation.
- Limitation: Cannot free individual allocations. All allocations from one bump allocator share the same lifetime.
Using bumpalo for Safe Bump Allocation
The bumpalo crate provides a production-quality bump allocator with a safe Rust interface:
// bumpalo is not available in the Playground — this shows the API. // Add to Cargo.toml: bumpalo = { version = "3", features = ["collections"] } // use bumpalo::Bump; // use bumpalo::collections::Vec as BumpVec; // Illustrating the pattern with a manual approach instead: struct BumpArena { slab: Vec<u8>, offset: usize, } impl BumpArena { fn new(capacity: usize) -> Self { Self { slab: vec![0u8; capacity], offset: 0, } } /// Allocate `size` bytes aligned to `align`. /// Returns None if the slab is exhausted. fn alloc(&mut self, size: usize, align: usize) -> Option<&mut [u8]> { // Align the current offset up. let aligned = (self.offset + align - 1) & !(align - 1); let end = aligned + size; if end > self.slab.len() { return None; // Out of space. } self.offset = end; Some(&mut self.slab[aligned..end]) } /// Reset the arena — all previous allocations are invalidated. fn reset(&mut self) { self.offset = 0; } fn used(&self) -> usize { self.offset } fn capacity(&self) -> usize { self.slab.len() } } fn main() { let mut arena = BumpArena::new(4096); // Allocate space for 10 u64 values. let buf = arena.alloc(10 * 8, 8).expect("arena exhausted"); println!("allocated {} bytes, used {}/{}", buf.len(), arena.used(), arena.capacity()); // Reset — all allocations invalidated, slab reused. arena.reset(); println!("after reset: used {}", arena.used()); }
Thread-Local Arenas for Concurrent Processing
For a multi-threaded pipeline where each worker thread processes its own batch of frames, a thread-local arena eliminates all lock contention:
#![allow(unused)] fn main() { use std::cell::RefCell; const ARENA_CAPACITY: usize = 16 * 1024 * 1024; // 16MB per thread thread_local! { // Each worker thread has its own private arena. // No synchronisation — no atomic operations, no locks. static FRAME_ARENA: RefCell<Vec<u8>> = RefCell::new(vec![0u8; ARENA_CAPACITY]); static ARENA_OFFSET: RefCell<usize> = const { RefCell::new(0) }; } fn alloc_frame_buffer(size: usize) -> *mut u8 { FRAME_ARENA.with(|arena| { ARENA_OFFSET.with(|offset| { let mut off = offset.borrow_mut(); let aligned = (*off + 7) & !7; // 8-byte alignment let end = aligned + size; let arena = arena.borrow(); if end > arena.len() { panic!("thread-local arena exhausted — increase ARENA_CAPACITY or reduce batch size"); } *off = end; arena.as_ptr() as *mut u8 }) }) } fn reset_thread_arena() { ARENA_OFFSET.with(|offset| *offset.borrow_mut() = 0); } }
In practice, use bumpalo::Bump in a thread_local! instead of building the unsafe version above. bumpalo handles alignment, growth, and lifetime correctly with a safe interface.
Epoch-Based Processing: The Right Workload
The bump allocator pattern maps naturally onto batch processing where all objects in a batch share the same lifetime:
use std::time::Instant; /// Simulates a frame batch processor using a bump-style pre-allocated pool. /// Each frame's payload is drawn from the batch buffer. /// When the batch is complete, the buffer is reset — no individual frees. struct FrameBatchProcessor { /// Pre-allocated buffer for all frame payloads in one batch. payload_pool: Vec<u8>, pool_offset: usize, batch_size: usize, frames_in_batch: usize, } impl FrameBatchProcessor { fn new(batch_size: usize, max_payload_per_frame: usize) -> Self { Self { payload_pool: vec![0u8; batch_size * max_payload_per_frame], pool_offset: 0, batch_size, frames_in_batch: 0, } } /// Claim space for a frame payload from the pool. /// Returns a mutable slice for the caller to fill. fn claim_payload_slot(&mut self, size: usize) -> Option<&mut [u8]> { if self.frames_in_batch >= self.batch_size { return None; // Batch full. } let end = self.pool_offset + size; if end > self.payload_pool.len() { return None; // Pool exhausted. } let slot = &mut self.payload_pool[self.pool_offset..end]; self.pool_offset = end; self.frames_in_batch += 1; Some(slot) } /// Process the current batch and reset for the next one. /// All payload slots are implicitly freed — no individual deallocation. fn flush_and_reset(&mut self) -> usize { let processed = self.frames_in_batch; self.pool_offset = 0; self.frames_in_batch = 0; processed } } fn main() { let mut processor = FrameBatchProcessor::new(1000, 1024); let start = Instant::now(); // Simulate processing 100,000 frames in batches of 1,000. let mut total = 0; for _batch in 0..100 { for _frame in 0..1000 { // Claim a 256-byte payload slot — no malloc. if let Some(slot) = processor.claim_payload_slot(256) { slot[0] = 0xAA; // Simulate writing frame data. } } total += processor.flush_and_reset(); } let elapsed = start.elapsed(); println!("processed {total} frames in {:?}", elapsed); println!("~{:.0} frames/sec", total as f64 / elapsed.as_secs_f64()); }
When Not to Use Bump Allocation
Bump allocators are not appropriate when:
-
Lifetimes are mixed. If some objects from a batch need to outlive the batch (e.g., forwarding a specific frame to a slow downstream consumer while releasing the rest), a bump allocator cannot express this. The solution is to copy out the long-lived objects to global-allocator memory before resetting.
-
Individual deallocation is required. A bump allocator cannot free one allocation while keeping others alive. Use a pool allocator (fixed-size slots with a free list) if individual deallocation of same-sized objects is needed.
-
Batches are unpredictably sized. If you cannot bound the total allocation size of a batch, the arena may exhaust. Size the arena conservatively — or use
bumpalo, which supports growth by chaining multiple slabs.
Code Examples
Comparing Global vs Arena Allocation for Frame Batches
This benchmark illustrates the overhead difference. Without running it on actual hardware, the expected speedup for small short-lived allocations is 5–20× in favour of the arena.
use std::time::Instant; const FRAMES: usize = 100_000; const PAYLOAD_SIZE: usize = 256; fn bench_global_alloc() -> std::time::Duration { let start = Instant::now(); for _ in 0..FRAMES { // Each Vec::new() + push triggers malloc + memcpy. let mut v = Vec::with_capacity(PAYLOAD_SIZE); for i in 0..PAYLOAD_SIZE { v.push(i as u8); } // Drop at end of loop iteration — free() called 100,000 times. let _ = v; } start.elapsed() } fn bench_arena_alloc() -> std::time::Duration { // Pre-allocate a slab for the entire batch. let mut slab = vec![0u8; FRAMES * PAYLOAD_SIZE]; let start = Instant::now(); let mut offset = 0; for frame_idx in 0..FRAMES { let start_byte = offset; let end_byte = offset + PAYLOAD_SIZE; for (i, byte) in slab[start_byte..end_byte].iter_mut().enumerate() { *byte = (frame_idx ^ i) as u8; } offset = end_byte; } // All frames "freed" by resetting offset to 0 — one operation. offset = 0; let _ = offset; start.elapsed() } fn main() { // Warm up to avoid measurement noise from cold caches. let _ = bench_global_alloc(); let _ = bench_arena_alloc(); let global_time = bench_global_alloc(); let arena_time = bench_arena_alloc(); println!("global alloc: {:?}", global_time); println!("arena alloc: {:?}", arena_time); let speedup = global_time.as_nanos() as f64 / arena_time.as_nanos() as f64; println!("arena speedup: {speedup:.1}×"); }
The arena's advantage grows with allocation count. At 100,000 256-byte frames, the arena avoids 100,000 malloc/free round-trips. The global allocator also has to find and merge free blocks over time as the heap fragments; the arena has zero fragmentation overhead.
Key Takeaways
-
The global allocator (
malloc/free) is general-purpose: thread-safe, handles arbitrary sizes and lifetimes, manages fragmentation. Its generality has overhead — internal synchronisation, free list management, metadata updates. -
A bump allocator eliminates this overhead for objects with a shared lifetime. Allocation is one integer addition. Deallocation is resetting one offset — all objects from one epoch freed simultaneously.
-
The lifetime constraint is the critical requirement. If any object from a bump-allocated batch must outlive the batch, copy it out to the global allocator before resetting. Do not try to mix lifetimes within one arena.
-
Thread-local arenas eliminate all cross-thread contention. Each worker thread gets its own slab; no lock, no atomic operation, no cache line bounce for allocation.
-
Use
bumpaloin production. It handles alignment, growth via chained slabs, and safe lifetimes. Implement your own bump allocator only for educational purposes or inno_stdenvironments where crate dependencies are restricted. -
Profile before optimising. The global allocator is fast for typical workloads. Bump allocation is a targeted optimisation for high-frequency, same-lifetime allocation patterns — not a universal replacement for
VecorBox.
Project — High-Throughput Telemetry Packet Processor
Module: Foundation — M05: Data-Oriented Design in Rust Prerequisite: All three module quizzes passed (≥70%)
Mission Brief
TO: Platform Engineering FROM: Mission Control Systems Lead CLASSIFICATION: UNCLASSIFIED // INTERNAL SUBJECT: RFC-0055 — Telemetry Packet Processor Performance Target
The current Rust-language telemetry processor runs at 62,000 frames per second under sustained load. The conjunction avoidance pipeline requires 100,000 frames per second to maintain sub-10-second delivery windows during peak orbital density events. The gap is 38%. Profiling shows two root causes:
- Allocator pressure. The processor allocates a
Vec<u8>per frame payload on the global heap. At 100k fps, this is 100kmalloc/freeround-trips per second — 18% of CPU time. - Cache waste. The
TelemetryFramestruct packs hot routing fields with cold payload data. Sequential scan of 100k frames for deduplication loads 2.4× more data than the deduplication logic uses.
Your task is to rebuild the processor core using the three techniques from this module: cache-optimal struct layout, SoA separation of hot and cold data, and arena allocation for frame payloads.
System Specification
Frame Structure
#![allow(unused)] fn main() { /// Hot fields — accessed in every pass (routing, deduplication, sorting). /// Must fit in ≤ 32 bytes and be ordered by descending alignment. #[derive(Debug, Clone, Copy)] pub struct FrameHeader { pub timestamp_ms: u64, pub sequence: u64, pub satellite_id: u32, pub byte_count: u16, pub station_id: u8, pub flags: u8, } /// Cold data — accessed only when forwarding to downstream consumers. /// Held as a reference into the batch arena; lifetime is one processing epoch. pub struct FramePayload<'arena> { pub data: &'arena [u8], } }
Processing Pipeline
The processor receives frames in batches of up to 10,000. For each batch:
- Claim payload space from the batch arena for each frame.
- Validate each frame's CRC (simulated: check that
flags & 0x80 == 0). - Deduplicate by
(satellite_id, sequence)— discard duplicates using a SoA scan over hot headers. - Sort the batch by
timestamp_msascending — sort only the header array, not the payloads. - Forward unique sorted frames to a
tokio::sync::mpsc::Sender<ForwardedFrame>. - Reset the arena — all payload allocations freed simultaneously.
Performance Target
- Process 100,000 frames per second sustained across a benchmark of 10,000 batches × 1,000 frames.
- Arena allocation must be used for frame payloads — no
Vec<u8>per payload. - Hot field access (deduplication and sort) must operate on the header array, not the full frame struct.
- Struct size assertions must compile:
size_of::<FrameHeader>() == 24.
Output
A binary crate that:
- Generates synthetic frame batches
- Runs the full pipeline (validate → deduplicate → sort → forward) for 10,000 batches
- Reports frames per second, percentage of duplicates discarded, and arena reset count
- Confirms no per-frame heap allocations occur in the hot path (verified by measuring allocator calls)
Acceptance Criteria
| # | Criterion | Verifiable |
|---|---|---|
| 1 | size_of::<FrameHeader>() == 24 — const assertion in source | Yes — compile-time |
| 2 | Frame payloads allocated from batch arena, not global heap | Yes — code review: no Vec::new() or Box::new() in hot path |
| 3 | Deduplication operates on &[FrameHeader] — no full struct access | Yes — code review |
| 4 | Sort operates on the header array by index — payloads not moved | Yes — code review |
| 5 | Arena resets after each batch — used bytes reset to 0 | Yes — assert in batch loop |
| 6 | Benchmark reports ≥ 100,000 frames/sec on a modern laptop | Yes — timing output |
| 7 | Duplicate detection uses a HashSet<(u32, u64)> on header fields only | Yes — code review |
Hints
Hint 1 — FrameHeader size assertion
#![allow(unused)] fn main() { const _: () = assert!( std::mem::size_of::<FrameHeader>() == 24, "FrameHeader must be 24 bytes — check field order and padding" ); }
Field order for 24 bytes with no padding:
timestamp_ms: u64(8)sequence: u64(8)satellite_id: u32(4)byte_count: u16(2)station_id: u8(1)flags: u8(1) = 24 bytes, alignment = 8, no padding.
Hint 2 — Batch arena design
Pre-allocate the slab once per processor lifetime. Reset between batches:
#![allow(unused)] fn main() { pub struct BatchArena { slab: Vec<u8>, offset: usize, } impl BatchArena { pub fn new(capacity: usize) -> Self { Self { slab: vec![0u8; capacity], offset: 0 } } /// Allocate `size` bytes; returns a mutable slice into the slab. pub fn alloc(&mut self, size: usize) -> Option<&mut [u8]> { let aligned = (self.offset + 7) & !7; // 8-byte alignment let end = aligned + size; if end > self.slab.len() { return None; } self.offset = end; Some(&mut self.slab[aligned..end]) } /// Reset — all previous allocations implicitly freed. pub fn reset(&mut self) { self.offset = 0; } pub fn used(&self) -> usize { self.offset } } }
Size the arena for worst-case batch: max_batch_size * max_payload_size.
Hint 3 — SoA deduplication
Maintain a Vec<FrameHeader> (hot, dense) separate from payloads:
#![allow(unused)] fn main() { use std::collections::HashSet; fn deduplicate(headers: &[FrameHeader]) -> Vec<usize> { let mut seen = HashSet::with_capacity(headers.len()); headers .iter() .enumerate() .filter_map(|(i, h)| { if seen.insert((h.satellite_id, h.sequence)) { Some(i) // Unique — keep this index. } else { None // Duplicate — discard. } }) .collect() } }
The deduplication loop touches only satellite_id and sequence from FrameHeader — 12 bytes of the 24-byte struct. With 1,000 headers per batch at 24 bytes each, the working set is 24 KB — fits in L1 cache.
Hint 4 — Sort headers by timestamp without moving payloads
Sort an index array by headers[i].timestamp_ms, not the headers themselves. This avoids any payload movement:
#![allow(unused)] fn main() { fn sort_by_timestamp(indices: &mut Vec<usize>, headers: &[FrameHeader]) { indices.sort_unstable_by_key(|&i| headers[i].timestamp_ms); } }
Iterating indices in sorted order gives frames in timestamp order without copying or moving any data.
Reference Implementation
Reveal reference implementation
use std::collections::HashSet; use std::time::Instant; // --- FrameHeader --- #[derive(Debug, Clone, Copy)] pub struct FrameHeader { pub timestamp_ms: u64, pub sequence: u64, pub satellite_id: u32, pub byte_count: u16, pub station_id: u8, pub flags: u8, } const _SIZE: () = assert!(std::mem::size_of::<FrameHeader>() == 24); const _ALIGN: () = assert!(std::mem::align_of::<FrameHeader>() == 8); // --- BatchArena --- pub struct BatchArena { slab: Vec<u8>, offset: usize, } impl BatchArena { pub fn new(capacity: usize) -> Self { Self { slab: vec![0u8; capacity], offset: 0 } } pub fn alloc(&mut self, size: usize) -> Option<&mut [u8]> { let aligned = (self.offset + 7) & !7; let end = aligned + size; if end > self.slab.len() { return None; } self.offset = end; Some(&mut self.slab[aligned..end]) } pub fn reset(&mut self) { self.offset = 0; } pub fn used(&self) -> usize { self.offset } } // --- Pipeline --- fn validate(header: &FrameHeader) -> bool { header.flags & 0x80 == 0 // Simulated CRC: high bit = error flag. } fn deduplicate_indices(headers: &[FrameHeader]) -> Vec<usize> { let mut seen = HashSet::with_capacity(headers.len()); headers.iter().enumerate().filter_map(|(i, h)| { if seen.insert((h.satellite_id, h.sequence)) { Some(i) } else { None } }).collect() } fn sort_indices_by_timestamp(indices: &mut Vec<usize>, headers: &[FrameHeader]) { indices.sort_unstable_by_key(|&i| headers[i].timestamp_ms); } fn process_batch( arena: &mut BatchArena, batch: &[(u64, u64, u32, u16, u8, u8, Vec<u8>)], // (ts, seq, sat, bytes, stn, flags, raw_payload) ) -> (usize, usize) { // 1. Fill header array and claim arena slots for payloads. let mut headers: Vec<FrameHeader> = Vec::with_capacity(batch.len()); let mut payload_offsets: Vec<(usize, usize)> = Vec::with_capacity(batch.len()); // (start, len) for (ts, seq, sat, bytes, stn, flags, payload) in batch { let header = FrameHeader { timestamp_ms: *ts, sequence: *seq, satellite_id: *sat, byte_count: *bytes, station_id: *stn, flags: *flags, }; // Validate before claiming arena space. if !validate(&header) { continue; } let slot = match arena.alloc(payload.len()) { Some(s) => s, None => break, // Arena full — drop remaining frames. }; slot.copy_from_slice(payload); let start = arena.used() - payload.len(); payload_offsets.push((start, payload.len())); headers.push(header); } // 2. Deduplicate on hot header array — SoA benefit. let mut unique_indices = deduplicate_indices(&headers); let duplicates = headers.len() - unique_indices.len(); // 3. Sort by timestamp — header array only, no payload movement. sort_indices_by_timestamp(&mut unique_indices, &headers); let forwarded = unique_indices.len(); // 4. Reset arena — all payload slots freed in O(1). arena.reset(); (forwarded, duplicates) } fn main() { const BATCH_SIZE: usize = 1_000; const NUM_BATCHES: usize = 10_000; const MAX_PAYLOAD: usize = 256; let mut arena = BatchArena::new(BATCH_SIZE * MAX_PAYLOAD + 64); // Generate synthetic batch data. let batch: Vec<(u64, u64, u32, u16, u8, u8, Vec<u8>)> = (0..BATCH_SIZE) .map(|i| { let seq = (i / 3) as u64; // Every 3 frames share a sequence — ~33% duplicates. ( 1_700_000_000 + i as u64, seq, (i % 48) as u32, MAX_PAYLOAD as u16, (i % 12) as u8, 0u8, vec![(i & 0xFF) as u8; MAX_PAYLOAD], ) }) .collect(); let start = Instant::now(); let mut total_forwarded = 0usize; let mut total_duplicates = 0usize; let mut resets = 0usize; for _ in 0..NUM_BATCHES { let (fwd, dups) = process_batch(&mut arena, &batch); total_forwarded += fwd; total_duplicates += dups; resets += 1; assert_eq!(arena.used(), 0, "arena must be reset after each batch"); } let elapsed = start.elapsed(); let total_frames = BATCH_SIZE * NUM_BATCHES; let fps = total_frames as f64 / elapsed.as_secs_f64(); println!("--- Telemetry Processor Benchmark ---"); println!("frames: {}", total_frames); println!("forwarded: {}", total_forwarded); println!("duplicates: {} ({:.1}%)", total_duplicates, 100.0 * total_duplicates as f64 / total_frames as f64); println!("resets: {}", resets); println!("elapsed: {:.2?}", elapsed); println!("throughput: {:.0} frames/sec", fps); println!("FrameHeader size: {} bytes", std::mem::size_of::<FrameHeader>()); }
Reflection
The three optimisations in this project compose:
- Struct layout ensures the
FrameHeaderarray is compact (24 bytes/entry, no wasted padding). 24 KB for 1,000 headers — fits in L1 cache, fully available during the deduplication and sort passes. - SoA separation means deduplication and sorting never touch payload data — the payload arena is not in the working set during hot-path operations.
- Arena allocation eliminates 100,000 per-second
malloc/freeround-trips. All payloads for one batch are freed in a single pointer reset.
Each optimisation is independently valuable. Together, they target the three most common sources of throughput bottlenecks in high-frequency data pipelines: allocator pressure, memory bandwidth waste, and cache thrashing.
The benchmarking mindset from Module 6 (Performance and Profiling) will give you the tools to measure these improvements precisely — comparing before and after with criterion, identifying the limiting factor with perf, and validating that the improvements hold under realistic workload conditions.
Module 06 — Performance & Profiling
Track: Foundation — Mission Control Platform
Position: Module 6 of 6 (Foundation track complete)
Source material: Rust for Rustaceans — Jon Gjengset, Chapter 6; criterion, cargo-flamegraph, perf, dhat documentation
Quiz pass threshold: 70% on all three lessons to unlock the project
- Mission Context
- What You Will Learn
- Lessons
- Capstone Project — Meridian Control Plane Performance Audit
- Prerequisites
- Foundation Track Complete
Mission Context
The Module 5 telemetry processor achieves 100,000 frames per second in isolation. The integrated control plane pipeline runs at 71,000. The 29% gap is not in the algorithm — it is in measurement blind spots: unmeasured allocations, unverified assumptions about what the compiler optimises away, and code paths that look fast but are not.
Performance engineering without measurement is optimism. This module provides the measurement toolkit: criterion for reliable benchmarks, flamegraph and perf for identifying hot paths, and allocation counting for detecting hidden heap overhead. The project combines all three into a structured audit that turns a performance gap into a documented, measured, verified improvement.
What You Will Learn
By the end of this module you will be able to:
- Identify the three failure modes of naive
Instant::now()benchmarks: dead-code elimination, constant folding, and I/O overhead masking the function under test - Apply
std::hint::black_boxcorrectly to both inputs and outputs to prevent compiler optimisations from invalidating benchmark results - Write
criterionbenchmarks with proper setup/measurement separation, interpret confidence intervals and p-values, and run parameterised benchmarks across input sizes - Configure the release profile with debug symbols for profiling, generate flamegraphs with
cargo flamegraph, and identify hot paths from flamegraph visual patterns - Read
perf statoutput to diagnose whether a workload is compute-bound, memory-bound, or branch-prediction-bound before generating a flamegraph - Use a
#[global_allocator]counting wrapper to count allocations in a specific code path, embed zero-allocation assertions in CI, and eliminate common hidden allocation sources (HashMap::new(),Vec::collect(),format!())
Lessons
Lesson 1 — Benchmarking with criterion: Writing Reliable Microbenchmarks
Covers the three failure modes of naive timing loops, std::hint::black_box placement for both input and output, criterion API and confidence interval interpretation, setup/measurement separation, benchmarking at realistic input sizes, and reading the statistical significance output.
Key question this lesson answers: How do you know your benchmark is measuring what you think it is, and how do you distinguish a real performance change from measurement noise?
→ lesson-01-benchmarking.md / lesson-01-quiz.toml
Lesson 2 — CPU Profiling with flamegraph and perf: Finding Hot Paths
Covers the sampling profiler model, configuring release builds with debug symbols for profiling, perf stat hardware counter diagnosis (IPC, cache miss rate, branch miss rate), cargo flamegraph workflow, reading flamegraph visual patterns (wide flat bars, deep towers, distributed overhead), and #[inline(never)] for profiling visibility.
Key question this lesson answers: Which function is consuming the most CPU time, and how do you distinguish a compute-bound bottleneck from a memory-bound one?
→ lesson-02-flamegraph.md / lesson-02-quiz.toml
Lesson 3 — Memory Profiling: Heap Allocation Tracking and Reducing Allocator Pressure
Covers the allocation cost model, #[global_allocator] counting wrappers for exact per-path allocation counts, HashMap::with_capacity and Vec::with_capacity pre-allocation, clear() for buffer reuse across batches, dhat for call-site-attributed heap profiling, and CI-embedded zero-allocation assertions.
Key question this lesson answers: How many allocations happen in the hot path, which call sites are responsible, and how do you make that a CI assertion rather than a one-time finding?
→ lesson-03-memory-profiling.md / lesson-03-quiz.toml
Capstone Project — Meridian Control Plane Performance Audit
Apply the full three-phase audit workflow to the integrated telemetry pipeline: establish a criterion baseline, generate a flamegraph to identify the hot path, use a counting allocator to quantify per-stage allocation overhead, implement the highest-impact fix, and verify the improvement is statistically significant (p < 0.05). Document findings in audit.md.
Acceptance is against 7 verifiable criteria including correct criterion usage, flamegraph generation, per-stage allocation counts, a documented fix, and a p < 0.05 improvement.
→ project-performance-audit.md
Prerequisites
Modules 1–5 must be complete. Module 5 (Data-Oriented Design) established the optimisations being measured here — this module gives you the tools to verify that those optimisations actually work and to prevent future regressions. Module 2 (Concurrency Primitives) introduced atomic operations, which are used by the counting allocator in Lesson 3.
Foundation Track Complete
With Module 6 complete, the Foundation track is done. The six modules cover the complete toolset for building Meridian's control plane in Rust: async task scheduling, concurrency primitives, message-passing architectures, network I/O, data-oriented design, and performance measurement. The four specialisation tracks — Database Internals, Data Pipelines, Data Lakes, and Distributed Systems — are now unlocked and can be taken in any order.
Lesson 1 — Benchmarking with criterion: Writing Reliable Microbenchmarks
Module: Foundation — M06: Performance & Profiling Position: Lesson 1 of 3 Source: Rust for Rustaceans — Jon Gjengset, Chapter 6
Context
The Module 5 project processor targets 100,000 frames per second. You have a number — but how confident are you in it? The benchmark loop used Instant::now() / elapsed() around a single iteration. That measurement is subject to three failure modes documented in Rust for Rustaceans Ch. 6: performance variance between runs (caused by CPU temperature, OS scheduler interrupts, memory layout), compiler optimisation eliminating the code under test entirely, and I/O overhead masking the actual function cost. A timing loop that contains a println! is usually measuring the speed of terminal output, not your function.
The criterion crate addresses all three. It runs each benchmark hundreds of times, applies statistical analysis to separate signal from noise, detects and reports outliers, and generates comparison reports that tell you whether a change is a real regression or measurement noise. When Meridian's CI pipeline regresses the frame processor throughput by 15%, criterion is how you prove the regression is real, quantify its size, and track it to the specific commit.
Source: Rust for Rustaceans, Chapter 6 (Gjengset)
Core Concepts
Why Instant::now() Loops Are Not Enough
Consider this naive benchmark:
fn my_function(data: &[u32]) -> u64 { data.iter().map(|&x| x as u64).sum() } fn main() { let data: Vec<u32> = (0..1_000).collect(); let start = std::time::Instant::now(); for _ in 0..10_000 { let _ = my_function(&data); } println!("took {:?}", start.elapsed()); }
Two problems. First, the compiler may eliminate my_function entirely — the result _ is discarded, so nothing in the code requires the computation to happen (Rust for Rustaceans, Ch. 6). In release mode, the loop body may compile to nothing. Second, a single run on a loaded machine is noise, not signal. CPU clock scaling, branch predictor warmup, and OS scheduler preemption all add variance. A function that takes 50µs may measure anywhere from 40µs to 200µs depending on external conditions.
criterion Basics
Add criterion to Cargo.toml:
[dev-dependencies]
criterion = { version = "0.5", features = ["html_reports"] }
[[bench]]
name = "frame_processor"
harness = false
A criterion benchmark in benches/frame_processor.rs:
// Note: criterion is a dev-dependency — not available in the Playground. // This demonstrates the API. In production add criterion = "0.5" to Cargo.toml. // use criterion::{black_box, criterion_group, criterion_main, Criterion}; // fn bench_deduplication(c: &mut Criterion) { // let headers: Vec<u64> = (0..1000).collect(); // c.bench_function("dedup_1000", |b| { // b.iter(|| { // // black_box prevents the compiler from optimising away the input // // or treating the result as dead code. // black_box(deduplicate(black_box(&headers))) // }) // }); // } // // criterion_group!(benches, bench_deduplication); // criterion_main!(benches); // Illustrating the structure with std::hint::black_box instead: fn deduplicate(headers: &[u64]) -> usize { let mut seen = std::collections::HashSet::new(); headers.iter().filter(|&&h| seen.insert(h)).count() } fn main() { let headers: Vec<u64> = (0..1000).collect(); // Warm up for _ in 0..100 { std::hint::black_box(deduplicate(std::hint::black_box(&headers))); } // Measure let iterations = 10_000; let start = std::time::Instant::now(); for _ in 0..iterations { std::hint::black_box(deduplicate(std::hint::black_box(&headers))); } let elapsed = start.elapsed(); println!("deduplicate(1000): {:.2?} per iteration", elapsed / iterations); }
black_box: Preventing Dead-Code Elimination
std::hint::black_box (or criterion::black_box) is the key primitive for correct benchmarks. It is an identity function that tells the compiler: assume this value is used in some arbitrary way (Rust for Rustaceans, Ch. 6). This prevents two failure modes:
Eliminating dead computation: if the benchmark discards the result with let _ = expensive(), the compiler may eliminate the call. black_box(expensive()) forces the computation to occur because the compiler must assume black_box uses its argument.
Constant-folding inputs: if the input to a benchmark is a compile-time constant, the compiler may pre-compute the result at compile time. black_box(input) forces the compiler to treat the input as runtime-unknown.
fn sum_slice(data: &[u32]) -> u64 { data.iter().map(|&x| x as u64).sum() } fn main() { let data: Vec<u32> = (0..1_000).collect(); // WRONG: compiler may eliminate the call — result discarded, input known. { let start = std::time::Instant::now(); for _ in 0..10_000 { let _ = sum_slice(&data); } println!("(likely wrong) took {:?}", start.elapsed()); } // CORRECT: black_box prevents dead-code elimination and constant folding. { let start = std::time::Instant::now(); for _ in 0..10_000 { std::hint::black_box(sum_slice(std::hint::black_box(&data))); } println!("(correct) took {:?}", start.elapsed()); } }
Note the placement: black_box on the input prevents constant folding (the compiler must treat the slice as runtime-unknown). black_box on the output prevents dead-code elimination (the compiler must assume the result is used).
Benchmark Structure Best Practices
Separate setup from measurement. Criterion's b.iter(|| { ... }) closure is the measured unit. Anything outside it is setup and runs once. Constructing test data inside the measured closure inflates the result with allocation cost.
// Illustrating the pattern with manual timing: fn build_test_headers(n: usize) -> Vec<u64> { // This is setup — not what we want to measure. (0..n as u64).collect() } fn deduplicate_headers(headers: &[u64]) -> usize { let mut seen = std::collections::HashSet::new(); headers.iter().filter(|&&h| seen.insert(h)).count() } fn bench_with_setup() { // Build test data ONCE — not inside the measured loop. let headers = build_test_headers(1000); let iterations = 100_000u32; let start = std::time::Instant::now(); for _ in 0..iterations { // Only the function under test is measured. std::hint::black_box(deduplicate_headers(std::hint::black_box(&headers))); } let elapsed = start.elapsed(); println!("deduplicate(1000): {:.2?}/iter", elapsed / iterations); } fn main() { bench_with_setup(); }
Benchmark at realistic input sizes. A function that is O(n log n) may be cache-bound at n=100 and compute-bound at n=100,000. Benchmark at the sizes you actually use in production. For Meridian's conjunction screen, that is 50,000 objects — not 100.
Use criterion's input size parameter for scaling analysis. BenchmarkGroup lets you benchmark the same function at multiple input sizes and plot throughput vs. size. The slope of that plot tells you whether your function is cache-bound (throughput drops sharply above L2 size) or compute-bound (throughput scales smoothly).
Interpreting Criterion Output
cargo bench produces output like:
dedup_1000 time: [12.453 µs 12.501 µs 12.554 µs]
change: [-2.1431% -1.6789% -1.1920%] (p=0.00 < 0.05)
Performance has improved.
The three numbers are the lower bound, estimate, and upper bound of the 95% confidence interval. If you see a wide interval (e.g., [5 µs, 50 µs, 200 µs]), measurement variance is high — run on a quieter machine, increase iteration count, or use --profile-time for more samples.
The change line compares against the previous run (stored in target/criterion). A p-value below 0.05 means the change is statistically significant with 95% confidence. Changes with p > 0.05 are likely noise.
Code Examples
Benchmarking the Telemetry Processor with Parameterised Input Sizes
This example uses black_box correctly and varies input size to understand the performance profile across the range of realistic batch sizes.
use std::collections::HashSet; use std::hint::black_box; use std::time::{Duration, Instant}; fn deduplicate(headers: &[(u32, u64)]) -> Vec<usize> { let mut seen = HashSet::with_capacity(headers.len()); headers.iter().enumerate() .filter_map(|(i, &(sat, seq))| { if seen.insert((sat, seq)) { Some(i) } else { None } }) .collect() } fn sort_by_timestamp(indices: &mut Vec<usize>, timestamps: &[u64]) { indices.sort_unstable_by_key(|&i| timestamps[i]); } /// Run `iterations` iterations, return median per-iteration duration. fn time_fn<F: Fn()>(f: F, iterations: u32) -> Duration { // Warm up — let branch predictor and instruction cache settle. for _ in 0..10 { f(); } let start = Instant::now(); for _ in 0..iterations { f(); } start.elapsed() / iterations } fn main() { println!("{:<10} {:>15} {:>15} {:>15}", "n_frames", "dedup (µs)", "sort (µs)", "total (µs)"); println!("{}", "-".repeat(55)); for &n in &[100usize, 500, 1_000, 5_000, 10_000] { // Build test data once — not in the measured loop. let headers: Vec<(u32, u64)> = (0..n) .map(|i| ((i % 48) as u32, (i / 3) as u64)) // ~33% duplicates .collect(); let timestamps: Vec<u64> = (0..n).map(|i| (n - i) as u64).collect(); let dedup_time = time_fn(|| { black_box(deduplicate(black_box(&headers))); }, 10_000); let sort_time = time_fn(|| { let mut indices = (0..n).collect::<Vec<_>>(); sort_by_timestamp(black_box(&mut indices), black_box(×tamps)); black_box(indices); }, 10_000); println!("{:<10} {:>15.2} {:>15.2} {:>15.2}", n, dedup_time.as_secs_f64() * 1e6, sort_time.as_secs_f64() * 1e6, (dedup_time + sort_time).as_secs_f64() * 1e6, ); } }
The slope of the dedup time as n grows reveals whether the function is O(n) (linear slope on a linear plot) or showing cache effects (steeper slope beyond a threshold). If dedup time grows faster than linearly above n=1000, the HashSet working set has exceeded L1 cache and you are paying for L2/L3 misses.
Key Takeaways
-
Instant::now()around a single-pass loop is not a reliable benchmark. Performance variance between runs, compiler dead-code elimination, and I/O in the loop can all produce completely wrong numbers (Rust for Rustaceans, Ch. 6). -
std::hint::black_box(orcriterion::black_box) prevents the compiler from eliminating benchmark code as dead. Apply it to both the input (prevent constant folding) and the output (prevent dead-code elimination). -
criterionruns each benchmark the statistically appropriate number of times, computes confidence intervals, detects outliers, and reports whether changes between runs are statistically significant. Usep < 0.05as the threshold for treating a change as real. -
Separate setup from measurement. Build test data outside the measured closure. Benchmark at realistic production input sizes, not toy sizes. Use parameterised input to understand performance scaling behaviour.
-
A 95% confidence interval that is narrow (< 5% spread) indicates a reliable measurement. A wide interval indicates high variance — run on a quieter machine or use
cargo bench -- --sample-size 200for more samples.
Lesson 2 — CPU Profiling with flamegraph and perf: Finding Hot Paths
Module: Foundation — M06: Performance & Profiling
Position: Lesson 2 of 3
Source: Synthesized from training knowledge (cargo flamegraph, perf, pprof documentation)
Source note: This lesson synthesizes from
cargo-flamegraph, Linuxperf, andpprofdocumentation. Verify specific CLI flags against your installed version ofperf— options vary between kernel versions.
Context
criterion tells you that the frame deduplication function takes 12.5µs. It does not tell you why. Is it the HashSet insertions? The iterator chain? A memory allocation path? To answer that question, you need a CPU profiler — a tool that samples the program's call stack at regular intervals and shows you where time is being spent.
The flamegraph is the standard visualisation for this: a call tree where width encodes time and the call stack grows upward. The widest frames at the top are where your program actually spends its time. A deep narrow tower is a deep but fast call chain. A wide flat bar is a hot leaf function. Reading flamegraphs is a skill that takes a few profiling sessions to develop, but the insight-to-effort ratio is very high.
This lesson covers the two-tool profiling workflow for Rust on Linux: perf to collect samples and cargo flamegraph to generate the visualisation.
Core Concepts
The Profiling Workflow
CPU profiling works by sampling: the OS timer fires at regular intervals (typically 99 Hz or 999 Hz), captures the current call stack, and records the sample. After the program finishes, the accumulated stack samples are folded into a call tree and rendered as a flamegraph. Functions that appear in more samples are proportionally wider in the graph.
The standard workflow:
# 1. Build with debug info (but optimisations enabled — profile release code).
# debug = true in [profile.release] preserves symbols without losing optimisations.
# Add to Cargo.toml:
# [profile.release]
# debug = true
# 2. Install cargo-flamegraph (wraps perf or dtrace depending on platform).
cargo install flamegraph
# 3. Profile the binary.
cargo flamegraph --bin meridian-processor -- --frames 100000
# 4. Open the generated flamegraph.svg in a browser.
# Click any frame to zoom in. Search by function name.
On Linux, cargo flamegraph uses perf record under the hood. On macOS, it uses dtrace. The output is always a flamegraph.svg.
Building for Profiling: Debug Symbols in Release Mode
Profiling a debug build measures the wrong thing — debug code contains bounds checks, non-inlined functions, and other overhead that does not exist in production. Profile release builds.
But release builds strip debug symbols by default — the flamegraph shows mangled symbol addresses instead of function names. The fix: add debug info to the release profile without disabling optimisations:
# Cargo.toml
[profile.release]
debug = true # Include debug symbols (DWARF info).
opt-level = 3 # Keep full optimisation.
# Note: debug = true increases binary size (~3-10×) but has negligible
# runtime overhead. Strip the binary before deploying to production.
Alternatively, use the profiling profile convention:
[profile.profiling]
inherits = "release"
debug = true
Then cargo build --profile profiling && cargo flamegraph --profile profiling.
Reading a Flamegraph
A flamegraph stacks call frames vertically — the root (main) at the bottom, callees above. Width is proportional to the percentage of samples that included that frame in the call stack. The top-most wide frames are the actual hot spots.
Patterns to recognise:
Wide flat bar at the top — a leaf function consuming significant CPU. Investigate whether it can be optimised directly (algorithm, data structure choice) or eliminated (caching, avoiding the call).
Wide bar with many narrow children — a function that spends time distributed across many callees. No single child is dominant; the function itself may be doing overhead work.
Deep narrow tower — a long call chain that is individually fast. Usually indicates overhead from indirection (dynamic dispatch, many small function calls). #[inline] or refactoring may help.
[unknown] frames — samples from code without debug symbols (runtime, system libraries). Usually not actionable. Can be reduced by profiling with kernel symbols (--call-graph dwarf).
perf stat: Hardware Counter Snapshot
Before generating a flamegraph, perf stat gives a quick diagnostic of what kind of bottleneck you have:
perf stat -e cycles,instructions,cache-misses,cache-references,branch-misses \
./target/release/meridian-processor --frames 100000
Output:
Performance counter stats for './target/release/meridian-processor':
4,521,847,032 cycles
6,234,891,045 instructions # 1.38 insn per cycle
12,847,334 cache-misses # 8.23% of all cache refs
156,234,123 cache-references
2,341,234 branch-misses # 0.21% of all branches
Instructions per cycle (IPC): 1.38 is moderate. Modern CPUs can sustain 3–4 IPC. Low IPC (< 1.5) suggests the processor is stalling — often on memory latency (cache misses) or branch mispredictions.
Cache miss rate: 8.23% is high. Typically < 1% is good. High cache miss rates point to the data layout problems covered in Module 5 — large structs, poor locality, random access patterns.
Branch miss rate: 0.21% is normal. > 5% suggests unpredictable branches — sorting or using branchless comparisons may help.
cargo-flamegraph in Practice
// Example: a function with a deliberately inefficient hot path // to demonstrate profiling workflow. fn find_conjunctions_naive( altitudes: &[f64], norad_ids: &[u32], threshold_km: f64, ) -> Vec<(u32, u32)> { let mut alerts = Vec::new(); let n = altitudes.len(); for i in 0..n { for j in (i + 1)..n { // This inner loop is O(n²) — will show as wide in a flamegraph. // The call to f64::abs() will likely appear as a hot child. if (altitudes[i] - altitudes[j]).abs() < threshold_km { alerts.push((norad_ids[i], norad_ids[j])); } } } alerts } fn main() { // Simulate workload for profiling. let n = 5_000; let altitudes: Vec<f64> = (0..n).map(|i| 400.0 + (i as f64) * 0.1).collect(); let norad_ids: Vec<u32> = (0..n as u32).collect(); let alerts = find_conjunctions_naive(&altitudes, &norad_ids, 2.0); println!("{} conjunction alerts", alerts.len()); }
In a flamegraph of this code, find_conjunctions_naive will be wide at the top (O(n²) iterations), with the subtraction and abs() call visible as the actual hot operations. The outer loop iteration overhead and the Vec::push for matches will also appear.
The flamegraph makes it immediately obvious: the inner loop is the hot path. The fix — using a sort + linear scan instead of O(n²) comparison — is visible from the profile before reading a single line of source.
Annotating Hot Functions with #[inline(never)]
By default, the compiler inlines small functions, which is good for performance but bad for profiling — inlined calls disappear into their callers in the flamegraph. For functions you specifically want to measure in isolation:
// Prevents inlining — this function will appear as a distinct frame in the flamegraph. // Remove before production use if inlining is desired for performance. #[inline(never)] fn compute_altitude_delta(a: f64, b: f64) -> f64 { (a - b).abs() } fn main() { // In a flamegraph, compute_altitude_delta will appear as its own frame, // making it easy to see exactly how much time the subtraction + abs costs. let result = compute_altitude_delta(410.0, 408.5); println!("{}", result); }
Use #[inline(never)] temporarily during profiling investigations. Remove it afterward — the compiler's inlining decisions are generally correct for production code.
Code Examples
A Profiling-Instrumented Processor Binary
The entry point for profiling runs a realistic workload of sufficient duration for the sampler to collect meaningful data. Too short (< 1 second) and there are too few samples for a reliable flamegraph.
use std::collections::HashSet; use std::hint::black_box; use std::time::Instant; fn build_test_data(n: usize) -> (Vec<u64>, Vec<(u32, u64)>) { let timestamps: Vec<u64> = (0..n as u64).rev().collect(); let headers: Vec<(u32, u64)> = (0..n) .map(|i| ((i % 48) as u32, (i / 3) as u64)) .collect(); (timestamps, headers) } #[inline(never)] // Visible as its own frame in flamegraph fn dedup_pass(headers: &[(u32, u64)]) -> Vec<usize> { let mut seen = HashSet::with_capacity(headers.len()); headers.iter().enumerate() .filter_map(|(i, &(sat, seq))| { if seen.insert((sat, seq)) { Some(i) } else { None } }) .collect() } #[inline(never)] // Visible as its own frame in flamegraph fn sort_pass(indices: &mut Vec<usize>, timestamps: &[u64]) { indices.sort_unstable_by_key(|&i| timestamps[i]); } fn process_batch(timestamps: &[u64], headers: &[(u32, u64)]) -> usize { let mut indices = dedup_pass(headers); sort_pass(&mut indices, timestamps); indices.len() } fn main() { // Run enough iterations for perf to collect ~1000+ samples. // At 99 Hz sampling, we need ~10 seconds of CPU time. let (timestamps, headers) = build_test_data(10_000); let batches = 50_000; let start = Instant::now(); let mut total = 0usize; for _ in 0..batches { total += black_box(process_batch( black_box(×tamps), black_box(&headers), )); } let elapsed = start.elapsed(); println!("processed {} batches, {} unique frames", batches, total); println!("throughput: {:.0} batches/sec", batches as f64 / elapsed.as_secs_f64()); println!("wall time: {:.2?}", elapsed); }
The #[inline(never)] attributes on dedup_pass and sort_pass ensure they appear as distinct frames in the flamegraph. The black_box calls prevent dead-code elimination from interfering with the profiling workload. The loop runs long enough to collect statistically meaningful samples.
Key Takeaways
-
Profile release builds with debug symbols (
debug = truein[profile.release]). Profiling debug builds measures overhead that does not exist in production. -
perf statprovides a hardware counter snapshot before you generate a flamegraph. High cache miss rate (> 5%) points to data layout issues; low IPC (< 1.5) suggests the processor is stalling on memory; high branch miss rate suggests unpredictable conditionals. -
In a flamegraph, width encodes time. Wide frames at the top are hot leaf functions — the actual bottleneck. Wide frames with narrow children indicate distributed overhead. Deep narrow towers indicate fast call chains, not hot spots.
-
#[inline(never)]temporarily prevents a function from being inlined so it appears as a distinct frame in the profiler. Remove it after the investigation — inlining is correct for production code. -
A profiling session should last at least 5–10 seconds of CPU time for reliable sample counts at 99 Hz. Use a workload that resembles production access patterns at production input sizes.
Lesson 3 — Memory Profiling: Heap Allocation Tracking and Reducing Allocator Pressure
Module: Foundation — M06: Performance & Profiling
Position: Lesson 3 of 3
Source: Synthesized from training knowledge (dhat, heaptrack, jemalloc statistics, custom allocator wrappers)
Source note: This lesson synthesizes from
dhat(Valgrind/DHAT profiler),heaptrackdocumentation, and allocator counting patterns. Verifydhat-rsAPI against the current crate version.
Context
The CPU flamegraph from Lesson 2 shows the telemetry processor spending 18% of time in malloc and free. The criterion benchmark from Lesson 1 confirms: 12.5µs per 1000-frame batch, 2.3µs of which is allocator overhead. The fix from Module 5 — arena allocation — eliminates this. But before implementing it, you need to know: exactly how many allocations happen per batch? Which call sites are responsible? Are there unexpected allocations from library code that you assumed was allocation-free?
Memory profiling answers these questions. Unlike CPU profiling (which samples stochastically), allocation profiling intercepts every alloc and dealloc call — giving you exact counts, sizes, and call sites. The tools: dhat for lightweight in-process counting, heaptrack for comprehensive heap timeline recording, and a custom counting allocator for targeted measurements in CI.
Core Concepts
The Allocation Cost Model
Every Vec::new(), Box::new(), String::from(), and collection growth hits the global allocator. The actual cost depends on the allocator (jemalloc is faster than the system allocator for concurrent workloads), the allocation size (small allocations have higher per-byte overhead), and contention (the global allocator serialises concurrent allocations internally).
Profiling allocation patterns reveals three categories of allocatable objects:
Long-lived allocations — startup config, connection state, per-session data structures. These are unavoidable and not a throughput problem.
Per-batch allocations — temporary buffers, work vectors, accumulators that are created and freed within one processing epoch. These are the target of arena allocation — eliminate them with pre-allocation.
Unexpected allocations — library calls that allocate internally even though the API looks allocation-free. format!(), HashMap::new(), Vec::collect() when the iterator doesn't know its size. These show up in memory profiles and are often surprising.
Counting Allocations: The Simplest Approach
Before reaching for a full memory profiler, a counting allocator wrapper tells you exactly how many allocations occur in a specific code path. This works in any environment and imposes very low overhead:
use std::alloc::{GlobalAlloc, Layout, System}; use std::sync::atomic::{AtomicU64, Ordering}; /// Wraps the system allocator and counts every alloc/dealloc. struct CountingAllocator; static ALLOC_COUNT: AtomicU64 = AtomicU64::new(0); static DEALLOC_COUNT: AtomicU64 = AtomicU64::new(0); static ALLOC_BYTES: AtomicU64 = AtomicU64::new(0); unsafe impl GlobalAlloc for CountingAllocator { unsafe fn alloc(&self, layout: Layout) -> *mut u8 { ALLOC_COUNT.fetch_add(1, Ordering::Relaxed); ALLOC_BYTES.fetch_add(layout.size() as u64, Ordering::Relaxed); System.alloc(layout) } unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) { DEALLOC_COUNT.fetch_add(1, Ordering::Relaxed); System.dealloc(ptr, layout) } } #[global_allocator] static ALLOCATOR: CountingAllocator = CountingAllocator; fn snapshot() -> (u64, u64, u64) { ( ALLOC_COUNT.load(Ordering::Relaxed), DEALLOC_COUNT.load(Ordering::Relaxed), ALLOC_BYTES.load(Ordering::Relaxed), ) } fn reset_counters() { ALLOC_COUNT.store(0, Ordering::Relaxed); DEALLOC_COUNT.store(0, Ordering::Relaxed); ALLOC_BYTES.store(0, Ordering::Relaxed); } // --- Application code under test --- fn process_frames(frames: &[Vec<u8>]) -> usize { // Allocates a HashSet internally. let mut seen = std::collections::HashSet::new(); frames.iter().filter(|f| seen.insert(f.as_ptr())).count() } fn main() { let frames: Vec<Vec<u8>> = (0..100).map(|i| vec![i as u8; 256]).collect(); // Reset — we only want to count allocations from process_frames. reset_counters(); let result = process_frames(&frames); let (allocs, deallocs, bytes) = snapshot(); println!("process_frames({}) result: {}", frames.len(), result); println!(" allocations: {allocs}"); println!(" deallocations: {deallocs}"); println!(" bytes: {bytes}"); }
The output reveals exactly how many times the global allocator was called inside process_frames. If the count is non-zero when it should be zero (the function is supposed to be allocation-free), you have a hidden allocation to hunt down.
Common Hidden Allocation Sources
HashSet::new() and HashMap::new() — these start empty (no allocation) but trigger an allocation on the first insert. HashSet::with_capacity(n) pre-allocates for n elements, avoiding the first realloc. Using with_capacity eliminates the grow-and-rehash allocation that occurs when the initial capacity is exceeded.
Vec::collect() without size hint — if the iterator does not implement ExactSizeIterator, the Vec starts with a small capacity and grows (allocating) as elements arrive. Call .collect::<Vec<_>>() only when you know the iterator is small or provide a size hint via .size_hint().
format!() and string operations — every format! call allocates a String. In hot paths, prefer writing to a pre-allocated String with write! or push_str, or avoid String entirely in favour of a stack buffer.
Arc::clone() is not free — cloning an Arc does not allocate, but Arc::new() does. In a hot path, pre-create the Arc at batch setup time rather than per-frame.
Iterator adapters that buffer — .sorted() from itertools allocates a Vec. .flat_map() with iterators that have non-trivial state may allocate. Check whether the adapter is allocation-free before using it in a hot path.
dhat: In-Process Heap Profiling
dhat (from Valgrind, with a Rust port via the dhat crate) instruments every allocation with a call-site stack trace. It produces a profile that shows, for each allocation site, the total bytes allocated, the peak live bytes, and the number of calls:
# Cargo.toml
[dependencies]
dhat = { version = "0.3", optional = true }
[features]
dhat-heap = ["dhat"]
// In main.rs — only active when the dhat-heap feature is enabled. // cfg gate prevents any overhead in production builds. #[cfg(feature = "dhat-heap")] #[global_allocator] static ALLOC: dhat::Alloc = dhat::Alloc; fn main() { #[cfg(feature = "dhat-heap")] let _profiler = dhat::Profiler::new_heap(); // ... run workload ... println!("dhat profile written on drop of _profiler"); }
Run with: cargo run --features dhat-heap. At program exit, dhat writes dhat-heap.json. View it at https://nnethercote.github.io/dh_view/dh_view.html.
The profile shows total bytes allocated per call site — letting you immediately identify which function is responsible for most allocations, even if that function is inside a library you did not write.
Reducing Allocator Pressure: Patterns
Pre-allocate with with_capacity:
fn process_batch_optimised(n: usize) -> Vec<usize> { // Pre-allocate with known capacity — no reallocation on push. let mut result = Vec::with_capacity(n); let mut seen = std::collections::HashSet::with_capacity(n); for i in 0..n { if seen.insert(i % (n / 2)) { // ~50% are unique result.push(i); } } result } fn main() { let batch = process_batch_optimised(10_000); println!("{} unique items", batch.len()); }
Reuse allocations across calls with clear() instead of dropping and reallocating:
struct FrameProcessor { // Persistent buffers — allocated once, reused every batch. seen: std::collections::HashSet<(u32, u64)>, indices: Vec<usize>, } impl FrameProcessor { fn new(expected_batch_size: usize) -> Self { Self { seen: std::collections::HashSet::with_capacity(expected_batch_size), indices: Vec::with_capacity(expected_batch_size), } } fn process(&mut self, headers: &[(u32, u64)]) -> &[usize] { // clear() retains the allocation — no new malloc per batch. self.seen.clear(); self.indices.clear(); for (i, &(sat, seq)) in headers.iter().enumerate() { if self.seen.insert((sat, seq)) { self.indices.push(i); } } &self.indices } } fn main() { let mut processor = FrameProcessor::new(1000); let headers: Vec<(u32, u64)> = (0..1000) .map(|i| ((i % 48) as u32, (i / 3) as u64)) .collect(); for batch_num in 0..5 { let unique = processor.process(&headers); println!("batch {batch_num}: {} unique frames", unique.len()); } }
The FrameProcessor struct holds the HashSet and Vec across batch calls. Each batch calls clear() — which sets the length to zero but retains the allocated capacity. After the first batch warms up the allocation, subsequent batches make zero allocator calls for these data structures.
Code Examples
Measuring Allocations Per Batch in CI
Embedding an allocation count assertion in CI ensures that future refactors do not accidentally reintroduce per-frame allocations:
use std::alloc::{GlobalAlloc, Layout, System}; use std::sync::atomic::{AtomicU64, Ordering::Relaxed}; struct CountingAllocator; static ALLOC_COUNT: AtomicU64 = AtomicU64::new(0); unsafe impl GlobalAlloc for CountingAllocator { unsafe fn alloc(&self, layout: Layout) -> *mut u8 { ALLOC_COUNT.fetch_add(1, Relaxed); System.alloc(layout) } unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) { System.dealloc(ptr, layout) } } #[global_allocator] static ALLOCATOR: CountingAllocator = CountingAllocator; // --- Frame processor under test --- struct Processor { seen: std::collections::HashSet<(u32, u64)>, indices: Vec<usize>, } impl Processor { fn new(cap: usize) -> Self { Self { seen: std::collections::HashSet::with_capacity(cap), indices: Vec::with_capacity(cap), } } fn process_batch(&mut self, headers: &[(u32, u64)]) -> usize { self.seen.clear(); self.indices.clear(); for (i, &key) in headers.iter().enumerate() { if self.seen.insert(key) { self.indices.push(i); } } self.indices.len() } } fn main() { let headers: Vec<(u32, u64)> = (0..1000) .map(|i| ((i % 48) as u32, (i / 3) as u64)) .collect(); let mut processor = Processor::new(1000); // Warm up — first batch may allocate as HashSet grows. processor.process_batch(&headers); // Reset counter — subsequent batches should be allocation-free. ALLOC_COUNT.store(0, Relaxed); // Run 100 batches. for _ in 0..100 { std::hint::black_box(processor.process_batch(std::hint::black_box(&headers))); } let allocs = ALLOC_COUNT.load(Relaxed); println!("allocations across 100 batches: {allocs}"); // In CI: assert!(allocs == 0, "unexpected allocations in hot path: {allocs}"); if allocs == 0 { println!("PASS: hot path is allocation-free after warm-up"); } else { println!("WARN: {allocs} unexpected allocations detected"); } }
The pattern: warm up once (let pre-allocated capacity fill), reset the counter, then assert zero allocations across subsequent batches. This assertion in CI will fail the build if any refactor introduces a hidden allocation.
Key Takeaways
-
Memory profiling reveals the call sites responsible for allocations, total bytes allocated per site, and peak live bytes.
dhat(via thedhatcrate) provides this with minimal production overhead when gated behind a feature flag. -
A counting allocator wrapper (
#[global_allocator]with atomic counters) is the fastest way to count allocations in a specific code path. Use it to establish a baseline, then assert zero allocations in CI for hot paths. -
HashSet::with_capacity(n)andVec::with_capacity(n)pre-allocate to avoid grow-and-rehash allocations. If you know the expected size, always usewith_capacity. -
clear()retains the underlying allocation. Use it to reuseVecandHashMapbuffers across batches rather than dropping and reallocating each time. -
Common hidden allocation sources:
format!(),HashMap::new()without capacity,Vec::collect()on unsized iterators, iterator adapters that buffer internally (.sorted(),.chunks()on non-slice iterators), andArc::new()in a per-frame code path. -
Profile allocations before optimising. The counting allocator tells you how many allocations happen. The flamegraph from Lesson 2 tells you where time is spent. Together they give a complete picture: is the bottleneck the allocation count, the allocator latency, or the subsequent memory access pattern?
Project — Meridian Control Plane Performance Audit
Module: Foundation — M06: Performance & Profiling Prerequisite: All three module quizzes passed (≥70%)
- Mission Brief
- Pipeline Under Audit
- Audit Procedure
- Expected Output
- Acceptance Criteria
- Hints
- Reference Implementation
- Reflection
Mission Brief
TO: Platform Engineering FROM: Mission Control Systems Lead CLASSIFICATION: UNCLASSIFIED // INTERNAL SUBJECT: RFC-0058 — Control Plane Performance Audit and Remediation
The telemetry processor built in Module 5 achieves 100,000 frames per second in isolation. When integrated with the full control plane pipeline — ground station TCP ingress, deduplication, sort, downstream forwarding — the integrated system runs at 71,000 frames per second, 29% below target.
Your task is to conduct a structured performance audit of the integrated pipeline, identify the bottleneck using the tools from this module, implement a targeted fix with measurable improvement, and document the result.
Pipeline Under Audit
The pipeline processes frames through four stages:
[TCP Ingress] → [Validator] → [Deduplicator] → [Forwarder]
Each stage has a measurable input and output rate. Profiling tools tell you which stage is the bottleneck and which specific function within that stage consumes the most CPU.
Audit Procedure
Phase 1: Establish a Baseline with criterion
Write a criterion benchmark for the full pipeline (not just the processor). Measure:
- Frames per second through the complete pipeline
- Per-stage latency breakdown (validator, deduplicator, forwarder separately)
- Memory allocation count per batch (using a counting allocator)
The baseline establishes the starting point. Every fix must demonstrate measurable improvement against this baseline — not just "it felt faster".
Phase 2: CPU Profile with flamegraph
Run cargo flamegraph on the pipeline binary for 30 seconds under sustained load. Identify:
- Which stage occupies the most flamegraph width
- Which function within that stage is the hot leaf
- Whether the flamegraph shows
malloc/freeas significant contributors
Phase 3: Memory Profile with a Counting Allocator
Integrate the counting allocator from Lesson 3. For each batch of 1,000 frames:
- Count total allocations per batch
- Count allocations per stage (reset/snapshot around each stage)
- Identify which stage is responsible for the most allocations
Phase 4: Implement and Measure a Fix
Based on the profiling findings, implement the highest-impact fix. Typical candidates:
- Replace
Vec::new()in the deduplicator with a reused buffer (clear()pattern) - Replace
HashMap::new()withHashMap::with_capacity(batch_size) - Replace
format!()in the validator with a pre-allocated error buffer - Apply arena allocation for payloads that were missed in Module 5
Re-run the criterion benchmark. Document the before/after comparison.
Expected Output
A workspace with:
- A
meridian-pipelinebinary crate implementing the four-stage pipeline - A
benches/pipeline.rscriterion benchmark measuring the full pipeline and each stage - An
audit.mddocument recording:- Baseline criterion output (copy from terminal)
- Flamegraph findings (which function was the hot path)
- Allocation counts per stage per batch (from counting allocator)
- The fix implemented
- Post-fix criterion output showing improvement
criterion's statistical significance output (p-value)
Acceptance Criteria
| # | Criterion | Verifiable |
|---|---|---|
| 1 | criterion benchmark runs and produces confidence intervals for the full pipeline | Yes — cargo bench output |
| 2 | black_box applied correctly — input and output both wrapped | Yes — code review |
| 3 | Test data built outside the criterion closure, not inside | Yes — code review |
| 4 | Flamegraph generated for a ≥ 30-second profiling run | Yes — flamegraph.svg present |
| 5 | Allocation counts per stage documented in audit.md | Yes — numbers in the document |
| 6 | At least one measurable fix implemented and documented with before/after timing | Yes — audit.md |
| 7 | criterion reports p < 0.05 for the improvement (statistically significant) | Yes — criterion output in audit.md |
Hints
Hint 1 — Criterion benchmark structure
#![allow(unused)] fn main() { // benches/pipeline.rs // use criterion::{black_box, criterion_group, criterion_main, BenchmarkId, Criterion}; // // fn bench_pipeline(c: &mut Criterion) { // let mut group = c.benchmark_group("pipeline"); // // for batch_size in [100, 500, 1000, 5000].iter() { // let headers = build_test_headers(*batch_size); // // group.bench_with_input( // BenchmarkId::new("full", batch_size), // batch_size, // |b, _| { // b.iter(|| { // black_box(run_pipeline(black_box(&headers))) // }) // }, // ); // } // group.finish(); // } // // criterion_group!(benches, bench_pipeline); // criterion_main!(benches); }
Hint 2 — Per-stage allocation counting
#![allow(unused)] fn main() { // Reset counter, run stage, snapshot: ALLOC_COUNT.store(0, Ordering::Relaxed); let result = run_validator(black_box(&frames)); let validator_allocs = ALLOC_COUNT.load(Ordering::Relaxed); ALLOC_COUNT.store(0, Ordering::Relaxed); let deduped = run_deduplicator(black_box(&result)); let dedup_allocs = ALLOC_COUNT.load(Ordering::Relaxed); println!("validator: {validator_allocs} allocs/batch"); println!("deduplicator: {dedup_allocs} allocs/batch"); }
Hint 3 — Reusing buffers between batches
If the deduplicator creates a new HashSet each batch, convert it to a persistent struct:
#![allow(unused)] fn main() { pub struct Deduplicator { seen: std::collections::HashSet<(u32, u64)>, unique_indices: Vec<usize>, } impl Deduplicator { pub fn new(expected_batch: usize) -> Self { Self { seen: std::collections::HashSet::with_capacity(expected_batch), unique_indices: Vec::with_capacity(expected_batch), } } pub fn process(&mut self, headers: &[(u32, u64)]) -> &[usize] { self.seen.clear(); // Retains allocation. self.unique_indices.clear(); // Retains allocation. for (i, &key) in headers.iter().enumerate() { if self.seen.insert(key) { self.unique_indices.push(i); } } &self.unique_indices } } }
Hint 4 — Flamegraph build configuration
Add to Cargo.toml:
[profile.release]
debug = true
[profile.profiling]
inherits = "release"
debug = true
Build and profile:
cargo build --profile profiling
cargo flamegraph --profile profiling --bin meridian-pipeline -- \
--duration 30 --batch-size 1000
If cargo flamegraph is not installed: cargo install flamegraph. Requires perf on Linux or Xcode instruments on macOS.
Reference Implementation
Reveal reference implementation
// src/main.rs — pipeline implementation for profiling use std::alloc::{GlobalAlloc, Layout, System}; use std::sync::atomic::{AtomicU64, Ordering::Relaxed}; use std::hint::black_box; use std::time::Instant; // --- Counting allocator --- struct CountingAllocator; static ALLOC_COUNT: AtomicU64 = AtomicU64::new(0); unsafe impl GlobalAlloc for CountingAllocator { unsafe fn alloc(&self, layout: Layout) -> *mut u8 { ALLOC_COUNT.fetch_add(1, Relaxed); System.alloc(layout) } unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) { System.dealloc(ptr, layout) } } #[global_allocator] static ALLOCATOR: CountingAllocator = CountingAllocator; // --- Pipeline stages --- #[inline(never)] fn validate(headers: &[(u32, u64, u8)]) -> Vec<(u32, u64)> { headers.iter() .filter(|&&(_, _, flags)| flags & 0x80 == 0) .map(|&(sat, seq, _)| (sat, seq)) .collect() } pub struct Deduplicator { seen: std::collections::HashSet<(u32, u64)>, indices: Vec<usize>, } impl Deduplicator { pub fn new(cap: usize) -> Self { Self { seen: std::collections::HashSet::with_capacity(cap), indices: Vec::with_capacity(cap), } } #[inline(never)] pub fn process(&mut self, valid: &[(u32, u64)]) -> &[usize] { self.seen.clear(); self.indices.clear(); for (i, &key) in valid.iter().enumerate() { if self.seen.insert(key) { self.indices.push(i); } } &self.indices } } #[inline(never)] fn forward(valid: &[(u32, u64)], unique: &[usize]) -> usize { unique.iter().map(|&i| valid[i].0 as usize).sum() } fn run_pipeline( headers: &[(u32, u64, u8)], dedup: &mut Deduplicator, ) -> usize { let valid = validate(headers); let unique = dedup.process(&valid).to_vec(); forward(&valid, &unique) } fn main() { let batch_size = 1_000usize; let headers: Vec<(u32, u64, u8)> = (0..batch_size) .map(|i| ((i % 48) as u32, (i / 3) as u64, 0u8)) .collect(); let mut dedup = Deduplicator::new(batch_size); // Warm up. for _ in 0..10 { run_pipeline(&headers, &mut dedup); } // Measure allocations per batch. ALLOC_COUNT.store(0, Relaxed); for _ in 0..1000 { black_box(run_pipeline(black_box(&headers), &mut dedup)); } let allocs = ALLOC_COUNT.load(Relaxed); println!("allocs across 1000 batches: {allocs}"); println!("allocs per batch: {:.1}", allocs as f64 / 1000.0); // Throughput measurement. let batches = 100_000u32; let start = Instant::now(); for _ in 0..batches { black_box(run_pipeline(black_box(&headers), &mut dedup)); } let elapsed = start.elapsed(); let fps = (batches as usize * batch_size) as f64 / elapsed.as_secs_f64(); println!("throughput: {:.0} frames/sec", fps); println!("elapsed: {:.2?}", elapsed); }
Reflection
The audit methodology in this project — baseline, profile, identify, fix, verify — is the standard performance engineering workflow. The workflow is the skill, not the specific tools. perf and flamegraph will be replaced by better tools; the habit of measuring before and after, asserting statistical significance, and documenting findings will not.
The counting allocator CI assertion from Lesson 3 is the instrument that keeps the improvements from this module from being silently regressed six months from now. Every performance optimisation needs a regression test. For throughput, that test is a criterion baseline stored in target/criterion. For allocation-freedom, it is a assert_eq!(allocs, 0) assertion in the CI pipeline.
With Module 6 complete, the full Foundation track is done. Every capability the control plane relies on — async scheduling, concurrency primitives, message passing, networking, data layout, and performance measurement — is now in your toolkit. The track-specific modules (Database Internals, Data Pipelines, Data Lakes, Distributed Systems) build directly on this foundation.
Module 01 — Storage Engine Fundamentals
Track: Database Internals — Orbital Object Registry
Position: Module 1 of 6
Source material: Database Internals — Alex Petrov, Chapters 1–4; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 3
Quiz pass threshold: 70% on all three lessons to unlock the project
Mission Context
SDA INCIDENT REPORT — OOR-2026-0041
Classification: OPERATIONAL DEFICIENCY
Subject: TLE index query latency exceeding conjunction avoidance SLAESA's Space Surveillance and Tracking (SST) division has notified Meridian Space Systems that our Two-Line Element (TLE) index cannot scale past 100,000 tracked orbital objects. Current architecture stores TLE records as serialized JSON blobs in PostgreSQL — every conjunction query triggers a full table scan. With the post-fragmentation debris field projected to add 12,000 new objects this quarter, the system will exceed the 500ms conjunction query SLA within 60 days.
Directive: Build a purpose-built storage engine for the Orbital Object Registry. Start with the lowest layer — how bytes hit disk and come back.
This module establishes the foundational layer of the Orbital Object Registry storage engine. Before you can index, query, or recover data, you need a reliable on-disk format and an efficient way to move pages between disk and memory. Every decision made here — page size, record layout, eviction policy — propagates upward through the entire engine.
Learning Outcomes
After completing this module, you will be able to:
- Design a fixed-size page format with headers, magic bytes, and checksums for integrity verification
- Implement a buffer pool that caches hot pages in memory and evicts cold pages using LRU or CLOCK policies
- Explain why random I/O is the dominant cost in storage engines and how page-aligned access patterns reduce it
- Implement a slotted page layout that supports variable-length records with in-page compaction
- Reason about the tradeoffs between page size, I/O amplification, and internal fragmentation
- Map TLE records to a binary page format suitable for the Orbital Object Registry
Lesson Summary
Lesson 1 — File Formats and Page Layout
How storage engines organize bytes on disk. Fixed-size pages, headers, magic bytes, and the page abstraction that separates logical records from physical storage. Why 4KB or 8KB pages align with OS and hardware boundaries.
Key question: Why do storage engines use fixed-size pages instead of variable-length records written sequentially?
Lesson 2 — Buffer Pool Management
The page cache that sits between the storage engine and the OS. LRU and CLOCK eviction policies, page pinning, dirty page tracking, and the flush protocol. Why the buffer pool exists even though the OS has its own page cache.
Key question: When should a storage engine bypass the OS page cache and manage its own buffer pool?
Lesson 3 — Slotted Pages
How to store variable-length records within a fixed-size page. The slot array, free space pointer, and in-page compaction. How deletions create fragmentation and how the engine reclaims space without rewriting the entire page.
Key question: How does a slotted page maintain stable record identifiers when records are moved during compaction?
Capstone Project — TLE Record Page Manager
Build a page manager that reads and writes orbital TLE records to a custom binary page format backed by a simple buffer pool. The page manager must support insert, lookup by slot, delete, and page-level compaction. Acceptance criteria and the full project brief are in project-tle-page-manager.md.
File Index
module-01-storage-engine-fundamentals/
├── README.md ← this file
├── lesson-01-page-layout.md ← File formats and page layout
├── lesson-01-quiz.toml ← Quiz (5 questions)
├── lesson-02-buffer-pool.md ← Buffer pool management
├── lesson-02-quiz.toml ← Quiz (5 questions)
├── lesson-03-slotted-pages.md ← Slotted pages
├── lesson-03-quiz.toml ← Quiz (5 questions)
└── project-tle-page-manager.md ← Capstone project brief
Prerequisites
- Foundation Track completed (all 6 modules)
- Familiarity with
std::fs::File,Read,Write,Seektraits - Basic understanding of how operating systems manage file I/O
What Comes Next
Module 2 (B-Tree Index Structures) builds on the page abstraction from this module. The B-tree nodes you implement in Module 2 are stored in the pages you design here. The buffer pool you build here is the same buffer pool that serves page requests for the B-tree and, later, the LSM engine.
Lesson 1 — File Formats and Page Layout
Module: Database Internals — M01: Storage Engine Fundamentals
Position: Lesson 1 of 3
Source: Database Internals — Alex Petrov, Chapters 1–3; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 3
Source note: This lesson was synthesized from training knowledge. The following concepts would benefit from verification against the source books: exact page header field sizes in Petrov Ch. 3, the magic byte conventions across SQLite/InnoDB/RocksDB, and Petrov's specific framing of the page abstraction layer.
Context
Every storage engine eventually answers the same question: how do bytes get from memory to disk and back? The answer is almost never "write them sequentially and hope for the best." Sequential writes are fast, but random reads against an unstructured file are catastrophic — seeking to an arbitrary byte offset in a 10GB file on a spinning disk costs 5–10ms per seek. Even on SSDs, random 512-byte reads are an order of magnitude slower than reading aligned 4KB blocks.
The solution, used by virtually every production storage engine from SQLite to RocksDB to PostgreSQL, is the page abstraction: divide the file into fixed-size blocks (pages), give each page a numeric identifier, and build all higher-level structures — indices, records, free lists — on top of this uniform unit. Pages align with the OS virtual memory system (typically 4KB) and the storage device's block size, which means reads and writes hit the hardware at its natural granularity.
For the Orbital Object Registry, each TLE record is approximately 140 bytes (two 69-character lines plus metadata). A single 4KB page can hold roughly 25–28 TLE records. With 100,000 tracked objects, the entire catalog fits in approximately 4,000 pages — about 16MB. The page format you design in this lesson is the physical foundation that every subsequent module builds on.
Core Concepts
The Page Abstraction
A page is a fixed-size block of bytes — the atomic unit of I/O in a storage engine. The engine never reads or writes less than one page. This constraint seems wasteful (reading 4KB to retrieve a 140-byte TLE record), but it aligns with how hardware and operating systems actually work:
- Disk drives read and write in sectors (512 bytes or 4KB for modern drives). Reading 1 byte costs the same as reading 4KB — the drive fetches the entire sector regardless.
- The OS page cache manages memory in 4KB pages. A storage engine that uses the same page size gets free alignment with the kernel's caching layer.
mmapand direct I/O both operate on page-aligned boundaries. Misaligned reads require the kernel to fetch extra pages and copy the relevant bytes — an unnecessary overhead.
Common page sizes: 4KB (SQLite, default PostgreSQL), 8KB (PostgreSQL configurable, InnoDB), 16KB (InnoDB default), 64KB (some OLAP systems). Larger pages reduce the number of I/O operations for sequential scans but increase I/O amplification for point lookups (you read 64KB to get 140 bytes). The Orbital Object Registry uses 4KB pages — the catalog is small enough that point lookup amplification matters more than scan throughput.
Page Layout
Every page begins with a header that identifies the page and describes its contents. The header is the first thing the engine reads after loading a page from disk, and it must contain enough information to interpret the rest of the page without external context.
A minimal page header contains:
| Field | Size | Purpose |
|---|---|---|
| Magic bytes | 4 bytes | Identifies this as a valid OOR page (e.g., 0x4F4F5231 = "OOR1") |
| Page ID | 4 bytes | Unique identifier for this page within the file |
| Page type | 1 byte | Discriminant: data page, index page, overflow page, free page |
| Record count | 2 bytes | Number of active records in this page |
| Free space offset | 2 bytes | Byte offset where free space begins |
| Checksum | 4 bytes | CRC32 of the page body for integrity verification |
Total header: 17 bytes. The remaining 4,079 bytes (in a 4KB page) are available for records.
Magic bytes serve two purposes: they let the engine detect corrupted or misidentified files on open (if the first 4 bytes of the file aren't OOR1, this isn't an OOR database), and they enable file-level identification by external tools (file command, hex editors). Production systems like SQLite use SQLite format 3\000 as the first 16 bytes of the file header.
Checksums detect bit rot and partial writes. A page whose checksum doesn't match its body was either corrupted on disk or partially written during a crash. The engine must reject it and attempt recovery from the WAL (Module 4). CRC32 is standard; some engines use xxHash for speed or SHA-256 for cryptographic integrity.
File Organization
The database file is a contiguous sequence of pages. Page 0 is typically a file header page that stores metadata: database version, page size, total page count, pointer to the free list head, and engine configuration. Pages 1 through N hold data.
┌──────────┬──────────┬──────────┬──────────┬─────┐
│ Page 0 │ Page 1 │ Page 2 │ Page 3 │ ... │
│ (header) │ (data) │ (data) │ (free) │ │
└──────────┴──────────┴──────────┴──────────┴─────┘
↑
File header: version, page size, page count,
free list head → Page 3
Addressing: Given a page ID and the page size, the byte offset in the file is page_id * page_size. This arithmetic is the reason pages must be fixed-size — variable-size pages would require a separate index to locate each page, adding a layer of indirection to every I/O operation.
Free list management: When a page is deallocated (all records deleted, or a B-tree node is merged), it goes on the free list rather than being returned to the OS. The next allocation request takes a page from the free list before extending the file. This avoids filesystem fragmentation and keeps the file size stable under delete-heavy workloads.
Alignment and Direct I/O
When a storage engine bypasses the OS page cache (using O_DIRECT on Linux), all reads and writes must be aligned to the device's block size — typically 512 bytes or 4KB. Misaligned I/O fails with EINVAL. Even when not using direct I/O, aligned access avoids read-modify-write cycles in the kernel's page cache.
In Rust, allocating page-aligned buffers requires care. Vec<u8> does not guarantee alignment beyond the default allocator's alignment (typically 8 or 16 bytes). For direct I/O, you need explicit alignment:
use std::alloc::{alloc, dealloc, Layout};
/// Allocate a page-aligned buffer for direct I/O.
/// Safety: caller must ensure `page_size` is a power of two.
fn alloc_aligned_page(page_size: usize) -> *mut u8 {
let layout = Layout::from_size_align(page_size, page_size)
.expect("page_size must be a power of two");
// Safety: layout is valid (non-zero size, power-of-two alignment)
unsafe { alloc(layout) }
}
Production engines wrap this in a PageBuf type that handles allocation, deallocation, and provides safe access to the underlying bytes.
Code Examples
Defining the Page Format for Orbital TLE Records
The Orbital Object Registry needs a page format that can store TLE records with their associated metadata. This example defines the page header, serialization, and deserialization logic.
use std::io::{self, Read, Write, Seek, SeekFrom};
use std::fs::File;
const PAGE_SIZE: usize = 4096;
const MAGIC: [u8; 4] = [0x4F, 0x4F, 0x52, 0x31]; // "OOR1"
const HEADER_SIZE: usize = 17;
/// Page types in the Orbital Object Registry.
#[repr(u8)]
#[derive(Debug, Clone, Copy, PartialEq)]
enum PageType {
FileHeader = 0,
Data = 1,
Index = 2,
Free = 3,
Overflow = 4,
}
/// Fixed-size page header. Sits at byte 0 of every page.
#[derive(Debug)]
struct PageHeader {
magic: [u8; 4],
page_id: u32,
page_type: PageType,
record_count: u16,
free_space_offset: u16,
checksum: u32,
}
impl PageHeader {
fn new(page_id: u32, page_type: PageType) -> Self {
Self {
magic: MAGIC,
page_id,
page_type,
record_count: 0,
// Free space starts immediately after the header
free_space_offset: HEADER_SIZE as u16,
checksum: 0,
}
}
fn serialize(&self, buf: &mut [u8]) {
buf[0..4].copy_from_slice(&self.magic);
buf[4..8].copy_from_slice(&self.page_id.to_le_bytes());
buf[8] = self.page_type as u8;
buf[9..11].copy_from_slice(&self.record_count.to_le_bytes());
buf[11..13].copy_from_slice(&self.free_space_offset.to_le_bytes());
buf[13..17].copy_from_slice(&self.checksum.to_le_bytes());
}
fn deserialize(buf: &[u8]) -> io::Result<Self> {
if buf[0..4] != MAGIC {
return Err(io::Error::new(
io::ErrorKind::InvalidData,
"invalid page magic bytes — not an OOR page",
));
}
Ok(Self {
magic: MAGIC,
page_id: u32::from_le_bytes(buf[4..8].try_into().unwrap()),
page_type: match buf[8] {
0 => PageType::FileHeader,
1 => PageType::Data,
2 => PageType::Index,
3 => PageType::Free,
4 => PageType::Overflow,
_ => return Err(io::Error::new(
io::ErrorKind::InvalidData,
"unknown page type discriminant",
)),
},
record_count: u16::from_le_bytes(buf[9..11].try_into().unwrap()),
free_space_offset: u16::from_le_bytes(buf[11..13].try_into().unwrap()),
checksum: u32::from_le_bytes(buf[13..17].try_into().unwrap()),
})
}
}
Notice that all multi-byte integers use little-endian encoding (to_le_bytes/from_le_bytes). This is a deliberate choice — the engine should produce the same on-disk format regardless of the host architecture. Big-endian is equally valid (and simplifies key comparison in B-trees, as we'll see in Module 2), but you must pick one and enforce it everywhere. Mixing endianness across page types is a subtle bug that survives unit tests and explodes in production.
Reading and Writing Pages to Disk
The page I/O layer translates between page IDs and file offsets. This is the lowest layer of the storage engine — everything above it thinks in pages, not bytes.
/// Low-level page I/O against the database file.
struct PageFile {
file: File,
page_size: usize,
}
impl PageFile {
fn open(path: &str, page_size: usize) -> io::Result<Self> {
let file = File::options()
.read(true)
.write(true)
.create(true)
.open(path)?;
Ok(Self { file, page_size })
}
/// Read a page from disk into the provided buffer.
/// The buffer must be exactly `page_size` bytes.
fn read_page(&mut self, page_id: u32, buf: &mut [u8]) -> io::Result<()> {
assert_eq!(buf.len(), self.page_size);
let offset = page_id as u64 * self.page_size as u64;
self.file.seek(SeekFrom::Start(offset))?;
self.file.read_exact(buf)?;
Ok(())
}
/// Write a page buffer to disk at the correct offset.
fn write_page(&mut self, page_id: u32, buf: &[u8]) -> io::Result<()> {
assert_eq!(buf.len(), self.page_size);
let offset = page_id as u64 * self.page_size as u64;
self.file.seek(SeekFrom::Start(offset))?;
self.file.write_all(buf)?;
// Note: we do NOT fsync here. Durability is the WAL's job (Module 4).
// Calling fsync on every page write would destroy throughput —
// a single fsync costs 1-10ms on SSD, 10-30ms on spinning disk.
Ok(())
}
/// Allocate a new page at the end of the file. Returns the new page ID.
fn allocate_page(&mut self) -> io::Result<u32> {
let file_len = self.file.seek(SeekFrom::End(0))?;
let page_id = (file_len / self.page_size as u64) as u32;
let zeroed = vec![0u8; self.page_size];
self.file.write_all(&zeroed)?;
Ok(page_id)
}
}
Two things to notice: first, read_page uses read_exact, not read. A short read (fewer bytes than page_size) means the file is truncated or corrupted — the engine must not silently accept a partial page. Second, write_page does not call fsync. This is intentional. The WAL (Module 4) provides durability guarantees; the page file relies on the WAL for crash recovery. Calling fsync on every page write would reduce throughput from thousands of pages/second to fewer than 100 on spinning disk.
Computing and Verifying Page Checksums
Every page is checksummed before being written to disk. On read, the checksum is verified before the page contents are trusted. This catches bit rot, partial writes, and storage firmware bugs.
/// CRC32 checksum of the page body (everything after the checksum field).
/// We zero the checksum field before computing so the checksum is
/// deterministic regardless of the previous checksum value.
fn compute_checksum(page_buf: &[u8]) -> u32 {
// Checksum covers bytes 17..PAGE_SIZE (the body).
// The header's checksum field (bytes 13..17) is excluded from the
// computation — it stores the result.
let body = &page_buf[HEADER_SIZE..];
crc32fast::hash(body)
}
fn write_page_with_checksum(
page_file: &mut PageFile,
page_id: u32,
buf: &mut [u8],
) -> io::Result<()> {
let checksum = compute_checksum(buf);
buf[13..17].copy_from_slice(&checksum.to_le_bytes());
page_file.write_page(page_id, buf)
}
fn read_and_verify_page(
page_file: &mut PageFile,
page_id: u32,
buf: &mut [u8],
) -> io::Result<()> {
page_file.read_page(page_id, buf)?;
let stored = u32::from_le_bytes(buf[13..17].try_into().unwrap());
let computed = compute_checksum(buf);
if stored != computed {
return Err(io::Error::new(
io::ErrorKind::InvalidData,
format!(
"page {} checksum mismatch: stored={:#010x}, computed={:#010x}",
page_id, stored, computed
),
));
}
Ok(())
}
The checksum covers only the page body, not the header's checksum field itself. This avoids a circular dependency: you can't include the checksum in the data being checksummed. Some engines (e.g., PostgreSQL) checksum the entire page with the checksum field zeroed before computation — both approaches work, but you must document which one you use.
Key Takeaways
- The page is the atomic unit of I/O in a storage engine. All reads and writes operate on full pages, never partial pages. This aligns with hardware block sizes and OS page cache granularity.
- Page size is a tradeoff: larger pages reduce I/O count for scans but increase amplification for point lookups. 4KB is the default for OLTP-style workloads like the Orbital Object Registry.
- Every page starts with a header containing magic bytes, page ID, type discriminant, and a checksum. The header must be self-describing — the engine should be able to interpret any page without external context.
- Byte order must be fixed across the entire on-disk format. Pick little-endian or big-endian and enforce it everywhere. Never rely on native endianness.
- Page writes do not call
fsync. Durability is provided by the write-ahead log, not by synchronous page flushes. This is a fundamental architectural decision that separates high-throughput engines from naive implementations.
Lesson 2 — Buffer Pool Management
Module: Database Internals — M01: Storage Engine Fundamentals
Position: Lesson 2 of 3
Source: Database Internals — Alex Petrov, Chapter 5; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 3
Source note: This lesson was synthesized from training knowledge. The following concepts would benefit from verification against the source books: Petrov's specific CLOCK algorithm variant, his framing of the buffer pool vs. OS page cache tradeoff, and the exact dirty page flush protocols described in Chapter 5.
Context
The page I/O layer from Lesson 1 reads and writes pages directly to disk. Every read_page call triggers a system call, a disk seek (on HDD) or a flash translation layer lookup (on SSD), and a DMA transfer. For the Orbital Object Registry's conjunction query workload — which repeatedly accesses the same hot set of TLE records during a pass window — going to disk for every page request is unacceptable. A conjunction check against 100 objects would issue 100+ page reads, taking 50–100ms on SSD and over a second on spinning disk.
The buffer pool solves this by caching recently-accessed pages in memory. It sits between the storage engine's upper layers (B-tree, LSM, query processor) and the page I/O layer. When a page is requested, the buffer pool checks whether it's already in memory. If so, it returns a pointer to the cached copy — no disk I/O required. If not, it evicts a cold page to make room, reads the requested page from disk, caches it, and returns it. For a well-tuned buffer pool with a hot working set that fits in memory, the hit rate exceeds 99%, and the storage engine operates almost entirely from RAM.
The buffer pool exists even though the OS already has a page cache. The difference is control: the OS page cache uses a generic LRU policy that doesn't know about access patterns specific to the storage engine (sequential scan flooding, index traversal locality). A purpose-built buffer pool can use workload-aware eviction, pin pages during multi-step operations, and track dirty pages for coordinated flushing.
Core Concepts
Buffer Pool Architecture
The buffer pool is a fixed-size array of frames — each frame holds one page-sized buffer plus metadata. The metadata tracks:
- Page ID — which on-disk page is currently loaded in this frame
- Pin count — how many active references exist to this frame. A pinned page cannot be evicted.
- Dirty flag — whether the page has been modified since it was loaded from disk. Dirty pages must be written back before eviction.
- Reference bit (for CLOCK) — whether the page has been accessed recently
The buffer pool also maintains a page table — a hash map from page ID to frame index — for O(1) lookups. When the engine requests page 42, the buffer pool checks the page table. If page 42 maps to frame 7, the engine gets a reference to frame 7's buffer. If page 42 is not in the page table, the buffer pool must evict a page and load 42 from disk.
Page Table (HashMap<PageId, FrameId>)
┌─────────┬─────────┐
│ Page 42 │ Frame 7 │
│ Page 13 │ Frame 2 │
│ Page 99 │ Frame 0 │
│ ... │ ... │
└─────────┴─────────┘
Frame Array
┌─────────┬─────────┬─────────┬─────────┐
│ Frame 0 │ Frame 1 │ Frame 2 │ Frame 3 │ ...
│ pg=99 │ (empty) │ pg=13 │ (empty) │
│ pin=1 │ │ pin=0 │ │
│ dirty=N │ │ dirty=Y │ │
└─────────┴─────────┴─────────┴─────────┘
Eviction Policies: LRU
Least Recently Used (LRU) evicts the page that hasn't been accessed for the longest time. The intuition: if a page hasn't been needed recently, it's unlikely to be needed soon. LRU is implemented with a doubly-linked list — on every access, the page is moved to the head of the list. On eviction, the tail page is removed.
LRU's weakness is scan flooding: a single sequential scan over the entire database evicts every hot page from the buffer pool, even if those pages are accessed hundreds of times per second by other queries. After the scan completes, every subsequent request misses the buffer pool and goes to disk. This is catastrophic for the OOR — a full catalog export scan would evict the conjunction query hot set.
Mitigations: LRU-K (track the K-th most recent access, not just the most recent), 2Q (separate queues for first-access and re-access pages), or ARC (adaptive replacement cache). PostgreSQL uses a clock-sweep approximation. MySQL/InnoDB uses a two-list LRU with a "young" and "old" sublist.
Eviction Policies: CLOCK
CLOCK approximates LRU without the overhead of maintaining a linked list. Each frame has a single reference bit. When a page is accessed, its reference bit is set to 1. When the buffer pool needs to evict, it sweeps through the frames in circular order (like a clock hand):
- If the current frame's reference bit is 1, clear it to 0 and advance the hand.
- If the current frame's reference bit is 0 and the page is not pinned and not dirty (or if dirty, flush it first), evict this page.
CLOCK is cheaper per operation than LRU (no linked list manipulation on every access — just set a bit) and provides comparable hit rates for most workloads. It's the default in many systems.
The weakness is the same as LRU: a full scan sets every reference bit to 1, requiring the clock hand to sweep the entire pool before any page can be evicted. CLOCK-sweep with a scan-resistant enhancement (used by PostgreSQL) mitigates this by not setting the reference bit for pages accessed during a sequential scan.
Page Pinning
A page is pinned when the engine is actively using it and it must not be evicted. The pin count tracks how many concurrent users hold a reference to the page. A page is evictable only when pin_count == 0.
Pinning is critical for correctness: if the engine is in the middle of reading a B-tree node and the buffer pool evicts that page, the engine reads garbage. The protocol:
- Fetch a page: buffer pool loads or finds it, increments pin count, returns a handle.
- Use the page: engine reads or writes the page data.
- Unpin the page: engine decrements pin count when done. If the engine modified the page, it marks it dirty.
Failing to unpin a page is a resource leak — the page can never be evicted, and eventually the buffer pool fills with pinned pages and all fetch requests fail. In Rust, RAII handles this naturally: the page handle decrements the pin count in its Drop implementation.
Dirty Page Flushing
A dirty page has been modified in memory but not yet written back to disk. The buffer pool tracks dirty pages and flushes them in two scenarios:
- Eviction flush: when a dirty page is selected for eviction, it must be written to disk before the frame can be reused.
- Background flush: a periodic background thread scans for dirty pages and writes them to disk proactively, reducing the chance that an eviction will stall on a synchronous write.
The buffer pool does not call fsync after every flush. Durability is the WAL's responsibility (Module 4). The buffer pool's flush is an optimization — it keeps the page file reasonably up-to-date so that crash recovery doesn't have to replay the entire WAL.
Code Examples
A Simple LRU Buffer Pool for the Orbital Object Registry
This buffer pool caches OOR pages in memory and evicts the least recently used unpinned page when the pool is full.
use std::collections::{HashMap, VecDeque};
use std::io;
const PAGE_SIZE: usize = 4096;
/// Metadata for a single buffer pool frame.
struct Frame {
page_id: Option<u32>,
data: [u8; PAGE_SIZE],
pin_count: u32,
is_dirty: bool,
}
impl Frame {
fn new() -> Self {
Self {
page_id: None,
data: [0u8; PAGE_SIZE],
pin_count: 0,
is_dirty: false,
}
}
}
/// LRU buffer pool backed by the OOR page file.
struct BufferPool {
frames: Vec<Frame>,
page_table: HashMap<u32, usize>, // page_id -> frame_index
/// LRU order: front = most recently used, back = least recently used.
/// Contains frame indices. Only unpinned frames participate in LRU.
lru_list: VecDeque<usize>,
page_file: PageFile, // From Lesson 1
}
impl BufferPool {
fn new(num_frames: usize, page_file: PageFile) -> Self {
let mut frames = Vec::with_capacity(num_frames);
let mut lru_list = VecDeque::with_capacity(num_frames);
for i in 0..num_frames {
frames.push(Frame::new());
lru_list.push_back(i); // All frames start as free (evictable)
}
Self {
frames,
page_table: HashMap::new(),
lru_list,
page_file,
}
}
/// Fetch a page into the buffer pool. Returns the frame index.
/// The page is pinned — caller MUST call `unpin` when done.
fn fetch_page(&mut self, page_id: u32) -> io::Result<usize> {
// Fast path: page is already in the pool
if let Some(&frame_idx) = self.page_table.get(&page_id) {
self.frames[frame_idx].pin_count += 1;
self.move_to_front(frame_idx);
return Ok(frame_idx);
}
// Slow path: need to load from disk. Find an evictable frame.
let frame_idx = self.find_evict_target()?;
// If the frame holds a dirty page, flush it before reuse
if let Some(old_page_id) = self.frames[frame_idx].page_id {
if self.frames[frame_idx].is_dirty {
self.page_file.write_page(
old_page_id,
&self.frames[frame_idx].data,
)?;
}
self.page_table.remove(&old_page_id);
}
// Load the requested page into the frame
self.page_file.read_page(page_id, &mut self.frames[frame_idx].data)?;
self.frames[frame_idx].page_id = Some(page_id);
self.frames[frame_idx].pin_count = 1;
self.frames[frame_idx].is_dirty = false;
self.page_table.insert(page_id, frame_idx);
self.move_to_front(frame_idx);
Ok(frame_idx)
}
/// Unpin a page. Caller must indicate whether the page was modified.
fn unpin(&mut self, frame_idx: usize, is_dirty: bool) {
let frame = &mut self.frames[frame_idx];
assert!(frame.pin_count > 0, "unpin called on unpinned frame");
frame.pin_count -= 1;
if is_dirty {
frame.is_dirty = true;
}
}
/// Find the least recently used unpinned frame.
fn find_evict_target(&self) -> io::Result<usize> {
// Scan from the back (LRU end) for an unpinned frame
for &frame_idx in self.lru_list.iter().rev() {
if self.frames[frame_idx].pin_count == 0 {
return Ok(frame_idx);
}
}
Err(io::Error::new(
io::ErrorKind::Other,
"buffer pool exhausted: all frames are pinned",
))
}
/// Move a frame to the front of the LRU list (most recently used).
fn move_to_front(&mut self, frame_idx: usize) {
self.lru_list.retain(|&idx| idx != frame_idx);
self.lru_list.push_front(frame_idx);
}
}
The move_to_front implementation is O(n) because VecDeque::retain scans the entire list. A production buffer pool uses an intrusive doubly-linked list for O(1) LRU updates — Rust crates like intrusive-collections provide this. The O(n) approach is correct and sufficient for understanding the algorithm; the optimization matters only when the buffer pool has thousands of frames and fetch rates exceed 100k/sec.
Notice the pin-count assert in unpin: a double-unpin is a logic bug that must crash immediately in development. In production, this would be debug_assert! to avoid panicking on a user-facing code path.
RAII Page Handle for Automatic Unpinning
Rust's ownership system prevents the "forgot to unpin" bug class entirely. A page handle unpins automatically when it goes out of scope.
/// RAII handle to a pinned buffer pool page.
/// Automatically unpins the page when dropped.
struct PageHandle<'a> {
pool: &'a mut BufferPool,
frame_idx: usize,
dirty: bool,
}
impl<'a> PageHandle<'a> {
fn data(&self) -> &[u8; PAGE_SIZE] {
&self.pool.frames[self.frame_idx].data
}
fn data_mut(&mut self) -> &mut [u8; PAGE_SIZE] {
self.dirty = true;
&mut self.pool.frames[self.frame_idx].data
}
}
impl<'a> Drop for PageHandle<'a> {
fn drop(&mut self) {
self.pool.unpin(self.frame_idx, self.dirty);
}
}
This is one of the places where Rust's borrow checker provides a genuine advantage over C/C++ buffer pool implementations. In C, every code path that fetches a page must remember to unpin it — including error paths, early returns, and exception-like longjmp flows. In Rust, the Drop implementation runs unconditionally when the handle leaves scope. The borrow checker also prevents holding a &mut reference to the page data after the handle is dropped, which would alias freed memory in C.
The tradeoff: the &'a mut BufferPool borrow means you can only hold one PageHandle at a time with this design. A production buffer pool uses Arc<Mutex<...>> or unsafe interior mutability to allow multiple concurrent page handles — we'll revisit this pattern when we implement B-tree traversal in Module 2.
Key Takeaways
- The buffer pool is a fixed-size array of page-sized frames with a hash map for O(1) page-to-frame lookup. It eliminates disk I/O for hot pages and is the single largest performance lever in any storage engine.
- LRU eviction is simple but vulnerable to scan flooding. CLOCK approximates LRU at lower cost per operation. Production engines use hybrid policies (LRU-K, 2Q, ARC) to resist scan-induced cache pollution.
- Page pinning prevents eviction during active use. In Rust, RAII handles make pin leaks impossible — the
Dropimplementation guarantees unpinning on all code paths, including panics. - Dirty pages are flushed on eviction and by background threads. The buffer pool does not call
fsync— durability is the WAL's job. - The "all frames pinned" error means the buffer pool is undersized for the workload's concurrency level. In the OOR, this can happen during peak conjunction checking if every active query holds a page pin simultaneously.
Lesson 3 — Slotted Pages
Module: Database Internals — M01: Storage Engine Fundamentals
Position: Lesson 3 of 3
Source: Database Internals — Alex Petrov, Chapter 3; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 3
Source note: This lesson was synthesized from training knowledge. The following concepts would benefit from verification against the source books: Petrov's specific slotted page layout, his terminology for the slot directory vs. cell pointer array, and the compaction algorithm described in Chapter 3.
Context
The page format from Lesson 1 stores records at fixed offsets. This works for fixed-size records — and if every TLE record were exactly 140 bytes, that would be sufficient. But real TLE data is messier: newer objects have additional metadata fields (drag coefficients, maneuver flags, covariance matrices), older legacy records omit optional fields, and record sizes will grow as ESA adds new conjunction assessment data. A format that requires all records to be the same size either wastes space (padding every record to the maximum) or breaks when schema evolves.
The slotted page layout solves this by decoupling record identity from record position. Records are addressed by slot numbers, and a slot directory at the beginning of the page maps each slot to the record's actual byte offset and length within the page. Records grow from the end of the page toward the front, while the slot directory grows from the front toward the end. They meet in the middle — and when they collide, the page is full.
This layout is the standard for every major relational database (PostgreSQL, MySQL/InnoDB, SQLite) and many key-value stores. Understanding it is prerequisite for everything that follows: B-tree nodes (Module 2) are slotted pages, WAL log records reference slot IDs (Module 4), and MVCC version chains (Module 5) track records by their page-and-slot address.
Core Concepts
Slotted Page Layout
A slotted page has three regions:
┌──────────────────────────────────────────────┐
│ Page Header (17 bytes — from Lesson 1) │
├──────────────────────────────────────────────┤
│ Slot Directory (grows →) │
│ [slot 0: offset, len] [slot 1: offset, len] │
├──────────────────────────────────────────────┤
│ │
│ Free Space │
│ │
├──────────────────────────────────────────────┤
│ Records (grow ←) │
│ [record 1 data] [record 0 data] │
└──────────────────────────────────────────────┘
Slot directory: An array of (offset, length) pairs, one per record. Slot 0 is the first entry. Each entry is 4 bytes (2 bytes offset + 2 bytes length), supporting records up to 65,535 bytes and offsets within a 64KB page. For 4KB pages, this is more than sufficient.
Records: Stored at the end of the page, growing backward (toward lower offsets). The first record inserted goes at the very end of the page; subsequent records are placed just before the previous one.
Free space: The gap between the end of the slot directory and the start of the record region. As records are inserted, the free space shrinks from both sides. The page is full when slot_directory_end >= record_region_start.
Record Addressing: (PageId, SlotId)
Higher layers of the storage engine refer to records by a Record ID (RID): a (page_id, slot_id) pair. This identifier is stable — it doesn't change when records are moved within the page during compaction, because the slot directory is updated to reflect the new offset. External references (B-tree leaf pointers, index entries) store RIDs, not raw byte offsets.
This indirection is what makes the slotted page powerful: the engine can rearrange the physical layout of records within a page (to reclaim fragmented space) without invalidating any external references. The slot ID stays the same; only the offset in the slot directory changes.
When a record is deleted, its slot entry is marked as tombstoned (offset set to a sentinel like 0xFFFF) but not removed from the directory. Removing it would shift all subsequent slot IDs by one, invalidating every external reference to those slots. Tombstoned slots can be reused for future inserts.
Insertion
To insert a record of N bytes:
- Check if there is enough free space:
free_space >= N + 4(4 bytes for the new slot entry). - Find the next available slot. If there's a tombstoned slot, reuse it. Otherwise, append a new entry to the directory.
- Write the record at
record_region_start - N. - Update the slot entry with
(offset, N). - Update the page header: increment record count, adjust
free_space_offset.
If the free space check fails, the page is full. The caller must either split the page (in a B-tree) or allocate a new page (in a heap file).
Deletion and Fragmentation
Deleting a record tombstones its slot entry and marks the record's bytes as reclaimable. But it doesn't shift other records — doing so would change their offsets and require updating every other slot entry that points past the deleted record.
This creates internal fragmentation: there are N bytes of garbage between valid records. Over time, a page can have plenty of total free space but no contiguous block large enough for a new record.
Before delete:
[Header][Slot 0][Slot 1][Slot 2] [free] [Rec 2][Rec 1][Rec 0]
After deleting record 1:
[Header][Slot 0][TOMB ][Slot 2] [free] [Rec 2][DEAD ][Rec 0]
← gap here cannot be used
unless compacted →
Page Compaction
Compaction reclaims fragmented space by sliding all live records to the end of the page (closing the gaps left by deleted records) and updating their slot directory entries to reflect the new offsets. After compaction, all free space is contiguous.
The algorithm:
- Collect all live records (slot entries that are not tombstoned), sorted by their current offset in descending order.
- Starting from the end of the page, write each record contiguously.
- Update each slot entry with the new offset.
- Update the page header's
free_space_offset.
Compaction is an in-page operation — it never spills to disk or affects other pages. It runs when an insert fails due to fragmentation (total free space is sufficient, but contiguous free space is not). Some engines compact proactively during quiet periods to avoid stalling inserts.
Overflow Pages
A single record might exceed the page's usable space (4,079 bytes in a 4KB page). This shouldn't happen for TLE records (140 bytes), but the engine must handle it for forward compatibility — covariance matrices and conjunction assessment reports can be kilobytes.
The solution: when a record is too large, store the first portion in the primary page and the remainder in one or more overflow pages. The slot entry points to the in-page portion, which contains a pointer to the overflow chain. This is sometimes called TOAST (The Oversized Attribute Storage Technique) in PostgreSQL terminology.
For the Orbital Object Registry, overflow pages are unlikely but should be supported. The implementation can be deferred until the schema actually requires it.
Code Examples
A Slotted Page Implementation for TLE Records
This implements the core slotted page logic: insert, lookup, delete, and compaction.
const PAGE_SIZE: usize = 4096;
const HEADER_SIZE: usize = 17;
const SLOT_SIZE: usize = 4; // 2 bytes offset + 2 bytes length
const TOMBSTONE: u16 = 0xFFFF;
/// A slotted page that stores variable-length records.
struct SlottedPage {
data: [u8; PAGE_SIZE],
}
impl SlottedPage {
fn new(page_id: u32) -> Self {
let mut page = Self {
data: [0u8; PAGE_SIZE],
};
// Initialize header (simplified — reuse PageHeader from Lesson 1)
let mut header = PageHeader::new(page_id, PageType::Data);
header.serialize(&mut page.data);
page
}
/// Number of slots (including tombstoned ones).
fn slot_count(&self) -> u16 {
u16::from_le_bytes(self.data[9..11].try_into().unwrap())
}
fn set_slot_count(&mut self, count: u16) {
self.data[9..11].copy_from_slice(&count.to_le_bytes());
}
/// Byte offset where the record region begins (records grow downward).
fn record_region_start(&self) -> u16 {
u16::from_le_bytes(self.data[11..13].try_into().unwrap())
}
fn set_record_region_start(&mut self, offset: u16) {
self.data[11..13].copy_from_slice(&offset.to_le_bytes());
}
/// Read a slot directory entry.
fn read_slot(&self, slot_id: u16) -> (u16, u16) {
let base = HEADER_SIZE + (slot_id as usize) * SLOT_SIZE;
let offset = u16::from_le_bytes(
self.data[base..base + 2].try_into().unwrap()
);
let length = u16::from_le_bytes(
self.data[base + 2..base + 4].try_into().unwrap()
);
(offset, length)
}
fn write_slot(&mut self, slot_id: u16, offset: u16, length: u16) {
let base = HEADER_SIZE + (slot_id as usize) * SLOT_SIZE;
self.data[base..base + 2].copy_from_slice(&offset.to_le_bytes());
self.data[base + 2..base + 4].copy_from_slice(&length.to_le_bytes());
}
/// Free space available for new records + slot entries.
fn free_space(&self) -> usize {
let slot_dir_end = HEADER_SIZE + (self.slot_count() as usize) * SLOT_SIZE;
let rec_start = self.record_region_start() as usize;
if rec_start > slot_dir_end {
rec_start - slot_dir_end
} else {
0
}
}
/// Insert a record. Returns the slot ID on success.
fn insert(&mut self, record: &[u8]) -> Option<u16> {
let needed = record.len() + SLOT_SIZE; // record bytes + new slot entry
if self.free_space() < needed {
return None; // Page full — caller should try compaction or new page
}
// Find a tombstoned slot to reuse, or allocate a new one
let slot_id = self.find_free_slot();
// Place the record at the end of the record region
let new_offset = self.record_region_start() - record.len() as u16;
let start = new_offset as usize;
let end = start + record.len();
self.data[start..end].copy_from_slice(record);
// Update the slot directory
self.write_slot(slot_id, new_offset, record.len() as u16);
self.set_record_region_start(new_offset);
Some(slot_id)
}
/// Look up a record by slot ID. Returns None if the slot is
/// tombstoned or out of range.
fn get(&self, slot_id: u16) -> Option<&[u8]> {
if slot_id >= self.slot_count() {
return None;
}
let (offset, length) = self.read_slot(slot_id);
if offset == TOMBSTONE {
return None; // Deleted record
}
let start = offset as usize;
let end = start + length as usize;
Some(&self.data[start..end])
}
/// Delete a record by tombstoning its slot entry.
fn delete(&mut self, slot_id: u16) -> bool {
if slot_id >= self.slot_count() {
return false;
}
let (offset, _) = self.read_slot(slot_id);
if offset == TOMBSTONE {
return false; // Already deleted
}
self.write_slot(slot_id, TOMBSTONE, 0);
true
}
/// Find a tombstoned slot to reuse, or allocate a new one.
fn find_free_slot(&mut self) -> u16 {
let count = self.slot_count();
for i in 0..count {
let (offset, _) = self.read_slot(i);
if offset == TOMBSTONE {
return i;
}
}
// No tombstoned slots — extend the directory
self.set_slot_count(count + 1);
count
}
}
Key design decisions: the slot directory grows forward from the header, records grow backward from the end of the page, and the two regions meet in the middle. This maximizes usable space — there's no fixed boundary between "slot space" and "record space." A page with few large records uses most of its space for record data; a page with many small records uses more for the slot directory.
The insert method does not attempt compaction automatically. The caller is responsible for detecting "free space exists but is fragmented" and calling compact() before retrying. This keeps the insert path simple and predictable.
Page Compaction: Defragmenting Live Records
When deletes have fragmented the record region, compaction slides all live records together and reclaims the gaps.
impl SlottedPage {
/// Compact the page: slide all live records to the end,
/// eliminating gaps from deleted records.
fn compact(&mut self) {
let slot_count = self.slot_count();
// Collect live records: (slot_id, data_copy)
let mut live_records: Vec<(u16, Vec<u8>)> = Vec::new();
for i in 0..slot_count {
let (offset, length) = self.read_slot(i);
if offset != TOMBSTONE {
let start = offset as usize;
let end = start + length as usize;
live_records.push((i, self.data[start..end].to_vec()));
}
}
// Rewrite records contiguously from the end of the page
let mut cursor = PAGE_SIZE as u16;
for (slot_id, record) in &live_records {
cursor -= record.len() as u16;
let start = cursor as usize;
let end = start + record.len();
self.data[start..end].copy_from_slice(record);
self.write_slot(*slot_id, cursor, record.len() as u16);
}
self.set_record_region_start(cursor);
}
}
This implementation copies live records into a temporary Vec and writes them back. A more memory-efficient approach would sort records by offset and slide them in-place, but the copy approach is correct, simple, and fast enough for 4KB pages. The total data moved is at most 4,079 bytes — negligible compared to the cost of a single disk I/O.
After compaction, the page's free space is contiguous. An insert that failed before compaction (due to fragmentation) will succeed after it — assuming the total free space is sufficient.
Key Takeaways
- Slotted pages decouple record identity (slot ID) from physical position (byte offset). Records can be moved within the page without invalidating external references.
- The
(page_id, slot_id)Record ID is the stable address used by B-tree leaf nodes, index entries, and MVCC version chains. Every higher layer depends on this abstraction. - Deletions create internal fragmentation. Compaction reclaims fragmented space by sliding live records together — an in-page operation that touches no other pages.
- Tombstoning (not removing) deleted slot entries preserves slot ID stability. A removed slot would shift all subsequent IDs, breaking every external reference.
- The "free space" calculation must account for both record bytes and slot directory growth. An insert that appears to fit by record size alone may fail because there's no room for the new slot entry.
Project — TLE Record Page Manager
Module: Database Internals — M01: Storage Engine Fundamentals
Track: Orbital Object Registry
Estimated effort: 4–6 hours
SDA Incident Report — OOR-2026-0042
Classification: ENGINEERING DIRECTIVE
Subject: Prototype page manager for the Orbital Object RegistryRef: OOR-2026-0041 (TLE index latency deficiency)
The first deliverable in the OOR storage engine build is a page manager capable of reading and writing TLE records to a custom binary page format. This component sits at the bottom of the storage stack — every subsequent module builds on it. The page manager must demonstrate correct page layout, buffer pool caching, slotted page record management, and integrity verification via checksums.
- Objective
- TLE Record Format
- Acceptance Criteria
- Starter Structure
- Hints
- Reference Implementation
- What Comes Next
Objective
Build a PageManager that:
- Manages a database file composed of fixed-size 4KB pages
- Implements a buffer pool with LRU or CLOCK eviction
- Uses slotted pages for variable-length TLE record storage
- Verifies page integrity with CRC32 checksums on every read
- Supports insert, lookup by Record ID
(page_id, slot_id), delete, and page compaction
TLE Record Format
For this project, a TLE record is a byte blob with the following structure:
/// A Two-Line Element record for a tracked orbital object.
struct TleRecord {
/// NORAD catalog number (unique object ID, e.g., 25544 for ISS)
norad_id: u32,
/// International designator (e.g., "98067A")
intl_designator: [u8; 8],
/// Epoch year + fractional day (e.g., 24045.5 = Feb 14 2024, 12:00 UTC)
epoch: f64,
/// Mean motion (revolutions per day)
mean_motion: f64,
/// Eccentricity (dimensionless, 0–1)
eccentricity: f64,
/// Inclination (degrees)
inclination: f64,
/// Right ascension of ascending node (degrees)
raan: f64,
/// Argument of perigee (degrees)
arg_perigee: f64,
/// Mean anomaly (degrees)
mean_anomaly: f64,
/// Drag term (B* coefficient)
bstar: f64,
/// Element set number (for provenance tracking)
element_set: u16,
/// Revolution number at epoch
rev_number: u32,
}
Serialized size: 4 + 8 + (8 × 8) + 2 + 4 = 82 bytes. Use little-endian encoding for all fields. You may add a 2-byte record-length prefix if your slotted page implementation requires it.
Acceptance Criteria
-
Page I/O correctness. Pages are written to and read from a file at the correct offsets. A page written at
page_id * 4096is read back identically. -
Checksum verification. Every
read_pagecall computes a CRC32 over the page body and compares it to the stored checksum. A tampered page (any bit flipped in the body) is detected and returns an error. -
Buffer pool hit rate. Insert 200 TLE records across multiple pages, then read them back in the same order. The buffer pool (configured with 8 frames) should achieve a hit rate above 90% on the read pass. Print the hit/miss counts.
-
Slotted page insert and lookup. Insert 40 records into a single page. Look up each by its
(page_id, slot_id)and verify the data matches. -
Delete and compaction. Delete every other record (slots 0, 2, 4, ...). Verify that lookups to deleted slots return
None. Compact the page and verify that all remaining records are still accessible by their original slot IDs. -
Page full handling. Insert records until a page reports full. Verify that the failure is detected before corrupting any data. Allocate a new page and continue inserting.
-
Deterministic output. The program runs without external dependencies beyond
stdandcrc32fast. Output includes the buffer pool hit/miss stats and a summary of records inserted/read/deleted.
Starter Structure
tle-page-manager/
├── Cargo.toml
├── src/
│ ├── main.rs # Entry point: runs the acceptance criteria
│ ├── page.rs # PageHeader, SlottedPage, checksums
│ ├── buffer_pool.rs # BufferPool, Frame, eviction policy
│ ├── page_file.rs # PageFile: raw I/O to the database file
│ └── tle.rs # TleRecord serialization/deserialization
Hints
Hint 1 — Serializing TLE records
Use to_le_bytes() for each field and concatenate them into a Vec<u8>. For deserialization, slice the byte buffer at the known offsets and use from_le_bytes(). Do not use serde or bincode — the point of this project is to understand raw binary layout.
impl TleRecord {
fn serialize(&self) -> Vec<u8> {
let mut buf = Vec::with_capacity(82);
buf.extend_from_slice(&self.norad_id.to_le_bytes());
buf.extend_from_slice(&self.intl_designator);
buf.extend_from_slice(&self.epoch.to_le_bytes());
// ... remaining fields
buf
}
}
Hint 2 — Buffer pool sizing
With 200 TLE records at 82 bytes each and ~49 records per page (4,079 usable bytes / 82 bytes ≈ 49, minus slot overhead), you need approximately 5 pages. An 8-frame buffer pool can hold the entire working set — but only if pages aren't evicted prematurely. Make sure your LRU implementation correctly promotes re-accessed pages.
Hint 3 — Compaction correctness check
After compacting, iterate all slot IDs and verify:
- Live records return the same data as before compaction
- Tombstoned slots still return
None - The page's total free space increased (fragmentation reclaimed)
- The page's contiguous free space equals total free space (no more gaps)
Hint 4 — Checksum verification testing
To test checksum detection, write a valid page to disk, then flip a single bit in the page body using raw file I/O. Read the page back through the buffer pool and verify that it returns a checksum error, not corrupted data.
// Flip bit 0 of byte 20 in page 1
let offset = 1 * PAGE_SIZE + 20;
file.seek(SeekFrom::Start(offset as u64))?;
let mut byte = [0u8; 1];
file.read_exact(&mut byte)?;
byte[0] ^= 0x01; // flip lowest bit
file.seek(SeekFrom::Start(offset as u64))?;
file.write_all(&byte)?;
Reference Implementation
Reveal full reference implementation
The reference implementation is intentionally omitted for this project. The three lessons provide all the code building blocks — your job is to integrate them into a working system. If you get stuck:
- Start with
page_file.rs— get raw page I/O working first - Add
page.rs— implementPageHeaderandSlottedPagefrom Lesson 1 and 3 - Add
buffer_pool.rs— wrap the page file with caching from Lesson 2 - Add
tle.rs— serialization is straightforward byte manipulation - Wire them together in
main.rs— run each acceptance criterion sequentially
What Comes Next
The page manager you build here is used directly by Module 2. B-tree nodes are stored as slotted pages in the buffer pool. The (page_id, slot_id) Record ID becomes the leaf-node pointer format in the B+ tree index.
Module 02 — B-Tree Index Structures
Track: Database Internals — Orbital Object Registry
Position: Module 2 of 6
Source material: Database Internals — Alex Petrov, Chapters 2, 4–6; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 3
Quiz pass threshold: 70% on all three lessons to unlock the project
Mission Context
SDA INCIDENT REPORT — OOR-2026-0043
Classification: PERFORMANCE DEFICIENCY
Subject: NORAD catalog ID lookups require full page scansThe page manager from Module 1 stores TLE records but provides no way to locate a specific record without scanning every page. A conjunction query for NORAD ID 25544 (ISS) currently reads all data pages sequentially — O(N) in the number of pages. With 100,000 tracked objects across ~2,000 data pages, a single point lookup takes 2–5ms. During a pass window with 500 conjunction checks per second, this saturates the I/O subsystem.
Directive: Build a B+ tree index over NORAD catalog IDs. Point lookups must be O(log N) in the number of records. Range scans over contiguous NORAD ID ranges must traverse only the relevant leaf pages.
The B-tree is the most widely deployed index structure in database systems. PostgreSQL, MySQL/InnoDB, SQLite, and most file systems use B-tree variants for ordered key lookups. This module covers the structure, invariants, and maintenance operations (splits and merges) that keep the tree balanced under insert and delete workloads.
Learning Outcomes
After completing this module, you will be able to:
- Describe the B-tree invariants (minimum fill factor, sorted keys, balanced height) and explain why they guarantee O(log N) lookups
- Implement node split and merge operations that maintain B-tree balance on insert and delete
- Distinguish between B-trees and B+ trees, and explain why B+ trees are preferred for range scans and disk-based storage
- Implement a B+ tree leaf-level linked list for efficient range scans over NORAD ID ranges
- Analyze the I/O cost of B-tree operations in terms of tree height and page size
- Integrate a B+ tree index with the page manager from Module 1
Lesson Summary
Lesson 1 — B-Tree Structure: Keys, Pointers, and Invariants
The B-tree data structure: internal nodes, leaf nodes, the fill factor invariant, and the guarantee of O(log N) height. How keys and child pointers are arranged within a node, and how the tree is traversed for point lookups.
Key question: What is the maximum height of a B-tree indexing 100,000 NORAD IDs with a branching factor of 200?
Lesson 2 — Node Splits and Merges
Maintaining B-tree balance under writes. How inserts cause node splits (bottom-up), how deletes cause node merges or redistributions, and how these operations propagate up the tree. The difference between eager and lazy merge strategies.
Key question: Can a single insert into a B-tree with height H cause more than H page writes?
Lesson 3 — B+ Trees and Range Scans
The B+ tree variant: all data in leaf nodes, internal nodes hold only separator keys, and leaf nodes are linked for sequential access. Why this layout is optimal for disk-based storage engines that need both point lookups and range scans.
Key question: Why do B+ trees outperform B-trees for range scans even when both have the same height?
Capstone Project — B+ Tree TLE Index Engine
Build a B+ tree index over NORAD catalog IDs that supports point lookups, range scans, inserts, and deletes. The index is backed by the page manager from Module 1 — each B+ tree node is a slotted page in the buffer pool. Full project brief in project-btree-index.md.
File Index
module-02-btree-index-structures/
├── README.md ← this file
├── lesson-01-btree-structure.md ← B-tree structure and invariants
├── lesson-01-quiz.toml ← Quiz (5 questions)
├── lesson-02-splits-merges.md ← Node splits and merges
├── lesson-02-quiz.toml ← Quiz (5 questions)
├── lesson-03-bplus-trees.md ← B+ trees and range scans
├── lesson-03-quiz.toml ← Quiz (5 questions)
└── project-btree-index.md ← Capstone project brief
Prerequisites
- Module 1 (Storage Engine Fundamentals) completed
- Understanding of slotted pages and the buffer pool
What Comes Next
Module 3 (LSM Trees & Compaction) introduces a fundamentally different index structure optimized for write-heavy workloads. The B+ tree you build here is read-optimized — every insert modifies pages in-place, which is expensive for high write throughput. The LSM tree takes the opposite approach: batch writes in memory and flush them to immutable files. Understanding both structures and their tradeoffs is essential for choosing the right approach for the OOR's workload.
Lesson 1 — B-Tree Structure: Keys, Pointers, and Invariants
Module: Database Internals — M02: B-Tree Index Structures
Position: Lesson 1 of 3
Source: Database Internals — Alex Petrov, Chapters 2 and 4
Source note: This lesson was synthesized from training knowledge. The following concepts would benefit from verification against the source books: Petrov's specific notation for B-tree order vs. branching factor, and his framing of the fill factor invariant.
Context
A heap file of slotted pages provides O(1) access by Record ID (page_id, slot_id), but O(N) access by key — finding NORAD ID 25544 requires scanning every data page. For the Orbital Object Registry, this is the difference between a 0.05ms indexed lookup and a 5ms full scan. At 500 conjunction checks per second, indexed lookups consume 25ms of I/O per second. Full scans consume 2,500ms — the system spends more time scanning than computing.
The B-tree is a balanced, sorted, multi-way tree optimized for disk-based storage. Each node occupies one page, and the tree's branching factor (number of children per node) is determined by how many keys fit in a page. A B-tree with a branching factor of 200 and 100,000 records has a height of 3 — any record can be found in at most 3 page reads. Compare this to a binary search tree, which would have height ~17 and require 17 page reads.
B-trees were invented in 1970 by Bayer and McCreight specifically for disk-based access patterns. Every modern relational database and most file systems use B-tree variants as their primary index structure.
Core Concepts
Tree Structure
A B-tree of order m has the following properties:
- Every internal node has at most
mchildren and at mostm - 1keys. - Every internal node (except the root) has at least
⌈m/2⌉children. - The root has at least 2 children (unless it is a leaf).
- All leaves are at the same depth.
- Keys within each node are sorted in ascending order.
The keys in an internal node serve as separators — they direct the search to the correct child. For a node with keys [K₁, K₂, K₃] and children [C₀, C₁, C₂, C₃]:
- All keys in subtree
C₀are< K₁ - All keys in subtree
C₁are≥ K₁and< K₂ - All keys in subtree
C₂are≥ K₂and< K₃ - All keys in subtree
C₃are≥ K₃
[30 | 60]
/ | \
[10|20] [40|50] [70|80|90]
/ | \ / | \ / | | \
... ... ... ... ... ... ... ... ... ...
Branching Factor and Tree Height
The branching factor determines how wide the tree is — and therefore how shallow. For the Orbital Object Registry:
- Page size: 4KB
- Key size: 4 bytes (NORAD ID as
u32) - Child pointer size: 4 bytes (page ID as
u32) - Node header overhead: ~20 bytes
Usable space per node: 4096 - 20 = 4076 bytes. Each key-pointer pair: 4 + 4 = 8 bytes. Maximum keys per node: 4076 / 8 ≈ 509. So the branching factor is approximately 510.
Tree height for N records with branching factor B: h = ⌈log_B(N)⌉ + 1 (counting from root to leaf, inclusive).
| Records | B=510 Height | Page Reads per Lookup |
|---|---|---|
| 100,000 | 2 | 2 |
| 1,000,000 | 2 | 2 |
| 100,000,000 | 3 | 3 |
With branching factor 510, the entire 100,000-record OOR catalog is reachable in 2 page reads (root + leaf). The root node is almost always cached in the buffer pool, so in practice most lookups require only 1 disk read (the leaf node).
Point Lookup Algorithm
To find key K:
- Start at the root node.
- Binary search the node's keys to find the correct child pointer.
- Follow the child pointer to the next level.
- Repeat until a leaf node is reached.
- Binary search the leaf node for K.
Each level requires one page read and one binary search. Binary search within a node is O(log m) comparisons — negligible compared to the page read cost.
Node Layout on Disk
Each B-tree node is stored as a page in the page file. The node layout within a page:
┌────────────────────────────────────────────┐
│ Node Header │
│ - node_type: u8 (Internal=0, Leaf=1) │
│ - key_count: u16 │
│ - parent_page_id: u32 (for split propagation) │
├────────────────────────────────────────────┤
│ Key-Pointer Pairs (for internal nodes): │
│ [child_0] [key_0] [child_1] [key_1] ... │
│ │
│ Key-Value Pairs (for leaf nodes): │
│ [key_0] [rid_0] [key_1] [rid_1] ... │
│ (RID = page_id + slot_id of data record) │
└────────────────────────────────────────────┘
Internal nodes store (child_page_id, key) pairs. Leaf nodes store (key, record_id) pairs where the record ID points to the actual TLE record in a data page (from Module 1).
The Fill Factor Invariant
The minimum fill requirement (at least ⌈m/2⌉ children per internal node) is what guarantees the tree stays balanced. Without it, degenerate deletions could produce a tree where one branch is much deeper than another, destroying the O(log N) guarantee.
The fill factor also ensures space efficiency — every node is at least half full, so the tree uses at most 2x the minimum space needed. In practice, B-trees maintain an average fill factor of ~67% (between the 50% minimum and 100% maximum), and bulk-loaded trees can achieve >90%.
Code Examples
B-Tree Node Representation
This defines the on-disk layout for B-tree nodes in the Orbital Object Registry, where keys are NORAD catalog IDs (u32) and values are Record IDs.
/// Record ID: a pointer to a TLE record in a data page.
#[derive(Debug, Clone, Copy, PartialEq)]
struct RecordId {
page_id: u32,
slot_id: u16,
}
/// A B-tree node stored in a single page.
#[derive(Debug)]
enum BTreeNode {
Internal(InternalNode),
Leaf(LeafNode),
}
#[derive(Debug)]
struct InternalNode {
page_id: u32,
/// Separator keys. `keys[i]` is the boundary between children[i] and children[i+1].
keys: Vec<u32>,
/// Child page IDs. `children.len() == keys.len() + 1`.
children: Vec<u32>,
}
#[derive(Debug)]
struct LeafNode {
page_id: u32,
/// Sorted key-value pairs. Keys are NORAD IDs, values are Record IDs.
keys: Vec<u32>,
values: Vec<RecordId>,
}
impl InternalNode {
/// Find the child page that could contain the given key.
fn find_child(&self, key: u32) -> u32 {
// Binary search for the first separator key > search key.
// The child to the left of that separator is the correct subtree.
let pos = self.keys.partition_point(|&k| k <= key);
self.children[pos]
}
}
impl LeafNode {
/// Point lookup: find the Record ID for a given NORAD ID.
fn find(&self, key: u32) -> Option<RecordId> {
match self.keys.binary_search(&key) {
Ok(idx) => Some(self.values[idx]),
Err(_) => None,
}
}
}
The partition_point method is the correct choice for internal node search — it finds the insertion point, which corresponds to the child that owns the search key's range. Using binary_search would be wrong: duplicate separator keys (from splits) would match incorrectly, and binary_search returns an arbitrary match when duplicates exist.
Traversal: Root-to-Leaf Lookup
A complete point lookup traverses from the root to a leaf, reading one page per level.
/// Look up a NORAD ID in the B-tree. Returns the Record ID if found.
fn btree_lookup(
root_page_id: u32,
key: u32,
buffer_pool: &mut BufferPool,
) -> io::Result<Option<RecordId>> {
let mut current_page_id = root_page_id;
loop {
let frame_idx = buffer_pool.fetch_page(current_page_id)?;
let page_data = buffer_pool.frame_data(frame_idx);
let node = deserialize_node(page_data)?;
// Unpin immediately — we've extracted the data we need.
// In a real implementation, we'd hold the pin during the
// search for concurrency safety (see Module 5).
buffer_pool.unpin(frame_idx, false);
match node {
BTreeNode::Internal(internal) => {
current_page_id = internal.find_child(key);
// Continue traversal — follow the child pointer
}
BTreeNode::Leaf(leaf) => {
return Ok(leaf.find(key));
}
}
}
}
This reads at most h pages where h is the tree height. For the OOR (100k records, branching factor ~510), h = 2. The root page is almost always cached in the buffer pool (it's accessed by every lookup), so the typical cost is 1 disk read — just the leaf page.
The comment about unpinning immediately is important: in a concurrent engine (Module 5), you'd hold the pin while searching to prevent the page from being evicted mid-traversal. For single-threaded Module 2, immediate unpin is safe and keeps the buffer pool available.
Key Takeaways
- A B-tree with branching factor B and N records has height O(log_B(N)). With B ≈ 500 (common for 4KB pages with small keys), a tree indexing 100 million records is only 4 levels deep.
- The fill factor invariant (nodes at least half full) guarantees balanced height and prevents degenerate trees. Splits and merges (Lesson 2) maintain this invariant.
- Internal nodes contain separator keys and child pointers. Leaf nodes contain the actual key-to-record-ID mapping. The search algorithm binary-searches within each node and follows pointers down the tree.
- The branching factor is determined by page size and key/pointer sizes. Larger pages or smaller keys mean a wider tree and fewer I/O operations per lookup.
- Root and upper-level internal nodes are almost always cached in the buffer pool, so the practical I/O cost of a lookup is usually just 1 page read (the leaf).
Lesson 2 — Node Splits and Merges
Module: Database Internals — M02: B-Tree Index Structures
Position: Lesson 2 of 3
Source: Database Internals — Alex Petrov, Chapters 4–5
Source note: This lesson was synthesized from training knowledge. Verify Petrov's specific split/merge algorithm variants and his treatment of lazy vs. eager rebalancing against Chapters 4–5.
Context
A B-tree that only grows (inserts, never deletes) would eventually have every node at 100% capacity. The next insert into a full node would fail — unless the tree can restructure itself. Node splitting is the mechanism: when a node overflows, it divides into two half-full nodes and promotes a separator key to the parent. This maintains the B-tree invariant (all leaves at the same depth) and keeps every node between half and completely full.
The reverse operation — node merging — handles deletions. When a node drops below the minimum fill factor (half full), it either borrows keys from a sibling or merges with a sibling. Without merging, a delete-heavy workload could leave the tree full of nearly-empty nodes, wasting space and degrading scan performance.
For the OOR, inserts happen when new objects are cataloged or existing TLEs are updated (≈1,000/day for routine operations, burst to 10,000+ during fragmentation events). Deletes happen when objects re-enter the atmosphere or are reclassified. The split/merge machinery ensures the index stays balanced through both workload patterns.
Core Concepts
Leaf Node Split
When an insert arrives at a full leaf node:
- Allocate a new leaf page from the page manager.
- Move the upper half of the keys (and their record IDs) to the new node.
- The median key becomes the separator — it is promoted to the parent internal node.
- The parent inserts the separator key with a pointer to the new node.
Before split (leaf full, max 4 keys):
Parent: [...| 30 |...]
|
Leaf: [10, 20, 30, 40] ← inserting 25
After split:
Parent: [...| 25 | 30 |...]
| |
Left: [10, 20] Right: [25, 30, 40]
The choice of median matters: promoting the middle key keeps both new nodes as close to half-full as possible, maximizing the number of inserts before the next split. Some implementations promote the first key of the right node instead — simpler but slightly less balanced.
Internal Node Split
If the parent is also full when receiving the promoted separator, the parent itself must split. This propagation continues upward until either a non-full ancestor is found or the root splits. A root split is the only operation that increases the tree's height:
- Split the root into two children.
- Create a new root with one separator key pointing to the two children.
- Tree height increases by 1.
Root splits are rare — for a B-tree with branching factor 510, the root doesn't split until it contains 509 keys, meaning the tree holds at least 260,000 records at height 2 before needing height 3.
The Cost of Splits
A single insert can trigger a cascade of splits from leaf to root. In the worst case (every ancestor is full), inserting one key causes h splits — one per level. Each split writes 2 pages (the original node and the new node) plus modifies the parent, so the worst-case write amplification is 2h + 1 page writes for one insert.
In practice, cascading splits are rare. The average cost of an insert is approximately 1.5 page writes: one for the leaf update and 0.5 for the amortized split cost (since splits happen once per ~B/2 inserts).
Deletion and Underflow
When a key is deleted from a leaf:
- Remove the key and its record ID from the leaf.
- If the leaf is still at least half full, done.
- If the leaf is below half full (underflow), rebalance.
Rebalancing options, tried in order:
- Redistribute from a sibling: If an adjacent sibling has more than the minimum number of keys, transfer one key from the sibling through the parent. This keeps both nodes at valid fill levels.
- Merge with a sibling: If both the underflowing node and its sibling are at minimum, merge them into one node and remove the separator from the parent.
Merge reduces the parent's key count by one. If the parent then underflows, the same process propagates upward. A merge at the root level (when the root has only one child) reduces the tree height by 1.
Redistribution vs. Merge
Redistribution (sibling has spare keys):
Parent: [...| 30 |...] Parent: [...| 25 |...]
| | → | |
Left: [10] Right: [25,30,40] Left: [10,25] Right: [30,40]
Merge (both at minimum):
Parent: [...| 30 |...] Parent: [...|...]
| | → |
Left: [10] Right: [30] Merged: [10,30]
Redistribution is preferred because it doesn't change the tree structure — no nodes are created or destroyed, no parent keys are removed. Merge is the fallback when redistribution isn't possible.
Lazy vs. Eager Rebalancing
Not all implementations rebalance immediately on underflow. Lazy rebalancing tolerates slightly-underfull nodes, deferring merges until a periodic compaction pass or until the node becomes completely empty. This reduces write amplification at the cost of slightly lower space efficiency and slightly higher scan costs (more nodes to traverse).
PostgreSQL's B-tree implementation, for example, does not merge on every delete — it marks deleted entries as "dead" and reclaims space during VACUUM. This is partly because merge operations require exclusive locks on multiple nodes, which would block concurrent readers.
For the OOR, lazy rebalancing is the pragmatic choice: the delete rate (~100/day for atmospheric re-entries) is low enough that occasional underfull nodes have negligible impact on scan performance.
Code Examples
Leaf Node Split
Splitting a full leaf node during insert, promoting the median key to the parent.
impl LeafNode {
/// Split this leaf and return (median_key, new_right_leaf).
/// After split, `self` retains the lower half of the keys.
fn split(&mut self, new_page_id: u32) -> (u32, LeafNode) {
let mid = self.keys.len() / 2;
// The median key is promoted to the parent as a separator
let median_key = self.keys[mid];
// Right half moves to the new node
let right_keys = self.keys.split_off(mid);
let right_values = self.values.split_off(mid);
let right = LeafNode {
page_id: new_page_id,
keys: right_keys,
values: right_values,
// In a B+ tree, link the leaves (see Lesson 3)
next_leaf: self.next_leaf,
};
// Update the left node's forward pointer to the new right sibling
self.next_leaf = Some(new_page_id);
(median_key, right)
}
}
split_off(mid) is the correct choice here — it takes the elements from index mid to the end in O(1) amortized time (it's a Vec truncation + ownership transfer). The left node retains elements [0..mid) and the right node gets [mid..]. The median key is promoted to the parent but also remains in the right leaf — in a B+ tree, leaf nodes hold all keys, and internal nodes hold copies as separators.
Inserting into a B-Tree with Split Propagation
A top-level insert that handles splits propagating up the tree.
/// Insert a key-value pair into the B+ tree.
/// If the root splits, creates a new root and increases tree height.
fn btree_insert(
root_page_id: &mut u32,
key: u32,
rid: RecordId,
buffer_pool: &mut BufferPool,
) -> io::Result<()> {
// Traverse to the leaf, collecting the path of ancestors
let (leaf_page_id, ancestors) = find_leaf_with_path(
*root_page_id, key, buffer_pool
)?;
// Attempt to insert into the leaf
let overflow = insert_into_leaf(leaf_page_id, key, rid, buffer_pool)?;
if let Some((promoted_key, new_page_id)) = overflow {
// Leaf split occurred — propagate up
propagate_split(
root_page_id, promoted_key, new_page_id,
&ancestors, buffer_pool
)?;
}
Ok(())
}
/// Propagate a split upward through the ancestors.
fn propagate_split(
root_page_id: &mut u32,
mut promoted_key: u32,
mut new_child_page_id: u32,
ancestors: &[u32], // page IDs from root to parent-of-leaf
buffer_pool: &mut BufferPool,
) -> io::Result<()> {
// Walk ancestors from bottom (parent of leaf) to top (root)
for &ancestor_page_id in ancestors.iter().rev() {
let overflow = insert_into_internal(
ancestor_page_id, promoted_key, new_child_page_id,
buffer_pool,
)?;
match overflow {
None => return Ok(()), // Ancestor had room — done
Some((key, page_id)) => {
promoted_key = key;
new_child_page_id = page_id;
// Continue propagating
}
}
}
// If we reach here, the root itself split.
// Create a new root pointing to the old root and the new child.
let new_root_page = buffer_pool.allocate_page()?;
let new_root = InternalNode {
page_id: new_root_page,
keys: vec![promoted_key],
children: vec![*root_page_id, new_child_page_id],
};
serialize_and_write_node(&new_root, buffer_pool)?;
*root_page_id = new_root_page;
Ok(())
}
The ancestor path is collected during the initial traversal. This avoids re-traversing the tree during split propagation, which would be both slower and incorrect under concurrent modifications (a problem we'll address in Module 5).
The root_page_id is passed as &mut u32 because a root split changes it — the old root becomes a child of the new root. In a production engine, the root page ID is stored in the file header page and updated atomically with WAL protection.
Key Takeaways
- Node splits maintain the B-tree invariant by dividing overfull nodes and promoting a separator key to the parent. Splits propagate upward; root splits are the only operation that increases tree height.
- The amortized cost of an insert is ~1.5 page writes. The worst case (full cascade) is 2h+1 writes but occurs rarely — once per ~B/2 inserts per level.
- Deletions trigger rebalancing when a node drops below half full. Redistribution (borrowing from a sibling) is preferred; merging is the fallback. Merges propagate upward like splits.
- Lazy rebalancing (deferring merges) reduces write amplification and lock contention at the cost of slightly underfull nodes. Most production B-tree implementations use some form of lazy deletion.
- The ancestor path must be collected during traversal for split propagation. Re-traversing the tree after a split is both slower and unsafe under concurrent access.
Lesson 3 — B+ Trees and Range Scans
Module: Database Internals — M02: B-Tree Index Structures
Position: Lesson 3 of 3
Source: Database Internals — Alex Petrov, Chapters 5–6; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 3
Source note: This lesson was synthesized from training knowledge. Verify Petrov's treatment of B+ tree leaf linking, his comparison of B-tree vs. B+ tree I/O cost for range scans, and his coverage of prefix compression in Chapter 6.
Context
The B-tree from Lessons 1 and 2 stores key-value pairs in both internal and leaf nodes. This is correct and complete for point lookups, but it has a significant limitation for range scans: the data is distributed across all levels of the tree. Scanning NORAD IDs 40000–40500 requires traversing internal nodes to find the start, then potentially bouncing between internal and leaf nodes to collect all matching records.
The B+ tree variant solves this by separating concerns: internal nodes contain only separator keys and child pointers (they are a pure navigation structure), and all data records live in leaf nodes. Leaf nodes are linked into a doubly-linked list, so a range scan only needs one tree traversal (to find the starting leaf) followed by a sequential walk along the leaf chain.
This is why virtually every production relational database uses B+ trees, not plain B-trees, as their primary index structure. The leaf-level linked list turns range scans from O(N log N) (re-traversing from the root for each key) into O(log N + K) where K is the number of matching keys — a massive improvement for the OOR's conjunction query workload, which frequently scans orbital parameter ranges.
Core Concepts
B+ Tree vs. B-Tree
| Property | B-Tree | B+ Tree |
|---|---|---|
| Data location | All nodes (internal + leaf) | Leaf nodes only |
| Internal node contents | Keys + values + child pointers | Keys + child pointers only |
| Range scan support | Must re-traverse or backtrack | Sequential leaf walk |
| Branching factor | Lower (values consume space in internal nodes) | Higher (internal nodes hold more keys) |
| Point lookup I/O | Potentially fewer reads (data can be in internal nodes) | Always traverses to leaf |
| Disk space for keys | Each key stored once | Separator keys duplicated in internal nodes |
For the OOR, the B+ tree's higher branching factor (more keys per internal node) and efficient range scans outweigh the slight overhead of always traversing to leaf nodes. The conjunction query workload is dominated by range scans over orbital parameter ranges, not single-key lookups.
Leaf-Level Linked List
B+ tree leaf nodes are connected in a linked list sorted by key order. Each leaf stores a pointer to its right sibling (and optionally its left sibling for reverse scans):
Root: [30 | 60]
/ | \
/ | \
[10,20] → [30,40,50] → [60,70,80,90]
(leaf 1) (leaf 2) (leaf 3)
A range scan for keys 25–55:
- Tree traversal: root → leaf 2 (first key ≥ 25 is 30).
- Sequential scan: read leaf 2 (keys 30, 40, 50), follow
nextpointer to leaf 3 (key 60 > 55, stop). - Total I/O: 2 page reads (root + leaf 2) + 1 page read (leaf 3) = 3 pages. The root is cached, so practical I/O is 2 page reads.
Without the linked list, the engine would need to return to the root for each key, or implement a complex in-order traversal using a stack of parent pointers.
Separator Keys in Internal Nodes
In a B+ tree, internal node keys are separators — they do not need to be exact copies of leaf keys. They only need to correctly direct traffic. If the left child's maximum key is 29 and the right child's minimum key is 30, any value in the range [30, ...] works as a separator. Some implementations use abbreviated separators (the shortest key that correctly divides the two children) to fit more entries per internal node.
This means a delete from a leaf does not necessarily require updating the parent's separator. If you delete key 30 from a leaf, the separator in the parent can remain 30 — it still correctly directs traffic because all keys in the right child are ≥ 30 (the next key might be 31). The separator only needs updating if the split boundary itself changes.
Prefix and Suffix Compression
Leaf nodes in a B+ tree often contain keys with significant commonality — for example, NORAD IDs in the range 40000–40499 share the prefix "400". Prefix compression stores the common prefix once and encodes each key as a delta from the prefix. Suffix truncation removes key suffixes that are unnecessary for correct comparison within the node.
For the OOR's u32 NORAD IDs, these optimizations provide modest savings (4-byte keys have limited prefix sharing). They become critical for composite keys or string keys — a B+ tree indexing international designators like "2024-001A" through "2024-999Z" would benefit substantially.
Bulk Loading
Building a B+ tree by inserting records one at a time produces nodes at ~50-67% average fill. Bulk loading builds the tree bottom-up from sorted data:
- Sort all records by key.
- Pack records into leaf nodes at near-100% fill.
- Build internal nodes bottom-up by taking separator keys from each pair of adjacent leaves.
- Repeat upward until the root is created.
Bulk loading is O(N) in the number of records and produces an optimally-packed tree. For the OOR's initial catalog load (100,000 records), bulk loading fills ~200 leaf pages at 100% fill. One-by-one inserts would produce ~300-400 pages at 50-67% fill.
Code Examples
B+ Tree Leaf Node with Sibling Links
Extending the leaf node from Lesson 1 with linked-list pointers for range scan support.
/// B+ tree leaf node with sibling links for range scans.
#[derive(Debug)]
struct BPlusLeafNode {
page_id: u32,
keys: Vec<u32>,
values: Vec<RecordId>,
/// Forward pointer to the next leaf (right sibling)
next_leaf: Option<u32>,
/// Backward pointer for reverse scans (optional)
prev_leaf: Option<u32>,
}
impl BPlusLeafNode {
/// Range scan: iterate all keys in [start, end] starting from this leaf.
/// Returns a lazy iterator that follows the leaf chain.
fn range_scan_from<'a>(
&'a self,
start: u32,
end: u32,
buffer_pool: &'a mut BufferPool,
) -> impl Iterator<Item = io::Result<(u32, RecordId)>> + 'a {
// Find the first key >= start in this leaf
let start_idx = self.keys.partition_point(|&k| k < start);
// Yield matching keys from this leaf and follow the chain
BPlusRangeScanIterator {
current_keys: &self.keys[start_idx..],
current_values: &self.values[start_idx..],
idx: 0,
end_key: end,
next_leaf_page: self.next_leaf,
buffer_pool,
done: false,
}
}
}
The range scan iterator is lazy — it reads the next leaf page only when the current leaf's keys are exhausted. This avoids reading leaf pages beyond what the caller actually consumes (important if the caller stops early after finding a match).
Range Scan Iterator
The iterator follows the leaf chain, reading one page at a time.
/// Iterator over a range of B+ tree leaf entries.
/// Follows the leaf-level linked list until the end key is exceeded.
struct BPlusRangeScanIterator<'a> {
current_leaf: Option<BPlusLeafNode>,
idx: usize,
end_key: u32,
buffer_pool: &'a mut BufferPool,
done: bool,
}
impl<'a> BPlusRangeScanIterator<'a> {
fn next_entry(&mut self) -> Option<io::Result<(u32, RecordId)>> {
loop {
if self.done {
return None;
}
let leaf = self.current_leaf.as_ref()?;
// Check if there are more entries in the current leaf
if self.idx < leaf.keys.len() {
let key = leaf.keys[self.idx];
if key > self.end_key {
self.done = true;
return None;
}
let rid = leaf.values[self.idx];
self.idx += 1;
return Some(Ok((key, rid)));
}
// Current leaf exhausted — follow the chain
match leaf.next_leaf {
None => {
self.done = true;
return None;
}
Some(next_page_id) => {
match self.load_leaf(next_page_id) {
Ok(next_leaf) => {
self.current_leaf = Some(next_leaf);
self.idx = 0;
// Loop back to yield from the new leaf
}
Err(e) => {
self.done = true;
return Some(Err(e));
}
}
}
}
}
}
fn load_leaf(&mut self, page_id: u32) -> io::Result<BPlusLeafNode> {
let frame_idx = self.buffer_pool.fetch_page(page_id)?;
let data = self.buffer_pool.frame_data(frame_idx);
let leaf = deserialize_leaf_node(data)?;
self.buffer_pool.unpin(frame_idx, false);
Ok(leaf)
}
}
This pattern — a struct that holds iteration state and lazily loads pages — is the volcano iterator model that we'll formalize in Module 6 (Query Processing). Every B+ tree range scan in every database engine works this way: traverse to the starting leaf, then pull one record at a time from the leaf chain, fetching the next page only when the current one is exhausted.
Key Takeaways
- B+ trees store all data in leaf nodes and use internal nodes purely for navigation. This maximizes the internal node branching factor and enables efficient range scans via the leaf-level linked list.
- Range scans are O(log N + K): one tree traversal to the starting leaf, then K sequential leaf reads. This is the primary advantage over plain B-trees and hash indices.
- Separator keys in internal nodes don't need to be exact copies of leaf keys — any value that correctly directs traffic works. This enables prefix compression and abbreviated separators.
- Bulk loading produces a B+ tree at near-100% fill factor in O(N) time, compared to O(N log N) for one-by-one inserts at ~50-67% fill. Always bulk-load when building an index from scratch.
- The leaf-level scan iterator is the first instance of the volcano iterator pattern — a pull-based interface that lazily fetches pages on demand. This pattern recurs throughout the query processing stack.
Project — B+ Tree TLE Index Engine
Module: Database Internals — M02: B-Tree Index Structures
Track: Orbital Object Registry
Estimated effort: 6–8 hours
- SDA Incident Report — OOR-2026-0043
- Objective
- Acceptance Criteria
- Starter Structure
- Hints
- What Comes Next
SDA Incident Report — OOR-2026-0043
Classification: ENGINEERING DIRECTIVE
Subject: Build ordered index for NORAD catalog ID lookups and range scansThe page manager from Module 1 stores TLE records but requires full scans for key-based access. Build a B+ tree index over NORAD catalog IDs that provides O(log N) point lookups and efficient range scans via a leaf-level linked list. The index must integrate with the existing page manager and buffer pool.
Objective
Build a B+ tree index that:
- Uses the page manager and buffer pool from Module 1 — each B+ tree node is stored as a page
- Supports point lookups by NORAD catalog ID in O(log N) page reads
- Supports range scans over NORAD ID ranges using the leaf-level linked list
- Handles inserts with automatic node splitting and split propagation
- Handles deletes with tombstoning (lazy rebalancing is acceptable)
- Provides a bulk-load operation for initial catalog construction
Acceptance Criteria
-
Point lookup correctness. Insert 10,000 TLE records with random NORAD IDs. Look up each by ID and verify the returned Record ID matches.
-
Range scan correctness. Insert NORAD IDs 1–10,000. Scan range [5000, 5100] and verify exactly 101 records returned in sorted order.
-
Split handling. Insert records until at least 3 leaf splits occur. Verify the tree remains balanced (all leaves at same depth) and all records retrievable.
-
Bulk-load efficiency. Bulk-load 100,000 sorted records. Verify leaf fill factor above 95%. Compare leaf count to one-by-one insertion.
-
Delete correctness. Delete 1,000 records by NORAD ID. Verify lookups return
Nonefor deleted keys and remaining records are unaffected. -
Integration with buffer pool. Run full test suite with only 16 buffer pool frames. Verify correctness under eviction pressure.
-
Deterministic output. Print tree height, leaf count, fill factor, and buffer pool hit/miss stats after each test phase.
Starter Structure
btree-index/
├── Cargo.toml
├── src/
│ ├── main.rs # Entry point: runs acceptance criteria
│ ├── btree.rs # BPlusTree: insert, lookup, range_scan, delete, bulk_load
│ ├── node.rs # InternalNode, LeafNode: serialization, split, merge
│ ├── page.rs # Reuse from Module 1
│ ├── buffer_pool.rs # Reuse from Module 1
│ ├── page_file.rs # Reuse from Module 1
│ └── tle.rs # Reuse from Module 1
Hints
Hint 1 — Node serialization format
Use a 1-byte discriminant at the start of each node page to distinguish internal from leaf. Internal: [type=0][key_count][child_0][key_0][child_1].... Leaf: [type=1][key_count][next_leaf][prev_leaf][key_0][rid_0]....
Hint 2 — Ancestor path for split propagation
During root-to-leaf traversal, push each internal node's page ID onto a Vec<u32>. After a leaf split, pop ancestors one at a time to insert the promoted key. If the ancestor splits too, continue popping. Empty stack = create new root.
Hint 3 — Bulk-load algorithm
- Sort all records by NORAD ID.
- Pack keys into leaf nodes at capacity, write each, record its page ID and last key.
- Build internal nodes bottom-up: group separators into internal node pages.
- Repeat step 3 until one root remains.
- Link leaves into a doubly-linked list.
Hint 4 — Buffer pool pressure during splits
Split propagation can pin the split node, the new node, and the parent simultaneously (3 frames). With a 16-frame pool and a 2-level tree, this is safe. But unpin nodes as soon as you've serialized them back — don't hold all three longer than necessary.
What Comes Next
Module 3 introduces LSM trees — a fundamentally different approach. Where B+ trees update pages in-place, LSM trees batch writes in memory and flush immutable files. You'll understand when each is appropriate for the OOR workload.
Module 03 — LSM Trees & Compaction
Track: Database Internals — Orbital Object Registry
Position: Module 3 of 6
Source material: Database Internals — Alex Petrov, Chapter 7; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 3; Mini-LSM Course — Alex Chi Z
Quiz pass threshold: 70% on all three lessons to unlock the project
Mission Context
SDA INCIDENT REPORT — OOR-2026-0044
Classification: PERFORMANCE DEFICIENCY
Subject: B+ tree write throughput insufficient for fragmentation event ingestionDuring the Cosmos-2251 debris cascade simulation, the OOR must ingest 12,000 new TLE records in under 60 seconds. The B+ tree index from Module 2 achieves ~200 inserts/second — each insert requires a root-to-leaf traversal and potential node split, generating 2-4 random page writes per insert. At this rate, ingesting 12,000 records takes 60 seconds, consuming the entire conjunction window.
Directive: Evaluate and implement an LSM-tree-based storage architecture. LSM trees batch writes in memory and flush them as immutable sorted files, converting random writes to sequential writes. The tradeoff: reads become more expensive (must check multiple files), but write throughput increases by 10–100x.
The LSM tree (Log-Structured Merge Tree) is the dominant architecture for write-heavy storage engines. RocksDB, LevelDB, Cassandra, HBase, CockroachDB, and TiKV all use LSM-tree variants. Where the B+ tree is optimized for read-heavy workloads with moderate writes, the LSM tree is optimized for write-heavy workloads where reads can tolerate checking multiple sorted structures.
This module covers the full LSM architecture: memtables, sorted string tables (SSTs), the write and read paths, compaction strategies, and read optimizations (bloom filters, block cache). It draws on the mini-lsm course structure and the LSM coverage in Database Internals Chapter 7.
Learning Outcomes
After completing this module, you will be able to:
- Describe the LSM write path (memtable → immutable memtable → SST flush) and explain why it converts random writes to sequential I/O
- Implement a memtable backed by a sorted data structure (e.g.,
BTreeMap) and flush it to an immutable SST file - Design an SST file format with data blocks, index blocks, and metadata blocks
- Explain the three amplification factors (read, write, space) and how compaction strategies trade between them
- Compare leveled, tiered, and FIFO compaction strategies and choose the appropriate strategy for a given workload
- Implement bloom filters and a block cache to reduce read amplification in an LSM engine
Lesson Summary
Lesson 1 — Memtables and Sorted String Tables
The LSM write path: how writes are batched in an in-memory memtable, frozen into immutable memtables, and flushed to sorted string table (SST) files on disk. The SST file format: data blocks, index blocks, bloom filter blocks, and footer. How the read path probes the memtable first, then SSTs from newest to oldest.
Key question: Why does the LSM tree maintain both a mutable and one or more immutable memtables instead of writing directly from the mutable memtable to disk?
Lesson 2 — Compaction Strategies
The core problem: as SSTs accumulate, reads slow down (more files to check) and space grows (deleted keys aren't reclaimed until compacted). Compaction merges SSTs to reduce read amplification and reclaim space, at the cost of write amplification. Leveled compaction, tiered (universal) compaction, and FIFO compaction — each trades differently between read, write, and space amplification.
Key question: Can you design a compaction strategy that minimizes all three amplification factors simultaneously?
Lesson 3 — Bloom Filters, Block Cache, and Read Optimization
LSM reads are expensive — they must check the memtable plus potentially every SST level. Bloom filters let the engine skip SSTs that definitely don't contain the target key. The block cache keeps hot SST data blocks in memory. Together, they reduce the effective read amplification from O(levels) to near O(1) for point lookups.
Key question: A bloom filter with a 1% false positive rate eliminates 99% of unnecessary SST reads. What is the cost of increasing it to 0.1%?
Capstone Project — LSM-Backed TLE Storage Engine
Build an LSM storage engine for the Orbital Object Registry that supports put, get, delete, and scan operations. The engine must implement memtable→SST flush, a basic leveled compaction strategy, and bloom filters for point lookup optimization. Full project brief in project-lsm-engine.md.
File Index
module-03-lsm-trees-compaction/
├── README.md ← this file
├── lesson-01-memtables-ssts.md ← Memtables and sorted string tables
├── lesson-01-quiz.toml ← Quiz (5 questions)
├── lesson-02-compaction.md ← Compaction strategies
├── lesson-02-quiz.toml ← Quiz (5 questions)
├── lesson-03-read-optimization.md ← Bloom filters, block cache, read path
├── lesson-03-quiz.toml ← Quiz (5 questions)
└── project-lsm-engine.md ← Capstone project brief
Prerequisites
- Module 1 (Storage Engine Fundamentals) — page I/O concepts
- Module 2 (B-Tree Index Structures) — understanding of B+ tree tradeoffs (to compare against)
What Comes Next
Module 4 (Write-Ahead Logging & Recovery) adds durability to the LSM engine. Currently, a crash loses all data in the memtable (which is in-memory only). The WAL ensures that every write is persisted before being acknowledged, and the recovery process replays the WAL to rebuild the memtable after a crash.
Lesson 1 — Memtables and Sorted String Tables
Module: Database Internals — M03: LSM Trees & Compaction
Position: Lesson 1 of 3
Source: Database Internals — Alex Petrov, Chapter 7; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 3; Mini-LSM Week 1
Source note: This lesson was synthesized from training knowledge and the Mini-LSM course structure. Verify Petrov's specific SSTable format description and Kleppmann's LSM compaction cost analysis against the source texts.
Context
The B+ tree from Module 2 provides O(log N) reads but suffers under write-heavy workloads: every insert modifies a leaf page in place, and node splits amplify writes further. For the Orbital Object Registry's burst ingestion scenario (8,000 new objects in 45 minutes), the B+ tree's write amplification of 10–20x is the bottleneck.
The Log-Structured Merge Tree (LSM) eliminates in-place updates entirely. All writes go to an in-memory sorted structure called a memtable. When the memtable reaches a size threshold, it is frozen (becomes immutable) and flushed to disk as a Sorted String Table (SSTable) — a file of sorted key-value pairs that is never modified after creation. Reads probe the memtable first, then search SSTables from newest to oldest.
This architecture makes writes trivially fast: inserting a key-value pair is a single in-memory operation (an insert into a skip list or B-tree in RAM). The cost is shifted to reads, which must check multiple SSTables, and to background compaction, which merges SSTables to keep read amplification bounded. The entire LSM design is a bet that write throughput matters more than read latency for many workloads — and for TLE burst ingestion, it does.
Core Concepts
The Memtable
The memtable is a mutable, in-memory sorted data structure that buffers incoming writes. Common implementations:
- Skip list (used by LevelDB, RocksDB): O(log N) insert and lookup, lock-free concurrent reads, good cache behavior. The standard choice.
- Red-black tree or B-tree in memory: O(log N) operations, but harder to make lock-free.
- Sorted vector: O(N) insert (shift), O(log N) lookup. Only viable for very small memtables.
For the OOR, a BTreeMap<Vec<u8>, Vec<u8>> is the simplest correct implementation. Production engines use skip lists for concurrent access, but the algorithm is the same: maintain sorted order in memory, flush to disk when full.
Key operations:
- Put(key, value): Insert or update a key in the memtable. O(log N).
- Delete(key): Insert a tombstone — a special marker that indicates the key has been deleted. The tombstone must persist through SSTables so that older versions of the key are masked.
- Get(key): Look up a key. Returns the value, or the tombstone if deleted, or None if the key is not in the memtable.
The tombstone design is critical: a delete cannot simply remove the key from the memtable, because older SSTables on disk may still contain the key. Without a tombstone, a read would miss the memtable (key not present), then find the old value in an SSTable and incorrectly return it.
Memtable Freeze and Flush
When the memtable reaches its size threshold (typically 4–64MB), the engine:
- Freezes the current memtable — it becomes immutable (no more writes).
- Creates a new active memtable for incoming writes.
- In the background, flushes the immutable memtable to disk as an SSTable.
The freeze-then-flush pattern ensures writes are never blocked by disk I/O. The only latency-sensitive operation is the in-memory insert. Multiple immutable memtables can exist simultaneously (queued for flush), but each consumes memory, so the engine must flush faster than new memtables are created.
Write path:
Put(key, value)
│
▼
Active Memtable (mutable, in-memory)
│ (size threshold reached)
▼
Immutable Memtable (frozen, in-memory)
│ (background flush)
▼
SSTable on disk (sorted, immutable)
SSTable Format
An SSTable is a file containing sorted key-value pairs organized into data blocks. The standard layout:
┌─────────────────────────────────────────────┐
│ Data Block 0: [k0:v0, k1:v1, k2:v2, ...] │
│ Data Block 1: [k3:v3, k4:v4, k5:v5, ...] │
│ ... │
│ Data Block N: [kM:vM, ...] │
├─────────────────────────────────────────────┤
│ Meta Block: Bloom filter (optional) │
├─────────────────────────────────────────────┤
│ Index Block: [block_0_last_key → offset, │
│ block_1_last_key → offset, │
│ ... ] │
├─────────────────────────────────────────────┤
│ Footer: index_offset, meta_offset, magic │
└─────────────────────────────────────────────┘
Data blocks (typically 4KB each) contain sorted key-value pairs. Keys within a block can use prefix compression — store the shared prefix once and encode each key as a delta.
Index block maps the last key of each data block to the block's file offset. A point lookup binary-searches the index block to find which data block might contain the key, then searches within that block.
Meta block stores auxiliary data — most importantly, a bloom filter for fast negative lookups (Lesson 3).
Footer is the last few bytes of the file, containing offsets to the index and meta blocks. The reader starts by reading the footer, then uses its offsets to locate everything else.
SSTables are immutable — once written, they are never modified. Updates and deletes are handled by writing new SSTables that supersede older ones. This immutability is the source of LSM's concurrency simplicity: readers can access any SSTable without locks (the file doesn't change under them), and the only coordination needed is between the flush/compaction writers and the metadata that tracks which SSTables are active.
The Read Path
To read a key from an LSM engine:
- Check the active memtable. If found, return immediately.
- Check each immutable memtable from newest to oldest.
- Check each SSTable from newest to oldest (L0, then L1, then L2, ...).
- If the key is not found anywhere, it does not exist.
At each level, finding a tombstone means the key was deleted — stop searching and return "not found." This is why tombstones must be ordered correctly: a tombstone in a newer SSTable masks the key's value in all older SSTables.
The worst case is a negative lookup (key doesn't exist): the engine must check every memtable and every SSTable before concluding the key is absent. This is where bloom filters (Lesson 3) provide the biggest win — they let the engine skip entire SSTables in O(1) per filter check.
The Merge Iterator
Range scans (and compaction) require merging sorted streams from multiple sources — the memtable and several SSTables. The merge iterator (also called a multi-way merge) takes N sorted iterators and produces a single sorted stream:
- Maintain a min-heap of the current key from each source.
- Pop the smallest key. If multiple sources have the same key, take the one from the newest source (the memtable or the most recent SSTable).
- Advance the source that produced the popped key.
The newest-wins rule is what makes updates and deletes work correctly: a newer value for the same key supersedes the older one, and a newer tombstone masks the older value.
Code Examples
A Simple Memtable Backed by BTreeMap
The memtable stores key-value pairs in sorted order. Tombstones are represented as None values.
use std::collections::BTreeMap;
/// Memtable: in-memory sorted store for LSM writes.
/// Keys are byte slices. Values are `Option<Vec<u8>>` where
/// `None` represents a tombstone (deleted key).
struct MemTable {
map: BTreeMap<Vec<u8>, Option<Vec<u8>>>,
size_bytes: usize,
size_limit: usize,
}
impl MemTable {
fn new(size_limit: usize) -> Self {
Self {
map: BTreeMap::new(),
size_bytes: 0,
size_limit,
}
}
/// Insert or update a key-value pair.
fn put(&mut self, key: Vec<u8>, value: Vec<u8>) {
let entry_size = key.len() + value.len();
self.size_bytes += entry_size;
self.map.insert(key, Some(value));
}
/// Mark a key as deleted (insert a tombstone).
fn delete(&mut self, key: Vec<u8>) {
let entry_size = key.len();
self.size_bytes += entry_size;
self.map.insert(key, None); // None = tombstone
}
/// Look up a key. Returns:
/// - Some(Some(value)) if the key exists
/// - Some(None) if the key is tombstoned (deleted)
/// - None if the key is not in this memtable
fn get(&self, key: &[u8]) -> Option<&Option<Vec<u8>>> {
self.map.get(key)
}
/// True if the memtable has reached its size limit and should be frozen.
fn should_flush(&self) -> bool {
self.size_bytes >= self.size_limit
}
/// Iterate all entries in sorted order for flushing to an SSTable.
fn iter(&self) -> impl Iterator<Item = (&Vec<u8>, &Option<Vec<u8>>)> {
self.map.iter()
}
}
The three-valued return from get is essential: None means "this memtable has no information about this key — keep searching older sources." Some(None) means "this key was deleted — stop searching." Some(Some(value)) means "here's the value." Collapsing the first two cases would make deleted keys reappear from older SSTables.
SSTable Builder: Flushing the Memtable to Disk
When the memtable is frozen, its entries are written to an SSTable file in sorted order.
use std::io::{self, Write, BufWriter};
use std::fs::File;
const BLOCK_SIZE: usize = 4096;
/// Builds an SSTable file from sorted key-value pairs.
struct SsTableBuilder {
writer: BufWriter<File>,
/// Index entries: (last_key_in_block, block_offset)
index: Vec<(Vec<u8>, u64)>,
current_block: Vec<u8>,
current_block_offset: u64,
entry_count: usize,
}
impl SsTableBuilder {
fn new(path: &str) -> io::Result<Self> {
let file = File::create(path)?;
Ok(Self {
writer: BufWriter::new(file),
index: Vec::new(),
current_block: Vec::new(),
current_block_offset: 0,
entry_count: 0,
})
}
/// Add a key-value pair. Keys must be added in sorted order.
fn add(&mut self, key: &[u8], value: Option<&[u8]>) -> io::Result<()> {
// Encode the entry: [key_len: u16][key][is_tombstone: u8][value_len: u16][value]
let is_tombstone = value.is_none();
let val = value.unwrap_or(&[]);
self.current_block.extend_from_slice(&(key.len() as u16).to_le_bytes());
self.current_block.extend_from_slice(key);
self.current_block.push(if is_tombstone { 1 } else { 0 });
self.current_block.extend_from_slice(&(val.len() as u16).to_le_bytes());
self.current_block.extend_from_slice(val);
self.entry_count += 1;
// If the block is full, flush it
if self.current_block.len() >= BLOCK_SIZE {
self.flush_block(key)?;
}
Ok(())
}
fn flush_block(&mut self, last_key: &[u8]) -> io::Result<()> {
let offset = self.current_block_offset;
self.writer.write_all(&self.current_block)?;
self.index.push((last_key.to_vec(), offset));
self.current_block_offset += self.current_block.len() as u64;
self.current_block.clear();
Ok(())
}
/// Finalize the SSTable: write the index block and footer.
fn finish(mut self) -> io::Result<()> {
// Flush any remaining data in the current block
if !self.current_block.is_empty() {
let last_key = self.index.last()
.map(|(k, _)| k.clone())
.unwrap_or_default();
self.flush_block(&last_key)?;
}
// Write the index block
let index_offset = self.current_block_offset;
for (key, offset) in &self.index {
self.writer.write_all(&(key.len() as u16).to_le_bytes())?;
self.writer.write_all(key)?;
self.writer.write_all(&offset.to_le_bytes())?;
}
// Write the footer: index_offset + magic
self.writer.write_all(&index_offset.to_le_bytes())?;
self.writer.write_all(b"OORSST01")?; // 8-byte magic
self.writer.flush()?;
Ok(())
}
}
The builder writes entries into fixed-size blocks and records the last key and offset of each block in the index. The footer at the end of the file lets the reader locate the index without scanning the entire file. This is the same layout used by LevelDB and RocksDB's table format (with more compression and filtering in production).
Notice that add requires keys in sorted order — the caller (the memtable flush code) is responsible for iterating the memtable in order. Violating this invariant produces a corrupt SSTable where binary search returns wrong results.
Key Takeaways
- The LSM write path is: active memtable → freeze → immutable memtable → background flush → SSTable on disk. Writes are never blocked by disk I/O — they complete as soon as the in-memory insert finishes.
- Deletes are tombstones, not removals. A tombstone must persist through SSTables to mask older versions of the key. Compaction eventually garbage-collects tombstones once no older version exists.
- SSTables are immutable sorted files partitioned into data blocks with an index block for O(log B) block lookup (where B is the number of blocks). Immutability enables lock-free concurrent reads.
- The read path checks memtable first, then SSTables from newest to oldest. Negative lookups (key doesn't exist) are the worst case — they must check every source. Bloom filters (Lesson 3) mitigate this.
- The merge iterator produces a single sorted stream from multiple sources, with newest-wins semantics for duplicate keys. This is the core data structure for both reads and compaction.
Lesson 2 — Compaction Strategies
Module: Database Internals — M03: LSM Trees & Compaction
Position: Lesson 2 of 3
Source: Database Internals — Alex Petrov, Chapter 7; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 3; Mini-LSM Week 2
Source note: This lesson was synthesized from training knowledge and the Mini-LSM course compaction chapters. Verify the specific amplification formulas against Petrov Chapter 7 and the RocksDB Tuning Guide.
Context
Without compaction, the LSM engine accumulates SSTables indefinitely. Every flush creates a new SSTable in Level 0 (L0). After 100 flushes, there are 100 L0 SSTables — and a point lookup must check all of them. Read amplification grows linearly with the number of SSTables. Space amplification grows too: deleted keys still consume space in older SSTables, and updated keys have multiple versions.
Compaction is the background process that merges SSTables to reduce read and space amplification. It reads multiple SSTables, merge-sorts their entries (applying tombstones and keeping only the newest version of each key), and writes the result as fewer, larger SSTables. The merged input files are then deleted.
The compaction strategy determines which SSTables to merge and when. Different strategies make different tradeoffs between three amplification factors:
- Read amplification (RA): How many SSTables must be checked for a single read. Lower RA = faster reads.
- Write amplification (WA): How many times each byte of user data is written to disk over its lifetime. Lower WA = faster ingestion.
- Space amplification (SA): How much extra disk space is used beyond the logical data size. Lower SA = less storage cost.
No strategy minimizes all three simultaneously — this is the fundamental tradeoff of LSM design. Choosing a strategy means choosing which factor to sacrifice for the workload at hand.
Core Concepts
Level 0 and the Flush Problem
When a memtable is flushed, it becomes an L0 SSTable. L0 is special: its SSTables have overlapping key ranges (each SSTable contains whatever keys were in the memtable at freeze time, which can be any subset of the key space). This means a point lookup at L0 must check every L0 SSTable — there's no way to narrow the search by key range.
All compaction strategies share a common first step: merge L0 SSTables into L1, where SSTables have non-overlapping key ranges. In L1 and below, a point lookup can determine which SSTable(s) to check based on key range alone (binary search on SSTable boundaries), reducing read amplification.
L0: [a-z] [b-y] [c-x] ← overlapping, must check all
L1: [a-f] [g-m] [n-z] ← non-overlapping, check at most 1
L2: [a-c] [d-f] [g-i] ...← non-overlapping, check at most 1
Leveled Compaction
Leveled compaction (default in RocksDB, used by LevelDB) organizes SSTables into levels with exponentially increasing size targets. Each level's total size is a fixed multiple (the size ratio, typically 10) of the level above:
- L1 target: 256MB
- L2 target: 2.56GB (10 × L1)
- L3 target: 25.6GB (10 × L2)
When a level exceeds its target size, the engine picks one SSTable from that level and merges it with all overlapping SSTables in the next level.
Key property: Within each level (L1+), SSTables have non-overlapping key ranges. This means at most one SSTable per level needs to be checked for a point lookup.
Amplification characteristics:
- Read amplification: O(L) where L is the number of levels. With size ratio 10, a 1TB database has ~4 levels → ~4 SSTable reads per lookup. Excellent.
- Write amplification: High — in the worst case, a single SSTable is rewritten once per level transition. With size ratio 10 and 4 levels, write amplification is approximately
10 × L = 40x. Each byte of data is rewritten ~10 times per level hop. - Space amplification: Low — each key has at most one copy per level, and compaction removes obsolete versions. Typically 1.1–1.2x.
Leveled compaction is optimal for read-heavy workloads with limited tolerance for space overhead — exactly the OOR's conjunction query workload after the initial bulk ingestion.
Tiered (Universal) Compaction
Tiered compaction (RocksDB's "Universal" mode, used by Cassandra) groups SSTables into tiers (sorted runs) of similar size. When the number of tiers at a size level reaches a threshold, they are merged into a single larger tier.
Key property: Each tier is a sorted run (non-overlapping internally), but different tiers can overlap with each other. A read must check one SSTable per tier.
Amplification characteristics:
- Read amplification: O(T) where T is the number of tiers. Worse than leveled because tiers overlap.
- Write amplification: Low — each compaction merges tiers of similar size, so each byte is rewritten fewer times. Typically 2–5x for well-configured tiering.
- Space amplification: High — during compaction, both the input tiers and the output tier exist simultaneously, requiring up to 2x the logical data size in temporary space. Permanent space amplification is also higher because multiple tiers may hold different versions of the same key.
Tiered compaction is optimal for write-heavy workloads where ingestion throughput matters more than read latency — exactly the OOR's burst ingestion during a fragmentation event.
FIFO Compaction
FIFO compaction simply deletes the oldest SSTable when total storage exceeds a threshold. No merging occurs. This is appropriate for time-series data with a natural expiration window — if the OOR only needs TLE records from the last 7 days, FIFO compaction automatically ages out older data.
Amplification characteristics:
- Read amplification: High — all SSTables persist until aged out.
- Write amplification: 1x — data is written once (the initial flush) and never rewritten.
- Space amplification: Bounded by the retention window.
FIFO is unsuitable for the OOR's catalog (which must retain all active objects indefinitely) but useful for the telemetry stream (which has natural time-based expiration).
Amplification Tradeoff Summary
| Strategy | Read Amp | Write Amp | Space Amp | Best For |
|---|---|---|---|---|
| Leveled | Low (O(L)) | High (~10L) | Low (~1.1x) | Read-heavy, space-sensitive |
| Tiered | Medium (O(T)) | Low (~2-5x) | High (~2x) | Write-heavy, burst ingestion |
| FIFO | High | Minimal (1x) | Time-bounded | TTL data, append-only streams |
The RocksDB wiki summarizes the tradeoff space clearly: it is generally impossible to minimize all three amplification factors simultaneously. The compaction strategy is a knob that slides between them.
Compaction Mechanics: The Merge Process
Regardless of strategy, the actual compaction operation is the same:
- Select input SSTables (determined by the strategy).
- Create a merge iterator over all input SSTables.
- For each key in sorted order:
- If multiple versions exist, keep only the newest.
- If the newest version is a tombstone and there are no older SSTables that might contain the key, drop the tombstone (garbage collection).
- Otherwise, write the entry to the output SSTable(s).
- Split output into new SSTables when they reach the target file size.
- Atomically update the LSM metadata to swap old SSTables for new ones.
- Delete the old input SSTables.
Step 5 is critical for crash safety — if the engine crashes during compaction, it must not lose data. Either the old SSTables or the new ones should be the active set, never a mix. This is typically handled by writing a manifest (metadata log) that records which SSTables are active, and updating it atomically (via rename or WAL).
Compaction Scheduling
Compaction runs in background threads and must be scheduled carefully:
- Too little compaction: Read amplification grows as uncompacted SSTables accumulate.
- Too much compaction: Write bandwidth is consumed by compaction, starving foreground writes.
- Compaction during peak load: The background I/O from compaction interferes with foreground query latency.
Production engines use rate limiting (RocksDB's rate_limiter) to cap compaction I/O bandwidth, and priority scheduling to defer compaction during high-load periods. The SILK paper (USENIX ATC '19) formalized this as a latency-aware compaction scheduler.
For the OOR, the practical guideline: run compaction aggressively during quiet periods (between pass windows) and throttle during conjunction query bursts.
Code Examples
A Simple Leveled Compaction Controller
This determines which SSTables to compact and when, based on level size targets.
/// Metadata for an SSTable in the LSM state.
#[derive(Debug, Clone)]
struct SstMeta {
id: u64,
level: usize,
size_bytes: u64,
min_key: Vec<u8>,
max_key: Vec<u8>,
}
/// LSM state: tracks all active SSTables by level.
struct LsmState {
/// L0 SSTables (overlapping key ranges, newest first)
l0_sstables: Vec<SstMeta>,
/// L1+ levels: each level is a vec of non-overlapping SSTables sorted by key range
levels: Vec<Vec<SstMeta>>,
/// Size ratio between adjacent levels (typically 10)
size_ratio: u64,
/// L1 target size in bytes
l1_target_bytes: u64,
}
/// A compaction task: which SSTables to merge and where to put the output.
struct CompactionTask {
input_ssts: Vec<SstMeta>,
output_level: usize,
}
impl LsmState {
/// Determine if compaction is needed and generate a task.
fn generate_compaction_task(&self) -> Option<CompactionTask> {
// Priority 1: Too many L0 SSTables (merge all L0 into L1)
if self.l0_sstables.len() >= 4 {
let mut inputs: Vec<SstMeta> = self.l0_sstables.clone();
// Include all L1 SSTables that overlap with any L0 SSTable
let l0_min = inputs.iter().map(|s| &s.min_key).min().unwrap().clone();
let l0_max = inputs.iter().map(|s| &s.max_key).max().unwrap().clone();
if let Some(l1) = self.levels.get(0) {
for sst in l1 {
if sst.max_key >= l0_min && sst.min_key <= l0_max {
inputs.push(sst.clone());
}
}
}
return Some(CompactionTask {
input_ssts: inputs,
output_level: 1,
});
}
// Priority 2: A level exceeds its target size
for (i, level) in self.levels.iter().enumerate() {
let level_num = i + 1; // levels[0] = L1
let target = self.l1_target_bytes * self.size_ratio.pow(i as u32);
let actual: u64 = level.iter().map(|s| s.size_bytes).sum();
if actual > target {
// Pick the SSTable with the most overlap in the next level
// (simplified: pick the first SSTable)
if let Some(sst) = level.first() {
let mut inputs = vec![sst.clone()];
// Add overlapping SSTables from the next level
if let Some(next_level) = self.levels.get(i + 1) {
for next_sst in next_level {
if next_sst.max_key >= sst.min_key
&& next_sst.min_key <= sst.max_key
{
inputs.push(next_sst.clone());
}
}
}
return Some(CompactionTask {
input_ssts: inputs,
output_level: level_num + 1,
});
}
}
}
None // No compaction needed
}
}
The L0-to-L1 compaction merges all L0 SSTables with the overlapping portion of L1. This is the most expensive compaction operation (L0 SSTables overlap each other, so the entire key range may be involved), but it's necessary to establish the non-overlapping property at L1.
The level-to-level compaction picks a single SSTable from the overfull level and merges it with the overlapping SSTables in the next level. Production implementations (RocksDB) cycle through SSTables in key order to ensure uniform compaction across the key space, preventing hot spots.
Key Takeaways
- Compaction is the LSM engine's background maintenance process — it merges SSTables to reduce read amplification and space amplification at the cost of write amplification.
- The three amplification factors (read, write, space) are fundamentally in tension. No compaction strategy minimizes all three. Leveled compaction favors reads; tiered favors writes; FIFO favors write throughput for TTL data.
- Leveled compaction organizes SSTables into levels with non-overlapping key ranges and exponentially increasing size targets. Write amplification of ~10x per level is the cost of O(L) read amplification.
- Tiered compaction groups SSTables into sorted runs of similar size. Write amplification of 2-5x is the reward for tolerating higher read and space amplification.
- L0 is special: SSTables have overlapping key ranges and must all be checked on every read. Flushing L0 to L1 is the highest-priority compaction task.
- Compaction scheduling must balance background I/O against foreground query latency. Throttle during peak load, compact aggressively during quiet periods.
Lesson 3 — Bloom Filters, Block Cache, and Read Optimization
Module: Database Internals — M03: LSM Trees & Compaction
Position: Lesson 3 of 3
Source: Database Internals — Alex Petrov, Chapters 7–8; Mini-LSM Week 1 Day 7
Source note: This lesson was synthesized from training knowledge. Verify the bloom filter bits-per-key formula and Petrov's block cache eviction policies against the source texts.
Context
The LSM read path from Lesson 1 checks every SSTable from newest to oldest. For a database with 50 SSTables, a negative lookup (key doesn't exist) requires 50 SSTable probes — each involving reading an index block and potentially a data block from disk. Even with leveled compaction limiting the effective probe count to one SSTable per level, a 4-level LSM still reads 4 SSTables per negative lookup.
Two optimizations close the read performance gap between LSM trees and B+ trees:
-
Bloom filters — a probabilistic data structure attached to each SSTable that answers "is this key possibly in this SSTable?" in O(1) with no disk I/O. A "no" answer is definitive (the key is definitely not there), so the engine skips the entire SSTable. With a 1% false positive rate, 99% of unnecessary SSTable probes are eliminated.
-
Block cache — an in-memory cache of recently-read SSTable data blocks and index blocks. Hot blocks stay in memory, eliminating disk reads for frequently-accessed keys. Combined with index block pinning (keeping all index blocks in memory permanently), this makes most SSTable lookups a single data block read.
Together, these two techniques reduce the practical I/O cost of an LSM point lookup to approximately 1 disk read — competitive with the B+ tree.
Core Concepts
Bloom Filters
A bloom filter is a bit array of m bits with k independent hash functions. To add a key, hash it with all k functions and set the corresponding bits. To check a key, hash it and verify all k bits are set. If any bit is 0, the key is definitely not in the set. If all bits are 1, the key is probably in the set (but might be a false positive).
The false positive probability for a bloom filter with m bits, k hash functions, and n inserted keys is approximately:
FPR ≈ (1 - e^(-kn/m))^k
The optimal number of hash functions for a given m/n (bits per key) ratio is k = (m/n) × ln(2).
Practical sizing for the OOR:
| Bits per key | False positive rate | Memory per 10,000 keys |
|---|---|---|
| 5 | 9.2% | 6.1 KB |
| 10 | 0.82% | 12.2 KB |
| 14 | 0.08% | 17.1 KB |
| 20 | 0.0006% | 24.4 KB |
10 bits per key is the standard choice (RocksDB default) — it provides a ~1% false positive rate at modest memory cost. For the OOR's 100,000 keys, the bloom filter per SSTable adds ~122 KB, trivially small compared to the SSTable data itself.
A bloom filter is stored in the SSTable's meta block and loaded into memory when the SSTable is opened. It is never written to disk again — it's read-only once created. This means the filter is computed during the SSTable build (compaction or flush) and persists for the SSTable's lifetime.
Bloom Filter in the Read Path
When the engine needs to check an SSTable for a key:
- Consult the in-memory bloom filter for that SSTable.
- If the filter says "no" → skip the SSTable entirely. Zero disk I/O.
- If the filter says "maybe" → read the index block, find the candidate data block, read the data block, and search for the key.
For a negative lookup across 50 SSTables with a 1% FPR:
- Without bloom filters: 50 SSTable probes (50 index reads + up to 50 data reads).
- With bloom filters: ~0.5 SSTable probes on average (50 × 0.01 = 0.5 false positives).
This transforms the LSM's worst-case operation into a near-best-case: negative lookups, which were O(L × SSTables) disk reads, become O(0.5) disk reads on average.
Block Cache
The block cache is an LRU cache (similar to the buffer pool from Module 1) that stores recently-accessed SSTable blocks in memory. Unlike the buffer pool, which caches full pages, the block cache stores individual SSTable blocks (typically 4KB) keyed by (sst_id, block_offset).
Two categories of blocks are cached:
- Data blocks — the actual key-value data. Cached on demand (when a read hits the block).
- Index blocks — the SSTable's internal index mapping last-key-per-block to block offset. Frequently accessed (every point lookup into the SSTable reads the index block first).
Many engines pin index blocks and filter blocks in the cache — they are loaded when the SSTable is opened and never evicted. This guarantees that every SSTable lookup requires at most one disk read (for the data block), because the filter and index are always in memory.
The block cache sits above the OS page cache and provides the engine with workload-aware caching. Like the buffer pool, it exists because the OS page cache doesn't understand SSTable access patterns — it can't distinguish between a hot data block and a cold compaction input.
Prefix Bloom Filters
Standard bloom filters answer "is this exact key in the SSTable?" For range scans, you need a different question: "does this SSTable contain any keys with this prefix?" A prefix bloom filter hashes key prefixes instead of full keys, enabling prefix-based filtering.
For the OOR, a prefix bloom on the first 3 bytes of the NORAD ID would let range scans skip SSTables that don't contain any keys in the target range. The false positive rate is higher (more keys share a prefix than match exactly), but the I/O savings for range scans are significant.
Combining Optimizations: End-to-End Read Path
A fully optimized LSM point lookup:
- Check active memtable (in-memory, O(log N)).
- Check immutable memtables (in-memory, O(log N) each).
- For each SSTable, newest to oldest: a. Check the bloom filter (in-memory, O(k) hash operations). If negative → skip. b. Read the index block (in block cache → 0 disk I/O if pinned). c. Binary search the index block for the target data block. d. Read the data block (in block cache → 0 I/O if hot; 1 disk read if cold). e. Search the data block for the key.
- First match (value or tombstone) terminates the search.
For a positive lookup on a hot key: 0 disk reads (everything in cache). For a negative lookup: 0 disk reads (bloom filters reject all SSTables). For a cold positive lookup: 1 disk read (the data block; index and filter are pinned).
Code Examples
A Simple Bloom Filter for SSTable Key Filtering
This implements the core bloom filter operations — build during SSTable creation, query during reads.
use std::hash::{Hash, Hasher};
use std::collections::hash_map::DefaultHasher;
/// A bloom filter for probabilistic key membership testing.
struct BloomFilter {
bits: Vec<u8>,
num_bits: usize,
num_hashes: u32,
}
impl BloomFilter {
/// Create a bloom filter sized for `num_keys` with the given
/// bits-per-key ratio. Optimal hash count is computed automatically.
fn new(num_keys: usize, bits_per_key: usize) -> Self {
let num_bits = num_keys * bits_per_key;
let num_bytes = (num_bits + 7) / 8;
// Optimal k = bits_per_key * ln(2) ≈ bits_per_key * 0.693
let num_hashes = ((bits_per_key as f64) * 0.693).ceil() as u32;
let num_hashes = num_hashes.max(1).min(30); // Clamp to [1, 30]
Self {
bits: vec![0u8; num_bytes],
num_bits,
num_hashes,
}
}
/// Add a key to the bloom filter.
fn insert(&mut self, key: &[u8]) {
for i in 0..self.num_hashes {
let bit_pos = self.hash(key, i) % self.num_bits;
self.bits[bit_pos / 8] |= 1 << (bit_pos % 8);
}
}
/// Check if a key might be in the set.
/// Returns false → definitely not in the set.
/// Returns true → possibly in the set (check the SSTable).
fn may_contain(&self, key: &[u8]) -> bool {
for i in 0..self.num_hashes {
let bit_pos = self.hash(key, i) % self.num_bits;
if self.bits[bit_pos / 8] & (1 << (bit_pos % 8)) == 0 {
return false; // Definitive: key is NOT in the set
}
}
true // All bits set — key is PROBABLY in the set
}
/// Generate the i-th hash of a key using double hashing.
/// h(i) = h1 + i * h2, where h1 and h2 are independent hashes.
fn hash(&self, key: &[u8], i: u32) -> usize {
let mut h1 = DefaultHasher::new();
key.hash(&mut h1);
let hash1 = h1.finish();
let mut h2 = DefaultHasher::new();
// Mix in a constant to get an independent second hash
(key, 0xDEADBEEFu32).hash(&mut h2);
let hash2 = h2.finish();
(hash1.wrapping_add((i as u64).wrapping_mul(hash2))) as usize
}
/// Serialize the bloom filter for storage in the SSTable meta block.
fn to_bytes(&self) -> Vec<u8> {
let mut buf = Vec::with_capacity(4 + 4 + self.bits.len());
buf.extend_from_slice(&(self.num_bits as u32).to_le_bytes());
buf.extend_from_slice(&self.num_hashes.to_le_bytes());
buf.extend_from_slice(&self.bits);
buf
}
}
The double hashing technique generates k hash values from just two base hashes: h(i) = h1 + i × h2. This is mathematically equivalent to using k independent hash functions for bloom filter purposes (proven by Kirsch and Mitzenmacher, 2006) and much cheaper to compute.
The DefaultHasher is SipHash, which is well-distributed but not the fastest. Production bloom filters use xxHash or wyhash for speed. The algorithm is the same regardless of hash function — only throughput changes.
Integrating the Bloom Filter into SSTable Reads
Modifying the LSM read path to consult bloom filters before reading SSTable blocks.
/// Check an SSTable for a key, using the bloom filter to skip if possible.
fn check_sstable(
sst: &SsTableReader,
key: &[u8],
block_cache: &mut BlockCache,
) -> io::Result<Option<Option<Vec<u8>>>> {
// Step 1: Bloom filter check (in-memory, zero I/O)
if !sst.bloom_filter.may_contain(key) {
return Ok(None); // Definitely not in this SSTable
}
// Step 2: Index block lookup (cached or pinned, usually zero I/O)
let block_handle = sst.find_block_for_key(key, block_cache)?;
// Step 3: Data block read (1 disk read if not cached)
let data_block = block_cache.get_or_load(
sst.id,
block_handle.offset,
block_handle.size,
&sst.file,
)?;
// Step 4: Search the data block for the key
match data_block.find(key) {
Some(entry) => Ok(Some(entry.value.clone())), // Found (value or tombstone)
None => Ok(None), // Key not in this block (bloom filter false positive)
}
}
When the bloom filter returns false (line 4), the entire SSTable is skipped — no index read, no data read, no disk I/O. This is the single biggest read-path optimization in the LSM architecture.
When the bloom filter returns true but the key isn't actually in the SSTable (false positive), the engine performs an unnecessary index + data block read. At 1% FPR, this happens once per 100 negative probes per SSTable — rare enough to be negligible.
Key Takeaways
- Bloom filters eliminate 99% of unnecessary SSTable probes for negative lookups at 10 bits per key. This transforms the LSM's worst case (negative lookups checking every SSTable) into a near-zero-I/O operation.
- The false positive rate is tunable via bits-per-key: 10 bits → ~1% FPR, 14 bits → ~0.1% FPR. The OOR should use 10 bits per key as the default, matching RocksDB's default.
- Block cache stores recently-accessed SSTable blocks in memory. Pinning index and filter blocks guarantees that every SSTable lookup costs at most 1 disk read (for the data block).
- The fully optimized LSM read path: bloom filter (in-memory) → index block (pinned in cache) → data block (cached or 1 disk read). For hot keys, this is 0 disk reads — competitive with B+ tree performance.
- Prefix bloom filters extend filtering to range scans by hashing key prefixes instead of full keys. Higher false positive rate but significant I/O savings for prefix-based range queries.
Project — LSM-Backed TLE Storage Engine
Module: Database Internals — M03: LSM Trees & Compaction
Track: Orbital Object Registry
Estimated effort: 8–10 hours
- SDA Incident Report — OOR-2026-0044
- Objective
- Acceptance Criteria
- Starter Structure
- Hints
- What Comes Next
SDA Incident Report — OOR-2026-0044
Classification: ENGINEERING DIRECTIVE
Subject: Build LSM storage engine prototype for write-optimized TLE ingestionThe B+ tree index cannot sustain burst ingestion rates during fragmentation events. Build an LSM-tree-based storage engine that batches writes in a memtable, flushes to immutable SSTables, and uses leveled compaction to bound read amplification. Bloom filters on each SSTable must reduce negative lookup cost to near-zero.
Objective
Build a complete LSM storage engine that:
- Accepts
put(key, value)anddelete(key)into an in-memory memtable - Freezes and flushes the memtable to SSTable files when a size threshold is reached
- Supports
get(key)by probing the memtable, then SSTables from newest to oldest - Implements a simple leveled compaction (merge all L0 into L1) triggered by SSTable count
- Attaches a bloom filter to each SSTable for fast negative lookups
- Supports
scan(start_key, end_key)via a merge iterator over all sources
Acceptance Criteria
-
Write throughput. Insert 100,000 TLE records with a 4MB memtable limit. Measure and print the total time and records/second. Target: >50,000 records/second (in-memory memtable inserts).
-
Memtable flush. Verify that SSTables are created on disk after the memtable reaches the size threshold. Print the number of SSTables after all inserts.
-
Point lookup correctness. After all inserts, look up 1,000 random NORAD IDs and verify each returns the correct TLE record. Look up 1,000 non-existent IDs and verify each returns
None. -
Bloom filter effectiveness. Report bloom filter hit/miss stats: how many SSTable probes were skipped by the bloom filter during the 1,000 negative lookups. Target: >95% skip rate.
-
Delete correctness. Delete 1,000 records. Verify
getreturnsNonefor deleted keys. Verify that non-deleted keys adjacent to deleted keys still return correct values. -
Compaction. Trigger compaction (merge L0 SSTables into L1). Verify that the number of SSTables decreases. Verify all keys are still accessible after compaction. Verify that deleted keys (tombstones) are garbage-collected if compaction output is the bottommost level.
-
Range scan. Scan NORAD IDs [40000, 40500]. Verify the results are in sorted order and include exactly the expected keys.
Starter Structure
lsm-storage/
├── Cargo.toml
├── src/
│ ├── main.rs # Entry point: runs acceptance criteria
│ ├── memtable.rs # MemTable: BTreeMap-backed sorted store
│ ├── sstable.rs # SsTableBuilder, SsTableReader
│ ├── bloom.rs # BloomFilter: insert, may_contain, serialize
│ ├── merge_iter.rs # MergeIterator: multi-way merge over sorted sources
│ ├── compaction.rs # Compaction controller and merge logic
│ └── lsm.rs # LsmEngine: top-level API (put, get, delete, scan)
Hints
Hint 1 — SSTable file naming
Name SSTable files with a monotonically increasing ID: sst_000001.dat, sst_000002.dat, etc. Higher IDs are newer. The LSM state (which SSTables are active) can be tracked in memory as a Vec<SstMeta> per level. Persist the LSM state to a manifest file for crash recovery (or defer this to Module 4).
Hint 2 — Simple L0→L1 compaction trigger
The simplest compaction trigger: when the number of L0 SSTables reaches 4, merge all L0 SSTables into a single sorted L1 SSTable. This eliminates L0's overlapping key ranges. If L1 already has SSTables, include them in the merge to maintain the non-overlapping invariant.
Hint 3 — Merge iterator design
Use a BinaryHeap<Reverse<(key, source_id, value)>> as the min-heap. The source_id breaks ties: lower source ID = newer source. When popping an entry, skip all entries with the same key from older sources (they are superseded).
Hint 4 — Atomicity of SSTable swap
During compaction, write all output SSTables before modifying the LSM state. Then atomically update the state (swap old inputs for new outputs). If the engine crashes mid-compaction, the old SSTables are still valid — the output SSTables are orphaned files that can be cleaned up. This is the simplest crash-safe compaction strategy without a full WAL/manifest (which Module 4 adds).
What Comes Next
Module 4 (WAL & Recovery) adds durability. The memtable is volatile — if the process crashes, unflushed writes are lost. The WAL logs every write before it enters the memtable, enabling recovery. The manifest log tracks which SSTables are active, enabling crash-safe compaction.
Module 04 — Write-Ahead Logging & Recovery
Track: Database Internals — Orbital Object Registry
Position: Module 4 of 6
Source material: Database Internals — Alex Petrov, Chapters 9–10; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 7
Quiz pass threshold: 70% on all three lessons to unlock the project
Mission Context
SDA INCIDENT REPORT — OOR-2026-0045
Classification: DATA LOSS INCIDENT
Subject: 2,400 TLE records lost after unplanned power failureAt 03:17 UTC, a PDU failure at Ground Station Bravo caused an unclean shutdown of the OOR storage engine. The active memtable contained approximately 2,400 TLE updates from the preceding 12-minute pass window. Because the memtable is a volatile in-memory structure, all 2,400 records were lost. The LSM engine restarted with only the previously flushed SSTables, leaving the catalog 12 minutes stale. Two conjunction alerts were delayed because the missing TLEs contained the most recent orbital elements for objects in a close-approach trajectory.
Directive: Implement a write-ahead log. Every mutation must be logged to durable storage before it is applied to the memtable. On crash recovery, replay the WAL to reconstruct the memtable to its pre-crash state.
Learning Outcomes
After completing this module, you will be able to:
- Explain the write-ahead rule and why it is the foundation of crash recovery
- Implement a WAL that logs key-value operations to an append-only file with checksummed records
- Describe the ARIES recovery protocol — analysis, redo, and undo phases
- Implement crash recovery by replaying the WAL to reconstruct the memtable
- Design a checkpointing strategy that limits WAL replay time after a crash
- Reason about the tradeoff between
fsyncfrequency and durability guarantees
Lesson Summary
Lesson 1 — WAL Fundamentals
The write-ahead rule, log record format, LSN ordering, and the WAL's role in the LSM write path. Why fsync is the only guarantee of durability, and the latency cost of calling it.
Key question: What is the maximum data loss window under group commit with 50ms batch intervals?
Lesson 2 — Crash Recovery
The ARIES recovery protocol adapted for the LSM engine. Analysis phase (determine which WAL records need replay), redo phase (replay committed operations into the memtable), and how the manifest tracks SSTable state for consistent restart.
Key question: If the engine crashes during a compaction, does it use the old or new SSTables on recovery?
Lesson 3 — Checkpointing
Fuzzy checkpoints that snapshot the LSM state without blocking writes. How checkpoints bound WAL replay time and enable WAL truncation. The tradeoff between checkpoint frequency and recovery time.
Key question: What is the maximum WAL replay time for the OOR with 60-second checkpoint intervals?
Capstone Project — Durable TLE Update Pipeline
Add WAL-based durability to the Module 3 LSM engine. Every write is logged before entering the memtable. On simulated crash, the engine recovers to a consistent state by replaying the WAL. Full brief in project-durable-pipeline.md.
File Index
module-04-wal-recovery/
├── README.md
├── lesson-01-wal-fundamentals.md
├── lesson-01-quiz.toml
├── lesson-02-crash-recovery.md
├── lesson-02-quiz.toml
├── lesson-03-checkpointing.md
├── lesson-03-quiz.toml
└── project-durable-pipeline.md
Prerequisites
- Module 3 (LSM Trees & Compaction) completed
What Comes Next
Module 5 (Transactions & Isolation) adds concurrent read/write support with MVCC snapshot isolation, building on the durable foundation established here.
Lesson 1 — WAL Fundamentals
Module: Database Internals — M04: Write-Ahead Logging & Recovery
Position: Lesson 1 of 3
Source: Database Internals — Alex Petrov, Chapters 9–10; Mini-LSM Week 2 Day 6
Source note: This lesson was synthesized from training knowledge. Verify Petrov's WAL record format, LSN semantics, and
fsyncdiscussion against Chapters 9–10.
Context
The LSM engine from Module 3 achieves high write throughput by buffering writes in a memtable and flushing them to SSTables in the background. But the memtable is volatile — it lives in process memory. If the process crashes, the OS kills the process, or power fails, the memtable's contents are lost. For the OOR, this means losing every TLE update since the last flush — potentially minutes of orbital data during an active pass window.
The Write-Ahead Log (WAL) is the solution: an append-only file on durable storage that records every mutation before it is applied to the memtable. The key insight is the write-ahead rule: no modification to the in-memory state is visible until its corresponding log record has been durably written to the WAL. If the engine crashes after logging but before flushing, the WAL contains enough information to reconstruct the memtable by replaying the logged operations.
This changes the durability guarantee from "data is safe after SSTable flush" (every 5–60 seconds) to "data is safe after WAL write" (every operation, or every batch). The cost is one sequential disk write per operation (or per batch) — but sequential appends are cheap, especially on SSDs.
Core Concepts
The Write-Ahead Rule
The rule is simple and inviolable: log first, then mutate. The LSM write path becomes:
- Serialize the operation (
putordelete) into a WAL log record. - Append the record to the WAL file.
- Call
fsyncon the WAL file (or batchfsyncfor a group of records). - Apply the operation to the memtable.
- Return success to the caller.
If the engine crashes between steps 2 and 4, the WAL contains the operation but the memtable does not — recovery will replay it. If the engine crashes before step 2, the operation was never logged — the caller did not receive a success response, so the operation is not considered committed.
WAL Record Format
Each WAL record is a self-contained, checksummed unit:
┌──────────────────────────────────────────┐
│ Record Header │
│ - LSN: u64 (log sequence number) │
│ - record_type: u8 (Put=1, Delete=2) │
│ - key_len: u16 │
│ - value_len: u16 │
│ - checksum: u32 (CRC32 of key+value) │
├──────────────────────────────────────────┤
│ Key bytes (key_len bytes) │
├──────────────────────────────────────────┤
│ Value bytes (value_len bytes, if Put) │
└──────────────────────────────────────────┘
The Log Sequence Number (LSN) is a monotonically increasing identifier assigned to each record. LSNs establish a total order over all operations — recovery replays records in LSN order to reconstruct the exact pre-crash state. The LSN also correlates WAL records with SSTable flushes: when the memtable is flushed, the engine records the highest LSN contained in that flush. On recovery, only WAL records with LSNs greater than the last flushed LSN need to be replayed.
The checksum protects against partial writes: if the engine crashes mid-write, the incomplete record will have an invalid checksum and is discarded during recovery.
fsync Strategies
fsync is expensive: 0.1–1ms on SSD, 5–30ms on spinning disk. Three strategies for trading durability against throughput:
Per-operation fsync: Maximum durability — at most one operation can be lost. Throughput limited by fsync latency (~1,000–10,000 ops/sec on SSD).
Group commit (batch fsync): Buffer multiple WAL writes, then fsync the batch. If the batch covers 100 operations and fsync takes 0.5ms, the amortized cost is 5µs per operation. Standard approach — used by RocksDB, PostgreSQL, MySQL.
No fsync: Maximum throughput, minimum durability — a crash can lose up to 30 seconds of data. Acceptable for caches, not for the OOR.
For the OOR, group commit is the correct choice: batch TLE updates per pass window, fsync once per batch.
WAL in the LSM Write Path
put(key, value)
│
▼
WAL append (serialize + write + optional fsync)
│
▼
Memtable insert (in-memory, fast)
│
▼
Return success
When the memtable is flushed to an SSTable, the engine records the flushed LSN. WAL records at or below this LSN are no longer needed for recovery, enabling WAL truncation.
Code Examples
WAL Writer: Appending Checksummed Records
use std::io::{self, Write, BufWriter};
use std::fs::{File, OpenOptions};
#[repr(u8)]
#[derive(Clone, Copy)]
enum WalRecordType {
Put = 1,
Delete = 2,
}
struct WalWriter {
writer: BufWriter<File>,
next_lsn: u64,
}
impl WalWriter {
fn open(path: &str) -> io::Result<Self> {
let file = OpenOptions::new()
.create(true)
.append(true)
.open(path)?;
Ok(Self {
writer: BufWriter::new(file),
next_lsn: 0,
})
}
fn log_put(&mut self, key: &[u8], value: &[u8]) -> io::Result<u64> {
self.write_record(WalRecordType::Put, key, Some(value))
}
fn log_delete(&mut self, key: &[u8]) -> io::Result<u64> {
self.write_record(WalRecordType::Delete, key, None)
}
fn write_record(
&mut self,
rec_type: WalRecordType,
key: &[u8],
value: Option<&[u8]>,
) -> io::Result<u64> {
let lsn = self.next_lsn;
self.next_lsn += 1;
let val = value.unwrap_or(&[]);
// Checksum covers key + value
let mut hasher = crc32fast::Hasher::new();
hasher.update(key);
hasher.update(val);
let checksum = hasher.finalize();
// Header: LSN(8) + type(1) + key_len(2) + val_len(2) + checksum(4) = 17 bytes
self.writer.write_all(&lsn.to_le_bytes())?;
self.writer.write_all(&[rec_type as u8])?;
self.writer.write_all(&(key.len() as u16).to_le_bytes())?;
self.writer.write_all(&(val.len() as u16).to_le_bytes())?;
self.writer.write_all(&checksum.to_le_bytes())?;
self.writer.write_all(key)?;
self.writer.write_all(val)?;
Ok(lsn)
}
/// Flush to durable storage. Call after a batch for group commit.
fn sync(&mut self) -> io::Result<()> {
self.writer.flush()?;
self.writer.get_ref().sync_all()
}
}
WAL Reader: Replaying Records for Recovery
struct WalRecord {
lsn: u64,
rec_type: WalRecordType,
key: Vec<u8>,
value: Vec<u8>,
}
struct WalReader {
reader: std::io::BufReader<File>,
}
impl WalReader {
fn next_record(&mut self) -> io::Result<Option<WalRecord>> {
let mut header = [0u8; 17];
match self.reader.read_exact(&mut header) {
Ok(()) => {}
Err(e) if e.kind() == io::ErrorKind::UnexpectedEof => return Ok(None),
Err(e) => return Err(e),
}
let lsn = u64::from_le_bytes(header[0..8].try_into().unwrap());
let rec_type = match header[8] {
1 => WalRecordType::Put,
2 => WalRecordType::Delete,
_ => return Err(io::Error::new(
io::ErrorKind::InvalidData, "invalid WAL record type",
)),
};
let key_len = u16::from_le_bytes(header[9..11].try_into().unwrap()) as usize;
let val_len = u16::from_le_bytes(header[11..13].try_into().unwrap()) as usize;
let stored_checksum = u32::from_le_bytes(header[13..17].try_into().unwrap());
let mut key = vec![0u8; key_len];
let mut value = vec![0u8; val_len];
self.reader.read_exact(&mut key)?;
self.reader.read_exact(&mut value)?;
// Verify checksum — detects partial writes from crashes
let mut hasher = crc32fast::Hasher::new();
hasher.update(&key);
hasher.update(&value);
if hasher.finalize() != stored_checksum {
return Err(io::Error::new(
io::ErrorKind::InvalidData,
format!("WAL checksum mismatch at LSN {} — partial write detected", lsn),
));
}
Ok(Some(WalRecord { lsn, rec_type, key, value }))
}
}
The checksum verification is the corruption boundary: a failed checksum means the engine crashed mid-write. All prior records are valid; this record and everything after it are discarded.
Key Takeaways
- The write-ahead rule is absolute: log the operation to durable storage before applying it to the memtable. This guarantees that committed operations survive crashes.
fsyncis the durability boundary. Group commit amortizes the cost across many operations — the standard approach for production engines.- Each WAL record is self-contained with a CRC32 checksum. Partial writes are detected and discarded during recovery.
- The LSN orders all operations and correlates WAL records with SSTable flushes. WAL records at or below the flushed LSN are safe to truncate.
- WAL writes are sequential appends — the cheapest form of disk I/O.
Lesson 2 — Crash Recovery
Module: Database Internals — M04: Write-Ahead Logging & Recovery
Position: Lesson 2 of 3
Source: Database Internals — Alex Petrov, Chapter 10
Source note: This lesson was synthesized from training knowledge. Verify Petrov's ARIES protocol adaptation for LSM engines against Chapter 10.
Context
The WAL ensures every committed operation is recorded on durable storage. After a crash, the engine must use that log to return to a consistent state. This is the job of the crash recovery protocol — a deterministic procedure that reads the WAL, determines what was lost, and reconstructs the in-memory state.
For the OOR's LSM engine, recovery is simpler than for a traditional B-tree database because SSTables are immutable. There are no dirty pages to redo or uncommitted transactions to undo — the only volatile state is the memtable. Recovery reconstructs the memtable by replaying WAL records that were not yet flushed to an SSTable.
The classical protocol is ARIES (Algorithms for Recovery and Isolation Exploiting Semantics). ARIES has three phases — analysis, redo, and undo. For the LSM engine, we adapt it: the analysis phase determines the recovery starting point, the redo phase replays the WAL into the memtable, and the undo phase is unnecessary (no uncommitted transactions to roll back at this stage).
Core Concepts
The Manifest
The manifest is a metadata file that records the LSM engine's durable state: which SSTables are active, what level each belongs to, and the highest flushed LSN. On recovery, the manifest is the starting point.
The manifest is itself an append-only log. Each entry records a state change:
[1] AddSSTable { id: 1, level: 0, min_key: "a", max_key: "z", flushed_lsn: 500 }
[2] AddSSTable { id: 2, level: 0, min_key: "b", max_key: "y", flushed_lsn: 1000 }
[3] Compaction { removed: [1, 2], added: [3], output_level: 1, flushed_lsn: 1000 }
LSM Recovery Protocol
-
Read the manifest. Reconstruct the active SSTable set and determine the highest flushed LSN.
-
Open the WAL. Seek to the first record with LSN > flushed_lsn.
-
Replay WAL records. For each valid record, apply it to a fresh memtable:
Put→ insert the key-value pairDelete→ insert a tombstone
-
Stop at corruption. If a record's checksum fails, stop replaying. That record and everything after it are discarded.
-
Resume normal operation. The memtable now contains all committed-but-unflushed operations.
Recovery timeline:
[SSTables on disk] [WAL on disk]
├─ flushed to LSN 1000 ─┤─ LSN 1001..1234 valid ─┤─ LSN 1235 corrupt ─┤
│ │
Replay into memtable Stop here
Crash During Compaction
If the engine crashes mid-compaction:
- Before manifest update: Old SSTables are still listed as active. Partially-written new SSTables are orphaned files. Recovery uses the old SSTables. Orphans are cleaned up.
- After manifest update: New SSTables are active. Old SSTables are marked for deletion. Recovery uses the new SSTables.
The manifest update is the atomicity boundary. The pattern: write data first, then atomically update metadata.
Crash During Flush
If the engine crashes mid-flush:
- Before manifest update: The partial SSTable is orphaned. The WAL still contains all records. Recovery replays the WAL.
- After manifest update: The SSTable is complete and active. WAL records up to the flushed LSN are redundant.
Orphan Cleanup
On startup, the engine scans the data directory for SSTable files not referenced by the manifest. These are orphans from interrupted compactions or flushes. They are deleted before normal operation begins.
Code Examples
LSM Engine Recovery
impl LsmEngine {
fn recover(db_path: &str) -> io::Result<Self> {
// Phase 1: Read manifest
let (sst_state, flushed_lsn) = ManifestReader::replay(
&format!("{}/MANIFEST", db_path)
)?;
eprintln!("Recovery: {} SSTables, flushed LSN = {}", sst_state.total_count(), flushed_lsn);
// Phase 2: Replay WAL
let mut memtable = MemTable::new(16 * 1024 * 1024);
let mut replayed = 0u64;
let mut next_lsn = flushed_lsn + 1;
if let Ok(mut reader) = WalReader::open(&format!("{}/WAL", db_path)) {
loop {
match reader.next_record() {
Ok(Some(record)) => {
if record.lsn <= flushed_lsn { continue; }
match record.rec_type {
WalRecordType::Put => memtable.put(record.key, record.value),
WalRecordType::Delete => memtable.delete(record.key),
}
next_lsn = record.lsn + 1;
replayed += 1;
}
Ok(None) => break,
Err(e) => {
eprintln!("Recovery: corrupt record ({}), {} replayed", e, replayed);
break;
}
}
}
}
eprintln!("Recovery: replayed {} WAL records", replayed);
// Phase 3: Clean up orphaned SSTables
cleanup_orphans(db_path, &sst_state)?;
// Phase 4: Open fresh WAL and manifest for new writes
let wal = WalWriter::open(&format!("{}/WAL", db_path))?;
let manifest = ManifestWriter::open(&format!("{}/MANIFEST", db_path))?;
Ok(Self { active_memtable: memtable, immutable_memtables: Vec::new(),
sst_state, wal, manifest, next_lsn })
}
}
fn cleanup_orphans(db_path: &str, sst_state: &LsmState) -> io::Result<()> {
let active_ids: std::collections::HashSet<u64> = sst_state.all_sst_ids().collect();
for entry in std::fs::read_dir(db_path)? {
let entry = entry?;
let name = entry.file_name().to_string_lossy().to_string();
if name.starts_with("sst_") && name.ends_with(".dat") {
let id: u64 = name[4..name.len()-4].parse().unwrap_or(u64::MAX);
if !active_ids.contains(&id) {
eprintln!("Recovery: deleting orphaned SSTable {}", name);
std::fs::remove_file(entry.path())?;
}
}
}
Ok(())
}
The recovery procedure is deterministic: given the same manifest and WAL files, it always produces the same memtable state.
Key Takeaways
- LSM crash recovery is simpler than B-tree recovery because SSTables are immutable. The only volatile state to reconstruct is the memtable.
- The manifest tracks active SSTables and the flushed LSN. It is the recovery starting point and the atomicity boundary for compaction and flush operations.
- Recovery replays WAL records with LSN > flushed_lsn into a fresh memtable. Partial writes are detected by checksum failure and discarded.
- The pattern "write data, then atomically update metadata" applies to both flushes and compactions. The manifest update is the commit point.
- Orphan cleanup on startup removes SSTable files from interrupted operations that were never recorded in the manifest.
Lesson 3 — Checkpointing
Module: Database Internals — M04: Write-Ahead Logging & Recovery
Position: Lesson 3 of 3
Source: Database Internals — Alex Petrov, Chapter 10
Source note: This lesson was synthesized from training knowledge. Verify Petrov's fuzzy checkpoint algorithm and his WAL truncation semantics against Chapter 10.
Context
Without checkpointing, the WAL grows indefinitely. If the engine has been running for 24 hours with 100,000 TLE updates, the WAL contains 100,000 records — all of which must be scanned during recovery to find those with LSN > flushed_lsn. Recovery time grows linearly with WAL size. For a system that must return to service within seconds after a crash (the OOR's conjunction avoidance SLA requires <30s recovery), unbounded WAL growth is unacceptable.
A checkpoint snapshots the current LSM state and records the position where recovery should start. After a checkpoint, WAL records before the checkpoint position can be safely deleted, bounding both WAL size and recovery time.
Core Concepts
What a Checkpoint Records
A checkpoint writes the following to the manifest:
- Checkpoint LSN — the highest LSN that is fully durable (either in an SSTable or committed to the WAL and fsync'd at checkpoint time).
- Active SSTable list — every SSTable currently in the LSM state, with level assignments.
- Memtable status — the LSN range of the current active memtable (not yet flushed).
After the checkpoint, the WAL can be truncated up to the minimum recovery LSN: the smallest LSN that might still need replay. This is the lower bound of the active memtable's LSN range at checkpoint time.
Fuzzy Checkpoints
A sharp checkpoint freezes all writes, flushes the memtable, records the state, and then resumes. This guarantees that the checkpoint LSN is fully consistent — but it blocks writes for the duration of the flush (potentially seconds).
A fuzzy checkpoint avoids blocking writes:
- Record the current memtable's LSN range and the active SSTable list.
- Write this snapshot to the manifest.
- Continue accepting writes — the memtable keeps growing.
The tradeoff: recovery after a fuzzy checkpoint must replay WAL records from the memtable's start LSN (not the checkpoint LSN), because the memtable was not flushed at checkpoint time. Fuzzy checkpoints are faster (no flush) but result in slightly longer recovery (more WAL records to replay).
In the LSM architecture, fuzzy checkpoints are natural: the memtable flush is a form of checkpointing. Every time a memtable is flushed to an SSTable, the flushed LSN advances, and older WAL records become eligible for truncation. Explicit checkpoints are only needed if flush intervals are very long.
WAL Truncation
After a checkpoint (or flush), WAL records below the minimum recovery LSN can be deleted. Two strategies:
Segment-based: The WAL is split into fixed-size segments (e.g., 64MB files). A segment can be deleted when all its records have LSN ≤ the minimum recovery LSN. Simple and efficient — the filesystem handles cleanup.
Single-file with logical truncation: The WAL is one file. A "truncation point" is maintained in the manifest. On recovery, records before this point are skipped. The file is physically truncated (or rewritten) during periodic maintenance.
Segment-based is the standard approach (used by RocksDB, Kafka, PostgreSQL's WAL segments). It avoids the complexity of in-place truncation and enables simple space reclamation.
Recovery Time Analysis
Recovery time = time to read manifest + time to replay WAL records.
Manifest replay is fast (typically <100 entries). WAL replay dominates: each record requires deserialization and a memtable insert. At ~1µs per memtable insert and 100,000 records to replay, recovery takes ~100ms for the replay phase.
Checkpointing bounds this: with checkpoints every 60 seconds and 3,000 writes/sec, the maximum WAL replay is ~180,000 records = ~180ms. Well within the OOR's 30-second recovery SLA.
Coordinating Checkpoints with Compaction
Checkpoints and compaction both modify the manifest. To prevent conflicts:
- Acquire a manifest lock before writing a checkpoint or compaction result.
- Write the manifest entry.
- Release the lock.
The lock is held briefly (one file write + fsync), so contention is low. The manifest itself is append-only, so there are no conflicting edits — only the ordering of entries matters.
Code Examples
Checkpoint Implementation
impl LsmEngine {
/// Write a fuzzy checkpoint to the manifest.
/// This records the current LSM state without blocking writes.
fn checkpoint(&mut self) -> io::Result<()> {
// Snapshot the current state
let active_ssts: Vec<SstMeta> = self.sst_state.all_sstables().cloned().collect();
let memtable_min_lsn = self.active_memtable_min_lsn();
let checkpoint_lsn = self.next_lsn - 1;
// Write checkpoint to manifest
self.manifest.write_checkpoint(CheckpointEntry {
checkpoint_lsn,
memtable_min_lsn,
active_sstables: active_ssts.iter().map(|s| s.id).collect(),
})?;
self.manifest.sync()?;
eprintln!(
"Checkpoint at LSN {}: {} active SSTables, \
WAL replay starts at LSN {}",
checkpoint_lsn, active_ssts.len(), memtable_min_lsn,
);
// Truncate WAL segments that are fully below memtable_min_lsn
self.wal.truncate_before(memtable_min_lsn)?;
Ok(())
}
fn active_memtable_min_lsn(&self) -> u64 {
// The earliest LSN in the active memtable is the minimum
// recovery point — WAL records before this are redundant.
// If the memtable is empty, use the flushed LSN.
self.active_memtable
.min_lsn()
.unwrap_or(self.sst_state.flushed_lsn())
}
}
The checkpoint does not flush the memtable — it records where the memtable starts (min LSN) so recovery knows where to begin WAL replay. This is the "fuzzy" part: writes continue during and after the checkpoint, but the WAL truncation point is safely advanced.
WAL Segment Manager
/// Manages WAL as a series of fixed-size segments for clean truncation.
struct WalSegmentManager {
dir: String,
segment_size: usize,
active_segment: WalWriter,
active_segment_id: u64,
}
impl WalSegmentManager {
/// Truncate (delete) all WAL segments whose max LSN is below the given LSN.
fn truncate_before(&mut self, min_recovery_lsn: u64) -> io::Result<()> {
let entries = std::fs::read_dir(&self.dir)?;
for entry in entries {
let entry = entry?;
let name = entry.file_name().to_string_lossy().to_string();
if name.starts_with("wal_") && name.ends_with(".log") {
// Parse segment ID from filename: wal_000042.log → 42
let seg_id = name[4..10].parse::<u64>().unwrap_or(u64::MAX);
// Each segment covers a known LSN range.
// Conservative: only delete if segment_max_lsn < min_recovery_lsn
if self.segment_max_lsn(seg_id) < min_recovery_lsn {
std::fs::remove_file(entry.path())?;
eprintln!("WAL: deleted segment {}", name);
}
}
}
Ok(())
}
fn segment_max_lsn(&self, segment_id: u64) -> u64 {
// In practice, track this in memory or in the segment header.
// Simplified: assume segments hold a known max number of records.
(segment_id + 1) * (self.segment_size as u64 / 103) // ~records per segment
}
}
Segment-based truncation is simple: delete files whose records are all below the recovery starting point. No in-place file modification, no complex bookkeeping. The filesystem handles space reclamation.
Key Takeaways
- Checkpoints bound WAL size and recovery time by recording a recovery starting point. Without checkpoints, the WAL grows indefinitely and recovery scans the entire log.
- Fuzzy checkpoints avoid blocking writes — they snapshot the LSM state without flushing the memtable. The tradeoff is slightly longer recovery (WAL replay from the memtable's start LSN, not the checkpoint LSN).
- In an LSM engine, every memtable flush is implicitly a checkpoint — it advances the flushed LSN and makes earlier WAL records eligible for truncation.
- WAL segments (fixed-size log files) enable clean truncation by deleting entire segment files. This is simpler and more efficient than truncating a single growing file.
- Recovery time for the OOR: ~180ms worst case with 60-second checkpoint intervals and 3,000 writes/sec. Well within the 30-second conjunction avoidance SLA.
Project — Durable TLE Update Pipeline
Module: Database Internals — M04: Write-Ahead Logging & Recovery
Track: Orbital Object Registry
Estimated effort: 6–8 hours
SDA Incident Report — OOR-2026-0045
Classification: ENGINEERING DIRECTIVE
Subject: Add WAL-based durability to the LSM storage engineRef: OOR-2026-0045 (data loss incident after PDU failure)
Integrate a write-ahead log into the Module 3 LSM engine. Every mutation must be logged before it enters the memtable. The engine must recover to a consistent state after simulated crashes.
Acceptance Criteria
-
WAL write path. Every
putanddeletecall appends a checksummed record to the WAL before modifying the memtable. Verify by inspecting the WAL file after 1,000 inserts. -
Clean recovery. Insert 5,000 records, gracefully shut down, then recover. All 5,000 records must be accessible after recovery.
-
Crash recovery. Insert 5,000 records. Simulate a crash by calling
std::process::abort()(or simply skipping the shutdown routine). Restart and recover. Records up to the last fsync'd batch must be accessible. Report how many records were recovered vs. the expected count. -
Crash during flush. Insert records until a memtable flush is triggered. Simulate a crash mid-flush (after writing the SSTable but before updating the manifest). Recover and verify all data is intact — the orphaned SSTable is ignored, and the WAL is replayed to reconstruct the memtable.
-
WAL truncation. After recovery, trigger a flush and checkpoint. Verify the WAL is truncated — old segments are deleted, and the remaining WAL contains only records above the flushed LSN.
-
Recovery time. Measure recovery time for WAL sizes of 10,000, 50,000, and 100,000 records. Report the time for each. Target: recovery < 500ms for 100,000 records.
-
Manifest correctness. After multiple flush/compaction/checkpoint cycles, recover the engine and verify the manifest correctly reports the active SSTable set and flushed LSN.
Starter Structure
durable-pipeline/
├── Cargo.toml
├── src/
│ ├── main.rs # Entry point: runs acceptance criteria
│ ├── wal.rs # WalWriter, WalReader, WalSegmentManager
│ ├── manifest.rs # ManifestWriter, ManifestReader, checkpoint
│ ├── lsm.rs # LsmEngine with WAL integration and recovery
│ ├── memtable.rs # Reuse from Module 3
│ ├── sstable.rs # Reuse from Module 3
│ ├── bloom.rs # Reuse from Module 3
│ └── compaction.rs # Reuse from Module 3
Hints
Hint 1 — Simulating a crash
The simplest crash simulation: after writing N records, drop the LsmEngine without calling any shutdown method, then construct a new LsmEngine::recover(). Alternatively, write to a temporary directory, copy/rename files to simulate partial state, and then recover from the copy.
Hint 2 — Manifest format
Keep the manifest simple: a sequence of newline-delimited JSON records. Each record is either {"type": "add_sst", "id": 42, "level": 1, "flushed_lsn": 5000} or {"type": "remove_sst", "ids": [31, 32, 33]} or {"type": "checkpoint", "lsn": 10000, "active_ssts": [42, 43]}. Parse with serde_json or manual string parsing.
Hint 3 — Crash-during-flush simulation
Write the SSTable file, then abort before writing the manifest entry. On recovery, the manifest doesn't list the SSTable. Scan the data directory for SSTable files not in the manifest and delete them (orphan cleanup). Replay the WAL to reconstruct the memtable.
What Comes Next
Module 5 (Transactions & Isolation) adds MVCC support — concurrent readers see consistent snapshots while writers continue modifying the database. The WAL and manifest from this module provide the durability foundation that MVCC transactions depend on.
Module 05 — Transactions & Isolation
Track: Database Internals — Orbital Object Registry
Position: Module 5 of 6
Source material: Database Internals — Alex Petrov, Chapters 12–13; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 7; Mini-LSM Week 3
Quiz pass threshold: 70% on all three lessons to unlock the project
Mission Context
SDA INCIDENT REPORT — OOR-2026-0046
Classification: DATA ANOMALY
Subject: Conjunction query returned stale TLE data during concurrent catalog updateAt 14:22 UTC, a conjunction assessment for NORAD 43013 used TLE epoch 2026-084.2 while a concurrent bulk update was writing epoch 2026-084.7 for the same object. The assessment computed a miss distance of 2.3km using the stale epoch. The updated epoch would have yielded 0.8km — below the avoidance maneuver threshold. The conjunction alert was delayed by 4 minutes until the next assessment cycle picked up the updated TLE.
Root cause: The LSM engine provides no isolation between concurrent readers and writers. A long-running conjunction query can read a mix of old and new TLE versions, producing inconsistent results.
Directive: Implement multi-version concurrency control (MVCC) with snapshot isolation. Every conjunction query must see a consistent snapshot of the catalog — either entirely before or entirely after any concurrent update.
Learning Outcomes
After completing this module, you will be able to:
- Explain the ACID properties and which guarantees are provided by each isolation level
- Implement two-phase locking (2PL) and explain why it prevents all anomalies but limits concurrency
- Implement MVCC snapshot isolation in an LSM engine using timestamped keys
- Explain write skew and why snapshot isolation does not prevent it
- Design a garbage collection strategy for old MVCC versions
- Reason about the tradeoff between isolation level and concurrent throughput
Lesson Summary
Lesson 1 — ACID Properties and Isolation Levels
What Atomicity, Consistency, Isolation, and Durability mean concretely. The isolation levels (Read Uncommitted, Read Committed, Repeatable Read, Serializable) and which anomalies each prevents.
Lesson 2 — Two-Phase Locking (2PL)
Lock-based concurrency control. Shared and exclusive locks, the growing and shrinking phases, strict 2PL, and deadlock detection. Why 2PL is correct but limits throughput.
Lesson 3 — MVCC and Snapshot Isolation
Multi-version concurrency control: storing multiple versions of each key with timestamps. Snapshot reads, write conflicts, and garbage collection. Adapted for the LSM architecture using timestamped keys (following Mini-LSM Week 3).
Capstone Project — Conjunction Query Engine with MVCC Snapshot Reads
Add MVCC snapshot isolation to the LSM engine. Concurrent conjunction queries see consistent catalog snapshots. Concurrent writers do not block readers. Full brief in project-conjunction-engine.md.
File Index
module-05-transactions-isolation/
├── README.md
├── lesson-01-acid-isolation.md
├── lesson-01-quiz.toml
├── lesson-02-two-phase-locking.md
├── lesson-02-quiz.toml
├── lesson-03-mvcc-snapshots.md
├── lesson-03-quiz.toml
└── project-conjunction-engine.md
Prerequisites
- Module 4 (WAL & Recovery) completed
What Comes Next
Module 6 (Query Processing) adds structured query execution on top of the transactional storage engine — the volcano iterator model, vectorized execution, and join algorithms.
Lesson 1 — ACID Properties and Isolation Levels
Module: Database Internals — M05: Transactions & Isolation
Position: Lesson 1 of 3
Source: Database Internals — Alex Petrov, Chapter 12; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 7
Source note: This lesson was synthesized from training knowledge. Verify Kleppmann's isolation level taxonomy and anomaly definitions against Chapter 7.
Context
The OOR's LSM engine from Modules 3–4 provides durable, crash-recoverable storage for TLE records. But it offers no guarantees about what happens when multiple operations execute concurrently. A conjunction query reading NORAD 43013's TLE while a bulk update is overwriting it can see a partially-updated record — or a mix of old and new versions across different objects. The result is a phantom: a conjunction assessment computed against a catalog state that never actually existed.
Transactions are the abstraction that prevents this. A transaction groups multiple operations into a single atomic, isolated unit. The ACID properties define what "correct" means for transactions, and the isolation level determines how strictly concurrent transactions are separated.
Core Concepts
ACID Properties
Atomicity: All operations in a transaction succeed or all fail. If a bulk TLE update covers 500 objects and fails on object 347, the first 346 updates are rolled back. The catalog is never left in a partially-updated state.
Consistency: The database moves from one valid state to another. Application-level invariants (e.g., every NORAD ID is unique, every TLE has a valid epoch) are preserved across transactions. Consistency is primarily the application's responsibility — the database enforces it through constraints.
Isolation: Concurrent transactions appear to execute serially. A conjunction query running alongside a bulk update sees either the entirely pre-update or entirely post-update catalog — never a mix. The isolation level determines how strictly this is enforced.
Durability: Once a transaction commits, its effects survive crashes. This is the WAL's job (Module 4).
Isolation Levels and Anomalies
Each isolation level prevents a specific set of anomalies — situations where concurrent execution produces results that no serial execution could produce.
Read Uncommitted: No isolation. A transaction can read another transaction's uncommitted writes. Vulnerable to dirty reads (reading data that may be rolled back).
Read Committed: A transaction only sees committed data. Prevents dirty reads. Still vulnerable to non-repeatable reads (reading the same key twice and getting different values because another transaction committed between the two reads).
Repeatable Read / Snapshot Isolation: A transaction sees a consistent snapshot taken at transaction start. Prevents dirty reads and non-repeatable reads. Still vulnerable to write skew (two transactions read overlapping data, make disjoint writes, and produce a state that neither would have produced alone).
Serializable: Full isolation — concurrent transactions produce results equivalent to some serial ordering. Prevents all anomalies including write skew. Most expensive to enforce.
| Level | Dirty Read | Non-Repeatable Read | Phantom Read | Write Skew |
|---|---|---|---|---|
| Read Uncommitted | ✗ | ✗ | ✗ | ✗ |
| Read Committed | ✓ | ✗ | ✗ | ✗ |
| Repeatable Read | ✓ | ✓ | ✗/✓ | ✗ |
| Snapshot Isolation | ✓ | ✓ | ✓ | ✗ |
| Serializable | ✓ | ✓ | ✓ | ✓ |
✓ = prevented, ✗ = possible
For the OOR, snapshot isolation is the practical target: conjunction queries need a consistent view of the catalog (preventing dirty reads, non-repeatable reads, and phantoms), but full serializability's overhead is unnecessary for a read-dominated workload.
Write Skew: The Anomaly Snapshot Isolation Misses
Two conjunction analysts each read that the other is on duty. Both decide to go off-duty simultaneously, leaving no one on watch. Each transaction's writes are consistent with its own read snapshot, but the combined result violates the invariant "at least one analyst on duty."
In the OOR context: two concurrent TLE update transactions each read that a different ground station is providing TLE data for NORAD 25544. Each decides to delete the other station's TLE (deduplication). Result: both TLEs are deleted, and the object has no TLE data. Each transaction saw a valid state, but the combined result is invalid.
Snapshot isolation does not prevent this because neither transaction writes a key that the other reads — they write disjoint keys. The conflict is at the application invariant level, not the data access level. Preventing write skew requires serializable isolation (2PL or SSI).
Code Examples
Transaction Interface for the OOR
/// A transaction handle that provides snapshot isolation.
struct Transaction {
/// Snapshot timestamp — all reads see data as of this moment.
read_ts: u64,
/// Commit timestamp — assigned at commit time.
write_ts: Option<u64>,
/// Buffered writes — applied to the engine only on commit.
write_set: Vec<(Vec<u8>, Option<Vec<u8>>)>,
}
impl Transaction {
fn begin(engine: &LsmEngine) -> Self {
Self {
read_ts: engine.current_timestamp(),
write_ts: None,
write_set: Vec::new(),
}
}
/// Read a key as of this transaction's snapshot timestamp.
fn get(&self, key: &[u8], engine: &LsmEngine) -> io::Result<Option<Vec<u8>>> {
// Read the version of the key that was committed at or before read_ts
engine.get_at_timestamp(key, self.read_ts)
}
/// Buffer a write (applied on commit).
fn put(&mut self, key: Vec<u8>, value: Vec<u8>) {
self.write_set.push((key, Some(value)));
}
fn delete(&mut self, key: Vec<u8>) {
self.write_set.push((key, None));
}
/// Commit the transaction: apply all buffered writes atomically.
fn commit(mut self, engine: &mut LsmEngine) -> io::Result<()> {
let write_ts = engine.next_timestamp();
self.write_ts = Some(write_ts);
// Apply all writes with the commit timestamp
for (key, value) in self.write_set {
match value {
Some(val) => engine.put_with_ts(&key, &val, write_ts)?,
None => engine.delete_with_ts(&key, write_ts)?,
}
}
Ok(())
}
}
The key insight: reads use the read_ts (taken at transaction start), so they always see a consistent snapshot. Writes are buffered and applied atomically with a write_ts (taken at commit time). Other transactions that started before write_ts will not see these writes — they read at their own read_ts.
Key Takeaways
- ACID properties define transaction correctness. Atomicity (all-or-nothing), Isolation (concurrent transactions don't interfere), and Durability (committed data survives crashes) are the storage engine's responsibility. Consistency is primarily the application's.
- Snapshot isolation gives each transaction a consistent view of the database as of its start time. This prevents dirty reads, non-repeatable reads, and phantom reads — sufficient for the OOR's conjunction query workload.
- Write skew is the anomaly that snapshot isolation misses. It occurs when two transactions read overlapping data and write disjoint keys, producing a combined result that neither would have produced alone.
- The transaction interface separates read path (snapshot at
read_ts) from write path (buffered, applied atwrite_ts). This is the foundation for MVCC (Lesson 3).
Lesson 2 — Two-Phase Locking (2PL)
Module: Database Internals — M05: Transactions & Isolation
Position: Lesson 2 of 3
Source: Database Internals — Alex Petrov, Chapter 12; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 7
Source note: This lesson was synthesized from training knowledge. Verify Petrov's 2PL description and deadlock detection algorithms against Chapter 12.
Context
Before MVCC became dominant, lock-based concurrency control was the standard approach to transaction isolation. Two-Phase Locking (2PL) is the classical protocol: transactions acquire locks before accessing data, and release them only after completing all operations. The "two phases" are the growing phase (acquiring locks) and the shrinking phase (releasing locks). A transaction never acquires a new lock after releasing any lock.
2PL provides full serializability — the strongest isolation level. But it comes at a cost: writers block readers, readers block writers, and concurrent throughput drops significantly under contention. For the OOR, where conjunction queries must not block TLE ingestion, 2PL's blocking behavior is problematic. Understanding 2PL is essential context for appreciating why MVCC (Lesson 3) is the preferred approach for read-heavy workloads.
Core Concepts
Lock Types
Shared lock (S): Allows the holder to read the locked resource. Multiple transactions can hold shared locks on the same resource simultaneously. Shared locks prevent writes but allow concurrent reads.
Exclusive lock (X): Allows the holder to read and write the locked resource. Only one transaction can hold an exclusive lock at a time. Exclusive locks block both reads and writes from other transactions.
Compatibility matrix:
| S held | X held | |
|---|---|---|
| S requested | ✓ grant | ✗ wait |
| X requested | ✗ wait | ✗ wait |
The Two Phases
Growing phase: The transaction acquires locks as needed (shared for reads, exclusive for writes). It never releases any lock during this phase.
Shrinking phase: After the transaction decides to commit (or abort), it releases all locks. Once any lock is released, no new locks can be acquired.
Strict 2PL is the common variant: all locks are held until the transaction commits or aborts. No locks are released during the shrinking phase — they are all released at once at commit time. This prevents cascading aborts (where one transaction's abort forces other transactions that read its uncommitted data to also abort).
Deadlocks
Two transactions can deadlock if each holds a lock the other needs:
- Transaction A holds an exclusive lock on NORAD 25544 and requests a shared lock on NORAD 43013.
- Transaction B holds an exclusive lock on NORAD 43013 and requests a shared lock on NORAD 25544.
- Neither can proceed. Both are stuck waiting.
Detection strategies:
- Timeout: If a lock wait exceeds a threshold, abort the transaction and retry. Simple but imprecise — the timeout may be too long (wasted time) or too short (false positives).
- Wait-for graph: Maintain a directed graph of which transactions are waiting for which. A cycle in the graph indicates a deadlock. Abort one transaction in the cycle (typically the youngest or the one with the least work done).
2PL Performance Characteristics
Under low contention (few transactions accessing the same keys), 2PL performs well — most lock requests are granted immediately. Under high contention (many transactions accessing overlapping keys), performance degrades:
- Writers block readers: a bulk TLE update holding exclusive locks on 500 objects blocks all conjunction queries that need any of those objects.
- Lock overhead: acquiring and releasing locks, checking the wait-for graph, and managing the lock table all consume CPU.
- Deadlock aborts: wasted work when a deadlock victim is rolled back and retried.
For the OOR's workload — frequent long-running read transactions (conjunction queries) alongside burst write transactions (TLE ingestion) — 2PL would cause conjunction queries to stall during every ingestion burst.
Code Examples
A Simple Lock Manager
use std::collections::HashMap;
use std::sync::{Mutex, Condvar};
#[derive(Debug, Clone, Copy, PartialEq)]
enum LockMode { Shared, Exclusive }
struct LockEntry {
mode: LockMode,
holders: Vec<u64>, // Transaction IDs holding this lock
wait_queue: Vec<(u64, LockMode)>, // Transactions waiting for this lock
}
struct LockManager {
locks: Mutex<HashMap<Vec<u8>, LockEntry>>,
cond: Condvar,
}
impl LockManager {
fn acquire(&self, txn_id: u64, key: &[u8], mode: LockMode) -> bool {
let mut locks = self.locks.lock().unwrap();
loop {
let entry = locks.entry(key.to_vec()).or_insert_with(|| LockEntry {
mode: LockMode::Shared,
holders: Vec::new(),
wait_queue: Vec::new(),
});
let can_grant = match (mode, entry.holders.is_empty()) {
(_, true) => true, // No holders — any mode is fine
(LockMode::Shared, false) => entry.mode == LockMode::Shared,
(LockMode::Exclusive, false) => false,
};
if can_grant {
entry.mode = mode;
entry.holders.push(txn_id);
return true;
}
// Cannot grant — add to wait queue
entry.wait_queue.push((txn_id, mode));
// Block until notified (simplified — real impl checks for deadlock)
locks = self.cond.wait(locks).unwrap();
}
}
fn release(&self, txn_id: u64, key: &[u8]) {
let mut locks = self.locks.lock().unwrap();
if let Some(entry) = locks.get_mut(key) {
entry.holders.retain(|&id| id != txn_id);
if entry.holders.is_empty() {
// Grant to the first waiter
if let Some((waiter_id, waiter_mode)) = entry.wait_queue.first().copied() {
entry.holders.push(waiter_id);
entry.mode = waiter_mode;
entry.wait_queue.remove(0);
}
}
self.cond.notify_all();
}
}
}
This simplified lock manager illustrates the core mechanics. Production lock managers use per-key condition variables (not a single global one), hash-based lock tables for O(1) lookup, and wait-for graph tracking for deadlock detection.
Key Takeaways
- Two-phase locking provides serializable isolation by ensuring transactions acquire all locks before releasing any. Strict 2PL holds all locks until commit.
- 2PL's blocking behavior is the fundamental problem: writers block readers and readers block writers. For read-heavy workloads like conjunction queries, this creates unacceptable stalls during concurrent writes.
- Deadlocks are an inherent risk of lock-based concurrency. Detection via wait-for graphs and resolution via transaction abort are standard but add overhead and wasted work.
- 2PL is still used in some systems (MySQL/InnoDB for certain isolation levels, distributed databases for coordination). Understanding it provides essential context for why MVCC is preferred.
Lesson 3 — MVCC and Snapshot Isolation
Module: Database Internals — M05: Transactions & Isolation
Position: Lesson 3 of 3
Source: Database Internals — Alex Petrov, Chapter 13; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 7; Mini-LSM Week 3
Source note: This lesson was synthesized from training knowledge and the Mini-LSM Week 3 MVCC chapters. Verify Petrov's MVCC version chain description and Kleppmann's snapshot isolation anomaly analysis.
Context
MVCC solves the fundamental problem of 2PL — writers blocking readers — by keeping multiple versions of each key. Instead of locking a key and making other transactions wait, the engine stores every version alongside a timestamp. Readers pick the version that matches their snapshot timestamp; writers create new versions without disturbing old ones. Readers never block writers. Writers never block readers. The only conflict is writer-writer on the same key.
For the LSM architecture, MVCC is a natural fit. The LSM already stores sorted key-value pairs — extending keys to include a timestamp is a straightforward encoding change. Mini-LSM's Week 3 implements exactly this: the key format changes from user_key to (user_key, timestamp), where newer timestamps sort first. A snapshot read at timestamp T scans for the first version of each key with timestamp ≤ T.
Core Concepts
Timestamped Keys in LSM
The MVCC key format encodes the user key and a commit timestamp into a single sortable byte string:
MVCC key = [user_key_bytes] [timestamp as big-endian u64, inverted]
The timestamp is stored as big-endian and bitwise inverted (XOR with u64::MAX) so that newer timestamps sort before older ones in the LSM's byte-order comparison. This means a scan for key "NORAD-25544" encounters the newest version first — exactly what a snapshot read needs.
Example ordering for key "NORAD-25544":
"NORAD-25544" | ts=110 (inverted: 0xFFFFFFFFFFFFFF91) ← newest, sorts first
"NORAD-25544" | ts=80 (inverted: 0xFFFFFFFFFFFFFFAF)
"NORAD-25544" | ts=50 (inverted: 0xFFFFFFFFFFFFFFCD) ← oldest, sorts last
Snapshot Read
A transaction with read_ts = 100 reading key K:
- Seek to the first MVCC key with prefix K.
- Scan versions from newest to oldest.
- Return the first version with
commit_ts ≤ read_ts. - If the first matching version is a tombstone, the key is deleted in this snapshot — return None.
This is O(V) where V is the number of versions of the key. In practice V is small (1–5 for most keys) because compaction garbage-collects old versions.
Write Path with Timestamps
When a transaction commits with write_ts = 105:
- For each key in the write set, create an MVCC key
(user_key, 105). - Write all MVCC key-value pairs to the memtable (via the WAL, as in Module 4).
- These versions become visible to any transaction with
read_ts ≥ 105.
Write-write conflicts: if two transactions both write the same user key, the later commit must detect the conflict. Simple approach: check if any version of the key with commit_ts > txn.read_ts exists at commit time. If so, another transaction has written this key after our snapshot — abort and retry.
Watermark and Garbage Collection
Old versions accumulate. If every write creates a new version, the database grows without bound. Garbage collection removes versions that are no longer visible to any active transaction.
The watermark is the minimum read_ts among all active transactions. Any version with commit_ts < watermark that has a newer version is safe to garbage-collect — no active transaction can ever read it.
Active transactions: read_ts = [100, 150, 200]
Watermark = 100
Key "NORAD-25544" versions:
ts=200 (value_v3) ← keep (above watermark, no newer version)
ts=110 (value_v2) ← keep (above watermark, may be needed by ts=100..149 txns)
ts=80 (value_v1) ← keep (ts=100 txn might need it — 80 ≤ 100 and next version is 110)
ts=30 (value_v0) ← GARBAGE: ts=80 exists and 30 < watermark, so no txn will ever read v0
Wait — v0 at ts=30 is still needed if a txn at ts=50 existed. But watermark is 100, meaning
no transaction has read_ts < 100. So v0 (ts=30) is only needed if some transaction reads at
ts=30..79. Since watermark=100 guarantees no such transaction exists, v0 is safe to collect.
Garbage collection happens during compaction: when the merge iterator encounters multiple versions of the same key, it keeps the newest version per user key that is above the watermark, plus one version at or below the watermark (for transactions at exactly the watermark timestamp). All older versions are dropped.
Write Batch Atomicity
MVCC writes must be atomic — all keys in a transaction get the same write_ts, and they all become visible at once. In the LSM engine, this means all keys in a write batch are logged to the WAL as a single unit and inserted into the memtable together. The write_ts is assigned from a global monotonic counter protected by a mutex (as in Mini-LSM's approach).
Code Examples
MVCC Key Encoding
/// Encode a user key and timestamp into an MVCC key.
/// Timestamps are inverted so newer versions sort first.
fn encode_mvcc_key(user_key: &[u8], timestamp: u64) -> Vec<u8> {
let mut key = Vec::with_capacity(user_key.len() + 8);
key.extend_from_slice(user_key);
// Invert timestamp: newer (larger) timestamps become smaller bytes,
// sorting first in ascending byte order.
key.extend_from_slice(&(!timestamp).to_be_bytes());
key
}
/// Decode an MVCC key back into user key and timestamp.
fn decode_mvcc_key(mvcc_key: &[u8]) -> (&[u8], u64) {
let ts_start = mvcc_key.len() - 8;
let user_key = &mvcc_key[..ts_start];
let inverted_ts = u64::from_be_bytes(
mvcc_key[ts_start..].try_into().unwrap()
);
(user_key, !inverted_ts)
}
Snapshot Read with MVCC
impl LsmEngine {
/// Read the value of a key at the given snapshot timestamp.
fn get_at_timestamp(
&self,
user_key: &[u8],
read_ts: u64,
) -> io::Result<Option<Vec<u8>>> {
// Seek to the newest version of this key
let seek_key = encode_mvcc_key(user_key, u64::MAX);
// Create a merge iterator over memtable + SSTables
let mut iter = self.create_merge_iterator(&seek_key)?;
while let Some((mvcc_key, value)) = iter.next()? {
let (key, ts) = decode_mvcc_key(&mvcc_key);
// Stop if we've moved past this user key
if key != user_key {
return Ok(None);
}
// Skip versions newer than our snapshot
if ts > read_ts {
continue;
}
// This is the newest version visible to us
return match value {
Some(val) => Ok(Some(val)),
None => Ok(None), // Tombstone — key is deleted in this snapshot
};
}
Ok(None) // Key not found in any source
}
}
The seek to (user_key, u64::MAX) positions the iterator at the newest possible version of the key (since u64::MAX is the largest timestamp, and inverted it becomes the smallest byte value). The iterator then scans backward through versions until it finds one with ts ≤ read_ts.
Key Takeaways
- MVCC stores multiple versions of each key with commit timestamps. Readers select the version matching their snapshot timestamp. Writers create new versions without disturbing old ones.
- In the LSM architecture, MVCC keys are encoded as
(user_key, inverted_timestamp)so that newer versions sort first in byte order. This makes snapshot reads efficient — the first matching version is the correct one. - Readers never block writers, and writers never block readers. The only conflict is write-write on the same key, detected at commit time.
- Garbage collection removes old versions that are below the watermark (minimum active
read_ts). This happens during compaction and is essential for bounding space amplification. - Write skew remains possible under snapshot isolation. For the OOR, this is an acceptable tradeoff — conjunction queries need consistent snapshots, not full serializability.
Project — Conjunction Query Engine with MVCC Snapshot Reads
Module: Database Internals — M05: Transactions & Isolation
Track: Orbital Object Registry
Estimated effort: 8–10 hours
SDA Incident Report — OOR-2026-0046
Classification: ENGINEERING DIRECTIVE
Subject: Add MVCC snapshot isolation to the OOR storage engineRef: OOR-2026-0046 (stale TLE data in conjunction assessment)
Extend the LSM engine with MVCC support. Conjunction queries must see consistent catalog snapshots. Concurrent TLE updates must not block or corrupt reads.
Acceptance Criteria
-
MVCC key encoding. Encode user keys with inverted big-endian timestamps. Verify that newer versions sort before older versions in byte order.
-
Snapshot read correctness. Insert key "NORAD-25544" at timestamps 50, 80, and 110. Read at timestamps 60, 90, and 120. Verify each read returns the correct version (ts=50, ts=80, ts=110 respectively).
-
Tombstone visibility. Insert key "NORAD-99999" at ts=50, delete at ts=80. Read at ts=60 → value. Read at ts=90 → None.
-
Concurrent reads and writes. Spawn two threads: one performs 10,000 reads at a fixed snapshot, the other performs 1,000 writes with incrementing timestamps. Verify all reads return consistent results (no torn reads, no version mixing). Writers must not block readers.
-
Write conflict detection. Start two transactions with overlapping read timestamps. Both write the same key. The first to commit succeeds; the second detects the conflict and is aborted.
-
Garbage collection. Set watermark to 100. Insert versions at ts=30, 70, 90, 120 for a key. Run compaction. Verify that ts=30 is garbage-collected, ts=70 and ts=90 are retained (safety margin), and ts=120 is retained.
-
Conjunction simulation. Load 10,000 TLE records. Start a conjunction query (snapshot read over 100 objects). While the query is running, update 50 of those objects. Verify the query sees only the pre-update versions.
Starter Structure
conjunction-engine/
├── Cargo.toml
├── src/
│ ├── main.rs # Entry point
│ ├── mvcc.rs # MVCC key encoding, Transaction, conflict detection
│ ├── lsm.rs # Extended with timestamp-aware get/put/scan
│ ├── compaction.rs # Extended with watermark-aware GC
│ └── (reuse remaining modules from Modules 3–4)
Hints
Hint 1 — Timestamp encoding
Use !timestamp (bitwise NOT) converted to big-endian bytes, appended to the user key. This makes newer timestamps sort first without modifying the LSM's comparator.
Hint 2 — Write conflict detection
At commit time, scan the memtable and SSTables for any version of the key with commit_ts > txn.read_ts. If found, another transaction wrote this key after our snapshot — abort.
Hint 3 — Watermark computation
Maintain a BTreeSet<u64> of all active transaction read timestamps. The watermark is the minimum value in the set. When a transaction commits or aborts, remove its read_ts. Use a mutex to protect the set.
What Comes Next
Module 6 (Query Processing) builds structured query execution on top of the MVCC storage engine — scan operators, join algorithms, and the volcano iterator model for composable query plans.
Module 06 — Query Processing
Track: Database Internals — Orbital Object Registry
Position: Module 6 of 6
Source material: Database Internals — Alex Petrov, Chapters 14–15; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 6
Quiz pass threshold: 70% on all three lessons to unlock the project
Mission Context
SDA INCIDENT REPORT — OOR-2026-0047
Classification: PERFORMANCE DEFICIENCY
Subject: Multi-source TLE merge exceeds conjunction window deadlineThe OOR ingests TLE data from 5 independent sources (18th SDS, ESA SST, LeoLabs, ExoAnalytic, Numerica). When multiple sources provide TLEs for the same object, the engine must merge them — selecting the most recent epoch, resolving conflicts, and joining against the master catalog. Currently this is done in application code with ad-hoc nested loops. A full catalog merge of 100,000 objects from 5 sources takes 45 seconds. The conjunction pipeline requires merge results within 10 seconds.
Directive: Implement a structured query processing layer: scan operators, join algorithms, and a composable execution model that can be optimized for the catalog merge workload.
Learning Outcomes
After completing this module, you will be able to:
- Implement the volcano (iterator) model for pull-based query execution with composable operators
- Explain vectorized execution and why processing column batches outperforms row-at-a-time for analytical queries
- Implement nested-loop, hash, and sort-merge join algorithms and determine which is optimal for a given workload
- Compose scan, filter, projection, and join operators into a query execution plan
- Analyze the I/O and memory costs of different join strategies for the OOR catalog merge workload
Lesson Summary
Lesson 1 — The Volcano (Iterator) Model
Pull-based query execution. Each operator (scan, filter, join) implements next() → Option<Row>. Operators compose like iterator chains. Pipelining and its limitations.
Lesson 2 — Vectorized Execution
Processing batches of rows (column vectors) instead of one row at a time. Cache efficiency, SIMD potential, and why OLAP engines (DuckDB, Velox, DataFusion) use vectorized execution.
Lesson 3 — Join Algorithms
Nested-loop join, hash join, and sort-merge join. Cost models, memory requirements, and when each algorithm is optimal. Application to the OOR multi-source TLE merge.
Capstone Project — Orbital Catalog Merge System
Build a query execution engine that merges TLE data from 5 sources using composable operators. The merge pipeline uses scan, filter, sort, and join operators composed in the volcano model. Full brief in project-catalog-merge.md.
File Index
module-06-query-processing/
├── README.md
├── lesson-01-volcano-model.md
├── lesson-01-quiz.toml
├── lesson-02-vectorized-execution.md
├── lesson-02-quiz.toml
├── lesson-03-join-algorithms.md
├── lesson-03-quiz.toml
└── project-catalog-merge.md
Prerequisites
- Module 5 (Transactions & Isolation) completed
- All previous modules in the Database Internals track
Track Complete
This is the final module of the Database Internals track. After completing it, you will have built a storage engine from the ground up: page layout (M1) → B-tree indexing (M2) → LSM write-optimized storage (M3) → crash recovery (M4) → MVCC concurrency (M5) → query processing (M6).
Lesson 1 — The Volcano (Iterator) Model
Module: Database Internals — M06: Query Processing
Position: Lesson 1 of 3
Source: Database Internals — Alex Petrov, Chapter 14
Source note: This lesson was synthesized from training knowledge. Verify Petrov's volcano model description and pipelining analysis against Chapter 14.
Context
The B+ tree range scan iterator from Module 2 and the LSM merge iterator from Module 3 are both instances of a general pattern: pull-based iteration. A consumer calls next(), the producer returns the next item or signals completion. The volcano model (also called the iterator model, introduced by Goetz Graefe in 1990) generalizes this into a complete query execution framework.
Every query operator — scan, filter, project, join, sort, aggregate — implements the same interface: open(), next(), close(). Operators compose by nesting: a filter's next() calls its child scan's next() and applies the predicate. A join's next() calls both children's next() methods to find matching rows. The entire query plan is a tree of iterators, and execution proceeds one row at a time from the root.
This model is simple, composable, and universal — it's used by PostgreSQL, MySQL, SQLite, and most traditional database engines.
Core Concepts
The Operator Interface
Every query operator implements:
trait Operator {
/// Initialize the operator (open files, allocate buffers).
fn open(&mut self) -> io::Result<()>;
/// Return the next row, or None if exhausted.
fn next(&mut self) -> io::Result<Option<Row>>;
/// Release resources.
fn close(&mut self) -> io::Result<()>;
}
Row is a tuple of typed values — for the OOR, it's a TLE record or a subset of its fields.
Operator Composition
A query plan is a tree of operators. The root operator's next() pulls data through the entire tree:
ProjectOperator (select norad_id, epoch, mean_motion)
│
FilterOperator (where inclination > 80.0)
│
SeqScanOperator (scan TLE table)
Execution: Project.next() calls Filter.next(), which calls SeqScan.next() repeatedly until finding a row with inclination > 80.0, then returns it to Project, which extracts the requested columns.
Pipelining
In the volcano model, rows pipeline through operators: each row flows from leaf to root without being materialized in an intermediate buffer. SeqScan produces row 1, Filter checks it and passes it to Project, which emits it. Then row 2, and so on.
Pipelining is memory-efficient — only one row is in flight at a time (per operator). But some operators break the pipeline:
- Sort: Must consume all input rows before producing any output (can't emit the smallest row until it has seen all rows).
- Hash join (build phase): Must consume the entire build side before probing with the probe side.
- Aggregate: Must consume all input to compute aggregates (e.g., COUNT, AVG).
These are pipeline breakers (also called blocking operators). They require materializing intermediate results, consuming memory proportional to the input size.
Volcano Model Limitations
CPU overhead: Each next() call involves a virtual method dispatch (or trait object call in Rust), a function call per operator per row. For 100,000 rows through 5 operators, that's 500,000 function calls. On modern CPUs, this overhead is small for I/O-bound queries but significant for CPU-bound analytical queries.
Cache inefficiency: Processing one row at a time means the CPU repeatedly jumps between different operator code paths. The instruction cache is constantly invalidated. Vectorized execution (Lesson 2) addresses this by processing batches.
No SIMD opportunity: Single-row processing cannot exploit SIMD instructions that process multiple values simultaneously.
For the OOR's I/O-bound conjunction queries (dominated by SSTable reads), the volcano model's CPU overhead is acceptable. For the CPU-bound catalog merge (5 sources × 100k objects × comparison logic), vectorized execution provides a significant speedup.
Code Examples
Operator Trait and Basic Operators
use std::io;
/// A row in the OOR catalog: simplified for the query layer.
#[derive(Debug, Clone)]
struct TleRow {
norad_id: u32,
epoch: f64,
inclination: f64,
mean_motion: f64,
source: String,
}
/// Pull-based query operator.
trait Operator {
fn open(&mut self) -> io::Result<()>;
fn next(&mut self) -> io::Result<Option<TleRow>>;
fn close(&mut self) -> io::Result<()>;
}
/// Sequential scan over an in-memory vector of TLE records.
struct SeqScan {
data: Vec<TleRow>,
cursor: usize,
}
impl Operator for SeqScan {
fn open(&mut self) -> io::Result<()> {
self.cursor = 0;
Ok(())
}
fn next(&mut self) -> io::Result<Option<TleRow>> {
if self.cursor < self.data.len() {
let row = self.data[self.cursor].clone();
self.cursor += 1;
Ok(Some(row))
} else {
Ok(None)
}
}
fn close(&mut self) -> io::Result<()> { Ok(()) }
}
/// Filter operator: passes through rows that match a predicate.
struct Filter {
child: Box<dyn Operator>,
predicate: Box<dyn Fn(&TleRow) -> bool>,
}
impl Operator for Filter {
fn open(&mut self) -> io::Result<()> { self.child.open() }
fn next(&mut self) -> io::Result<Option<TleRow>> {
loop {
match self.child.next()? {
Some(row) if (self.predicate)(&row) => return Ok(Some(row)),
Some(_) => continue, // Row doesn't match — pull next
None => return Ok(None),
}
}
}
fn close(&mut self) -> io::Result<()> { self.child.close() }
}
/// Projection operator: transforms rows.
struct Projection {
child: Box<dyn Operator>,
project_fn: Box<dyn Fn(TleRow) -> TleRow>,
}
impl Operator for Projection {
fn open(&mut self) -> io::Result<()> { self.child.open() }
fn next(&mut self) -> io::Result<Option<TleRow>> {
match self.child.next()? {
Some(row) => Ok(Some((self.project_fn)(row))),
None => Ok(None),
}
}
fn close(&mut self) -> io::Result<()> { self.child.close() }
}
Composing a Query Plan
fn build_high_inclination_query(tle_data: Vec<TleRow>) -> Box<dyn Operator> {
let scan = Box::new(SeqScan { data: tle_data, cursor: 0 });
let filter = Box::new(Filter {
child: scan,
predicate: Box::new(|row: &TleRow| row.inclination > 80.0),
});
let project = Box::new(Projection {
child: filter,
project_fn: Box::new(|mut row: TleRow| {
// Strip fields we don't need downstream
row.source = String::new();
row
}),
});
project
}
fn execute_query(mut plan: Box<dyn Operator>) -> io::Result<Vec<TleRow>> {
plan.open()?;
let mut results = Vec::new();
while let Some(row) = plan.next()? {
results.push(row);
}
plan.close()?;
Ok(results)
}
The query plan is built bottom-up (scan → filter → project) and executed top-down (project pulls from filter pulls from scan). This separation between plan construction and execution is what makes the volcano model so composable — you can swap operators, add layers, or optimize the plan without changing the execution engine.
Key Takeaways
- The volcano model defines a universal operator interface:
open(),next(),close(). Every query operator — scan, filter, join, sort — implements this interface and composes with any other operator. - Pipelining passes rows through operators one at a time without intermediate materialization. Pipeline breakers (sort, hash build, aggregate) must buffer all input before producing output.
- The model's simplicity is its strength: it's easy to implement, test, and extend. Its weakness is per-row overhead (function calls, cache misses) that hurts CPU-bound analytical queries.
- The B+ tree range scan (Module 2) and LSM merge iterator (Module 3) are both volcano-model operators. The query layer from this module composes on top of them.
Lesson 2 — Vectorized Execution
Module: Database Internals — M06: Query Processing
Position: Lesson 2 of 3
Source: Designing Data-Intensive Applications — Martin Kleppmann, Chapter 6
Source note: This lesson was synthesized from training knowledge. Verify Kleppmann's columnar processing analysis against Chapter 6. Additional references: MonetDB/X100 paper (Boncz et al., 2005), DuckDB architecture documentation.
Context
The volcano model processes one row at a time. For the OOR's catalog merge — 500,000 rows from 5 sources, with comparison logic on 4 floating-point fields — the per-row function call overhead and cache inefficiency dominate CPU time. Vectorized execution addresses this by changing the unit of work from a single row to a batch (or vector) of rows.
Instead of next() → Option<Row>, vectorized operators return next_batch() → Option<ColumnBatch>, where a ColumnBatch contains 256–2048 rows stored in columnar format (one array per column). Operators process entire columns at once — a filter evaluates a predicate on an f64 array of 1024 inclination values in a tight loop, producing a selection bitmap. This tight loop is cache-friendly (one contiguous array), branch-predictor-friendly (same instruction repeated), and SIMD-exploitable (process 4 or 8 values per instruction).
Core Concepts
Row-at-a-Time vs. Batch-at-a-Time
Row-at-a-time (volcano): Each operator call processes 1 row. For N rows through K operators: N × K function calls, each touching a different memory region.
Batch-at-a-time (vectorized): Each operator call processes B rows. For N rows through K operators: (N/B) × K function calls. Each call processes a contiguous array, keeping the CPU cache warm and enabling auto-vectorization (SIMD).
The batch size B is typically 1024 or 2048 — large enough to amortize per-call overhead, small enough to fit in L1/L2 cache.
Columnar Batch Format
A vectorized batch stores data in columnar layout — one array per column:
Row-oriented (volcano):
Row 0: { norad_id: 25544, epoch: 84.7, inclination: 51.6, ... }
Row 1: { norad_id: 43013, epoch: 84.2, inclination: 97.4, ... }
Columnar (vectorized):
norad_id: [25544, 43013, ...] ← contiguous u32 array
epoch: [84.7, 84.2, ...] ← contiguous f64 array
inclination: [51.6, 97.4, ...] ← contiguous f64 array
A filter on inclination > 80.0 processes the inclination array without touching norad_id or epoch — only the relevant column is loaded into cache. The Rust compiler can auto-vectorize the tight comparison loop into SIMD instructions (e.g., _mm256_cmp_pd comparing 4 f64 values per instruction).
Selection Vectors
Instead of copying matching rows to a new batch (expensive), vectorized engines use a selection vector — an array of indices into the batch that identifies which rows passed the filter:
// Filter: inclination > 80.0
let inclinations: &[f64] = &batch.inclination;
let mut selection: Vec<u32> = Vec::new();
for (i, &inc) in inclinations.iter().enumerate() {
if inc > 80.0 {
selection.push(i as u32);
}
}
// selection = [1, 3, 7, ...] ← indices of matching rows
Downstream operators use the selection vector to skip non-matching rows without copying data. This avoids the memory allocation and copy cost of materializing filtered batches.
Apache Arrow as the Columnar Format
Apache Arrow defines a standardized in-memory columnar format used by DataFusion, DuckDB (internally similar), Polars, and many data processing engines. Key features:
- Zero-copy sharing between operators — no serialization/deserialization between pipeline stages.
- Validity bitmaps for null handling — one bit per value indicating null/non-null.
- Dictionary encoding for low-cardinality string columns — stores unique values once and references them by index.
For the OOR, using Arrow arrays for TLE column batches enables integration with the broader Rust data ecosystem (the arrow crate).
Code Examples
Vectorized Filter Operator
const BATCH_SIZE: usize = 1024;
/// A batch of TLE records in columnar format.
struct TleBatch {
norad_ids: Vec<u32>,
epochs: Vec<f64>,
inclinations: Vec<f64>,
mean_motions: Vec<f64>,
len: usize,
}
/// Vectorized operator interface.
trait VecOperator {
fn open(&mut self) -> io::Result<()>;
fn next_batch(&mut self) -> io::Result<Option<TleBatch>>;
fn close(&mut self) -> io::Result<()>;
}
/// Vectorized filter: evaluates predicate on entire columns at once.
struct VecFilter {
child: Box<dyn VecOperator>,
/// Returns a boolean mask: true for rows that pass the filter.
predicate: Box<dyn Fn(&TleBatch) -> Vec<bool>>,
}
impl VecOperator for VecFilter {
fn open(&mut self) -> io::Result<()> { self.child.open() }
fn next_batch(&mut self) -> io::Result<Option<TleBatch>> {
loop {
match self.child.next_batch()? {
None => return Ok(None),
Some(batch) => {
let mask = (self.predicate)(&batch);
let filtered = apply_mask(&batch, &mask);
if filtered.len > 0 {
return Ok(Some(filtered));
}
// Entire batch filtered out — pull next
}
}
}
}
fn close(&mut self) -> io::Result<()> { self.child.close() }
}
/// Apply a boolean mask to a batch, keeping only rows where mask[i] is true.
fn apply_mask(batch: &TleBatch, mask: &[bool]) -> TleBatch {
let mut out = TleBatch {
norad_ids: Vec::new(), epochs: Vec::new(),
inclinations: Vec::new(), mean_motions: Vec::new(), len: 0,
};
for (i, &keep) in mask.iter().enumerate() {
if keep {
out.norad_ids.push(batch.norad_ids[i]);
out.epochs.push(batch.epochs[i]);
out.inclinations.push(batch.inclinations[i]);
out.mean_motions.push(batch.mean_motions[i]);
out.len += 1;
}
}
out
}
The predicate function operates on entire columns: |batch| batch.inclinations.iter().map(|&inc| inc > 80.0).collect(). This tight loop over a contiguous f64 array is exactly the pattern the compiler auto-vectorizes into SIMD instructions. A production implementation would use selection vectors instead of copying rows.
Key Takeaways
- Vectorized execution processes batches of rows (typically 1024) instead of single rows, reducing per-row overhead by 100–1000x for CPU-bound operations.
- Columnar layout stores each column as a contiguous array, enabling cache-efficient processing and SIMD auto-vectorization. A filter on one column never touches other columns.
- Selection vectors track which rows pass a filter without copying data. This avoids materialization cost and keeps downstream operators working on the original arrays.
- Apache Arrow provides a standardized columnar format for zero-copy interop between operators and libraries. The
arrowRust crate is the foundation for DataFusion. - Vectorized execution is most impactful for CPU-bound analytical queries (aggregations, joins, comparisons). For I/O-bound point lookups, the volcano model is sufficient.
Lesson 3 — Join Algorithms
Module: Database Internals — M06: Query Processing
Position: Lesson 3 of 3
Source: Database Internals — Alex Petrov, Chapter 15; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 6
Source note: This lesson was synthesized from training knowledge. Verify Petrov's join algorithm cost models and Kleppmann's distributed join discussion against the source chapters.
Context
The OOR's catalog merge problem is fundamentally a join: match TLE records from 5 sources on NORAD catalog ID, then select the best TLE for each object (most recent epoch, highest source priority). In SQL terms: SELECT * FROM source_a JOIN source_b ON a.norad_id = b.norad_id.
The choice of join algorithm determines whether this merge takes 45 seconds (nested-loop) or under 1 second (hash join). This lesson covers the three fundamental join algorithms, their cost models, and when each is optimal.
Core Concepts
Nested-Loop Join
The simplest join: for each row in the outer table, scan the entire inner table for matches.
for each row_a in source_a: # |A| iterations
for each row_b in source_b: # |B| iterations per outer row
if row_a.norad_id == row_b.norad_id:
emit (row_a, row_b)
Cost: O(|A| × |B|) comparisons. For two 100k-row sources: 10 billion comparisons. Completely impractical for the OOR catalog merge.
When to use: Only when one side is very small (< 100 rows) or when no better algorithm is available (no index, insufficient memory for a hash table). Also useful for non-equi joins (e.g., a.epoch > b.epoch) where hash join doesn't apply.
Block nested-loop improves this by loading a block of the outer table into memory and scanning the inner table once per block. This reduces I/O from |A| × (inner scans) to |A|/B × (inner scans).
Hash Join
Build a hash table on the smaller input (the build side), then probe it with the larger input (the probe side).
Build phase: Scan the build side and insert each row into a hash table keyed by the join column.
Probe phase: Scan the probe side. For each row, hash the join column and look up matching rows in the hash table.
Build: hash_table = {}
for each row_b in source_b:
hash_table[row_b.norad_id].append(row_b)
Probe:
for each row_a in source_a:
for each row_b in hash_table[row_a.norad_id]:
emit (row_a, row_b)
Cost: O(|A| + |B|) — one scan of each input. Hash table operations are O(1) amortized.
Memory: The hash table must fit in memory. Size ≈ |build_side| × (key_size + row_size + overhead). For 100k TLE records at ~100 bytes each: ~10MB. Trivially fits in memory.
When to use: Equi-joins (join on equality) where the build side fits in memory. This is the default join algorithm in most query engines for good reason — it's optimal for the vast majority of join workloads.
For the OOR: hash join merges two 100k-row sources in ~200k operations. Five sources require 4 sequential hash joins (or a multi-way hash join), all completing in under 100ms.
Sort-Merge Join
Sort both inputs on the join column, then merge them in a single pass (like the LSM merge iterator from Module 3).
Sort phase: Sort both inputs by the join key. O(|A| log |A| + |B| log |B|).
Merge phase: Advance two cursors through the sorted inputs, matching on the join key. O(|A| + |B|).
Sort source_a by norad_id
Sort source_b by norad_id
cursor_a = 0, cursor_b = 0
while cursor_a < |A| and cursor_b < |B|:
if a[cursor_a].norad_id == b[cursor_b].norad_id:
emit (a[cursor_a], b[cursor_b])
advance both cursors (handling duplicates)
elif a[cursor_a].norad_id < b[cursor_b].norad_id:
cursor_a += 1
else:
cursor_b += 1
Cost: O(|A| log |A| + |B| log |B|) for the sort phases, O(|A| + |B|) for the merge. Dominated by the sort.
When to use: When inputs are already sorted (e.g., from an index scan or a preceding sort operator), the sort phase is free and the total cost is O(|A| + |B|) — optimal. Also useful when the join result must be sorted (the output is already in join-key order). Handles non-memory-fitting inputs gracefully via external sort.
For the OOR: if TLE sources are pre-sorted by NORAD ID (which they often are, since NORAD IDs are sequential), sort-merge join is optimal — the sort phase costs nothing, and the merge is a single linear pass.
Cost Comparison
| Algorithm | Time | Memory | Pre-sorted Input |
|---|---|---|---|
| Nested-loop | O(A × B) | O(1) | No benefit |
| Hash join | O(A + B) | O(min(A,B)) | No benefit |
| Sort-merge | O(A log A + B log B) | O(A + B) for sort | O(A + B) if pre-sorted |
Multi-Way Join for 5 Sources
The OOR catalog merge joins 5 sources. Strategies:
Sequential pairwise: Join source 1 with 2, then result with 3, then with 4, then with 5. Four hash joins. Total cost: O(5 × N) where N is the source size. Simple and effective.
Multi-way sort-merge: Sort all 5 sources by NORAD ID, then merge all 5 simultaneously using a priority queue (exactly the merge iterator from Module 3). One pass through all data. Optimal if sources are pre-sorted.
For the OOR, the multi-way sort-merge is the better choice: TLE sources arrive pre-sorted by NORAD ID, and the merge iterator is already implemented.
Code Examples
Hash Join Implementation
use std::collections::HashMap;
/// Hash join: match TLE records from two sources on NORAD ID.
fn hash_join(
build_side: &[TleRow], // Smaller source
probe_side: &[TleRow], // Larger source
) -> Vec<(TleRow, TleRow)> {
// Build phase: index the build side by NORAD ID
let mut hash_table: HashMap<u32, Vec<&TleRow>> = HashMap::new();
for row in build_side {
hash_table.entry(row.norad_id).or_default().push(row);
}
// Probe phase: look up each probe-side row in the hash table
let mut results = Vec::new();
for probe_row in probe_side {
if let Some(matches) = hash_table.get(&probe_row.norad_id) {
for &build_row in matches {
results.push((build_row.clone(), probe_row.clone()));
}
}
}
results
}
Sort-Merge Join for Pre-Sorted Sources
/// Sort-merge join on pre-sorted inputs. Both inputs must be sorted by norad_id.
fn sort_merge_join(
left: &[TleRow],
right: &[TleRow],
) -> Vec<(TleRow, TleRow)> {
let mut results = Vec::new();
let mut li = 0;
let mut ri = 0;
while li < left.len() && ri < right.len() {
match left[li].norad_id.cmp(&right[ri].norad_id) {
std::cmp::Ordering::Equal => {
// Collect all rows with this key from both sides
let key = left[li].norad_id;
let l_start = li;
while li < left.len() && left[li].norad_id == key { li += 1; }
let r_start = ri;
while ri < right.len() && right[ri].norad_id == key { ri += 1; }
// Cross product of matching rows (for equi-join)
for l in &left[l_start..li] {
for r in &right[r_start..ri] {
results.push((l.clone(), r.clone()));
}
}
}
std::cmp::Ordering::Less => li += 1,
std::cmp::Ordering::Greater => ri += 1,
}
}
results
}
The sort-merge join's merge phase is identical to the LSM merge iterator logic. For the OOR's unique NORAD IDs (no duplicates within a source), the cross-product in the equal case always produces exactly one match — the merge is linear.
Key Takeaways
- Nested-loop join is O(A × B) — only viable for very small inputs. Hash join is O(A + B) with O(min(A,B)) memory. Sort-merge join is O(A + B) if inputs are pre-sorted.
- Hash join is the default for equi-joins in most query engines. It requires the build side to fit in memory, which is almost always true for the OOR's workload sizes.
- Sort-merge join is optimal when inputs are pre-sorted (the sort phase is free). The LSM merge iterator from Module 3 is already a sort-merge join — the same algorithm applies here.
- The OOR catalog merge (5 pre-sorted sources × 100k objects) is best served by a multi-way sort-merge: one linear pass through all sources using a merge iterator with a priority queue.
- Join algorithm selection is a query optimization decision. The execution engine should support all three algorithms and choose based on input sizes, sort order, and available memory.
Project — Orbital Catalog Merge System
Module: Database Internals — M06: Query Processing
Track: Orbital Object Registry
Estimated effort: 6–8 hours
- SDA Incident Report — OOR-2026-0047
- Acceptance Criteria
- Starter Structure
- Test Data Generation
- Hints
- Track Complete
SDA Incident Report — OOR-2026-0047
Classification: ENGINEERING DIRECTIVE
Subject: Build a structured query engine for multi-source TLE catalog mergingReplace the ad-hoc nested-loop catalog merge with a composable query execution engine. The engine must support scan, filter, sort, and join operators, and merge TLE data from 5 sources within the conjunction pipeline's 10-second deadline.
Acceptance Criteria
-
Volcano operators. Implement
SeqScan,Filter,Projection, andSortoperators using theOperatortrait. Compose them into a plan that scans 100,000 TLE records, filters by inclination > 80°, and projects to (norad_id, epoch). -
Hash join. Implement a hash join operator. Join two 100k-row sources on NORAD ID. Verify the output contains exactly the matching pairs.
-
Sort-merge join. Implement a sort-merge join for pre-sorted inputs. Join two 100k-row sources (pre-sorted by NORAD ID). Verify output matches the hash join result.
-
Multi-way merge. Implement a 5-way merge join using a min-heap (reuse the merge iterator pattern from Module 3). Merge 5 sources of 100k records each, all sorted by NORAD ID. Verify the merged output is sorted and contains all matching records.
-
Performance target. The 5-way merge of 500,000 total records must complete in under 2 seconds. Print elapsed time. Compare against a naive nested-loop join on a subset (1,000 records per source) and report the speedup.
-
Conflict resolution. When multiple sources provide TLEs for the same NORAD ID, select the TLE with the most recent epoch. Print the number of conflicts resolved and the winning source for 10 sample objects.
-
Vectorized filter (bonus). Implement a vectorized filter that processes batches of 1024 rows in columnar format. Compare its throughput to the row-at-a-time volcano filter on 100,000 rows.
Starter Structure
catalog-merge/
├── Cargo.toml
├── src/
│ ├── main.rs # Entry point
│ ├── operator.rs # Operator trait, SeqScan, Filter, Projection, Sort
│ ├── hash_join.rs # HashJoinOperator
│ ├── sort_merge.rs # SortMergeJoinOperator
│ ├── merge_iter.rs # MultiWayMerge (reuse from Module 3)
│ ├── vectorized.rs # VecFilter (bonus)
│ └── tle.rs # TleRow, test data generation
Test Data Generation
Generate synthetic TLE data for 5 sources:
fn generate_source(source_name: &str, num_objects: usize) -> Vec<TleRow> {
let mut rng = /* deterministic seed per source */;
(0..num_objects).map(|i| TleRow {
norad_id: i as u32 + 1, // NORAD IDs 1..100000
epoch: 84.0 + rng.gen::<f64>() * 0.5, // Slight epoch variation per source
inclination: rng.gen::<f64>() * 180.0,
mean_motion: 14.0 + rng.gen::<f64>() * 2.0,
source: source_name.to_string(),
}).collect()
}
Each source provides TLEs for the same 100k objects but with slightly different epochs and measurements. The merge resolves conflicts by picking the most recent epoch per NORAD ID.
Hints
Hint 1 — Hash join operator as a volcano operator
The hash join operator's open() consumes the entire build side (calling build_child.next() until None, building the hash table). Then next() probes one row at a time from the probe side. The build phase is a pipeline breaker; the probe phase pipelines.
Hint 2 — Conflict resolution as a post-merge step
After the 5-way merge produces groups of TLEs with the same NORAD ID, apply a "group-by" operator that collects all rows with the same key and emits the winner (most recent epoch). This is a simple reduce over each group.
Hint 3 — Performance measurement
Use std::time::Instant for timing. Measure the merge end-to-end (including any sort phases). For the nested-loop comparison, use a small subset (1,000 rows per source) to avoid waiting minutes.
Track Complete
Congratulations. You have built a storage engine from the ground up:
| Module | What You Built |
|---|---|
| M1: Storage Engine Fundamentals | Page layout, buffer pool, slotted pages |
| M2: B-Tree Index Structures | B+ tree with splits, merges, range scans |
| M3: LSM Trees & Compaction | Memtable, SSTables, leveled compaction, bloom filters |
| M4: WAL & Recovery | Write-ahead log, crash recovery, checkpointing |
| M5: Transactions & Isolation | MVCC snapshot isolation, write conflict detection, GC |
| M6: Query Processing | Volcano model, vectorized execution, hash/sort-merge joins |
The Orbital Object Registry is now a fully functional, crash-recoverable, transactional storage engine with indexed access and structured query execution. The ESA deadline has been met.
Module 01 — Stream Processing Foundations
Track: Data Pipelines — Space Domain Awareness Fusion Position: Module 1 of 6 Source material: Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11; Streaming Data — Andrew Psaltis, Chapters 1–3; Fundamentals of Data Engineering — Joe Reis & Matt Housley, Chapter 7; Kafka: The Definitive Guide — Shapira et al., Chapter 1 Quiz pass threshold: 70% on all three lessons to unlock the project
Mission Context
OPS ALERT — SDA-2026-0118 Classification: PIPELINE STAND-UP Subject: Heterogeneous sensor ingestion for SDA Fusion Service
Meridian's existing Space Domain Awareness pipeline is a Python script that polls each sensor source on a 30-second cron, batches the results into Parquet, and uploads to S3. End-to-end latency from observation to fused track is currently 4–7 minutes. After last quarter's Cosmos-1408 anti-satellite test, the post-event debris environment requires sub-30-second conjunction detection to maintain constellation safety.
Directive: Stand up the front door of the SDA Fusion Service. Three sensor source types (X-band radar arrays, optical telescopes, inter-satellite link feeds) must be ingested as a continuous stream, normalized into a common observation envelope, and forwarded to downstream fusion stages. No more cron, no more batching at the edge.
This module establishes the conceptual and physical foundations of stream processing — what a stream is, what it isn't, and how production systems are built around the source/sink boundary. Every architectural decision that follows in this track (orchestration, windowing, delivery semantics, observability) assumes you can reason fluently about the model introduced here.
The code you write in this module is the literal first stage of the SDA Fusion Service. It will be extended, not replaced, in every subsequent module.
Learning Outcomes
After completing this module, you will be able to:
- Define what makes a system a stream processor rather than a low-latency batch processor, and articulate the operational consequences of the difference
- Design a common observation envelope that unifies heterogeneous sensor wire formats without losing source-specific provenance
- Implement async source and sink abstractions in Rust using the dataflow model — operators that consume from upstream and produce to downstream
- Choose between push, pull, and poll ingestion patterns for a given source based on latency, control, and reliability requirements
- Reason about the bounded-vs-unbounded data distinction and its implications for memory, completeness, and correctness
Lesson Summary
Lesson 1 — Streams, Sources, and Sinks
The conceptual model. What distinguishes a stream from a queue, a log, or a continuously polled batch. The source/sink abstraction as the boundary of every streaming pipeline. Bounded vs unbounded data and why the distinction shapes the entire system. The observation envelope pattern for unifying heterogeneous sources.
Key question: If a sensor produces a fixed dataset of 10 million observations from a one-time fragmentation event, is processing that dataset a stream or a batch problem?
Lesson 2 — The Dataflow Model
The streaming abstraction in production systems. Operators as functions over streams: map, filter, fold, merge, partition. Why dataflow composition beats imperative loops for pipeline code. The graph topology — sources at the edges, sinks at the other edges, operators in between. State, statelessness, and where state lives in a streaming pipeline.
Key question: Why does the dataflow model treat the pipeline as a graph that runs continuously rather than as a function that is called with a batch of inputs?
Lesson 3 — Push, Pull, and Poll Semantics
The three patterns by which data enters and traverses a pipeline. Push (the source initiates and forwards), pull (the consumer requests and receives), poll (the consumer requests on a schedule). Why Kafka consumers poll rather than subscribe to a callback. Where each pattern fits in the SDA fusion topology — radar arrays push, optical archives pull, ISL beacons poll. The hidden cost of polling and the hidden risk of pushing.
Key question: The optical telescope archive exposes only an HTTP REST endpoint with no notification mechanism. How should the ingestion service interact with it, and what are the operational consequences?
Capstone Project — SDA Sensor Ingestion Service
Build the front door of the SDA Fusion Service. Three async source tasks (radar, optical, ISL) consume from their respective wire formats, normalize observations into a common Observation envelope, and forward them to a shared sink. The sink writes a structured event log that downstream stages will consume. Acceptance criteria, suggested architecture, and the full project brief are in project-sensor-ingestion.md.
This is the module where the SDA Fusion Service begins to exist. Every subsequent module's project extends what you build here.
File Index
module-01-stream-processing-foundations/
├── README.md ← this file
├── lesson-01-streams-sources-sinks.md ← Streams, sources, sinks
├── lesson-01-quiz.toml ← Quiz (5 questions)
├── lesson-02-dataflow-model.md ← The dataflow model
├── lesson-02-quiz.toml ← Quiz (5 questions)
├── lesson-03-push-pull-poll.md ← Push, pull, and poll semantics
├── lesson-03-quiz.toml ← Quiz (5 questions)
└── project-sensor-ingestion.md ← Capstone project brief
Prerequisites
- Foundation Track completed (all 6 modules) — async Rust, channels, network programming, and data layout are assumed
- Familiarity with
tokio,tokio::sync::mpsc,serde, andanyhow::Result - Working understanding of TCP, UDP, and HTTP — the three transport types you'll deal with in the project
What Comes Next
Module 2 (Pipeline Orchestration Internals) takes the source-to-sink primitives you build here and composes them into a multi-stage DAG with a real task scheduler. The Observation envelope you define in this module is the data structure that flows through every subsequent stage of the pipeline.
Lesson 1 — Streams, Sources, and Sinks
Module: Data Pipelines — M01: Stream Processing Foundations Position: Lesson 1 of 3 Source: Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11 (Stream Processing); Streaming Data — Andrew Psaltis, Chapter 1 (Introducing Streaming Data); Fundamentals of Data Engineering — Joe Reis & Matt Housley, Chapter 7 (Ingestion: Bounded vs Unbounded Data)
Context
Every streaming system answers one question before any other: where does data enter, and where does it leave? The pipeline between those two points can grow arbitrarily complex — windowing, joins, stateful aggregation, exactly-once delivery — but the entry and exit points are non-negotiable. They are the contract between the pipeline and the rest of the world. Get them wrong and no amount of clever processing recovers the system.
The mental shift from batch to streaming is harder than it sounds. A batch job is a function: you give it a finite input, it returns a finite output, it terminates. A streaming pipeline is a process: it runs forever, it consumes a potentially infinite input, it produces a potentially infinite output, and "completion" is a category error. Most production incidents in streaming systems trace back to engineers who built a batch job and called it a stream — fixed-size buffers that fill up, retries that assume idempotency the source doesn't provide, "is the job done yet?" health checks against a process that has no notion of done.
For the SDA Fusion Service, the source side is three heterogeneous feeds. Ground-based X-band radar arrays push detection records over UDP at 100–500 Hz per array. Optical telescopes expose a REST API that returns observation batches when polled. The constellation's inter-satellite link beacons emit position reports on a 1 Hz cadence over a custom binary protocol. The sink side is a single fusion stage that expects one normalized envelope format. This module's job is to define what that envelope is, what a source and sink mean in this system, and how to think about the data crossing the boundary.
Core Concepts
What Makes a Stream a Stream
The defining property of a stream is unboundedness: there is no expectation that the input will end. This is the property that forces every other architectural decision in a streaming system. Memory cannot grow without bound, so state must be either fixed-size or explicitly bounded by time or count. Completeness checks (COUNT(*) over a stream, is the result correct?) cannot terminate, so they must be redefined as point-in-time approximations. Failure recovery cannot rely on "rerun the job from the start" because there is no start.
The DDIA framing is precise: a stream is a sequence of events, each event a small, immutable record describing something that happened at a point in time. Events are produced by producers (also called sources, publishers, emitters) and consumed by consumers (also called sinks, subscribers, recipients). The stream itself is not the events — it is the abstraction over the path they take from producer to consumer.
A finite, replayable input is not a stream just because it is processed event-by-event. A 10 GB log file processed line-by-line in a Spark job is a batch job with a streaming-style execution model. The distinction matters because finite inputs admit completion: you can compute exact aggregates, you can verify correctness end-to-end, you can retry the whole computation. Unbounded inputs admit none of these. Bounded data running through streaming infrastructure (sometimes called "stream-with-known-end") is real but rare in practice; treating an unbounded source as if it were bounded is a common and expensive mistake.
Bounded vs Unbounded Data
The bounded/unbounded distinction is the most important typology in data engineering. Reis and Housley make the case in Fundamentals of Data Engineering Ch. 7: the boundary you draw at ingestion shapes every downstream system. If you treat data as bounded, you can use ETL, you can do exact joins, you can cleanly partition by date. If you treat it as unbounded, you must use windowing, you must accept approximation, you must deal with late arrivals.
Sensor data from the SDA Fusion Service is unambiguously unbounded — the radar arrays will emit observations as long as there are objects in the sky and the radar has power. The fusion service has no notion of "the last observation"; the input is a continuous flow that the pipeline drinks from for as long as it operates.
Some sources are bounded but feel unbounded. The optical telescope archive exposes the last 24 hours of observations on demand. From the pipeline's perspective, the source produces observations now, and the archive happens to also expose recent ones. Treating it as a bounded "give me the last 24 hours" source produces a different system than treating it as an unbounded "show me observations as they appear" source — even though the underlying data is the same. The latter requires the pipeline to track what it has already consumed (a watermark or offset) and to poll only for new observations since that point.
The architectural rule: bounded sources can be treated as unbounded (just process them and stop when they end), but unbounded sources cannot be treated as bounded without introducing artificial cutoffs. When in doubt, treat it as unbounded.
Sources and Sinks as a Boundary
A source is the component that produces events into the pipeline. It is the boundary between the pipeline and the upstream world — the radar firmware, the satellite telemetry stream, the upstream Kafka topic, the message queue. The source's responsibility is to deliver events into the pipeline's first stage in a format the pipeline understands.
A sink is the symmetric component at the output: the boundary between the pipeline and the downstream world. The sink writes events to wherever they need to go — another Kafka topic, an object store, a database, a subscriber callback, a downstream service.
The pivotal insight is that source and sink are positions, not types. A Kafka topic is a sink for the pipeline that produces into it and a source for the pipeline that consumes from it. This is why production streaming architectures look like graphs of pipelines connected by durable queues — each queue is the sink of its writer and the source of its readers, and the queue's durability decouples them in time and in failure modes.
Three properties of a source determine how the pipeline must interact with it:
- Replayability. Can the pipeline ask the source for events it has already consumed? Kafka can (configurable retention; consumers track offsets); a UDP radar feed cannot (packets arrive once and are gone if missed).
- Ordering guarantees. Does the source guarantee a total order, a per-partition order, or no order at all? UDP gives no order. Kafka guarantees per-partition order. ISL beacons typically guarantee per-satellite order but not cross-satellite order.
- Delivery guarantees. Does the source guarantee at-least-once delivery, at-most-once, or exactly-once? UDP is at-most-once. TCP-based sources are at-least-once if the application acks correctly. Kafka producers can be configured for exactly-once via the idempotent producer (covered in Module 5).
These three properties propagate. A pipeline cannot offer a stronger delivery guarantee than its weakest source unless it is willing to drop, deduplicate, or buffer to make up the difference.
The Observation Envelope
When the pipeline accepts events from heterogeneous sources, the first stage's job is to normalize them into a single internal format. This is the envelope — a wrapper that preserves source-specific provenance while presenting a uniform interface to downstream stages.
For SDA fusion, every observation, regardless of source, has the same logical content: something was detected at some position at some time. The wire formats differ wildly — radar produces complex IQ samples reduced to range-rate pairs; optical produces angular measurements with timestamps; ISL produces full state vectors — but the downstream correlator does not need to know any of that. It needs position, time, source identifier, and uncertainty.
The envelope pattern:
struct Observation {
// Identity and provenance
source_id: SourceId, // which sensor produced this
source_kind: SourceKind, // radar | optical | isl
sensor_timestamp: SystemTime, // when the sensor recorded it
ingest_timestamp: SystemTime, // when we received it
// The actual observation payload
target: ObservationTarget, // what was observed (position, range-rate, etc.)
uncertainty: Uncertainty, // standard deviation or covariance
// Routing and dedup
observation_id: Uuid, // unique per observation
}
The envelope is thin. It carries enough provenance to trace any observation back to its source and enough payload for the next stage to act on, but no more. Production envelopes drift toward fat over time as engineers add convenience fields; resist this. Every field in the envelope is paid for in CPU (deserialization), memory (buffering), and network (when stages run on different hosts).
A common mistake is to make the envelope a sum type that holds the original wire format plus normalized fields. This produces an envelope twice the necessary size and tempts downstream stages to peek at the original wire format, breaking the abstraction. If the original wire format must be preserved (for replay, audit, or forensic analysis), write it to a parallel "raw observations" sink at ingestion time. Don't carry it through the pipeline.
Code Examples
Defining the Observation Envelope
The envelope is the type that flows through every stage of the SDA Fusion Service. The choice of representation here propagates everywhere — a poor envelope makes every downstream lesson harder.
use serde::{Deserialize, Serialize};
use std::time::SystemTime;
use uuid::Uuid;
/// The kind of sensor that produced an observation. This drives source-specific
/// handling at later stages (e.g., uncertainty models differ by sensor type).
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
pub enum SourceKind {
/// Ground-based X-band radar — high-rate, range-rate measurements
Radar,
/// Ground-based optical telescope — angular measurements
Optical,
/// Inter-satellite link beacon — full state vector reports
InterSatelliteLink,
}
/// Stable identifier for the specific sensor that produced this observation.
/// Distinct from SourceKind: there are 14 X-band radars in the network, and we
/// want to know *which* one detected this object.
#[derive(Debug, Clone, PartialEq, Eq, Hash, Serialize, Deserialize)]
pub struct SourceId(pub String);
/// What the sensor observed. The variants reflect what each sensor type
/// actually measures — we do not pretend they all measure position vectors.
/// The correlator stage (Module 2) is responsible for fusing these into
/// position estimates.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum ObservationTarget {
/// Range, range-rate, azimuth, elevation from a radar
RangeRate { range_m: f64, range_rate_m_s: f64, az_rad: f64, el_rad: f64 },
/// Right ascension and declination from an optical telescope
Angular { ra_rad: f64, dec_rad: f64 },
/// ECI-frame state vector from an ISL beacon
StateVector { position_m: [f64; 3], velocity_m_s: [f64; 3] },
}
/// Per-measurement uncertainty. Production code uses full covariance matrices;
/// this is a starting representation that we will refine in Module 3.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Uncertainty {
pub sigma: f64,
}
/// The canonical envelope. Every observation in the pipeline has this shape.
/// Downstream stages do not look at wire formats — only at this struct.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Observation {
pub observation_id: Uuid,
pub source_id: SourceId,
pub source_kind: SourceKind,
pub sensor_timestamp: SystemTime,
pub ingest_timestamp: SystemTime,
pub target: ObservationTarget,
pub uncertainty: Uncertainty,
}
Three things to notice. First, observation_id is a UUID, not a sequence number. UUIDs are generated at the source without coordination — sequence numbers would require a central allocator and would become a bottleneck under load. The tradeoff is that UUIDs are 16 bytes versus 8 for a u64; for SDA's volumes (hundreds of thousands of observations per minute), this is a non-issue. Second, the envelope holds both sensor_timestamp and ingest_timestamp. This distinction (event time vs processing time) becomes load-bearing in Module 3, but the data must be captured at ingestion or it is lost forever. Third, ObservationTarget is a sum type rather than a normalized "always position vector" representation. Forcing premature unification at ingestion discards information that the correlation stage needs. Normalize the envelope; preserve the measurement.
A Source Trait
A source is a thing that produces a stream of observations. In Rust, the natural shape is an async trait that returns successive items:
use anyhow::Result;
use async_trait::async_trait;
/// A source produces observations from some upstream system. Implementations
/// hide the wire-format details behind this trait.
///
/// Cancellation: implementations should be cancel-safe at await points.
/// Dropping the future returned by `next` must not corrupt the source's state.
#[async_trait]
pub trait ObservationSource: Send {
/// Yield the next observation, or None if the source has terminated.
/// For genuinely unbounded sources (radar feeds), this never returns None.
async fn next(&mut self) -> Result<Option<Observation>>;
/// A stable identifier for logging and metrics. Not the same as
/// SourceId on the envelope — a single source instance may produce
/// observations from multiple SourceIds (e.g., one ISL listener
/// receives beacons from many satellites).
fn name(&self) -> &str;
}
Two design choices to flag. The trait returns Option<Observation> rather than just Observation because we want to signal graceful termination without using an error. Errors are reserved for actual failures (network drop, deserialization error, source-specific protocol violation). A radar source that never returns None is correct. An optical archive source that returns None when the archive has no new observations and the polling cadence has been satisfied is also correct. Second, the trait is not a Stream. We could have used futures::Stream<Item = Result<Observation>> and gained combinator support, but that buys us less than it costs: the explicit next method makes lifecycle management (logging, retries, source-specific timeouts) easier to compose. Modules 2 and 4 will build the orchestration layer around this trait.
A Minimal UDP Radar Source
This is the actual code that ingests from one of the X-band radar arrays. The radar firmware emits a fixed-size binary frame over UDP at 100–500 Hz; our job is to deserialize it and emit envelopes.
use anyhow::{Context, Result};
use async_trait::async_trait;
use std::time::{SystemTime, UNIX_EPOCH, Duration};
use tokio::net::UdpSocket;
use uuid::Uuid;
/// Wire format emitted by the X-band radar firmware. 64 bytes, packed.
/// Field layout is documented in Meridian-RF-2024-RADAR-WIRE-FORMAT.
#[repr(C, packed)]
#[derive(Clone, Copy)]
struct RadarFrame {
array_id: u32,
target_track_id: u32,
timestamp_ns: u64, // sensor-local clock since UNIX epoch
range_m: f64,
range_rate_m_s: f64,
azimuth_rad: f64,
elevation_rad: f64,
sigma_range_m: f32,
sigma_rate_m_s: f32,
_reserved: [u8; 4],
}
const RADAR_FRAME_SIZE: usize = std::mem::size_of::<RadarFrame>();
pub struct UdpRadarSource {
socket: UdpSocket,
name: String,
buf: Box<[u8; 1500]>, // standard MTU; larger frames would arrive truncated
}
impl UdpRadarSource {
pub async fn bind(addr: &str, name: impl Into<String>) -> Result<Self> {
let socket = UdpSocket::bind(addr)
.await
.with_context(|| format!("binding radar source on {addr}"))?;
Ok(Self {
socket,
name: name.into(),
buf: Box::new([0u8; 1500]),
})
}
}
#[async_trait]
impl ObservationSource for UdpRadarSource {
async fn next(&mut self) -> Result<Option<Observation>> {
// recv_from is cancel-safe in tokio: a dropped future leaves no
// partially-consumed datagram. This matters for the orchestrator
// (Module 2), which cancels source tasks during shutdown.
let (n, _peer) = self.socket.recv_from(&mut self.buf[..]).await
.context("recv_from on radar UDP socket")?;
if n != RADAR_FRAME_SIZE {
// Truncated or oversized frame — log and drop. UDP gives no
// recovery; the radar will emit the next frame in ~2-10ms.
anyhow::bail!("radar frame size {n} != expected {RADAR_FRAME_SIZE}");
}
// SAFETY: we just verified the byte count matches the struct size,
// and RadarFrame is #[repr(C, packed)] of POD types. The radar firmware
// is documented to emit little-endian on the wire and our hosts are
// little-endian; if we ever deploy on big-endian hosts we add a swap.
let frame: RadarFrame = unsafe {
std::ptr::read_unaligned(self.buf.as_ptr() as *const RadarFrame)
};
let sensor_ts = UNIX_EPOCH + Duration::from_nanos(frame.timestamp_ns);
let array_id = frame.array_id; // copy out of packed struct for formatting
Ok(Some(Observation {
observation_id: Uuid::new_v4(),
source_id: SourceId(format!("radar-{}", array_id)),
source_kind: SourceKind::Radar,
sensor_timestamp: sensor_ts,
ingest_timestamp: SystemTime::now(),
target: ObservationTarget::RangeRate {
range_m: frame.range_m,
range_rate_m_s: frame.range_rate_m_s,
az_rad: frame.azimuth_rad,
el_rad: frame.elevation_rad,
},
uncertainty: Uncertainty {
sigma: frame.sigma_range_m as f64,
},
}))
}
fn name(&self) -> &str { &self.name }
}
A few points worth dwelling on. UDP gives no delivery guarantee — if a radar frame is lost in transit, it's gone. For a sensor producing 100–500 frames per second per array, this is acceptable; the consequence is slightly higher uncertainty in the fused track, not a missed conjunction. If we needed at-least-once delivery here, we would need a different transport (Kafka with a radar-side producer, for instance) and we would lose the simplicity of UDP. The choice of transport is a delivery-guarantee decision, not just a performance decision. We will return to this in Module 5. The unsafe block deserializing the frame is necessary because the wire format is #[repr(C, packed)] and UDP buffers have no alignment guarantee. Production systems use crates like zerocopy or bytemuck to make this safe; we use raw read_unaligned here for clarity. Either way, the cost of the deserialization is single-digit nanoseconds per frame, far below the per-frame budget.
A Minimal Sink
A sink consumes observations and writes them somewhere. The simplest possible sink is one that pushes envelopes onto an MPSC channel for the next stage to consume:
use tokio::sync::mpsc;
/// A sink consumes observations from a source and forwards them onward.
/// The simplest sink is a channel send: the next stage owns the receiver
/// and pulls from it.
pub struct ChannelSink {
tx: mpsc::Sender<Observation>,
name: String,
}
impl ChannelSink {
pub fn new(tx: mpsc::Sender<Observation>, name: impl Into<String>) -> Self {
Self { tx, name: name.into() }
}
/// Forward an observation to the downstream stage.
/// Returns Err if the receiver has been dropped (downstream is gone).
pub async fn write(&self, obs: Observation) -> Result<()> {
// .send().await applies backpressure: if the channel is full,
// this future does not resolve until there is capacity. The
// upstream source is forced to wait. This is the right behavior —
// we will analyze it in depth in Module 4.
self.tx.send(obs).await
.map_err(|_| anyhow::anyhow!("downstream receiver dropped"))?;
Ok(())
}
pub fn name(&self) -> &str { &self.name }
}
The single most important property of this sink is that write awaits. When the downstream channel is full (because the next stage is slow), the source's next().await plus the sink's write().await form a chain that propagates backpressure all the way back to the radar UDP socket — at which point we start dropping packets at the kernel level rather than building unbounded memory in the application. This is the foundation of the dataflow model we'll cover in Lesson 2 and the explicit subject of Module 4. A non-awaiting sink (one that internally buffered into an unbounded queue) would silently OOM the process during a fragmentation event surge. The choice between bounded and unbounded internal buffering is a load-bearing architectural decision masquerading as an implementation detail.
Key Takeaways
- A stream is defined by unboundedness: the input has no expected end. This single property dictates that state must be bounded, completeness is point-in-time, and recovery cannot rerun from the start. Treating an unbounded source as bounded is a category error that produces predictable failures under load.
- Sources and sinks are boundary positions, not data types. The Kafka topic that is the sink of one pipeline is the source of the next. This composition is why production streaming architectures look like graphs connected by durable queues.
- A source is characterized by three properties — replayability, ordering, delivery guarantees — and the pipeline cannot offer stronger guarantees than its weakest source without compensating mechanisms.
- The observation envelope unifies heterogeneous wire formats behind a single internal type. Capture provenance (source ID, sensor timestamp, ingest timestamp) at the boundary; preserve the original measurement form rather than prematurely normalizing to a single shape.
tokio::sync::mpsc::Sender::send().awaitis the foundation of backpressure. A source that awaits its sink and a sink that awaits its downstream channel form a chain that propagates pressure to the kernel. Internal unbounded buffering breaks this chain and produces silent OOMs.
Lesson 2 — The Dataflow Model
Module: Data Pipelines — M01: Stream Processing Foundations Position: Lesson 2 of 3 Source: Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11 (Stream Processing, "Processing Streams" section); Kafka: The Definitive Guide — Shapira et al., Chapter 14 (Stream Processing: Topology, State); Fundamentals of Data Engineering — Joe Reis & Matt Housley, Chapter 3 ("The Dataflow Model and Unified Batch and Streaming")
Context
The dataflow model is the conceptual frame that production stream processing systems are organized around. Kafka Streams, Apache Flink, Apache Beam, the internal architecture of Tokio's Stream combinators, every Rust pipeline you'll write in this track — they all express their work as a graph of operators that consume from upstream, transform, and produce downstream. Understanding the model gives you a vocabulary to reason about pipelines that scales from single-process Rust binaries to multi-cluster Beam jobs.
The shift from imperative to dataflow is the same shift you made when you moved from for loops to iterator combinators in Rust. An imperative pipeline says "loop over inputs, do step A, do step B, do step C." A dataflow pipeline says "compose operator A with operator B with operator C; the framework decides how to schedule the work, how to parallelize it, where to introduce buffering, when to checkpoint." The shift is more than aesthetic — the dataflow representation is what enables the framework to make decisions an imperative loop can't (parallel execution across operators, automatic backpressure, exactly-once via barrier markers, topology-aware optimization). Reis and Housley argue in Fundamentals of Data Engineering Ch. 3 that the dataflow model is what unifies batch and streaming computation: a batch is a stream that ends, and the same operator graph can run over either.
For the SDA Fusion Service, the operator graph is the architecture document. Sources at the edges (radar, optical, ISL), sinks at the other edges (the catalog store, the alert emitter), and a chain of operators in between (normalize, dedupe, correlate, filter, enrich). When the conjunction-alert latency SLA is at risk, the question is which operator in the graph is the bottleneck. When the pipeline is rebuilt for a new sensor, the question is where in the graph the new branch attaches. Without the graph, every conversation is about implementation; with the graph, the conversation can be about architecture.
Core Concepts
Operators as Functions Over Streams
In the dataflow model, an operator is a function that takes one or more input streams and produces one or more output streams. The function is total — it is defined for every possible input event — but the output is not constrained to be a one-for-one mapping. An operator may emit zero, one, or many output events for each input event, and the relationship between input rate and output rate is part of the operator's specification.
The five canonical operator shapes are:
- Map. One input event in, one output event out. Stateless. The dominant operator in normalization stages — converting a wire-format frame to an
Observationenvelope is a map. - Filter. One input event in, zero or one output events out. Stateless. Used to drop observations that fail validation, fall outside the area of interest, or duplicate ones already seen.
- FlatMap. One input event in, zero or more output events out. Stateless. A radar frame containing multiple detected targets becomes one event per target.
- Fold (Aggregate). One or more input events in, one output event out, with state accumulated across events. Stateful. Computing a running mean of range-rate per object is a fold.
- Window (Group). A grouping operator that collects events into bounded buckets — by time, by count, by session — and emits one output per bucket when the window closes. Stateful and time-aware. Conjunction risk computation in Module 3 is a windowed operator.
A sixth shape that doesn't fit neatly into the above is the join, which takes two input streams and produces a single output stream containing matched pairs. Joins are the most expensive operator class — they require state proportional to the unmatched-but-still-relevant events from both sides — and we cover them in detail in Module 3.
The DDIA framing is that operators are streaming versions of relational algebra. Map is projection, filter is selection, fold is aggregation, window is group-by, join is join. The same algebraic identities hold (you can push filters past maps, you can fuse adjacent maps), and the same costs apply (joins are the expensive operation, aggregations require state). If you have an intuition for how a SQL query gets optimized, you have the foundation to reason about a streaming topology.
The Pipeline as a Topology
The full pipeline — sources to sinks — is a directed graph. Vertices are operators (including sources and sinks); edges are streams flowing between operators. Kafka Streams calls this a topology; Flink calls it a job graph; Beam calls it a pipeline. They are all the same object.
The topology has structural properties that matter:
- Linear vs branching. A linear topology has a single path from source to sink. A branching topology has fan-out (one operator feeds multiple downstreams) or fan-in (multiple operators feed one downstream). The SDA pipeline is both: three sources fan in to a single normalization stage, then fan out to a correlator and a raw archive sink in parallel.
- Acyclic vs cyclic. Almost all production topologies are acyclic. Cycles introduce hard problems: when does a fixed point exist? How is termination defined? How does backpressure traverse a cycle? Iterative algorithms in Beam and Flink support cycles with explicit barrier semantics, but the cost is significant. Treat cycles as a smell.
- Stateless vs stateful operators. Stateless operators (map, filter, flatmap) can be parallelized trivially — replicate the operator N ways and load-balance events across the replicas. Stateful operators (fold, window, join) require partitioning — events that share a key must go to the same operator replica, because the state for that key is held there.
The topology view is what makes streaming pipelines explainable to other engineers and to operators. A diagram showing source-to-sink connectivity, with annotations for which operators are stateful and how the streams are partitioned, is more useful than any amount of source code for understanding why the pipeline behaves the way it does.
Statelessness, State, and Partitioning
The most consequential distinction among operators is whether they carry state. A stateless operator is a pure function — given the same input, it produces the same output every time. It can be torn down and rebuilt on a new host with no recovery; it can run with arbitrary parallelism. A stateful operator carries information between events: a counter, a window of recent values, a lookup table. State is the source of operational complexity in streaming systems. Every stateful operator is a question about checkpointing, recovery, exactly-once semantics, and partitioning.
Where does the state live? Three choices, in order of increasing operational cost:
- In-process state. A
HashMapinside the operator. Fast, simple, lost on crash, doesn't survive rescaling. Acceptable for low-importance operators or for operators whose state can be reconstructed from the input stream by replaying recent events. - Embedded persistent state. RocksDB or sled inside the operator process. Fast for local access, requires explicit checkpointing for recovery, requires partition-aware redistribution when scaling. This is what Kafka Streams and Flink use for their state backends.
- External state. A separate database or cache (Redis, Cassandra, the OOR storage engine you built in the Database Internals track). Slow per access, easy to share across operator replicas, decouples scaling from state. Used when state must be queryable from outside the pipeline.
The choice of state backing is one of the most consequential decisions in pipeline architecture. We will not implement embedded persistent state in this track — it would consume the entire track on its own — but we will use in-process state in Modules 2 and 3 and discuss the implications of moving to embedded state in Module 5.
Partitioning is the bridge between state and parallelism. A stateful operator is partitioned by a key — for the SDA correlator, the key is the orbital object identifier. All observations of object 2024-001A route to the same operator replica, where the state for that object lives. Partitioning is what allows stateful operators to scale: add more replicas, repartition the stream by key, and each replica owns a disjoint subset of keys. Partitioning is also where ordering guarantees come from in streaming systems: within a single partition, events are processed in order; across partitions, no order is guaranteed.
Why Dataflow Beats Imperative Loops
You could write the SDA pipeline as a single async function that reads from sources, transforms in-line, and writes to sinks. Many systems start that way. Three things go wrong as the pipeline grows:
-
Mixing concerns. The function ends up containing transport details (UDP receive logic), serialization (binary frame parsing), validation (drop frames with impossible range rates), business logic (correlation), and observability (metrics and lineage). Every change touches a function that touches everything else.
-
Fixed parallelism. A monolithic loop runs at a single rate. If correlation is slow, ingestion is slow. If ingestion is slow, the radar UDP buffer overflows. The dataflow model lets each operator run at its own rate, with bounded buffers between them — slow operators can be replicated, fast operators can stay single-threaded.
-
No structural visibility. When a metric goes wrong (P99 latency rises, throughput drops, errors spike), the only handle on the system is the call stack. The dataflow model gives every edge in the graph a name and lets you instrument each one independently. Per-stage lag, per-stage throughput, per-stage error rate become first-class observable properties.
The Kafka Streams architecture documentation makes this explicit: the topology is the artifact you reason about, debug against, and scale. The code that implements the topology is much shorter and changes much less often than the equivalent imperative code would.
Code Examples
Building Operators on tokio Channels
The simplest way to express a pipeline graph in Rust is one task per operator, connected by mpsc channels. Each operator is a long-running async function that owns one or more receivers and one or more senders.
use anyhow::Result;
use tokio::sync::mpsc;
/// A stateless map operator. Reads observations, applies a transformation,
/// forwards the result.
///
/// The signature is parameterized by the transform function for reusability;
/// in the SDA pipeline this is used for normalization, enrichment, and
/// schema migration.
pub async fn map_operator<F>(
mut input: mpsc::Receiver<Observation>,
output: mpsc::Sender<Observation>,
mut transform: F,
) -> Result<()>
where
F: FnMut(Observation) -> Observation + Send,
{
while let Some(obs) = input.recv().await {
let transformed = transform(obs);
// .send().await applies backpressure if downstream is full.
// If the receiver is dropped, the operator terminates cleanly —
// a downstream shutdown signal naturally propagates upstream.
if output.send(transformed).await.is_err() {
tracing::info!("map operator: downstream closed, shutting down");
return Ok(());
}
}
Ok(())
}
/// A stateless filter operator. Drops observations for which the predicate
/// returns false. Used in SDA for area-of-interest filtering and validation.
pub async fn filter_operator<F>(
mut input: mpsc::Receiver<Observation>,
output: mpsc::Sender<Observation>,
mut predicate: F,
) -> Result<()>
where
F: FnMut(&Observation) -> bool + Send,
{
while let Some(obs) = input.recv().await {
if predicate(&obs) {
if output.send(obs).await.is_err() {
return Ok(());
}
}
// Dropped observations: no send, no backpressure stall.
// The filter operator runs at the input rate.
}
Ok(())
}
Two design points. The operators take ownership of the receivers but accept a Sender (which is cloneable) — this is intentional. When a topology has fan-out (one operator feeds multiple downstreams), the operator clones its Sender for each downstream branch. When it has fan-in (multiple operators feed one downstream), each upstream operator owns its own clone of the same Sender. Second, the operators terminate cleanly when their output channel is closed. This produces a clean shutdown propagation: closing the final sink causes the last operator to terminate, which drops its receiver, which causes the operator before it to terminate, and so on back to the source. This is the streaming-system equivalent of unwinding a call stack — but it works across asynchronous task boundaries.
A Stateful Fold Operator
The first interesting operator in the SDA pipeline is a stateful one: the deduplicator. The same orbital object can be detected by multiple radar arrays during a single pass; the deduplicator collapses these into a single observation per (object, time-window) pair before forwarding to the correlator.
use std::collections::HashMap;
use std::time::{Duration, Instant};
/// Deduplicates observations within a sliding time window keyed on object ID.
/// State is held in-process; on crash, we lose the dedup state and may
/// briefly emit duplicates as the window refills. This is acceptable for
/// the SDA pipeline; alternative state backings are discussed in Module 5.
///
/// The window is *not* an event-time window — that comes in Module 3.
/// This is a processing-time approximation suitable for ingestion-time dedup.
pub async fn dedup_operator(
mut input: mpsc::Receiver<Observation>,
output: mpsc::Sender<Observation>,
window: Duration,
) -> Result<()> {
// State: object ID -> last-seen time (in processing-time / wall clock).
// For SDA volumes (~5e4 active objects), this HashMap stays under 5MB.
// We periodically prune entries older than `window` to bound memory.
let mut last_seen: HashMap<String, Instant> = HashMap::new();
let mut last_pruned = Instant::now();
while let Some(obs) = input.recv().await {
// Use object track ID from the source for the dedup key. In a real
// system this requires a per-source mapping to a global ID — that
// is one of the things the correlator does. For now, dedup within
// a single source's track ID space.
let key = format!("{:?}:{}", obs.source_kind, obs.source_id.0);
let now = Instant::now();
let should_emit = match last_seen.get(&key) {
Some(&t) if now.duration_since(t) < window => false,
_ => true,
};
if should_emit {
last_seen.insert(key, now);
if output.send(obs).await.is_err() {
return Ok(());
}
}
// Periodic pruning to bound memory. Run every `window` interval.
if now.duration_since(last_pruned) > window {
last_seen.retain(|_, &mut t| now.duration_since(t) < window * 2);
last_pruned = now;
}
}
Ok(())
}
The state in this operator is a HashMap<String, Instant>. It survives across observations but does not survive a restart — if the process crashes, the next batch of observations will all appear novel and may be emitted twice. For the SDA pipeline this is an acceptable failure mode; the conjunction analysis tolerates a brief uptick in duplicate events after a restart, and we save the operational cost of a persistent state backend. This tradeoff — accepting transient correctness violations to avoid persistent state — is one of the most common in streaming systems and one that should always be made explicitly. We will revisit it in Module 5 when discussing exactly-once semantics. Note that the dedup window here is in processing time (wall-clock since we last saw this key). This is fine for an ingestion-time guard but is not the same thing as an event-time window — which we cover in Module 3, and which is necessary for any time-correctness guarantee.
Composing the Topology
With operators defined, the topology is built by spawning each operator as a task and wiring the channels between them. The structure of the spawning code is the structure of the pipeline graph.
use tokio::task::JoinSet;
/// Spawns the M1 ingestion topology:
///
/// [radar_src] ─┐
/// [optical_src] ┼─→ [normalize] ─→ [dedup] ─→ [sink]
/// [isl_src] ────┘
///
/// This is the operator graph for the SDA Sensor Ingestion Service.
/// Module 2 extends it with a real orchestrator; Module 3 replaces dedup
/// with a windowed correlator.
pub async fn spawn_ingestion_topology(
mut radar: UdpRadarSource,
mut optical: OpticalArchiveSource,
mut isl: IslBeaconSource,
final_sink: mpsc::Sender<Observation>,
) -> JoinSet<Result<()>> {
let mut tasks = JoinSet::new();
// Three source-to-normalize channels (fan-in).
// Buffer size 1024 is a starting point; Module 4 covers buffer sizing.
let (n_tx, n_rx) = mpsc::channel::<Observation>(1024);
// Source tasks: each pushes to the same n_tx, so we clone it per source.
let n_tx_radar = n_tx.clone();
tasks.spawn(async move {
loop {
match radar.next().await? {
Some(obs) => {
if n_tx_radar.send(obs).await.is_err() { break; }
}
None => break,
}
}
Ok(())
});
let n_tx_optical = n_tx.clone();
tasks.spawn(async move {
loop {
match optical.next().await? {
Some(obs) => {
if n_tx_optical.send(obs).await.is_err() { break; }
}
None => break,
}
}
Ok(())
});
let n_tx_isl = n_tx; // last clone goes here, drop n_tx
tasks.spawn(async move {
loop {
match isl.next().await? {
Some(obs) => {
if n_tx_isl.send(obs).await.is_err() { break; }
}
None => break,
}
}
Ok(())
});
// Normalize -> dedup edge.
let (d_tx, d_rx) = mpsc::channel::<Observation>(1024);
tasks.spawn(map_operator(n_rx, d_tx, |obs| {
// Normalize timestamps: ensure ingest_timestamp is set.
// The actual normalization rules grow as the pipeline matures.
obs
}));
// Dedup -> final sink.
tasks.spawn(dedup_operator(d_rx, final_sink, Duration::from_millis(500)));
tasks
}
Notice that the function signature is the topology specification. The arguments name the inputs and outputs; the body lays out which operators connect to which channels. A reader unfamiliar with the codebase can understand the pipeline shape from this function alone — they don't need to chase through five files to understand what flows where. This is the dataflow model paying off: the structure of the code mirrors the structure of the data. Production systems with many operators eventually outgrow inline channel wiring and adopt a topology builder DSL (Kafka Streams' StreamsBuilder is the canonical example), but the principle is the same: declare the graph, let the framework spawn the tasks. We will build a small topology builder in Module 2.
Source note: This lesson synthesizes the dataflow model from DDIA Ch. 11 (which discusses operators and stream processing without using a single canonical name for the framework), Kafka Ch. 14 (which uses the term "topology" extensively), and FDE Ch. 3 (which uses "Dataflow Model" to specifically reference the Apache Beam paper of that name). Engineers familiar with one framework's vocabulary will recognize the others — the concepts are stable across implementations.
Key Takeaways
- The pipeline is a graph of operators, not a sequence of function calls. Sources at the edges, sinks at the other edges, operators connected by streams in between. This representation is the artifact you architect, debug, and scale against.
- Operators come in five canonical shapes — map, filter, flatmap, fold, window — plus join. Stateless operators (map, filter, flatmap) parallelize trivially; stateful ones (fold, window, join) require partitioning by key.
- State is the source of operational cost in streaming systems. Every stateful operator forces decisions about checkpointing, recovery, and where state lives (in-process, embedded persistent, external). Choose deliberately and document the consequences.
- Backpressure propagates along the topology when channels are bounded and operators await on send. This is the inverse of the imperative-loop pitfall: in dataflow code, doing nothing (awaiting) is the correct behavior under load.
- The topology specification is architecture documentation. A function whose body wires up channels and spawns operators is the most direct expression of the pipeline's structure — clearer than any prose description. Production systems graduate to a topology DSL but the principle is unchanged.
Lesson 3 — Push, Pull, and Poll Semantics
Module: Data Pipelines — M01: Stream Processing Foundations Position: Lesson 3 of 3 Source: Fundamentals of Data Engineering — Joe Reis & Matt Housley, Chapter 7 ("Push Versus Pull Versus Poll Patterns" and "Consumer Pull and Push"); Streaming Data — Andrew Psaltis, Chapter 2 (Common Interaction Patterns); Kafka: The Definitive Guide — Shapira et al., Chapter 4 (Kafka Consumers — The Poll Loop)
Context
Three patterns govern how data crosses boundaries in a streaming pipeline. Push: the producer initiates contact and forwards data to the consumer. Pull: the consumer initiates contact and requests data from the producer. Poll: the consumer initiates contact on a schedule, regardless of whether new data exists. The choice among them is one of the most consequential and underrated decisions in streaming architecture. The wrong choice produces systems that look fine in development and fall over in production — push pipelines that overwhelm slow consumers, pull pipelines that introduce avoidable latency, poll pipelines that burn capacity asking sources that have nothing to say.
Reis and Housley make a strong case in Fundamentals of Data Engineering Chapter 7: every interaction between a source and the pipeline (and between every pair of stages within the pipeline) is one of these three patterns, and the choice should be deliberate. Most production confusion comes from systems that have grown organically into a mix of all three without anyone designing the mix on purpose. The Kafka consumer's poll loop, documented in detail in Shapira et al. Chapter 4, is the canonical example of a deliberately chosen pattern — Kafka could have been push-based, the designers chose poll-based, and the choice shapes the system's operational properties end-to-end.
For the SDA Fusion Service, the three sensor sources naturally land on different patterns. Radar arrays push UDP frames whether anyone is listening; optical archive servers expose a REST endpoint that must be pulled; ISL beacons emit on a fixed cadence that requires polling because there is no notification mechanism on the wire protocol. The pipeline cannot impose a single pattern on all three — it must adapt, and the adaptation logic lives at the boundary. Understanding what each pattern costs and what it guarantees is how you build that boundary correctly.
Core Concepts
Push Semantics
In push, the producer initiates the transfer. When new data exists, the producer sends it to the consumer immediately, without waiting for a request. The consumer is reactive: it receives data when the producer decides.
Push has two compelling properties. First, latency is minimal — data flows from producer to consumer with one network round trip and no waiting. Second, the producer needs no model of consumer demand — it produces at its natural rate and the consumer either keeps up or doesn't. For high-rate, latency-sensitive sources (radar at 200 Hz; market data; sensor telemetry), push is the natural fit.
The cost of push is that flow control is the consumer's responsibility, and getting it wrong is catastrophic. A push consumer that cannot keep up has only three options:
- Buffer. Store incoming data until processing catches up. If the producer's rate exceeds the consumer's processing rate persistently, the buffer grows without bound and the consumer OOMs.
- Drop. Discard data the consumer cannot process. This is the UDP model — packets in excess of the receive buffer are dropped at the kernel level. Acceptable for high-rate sensor data where individual events are low-value; unacceptable for events that must not be lost.
- Push back. Tell the producer to slow down. This requires a feedback channel that push semantics do not provide by default — implementing it is essentially layering pull semantics on top of push.
Push is the right choice when (a) latency matters more than reliability, (b) the producer's rate is bounded by something the consumer can rely on (sensor specifications, network bandwidth), and (c) the cost of a dropped event is low enough to absorb. Outside of those conditions, push is hazardous — it produces systems that work until they don't and fail badly when they do.
Pull Semantics
In pull, the consumer initiates the transfer. When the consumer is ready for more data, it sends a request; the producer responds with whatever is available. The consumer is in control: it consumes at its own rate.
Pull's cardinal property is demand-driven flow. The consumer asks only when it can process; the producer never sends faster than the consumer can absorb. This eliminates the consumer-side OOM mode of push and makes the producer-consumer relationship symmetric: both sides have a model of the other's pace.
Pull's cost is latency — every event waits at the producer until the next pull request arrives. For low-rate, low-latency-tolerant sources, this is negligible. For high-rate sources, it adds round-trip latency to every event and can become a bottleneck if the pull rate cannot keep up with production. Kafka mitigates this by allowing a single pull (poll) to return many events at once — max.poll.records is configurable up to 2,000 — so the round-trip cost is amortized across a batch. Network Programming with Rust covers similar batching strategies for non-Kafka pull-based protocols.
Pull is the right choice when (a) the consumer needs control over pace (because it is itself rate-limited downstream, or it processes batches), (b) the producer can hold data until requested (which means it has buffered or persistent storage), and (c) the additional round-trip latency is acceptable. Most production message queue clients are pull-based for exactly these reasons.
Poll Semantics
Polling is pull on a fixed schedule. The consumer issues pull requests at regular intervals — every 30 seconds, every minute, every hour — regardless of whether the producer has new data. It is the simplest possible pull strategy and the easiest to reason about.
Polling has one major virtue: statelessness. The consumer needs no notification mechanism, no long-lived connection, no producer-side state about pending data. Every poll cycle is independent. This is why polling dominates in environments where stateful protocols are expensive: HTTP-based archive APIs, cron-driven batch ingest, IoT devices on intermittent connections, and the Meridian ISL beacon protocol — which has no notification support and exposes only a "give me the latest state" endpoint.
Polling has two costs. First, inherent latency: an event produced just after a poll cycle must wait until the next cycle. With a 30-second poll interval, average per-event latency is 15 seconds and worst-case is 30 seconds. Second, wasted requests: most polls return nothing new, especially when the producer is quiet. The poll-rate-vs-latency tradeoff is direct — halve the interval, double the request rate, halve the average latency. There is no free lunch.
The Kafka poll loop is a hybrid: it is poll-based from the consumer's perspective (the consumer calls poll(timeout) in a loop), but the broker uses long polling to avoid the wasted-request cost. The broker holds the request open until either new data arrives or the timeout expires; if new data arrives during that window, it is returned immediately. Long polling collapses poll's wasted-request cost while preserving the consumer-driven flow control. Kafka: The Definitive Guide Ch. 4 documents the relevant configurations: fetch.min.bytes (don't return until at least this much data is available, up to the timeout) and fetch.max.wait.ms (don't hold the request longer than this).
Pattern Mismatch and Adaptation
The most common architecture problem in pipelines is pattern mismatch: the source uses one pattern and the pipeline expects another. The radar UDP source is push; the optical archive is pull; the ISL beacon must be polled. The pipeline's first stage cannot impose a single pattern on all three sources — it must adapt at the boundary, and the adaptation logic lives in the source implementation.
Three adaptation patterns:
-
Push to pull. The source maintains an internal buffer; an external
next()call pulls from the buffer. The radar source uses this pattern:recv_from()is push-based at the kernel level, but the source'snext()method exposes a pull interface to the rest of the pipeline. The buffer (the kernel UDP receive buffer) is bounded; overflow drops at the kernel. -
Pull to poll. The source wraps a pull-based remote API in a polling loop. The optical archive source uses this pattern: it sleeps for the poll interval, makes a pull request to the REST endpoint, and yields the results. The poll interval is the dominant latency cost.
-
Poll to pull. The source polls on its own schedule and exposes the most recently fetched data on
next(). The ISL beacon source uses this when the beacon's protocol doesn't support pulls — a background task polls, the foregroundnext()reads from a shared cell. The latency includes the polling interval plus the time to detect a change.
The pipeline interior, after the source layer, runs entirely on push semantics — tokio::sync::mpsc is push from the sender's perspective, with backpressure providing the rate-control loop that pure push lacks. This is a deliberate architecture choice: by adapting at the boundary, the rest of the pipeline gets uniform semantics and you don't have to reason about three different patterns at every operator.
Choosing a Pattern
The decision matrix is small:
| Source property | Push | Pull | Poll |
|---|---|---|---|
| Latency-sensitive (<100 ms) | ✓ | only with batching | ✗ |
| Producer rate exceeds consumer rate | risky | ✓ | ✓ |
| Producer cannot buffer | ✓ | ✗ | ✗ |
| Consumer wants to control pace | ✗ | ✓ | ✓ |
| No notification mechanism on wire | adapt to push at source | adapt to pull at source | ✓ |
| Network is intermittent | ✗ | risky | ✓ |
| Source is bursty | only with consumer-side buffer | ✓ | poor — wastes polls during bursts |
The Reis and Housley framing in FDE Ch. 7 is that push is appropriate for systems where latency dominates, pull for systems where consumer control dominates, and poll for systems where simplicity dominates. None is universally correct; the right answer is whichever matches the source's wire-protocol capabilities and the pipeline's downstream needs. The wrong answer is whichever the developer was most familiar with from a prior project.
Code Examples
Adapting an HTTP Pull Source: The Optical Archive
The optical telescope archive exposes a REST endpoint at https://optical-archive.meridian.internal/observations?since={timestamp}. The endpoint returns a JSON array of observations. The source's job is to wrap this into a pull-based ObservationSource.
use anyhow::{Context, Result};
use async_trait::async_trait;
use reqwest::Client;
use serde::Deserialize;
use std::collections::VecDeque;
use std::time::{Duration, SystemTime};
use tokio::time::sleep;
#[derive(Deserialize)]
struct OpticalRecord {
obs_id: String,
timestamp_ns: u64,
ra_rad: f64,
dec_rad: f64,
sigma_arcsec: f64,
site_id: String,
}
pub struct OpticalArchiveSource {
client: Client,
endpoint: String,
poll_interval: Duration,
/// High-water mark: timestamp of the most recent observation we have
/// already consumed. Sent as the `since` parameter to avoid reprocessing.
/// Persisting this across restarts is a Module 5 concern; for now we
/// start fresh and may briefly reprocess on restart.
high_water_mark_ns: u64,
/// Local buffer of fetched-but-not-yet-yielded observations. Smooths
/// the burstiness of a single fetch returning many records.
buffer: VecDeque<Observation>,
name: String,
}
impl OpticalArchiveSource {
pub fn new(endpoint: impl Into<String>, poll_interval: Duration) -> Self {
Self {
client: Client::new(),
endpoint: endpoint.into(),
poll_interval,
high_water_mark_ns: 0,
buffer: VecDeque::new(),
name: "optical-archive".into(),
}
}
/// Fetch the next batch of observations from the archive. Returns the
/// number of new observations buffered.
async fn fetch_batch(&mut self) -> Result<usize> {
let url = format!("{}?since={}", self.endpoint, self.high_water_mark_ns);
let records: Vec<OpticalRecord> = self.client
.get(&url)
.timeout(Duration::from_secs(10))
.send()
.await
.context("optical archive HTTP GET")?
.error_for_status()?
.json()
.await
.context("optical archive JSON parse")?;
let count = records.len();
for r in records {
// Advance the high-water mark as we consume the batch.
// The archive returns records in timestamp order, so the last
// record sets the new mark.
self.high_water_mark_ns = self.high_water_mark_ns.max(r.timestamp_ns);
self.buffer.push_back(Observation {
observation_id: uuid::Uuid::new_v4(),
source_id: SourceId(format!("optical-{}", r.site_id)),
source_kind: SourceKind::Optical,
sensor_timestamp: SystemTime::UNIX_EPOCH
+ Duration::from_nanos(r.timestamp_ns),
ingest_timestamp: SystemTime::now(),
target: ObservationTarget::Angular {
ra_rad: r.ra_rad,
dec_rad: r.dec_rad,
},
uncertainty: Uncertainty {
// Convert arcseconds to radians for downstream uniformity.
sigma: r.sigma_arcsec * std::f64::consts::PI / (180.0 * 3600.0),
},
});
}
Ok(count)
}
}
#[async_trait]
impl ObservationSource for OpticalArchiveSource {
async fn next(&mut self) -> Result<Option<Observation>> {
loop {
// Buffered observations from a prior fetch take priority —
// we drain them before issuing a new pull.
if let Some(obs) = self.buffer.pop_front() {
return Ok(Some(obs));
}
// Buffer empty: fetch a new batch. If the archive returns
// nothing, sleep for the poll interval and try again.
// This is the poll part of the pattern.
match self.fetch_batch().await {
Ok(0) => {
sleep(self.poll_interval).await;
// Loop back to fetch again.
}
Ok(_) => {
// Buffer now has records; loop back to pop_front.
}
Err(e) => {
// Transient error: log and retry after the poll interval.
// Module 5 covers proper retry/backoff strategies; this
// is the minimum viable behavior.
tracing::warn!("optical archive fetch failed: {e:#}");
sleep(self.poll_interval).await;
}
}
}
}
fn name(&self) -> &str { &self.name }
}
This source is poll-to-pull adaptation in code. The wire protocol is HTTP (a pull primitive), but the consumer's next() is also pull (the pipeline asks when it wants more). The polling layer is internal: when the buffer empties, the source decides whether to issue another HTTP request immediately or sleep for the poll interval. The since query parameter is the watermark mechanism — without it, every poll would return all observations and the source would either drown in duplicates or have to deduplicate downstream. The watermark approach is the standard way to convert a non-incremental REST API into an incremental stream. Note the operational consequence of the poll interval: at 5 seconds, the average optical observation waits 2.5 seconds at the archive before reaching the pipeline. For SDA's conjunction-detection SLA (sub-30-second end-to-end), this is comfortable. If the SLA tightened to 5 seconds, we would either need to push the optical archive team to add a notification mechanism (push) or shorten the poll interval significantly (which costs them server capacity). This is the conversation the pattern choice forces.
Long-Polling: The Kafka Pattern Applied
For sources where latency matters but the producer can hold open requests, long polling combines the simplicity of polling with the latency of push. The pipeline's poll request stays open at the producer until either new data arrives or the timeout expires.
/// A long-polling source. The remote endpoint supports a `wait_ms` parameter:
/// the request blocks at the server until either a new observation is
/// available or `wait_ms` elapses, whichever comes first. This pattern is
/// borrowed directly from how Kafka's poll() works at the broker.
pub struct LongPollSource {
client: Client,
endpoint: String,
/// How long the server holds the request before returning empty.
/// Trading off: longer = lower request rate but slower shutdown response.
wait_ms: u32,
high_water_mark_ns: u64,
name: String,
}
#[async_trait]
impl ObservationSource for LongPollSource {
async fn next(&mut self) -> Result<Option<Observation>> {
loop {
let url = format!(
"{}?since={}&wait_ms={}",
self.endpoint, self.high_water_mark_ns, self.wait_ms,
);
// Note the request timeout exceeds wait_ms — the server will
// reliably return within wait_ms; we add a small grace period
// to tolerate network jitter without aborting valid requests.
let resp = self.client
.get(&url)
.timeout(Duration::from_millis(self.wait_ms as u64 + 2_000))
.send()
.await?
.error_for_status()?
.json::<Vec<OpticalRecord>>()
.await?;
if resp.is_empty() {
// Server returned no new data within wait_ms.
// Loop immediately to issue another long poll — no sleep.
continue;
}
// Server returned records: convert and yield the first one.
// (Production code would buffer the rest like OpticalArchiveSource.)
let r = &resp[0];
self.high_water_mark_ns = self.high_water_mark_ns.max(r.timestamp_ns);
return Ok(Some(/* convert r to Observation, see prior example */
Observation {
observation_id: uuid::Uuid::new_v4(),
source_id: SourceId(format!("optical-{}", r.site_id)),
source_kind: SourceKind::Optical,
sensor_timestamp: SystemTime::UNIX_EPOCH
+ Duration::from_nanos(r.timestamp_ns),
ingest_timestamp: SystemTime::now(),
target: ObservationTarget::Angular {
ra_rad: r.ra_rad,
dec_rad: r.dec_rad,
},
uncertainty: Uncertainty {
sigma: r.sigma_arcsec * std::f64::consts::PI / (180.0 * 3600.0),
},
}
));
}
}
fn name(&self) -> &str { &self.name }
}
The latency profile of long polling is the producer's data-arrival latency plus one network round trip — essentially the same as push, with no producer-side notification mechanism required. The cost is slightly more server capacity (each consumer holds an open connection) and a deliberate choice of wait_ms that balances request volume against shutdown responsiveness. Kafka brokers default to a 500-ms maximum wait, which is the right order of magnitude for most systems. Note that long polling shifts complexity to the server: the server must support holding requests open, which not all REST APIs do. When it does, long polling is almost always preferable to fixed-interval polling.
A Push-Source-with-Backpressure-Adaptation
Sometimes a push source needs to be slowed down. The radar UDP source can't be slowed down (the radar emits whether anyone is listening), but a TCP-based push source can be — by simply not reading from the socket. Modern OS TCP stacks signal back-pressure all the way to the producer when the consumer's receive buffer fills.
use tokio::io::AsyncReadExt;
use tokio::net::TcpStream;
/// A push-based source over TCP. The producer streams length-prefixed binary
/// frames. The transport-level backpressure (TCP windowing) automatically
/// slows the producer when we stop reading — but only if our application-level
/// reads are themselves controlled by backpressure from the downstream sink.
pub struct TcpPushSource {
stream: TcpStream,
buf: Vec<u8>,
name: String,
}
impl TcpPushSource {
pub async fn connect(addr: &str, name: impl Into<String>) -> Result<Self> {
let stream = TcpStream::connect(addr).await?;
Ok(Self {
stream,
buf: vec![0u8; 65_536],
name: name.into(),
})
}
}
#[async_trait]
impl ObservationSource for TcpPushSource {
async fn next(&mut self) -> Result<Option<Observation>> {
// Read a 4-byte length prefix.
let mut len_buf = [0u8; 4];
self.stream.read_exact(&mut len_buf).await?;
let frame_len = u32::from_le_bytes(len_buf) as usize;
if frame_len > self.buf.len() {
anyhow::bail!("push source: frame size {frame_len} exceeds buffer {}", self.buf.len());
}
// Read exactly frame_len bytes.
self.stream.read_exact(&mut self.buf[..frame_len]).await?;
// Parse and return as Observation. Implementation omitted for brevity;
// see the radar source for the deserialization pattern.
Ok(Some(parse_isl_frame(&self.buf[..frame_len])?))
}
fn name(&self) -> &str { &self.name }
}
The interesting thing about this source is what happens when the pipeline downstream is slow. The orchestrator's call to next() will not happen as quickly. The TCP stream's read buffer (kernel-level) fills up. The kernel's TCP window shrinks. The remote producer's send window shrinks. The producer's writes block at the syscall level. The producer is slowed down — automatically, by the operating system, with no application-level coordination required. This is the hidden virtue of TCP-based push: backpressure traverses the network for free, as long as the application never reads faster than it can process. A push source that internally buffered into an unbounded queue would defeat this. Using read_exact synchronously inside next() preserves it. The Network Programming with Rust text covers TCP windowing and its interaction with application-level I/O in detail.
Source note: This lesson synthesizes pattern-choice guidance from FDE Ch. 7 ("Push Versus Pull Versus Poll Patterns"), Streaming Data Ch. 2 ("Common Interaction Patterns"), and Kafka Ch. 4 ("The Poll Loop"). The long-polling pattern as described matches Kafka's broker-side
fetch.max.wait.msmechanism; the SDA pipeline applies the same pattern to a custom REST API. The TCP-windowing claim about backpressure-for-free is well-established in network-programming texts (Stevens, TCP/IP Illustrated) but worth verifying against the production behavior of the specific TCP stack and kernel you deploy on.
Key Takeaways
- Push, pull, and poll are not implementation details — they are architectural choices that determine latency, flow control, and failure modes for every interaction in the pipeline. Choose deliberately, document the choice, and review it when requirements change.
- Push minimizes latency but offloads flow control to the consumer. When push is appropriate (high-rate, low-event-cost sources where drop-on-overload is acceptable), it is the best choice. When it isn't, push systems fail badly under load.
- Pull gives the consumer rate control. Most production message queue clients are pull-based; the round-trip cost is amortized by batching multiple events per pull request. Kafka's
max.poll.recordsandfetch.min.bytesare the canonical knobs. - Long polling is the practical compromise. It collapses the wasted-request cost of fixed-interval polling while preserving consumer-driven flow control. When the producer supports it, long polling is almost always preferable to fixed-interval polling.
- TCP windowing provides free backpressure for push-over-TCP sources, as long as the application never reads faster than it can process. Internal unbounded application buffering breaks the chain. Stick to bounded buffers and synchronous read-then-process loops to preserve transport-level backpressure end-to-end.
Capstone Project — SDA Sensor Ingestion Service
Module: Data Pipelines — M01: Stream Processing Foundations Estimated effort: 1–2 weeks of focused work Prerequisites: All three lessons in this module passed at ≥70%
Mission Brief
OPS DIRECTIVE — SDA-2026-0118 / Phase 1 Implementation Classification: PIPELINE STAND-UP — INGESTION TIER
Stand up the front door of the Space Domain Awareness Fusion Service. Three sensor source types (X-band radar arrays over UDP, optical telescope archive over HTTP, ISL beacon network over TCP) must be unified into a single stream of normalized observation envelopes ready for fusion in downstream stages. The legacy Python ingestion script will be retired when this service reaches feature parity.
Success criteria for Phase 1: the service ingests from all three source types simultaneously, normalizes to the canonical
Observationenvelope, and writes to a structured event log that downstream stages can consume. Sustained throughput of 10,000 observations per second with end-to-end ingest-to-log latency under 250 ms at the 99th percentile.
What You're Building
A standalone Rust binary, sda-ingest, that:
- Connects to three configured source types — UDP radar, HTTP optical archive, TCP ISL beacon listener
- Wraps each source in an
ObservationSourceimplementation that normalizes wire formats into the canonicalObservationenvelope - Composes the three sources into a single fan-in topology that feeds a downstream sink
- Writes the normalized stream to a structured JSONL event log on local disk (one observation per line, atomically rotated by size)
- Exposes a small HTTP control plane for health checks and basic metrics (per-source ingest rate, channel occupancy, error counters)
The service must run as a long-lived process and gracefully shut down on SIGTERM — flushing the event log, closing source connections, and exiting cleanly. No data should be lost on a clean shutdown; some data may be in flight on a hard kill, and that is acceptable for this module (Module 5 covers durability).
The deliverable is the binary, the test suite, and a 1-page operational README documenting how to run it, what configuration it expects, and what its observable behavior looks like under load.
Suggested Architecture
┌────────────────────┐
│ UDP Radar Source │──┐
│ (1-3 arrays) │ │
└────────────────────┘ │
│
┌────────────────────┐ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ HTTP Optical │──┼──→│ Normalize │──→│ Validate │──→│ JSONL Sink │
│ Archive Source │ │ │ (map) │ │ (filter) │ │ │
└────────────────────┘ │ └──────────────┘ └──────────────┘ └──────────────┘
│
┌────────────────────┐ │
│ TCP ISL Beacon │──┘
│ Listener │
└────────────────────┘
Each source runs in its own tokio task. All three feed a shared mpsc::Sender<Observation> (cloned three ways) that drains into the normalize operator. Normalize feeds validate; validate feeds the JSONL sink. The HTTP control-plane runs on a separate task; it shares an Arc<Metrics> with the data-plane tasks for read access.
You may diverge from this architecture if you have a defensible reason. Document it in the operational README.
Acceptance Criteria
Functional Requirements
-
The
Observationenvelope is defined as in Lesson 1 and used unchanged across all three sources -
UdpRadarSourceconsumes the documented wire format (length, layout per Lesson 1) and produces validObservationrecords -
OpticalArchiveSourcepolls the HTTP endpoint with asincewatermark, buffers multi-record responses, and produces validObservationrecords (mock the HTTP endpoint for testing — a smallmockitoorwiremocksetup is fine) -
IslBeaconSourceaccepts incoming TCP connections, deserializes the wire format, and produces validObservationrecords (mock the TCP producer for testing — atokio::net::TcpListenerpaired with a producer task is fine) -
All three sources implement the same
ObservationSourcetrait - The topology fan-in correctly merges all three streams into the normalize operator
- The JSONL sink writes one observation per line, with atomic rotation when the current file exceeds 64 MB
-
On
SIGTERM, the service stops accepting new observations from sources, drains the in-flight pipeline, flushes the JSONL sink, and exits within 5 seconds
Quality Requirements
- Every source handles a deserialization error by logging it and continuing — one bad frame must not stop the source
- Every channel in the topology is bounded; buffer sizes are chosen and documented (a comment per channel is sufficient)
-
All
awaitpoints in the data plane are cancel-safe; a dropped task does not corrupt source state -
No
.unwrap()or.expect()in non-startup code paths (initialization may panic on misconfiguration) - At least one integration test exercises the full pipeline end-to-end with all three (mocked) sources running concurrently
Operational Requirements
-
HTTP control plane exposes
GET /healthreturning HTTP 200 when all source tasks are alive, 503 if any source has terminated unexpectedly -
HTTP control plane exposes
GET /metricsreturning a JSON object with at minimum:- per-source ingest rate (observations per second, EWMA over 30s)
- per-channel occupancy (current count and capacity)
- per-source deserialization error count (lifetime of the process)
- JSONL sink bytes written (lifetime of the process)
-
The service logs structured events (
tracing+tracing-subscriberJSON formatter) — one log line per significant lifecycle event, not one per observation -
A 1-page
README.mdin the project root documents: build, run, configure, expected metrics under steady-state load, known failure modes
Self-Assessed Stretch Goals
-
(self-assessed) Throughput sustained at 10,000 obs/sec with P99 ingest-to-log latency under 250 ms on a developer laptop. Provide a
criterionbenchmark and aflamegraphprofile showing where time is spent. - (self-assessed) The optical source supports both fixed-interval polling and long-polling modes, configurable. Document the latency tradeoff in the README.
- (self-assessed) The pipeline gracefully handles a "fragmentation event" simulation: drive the radar source at 10x normal rate for 60 seconds and observe that no observations are silently dropped at the application layer (UDP-level kernel drops are acceptable; document them).
Hints
How should I model the configuration?
A small TOML file is the path of least resistance:
[radar]
bind_addrs = ["0.0.0.0:7001", "0.0.0.0:7002"]
[optical]
endpoint = "https://optical-archive.example/observations"
poll_interval_ms = 5000
[isl]
listen_addr = "0.0.0.0:7100"
[sink]
output_dir = "/var/log/sda/ingest"
rotation_bytes = 67108864 # 64 MiB
[control_plane]
bind_addr = "127.0.0.1:9100"
serde + toml makes this trivial. figment if you want layered config
(file + env vars). Don't over-engineer; you can add a config crate later.
What buffer size should the channels use?
The general rule from Module 4: buffers are sized to absorb expected short-term burstiness, not to be a primary backpressure mechanism.
For ingest-to-normalize, a buffer of 1024–4096 is reasonable for SDA's volumes. The dominant cost of an oversized buffer is increased latency under load — every observation in the buffer is one in front of yours. The dominant risk of an undersized buffer is unnecessary backpressure oscillation if downstream is bursty.
Pick a number, document why, and revisit once you have load-test data. You will revisit this in Module 4 with much more rigor.
How should I test the UDP radar source?
Spin up a tokio::net::UdpSocket in the test that sends the wire format
to the source's bind address. The source thinks it's reading from a real
radar; the test constructs the bytes and emits them. This pattern works
for any push-over-UDP source.
#[tokio::test]
async fn radar_source_decodes_valid_frame() {
let bind = "127.0.0.1:0"; // OS picks port
let mut source = UdpRadarSource::bind(bind, "test-radar").await.unwrap();
let source_addr = source.local_addr().unwrap();
// The producer side: encode a known frame and send it.
let producer = UdpSocket::bind("127.0.0.1:0").await.unwrap();
let frame = encode_test_radar_frame(/* ... */);
producer.send_to(&frame, source_addr).await.unwrap();
let obs = source.next().await.unwrap().unwrap();
assert_eq!(obs.source_kind, SourceKind::Radar);
// ... assert other fields
}
You'll need to expose local_addr() on the source, or have the test know
the bind address ahead of time (less robust because of port races).
What's a clean way to handle SIGTERM?
tokio::signal::ctrl_c() for SIGINT, tokio::signal::unix::signal(SignalKind::terminate())
for SIGTERM. Combine them with a tokio::select! against the main service
loop; on signal, drop the source senders (which closes the channels), and
let the normalize → validate → sink chain drain naturally.
let mut sigterm = tokio::signal::unix::signal(SignalKind::terminate())?;
tokio::select! {
_ = sigterm.recv() => {
tracing::info!("SIGTERM received; initiating graceful shutdown");
}
_ = tokio::signal::ctrl_c() => {
tracing::info!("SIGINT received; initiating graceful shutdown");
}
res = run_service(&mut topology) => {
// service exited on its own — usually means a source error propagated
return res;
}
}
// Drop sources to close their channel senders; downstream drains.
topology.shutdown().await;
The topology.shutdown() method is yours to design — typically it joins
all tasks with a deadline and force-aborts any that don't finish in time.
How verbose should the metrics be?
Resist the urge to add a metric per operation. The four metric families required by the acceptance criteria are sufficient for Module 1.
You will revisit metrics with rigor in Module 6, where Reis and Housley's DataOps framing and Kafka's monitoring chapter (Shapira et al. Ch. 13) provide the proper foundation. For now, four metrics that you understand and that work correctly are far better than twenty that you cargo-culted from a Kafka dashboard.
Getting Started
Recommended order:
- Define the envelope. Get
Observationand the trait in place; write a unit test that round-trips it through serde JSON. - Implement the simplest source. The UDP radar source is the most self-contained — no HTTP, no TCP listener, no state. Start there. Get it tested end-to-end with a UDP producer in the test harness.
- Implement the JSONL sink. Get observations flowing source → channel → sink to disk before adding the other sources.
- Add the optical source. This is the most complex one because of the HTTP polling and watermark management. Mock the HTTP server in tests.
- Add the ISL TCP source. Apply what you learned from radar plus what you learned from the optical-source error handling.
- Wire the topology together. Compose the three sources into a fan-in; spawn each as a task; verify the integration test.
- Add the control plane. Health and metrics last; they are the cherry on top, not the foundation.
Aim for a working end-to-end pipeline by day 4 even if everything in it is minimal. Optimize and harden after that. Premature optimization (specifically, premature buffer-size tuning) is a common time-sink in this project.
What This Module Sets Up
In Module 2 you will replace this hand-spawned topology with a real orchestrator: a Rust task DAG executor that schedules operators, manages retries, and propagates idempotency keys across stage boundaries. The Observation envelope and the source/sink traits you define here will not change. The topology composition will become declarative.
In Module 3 you will replace the JSONL sink with an event-time correlation operator that windows observations by sensor timestamp and computes conjunction risk. The watermark logic in your optical source becomes the foundation for event-time watermark propagation across the pipeline.
You are not building a throwaway. You are building the first stage of a system that grows for five more modules.
Module 02 — Pipeline Orchestration Internals
Track: Data Pipelines — Space Domain Awareness Fusion
Position: Module 2 of 6
Source material: Async Rust — Maxwell Flitton & Caroline Morton, the chapters on tokio::task, JoinSet, cancellation, and structured shutdown; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11 (operator-graph execution and failure handling); Kafka: The Definitive Guide — Shapira, Palino, Sivaram, Petty, Chapter 7 (Reliable Data Delivery); Fundamentals of Data Engineering — Joe Reis & Matt Housley, Chapters 2–3 (Orchestration as an Undercurrent; Plan for Failure)
Quiz pass threshold: 70% on all four lessons to unlock the project
Mission Context
OPS ALERT — SDA-2026-0119 Classification: ORCHESTRATION TIER STAND-UP Subject: Replace hand-spawned topology with declarative orchestrator
Phase 1 ingestion (
sda-ingest) is in production with three sensor sources, a normalize stage, a validate stage, and a JSONL sink. The next-quarter roadmap adds five more sources, a windowed dedup, a cross-sensor correlator, an alert emitter, and an audit sink. The hand-spawned topology inmain.rshas reached the practical limit of what one engineer can hold in their head. Two operational gaps from last quarter's incidents are also unresolved: a panicking task is silently torn down with no recovery hook (SDA-2026-0094 postmortem), and the optical poller's retry logic is unjittered fixed-delay (SDA-2026-0103 postmortem — the 90-minute outage extension after a 30-second partner blip).Directive: Build the orchestrator. A declarative DAG of operators, supervised by a single supervisor, with retry and circuit-breaker primitives that respect downstream characteristics, and runtime-level bulkheading for the new orbital propagator. The Phase 1 binary is refactored to use the orchestrator with no behavioral regression beyond the documented failure-handling improvements.
This module is the connective tissue of the rest of the track. The orchestrator built here is what every subsequent module's project hangs on. Module 3 changes one operator (the dedup → windowed correlator); Module 4 hardens channel boundaries against burst load; Module 5 makes operator state crash-safe via checkpointing; Module 6 wraps the assembled system in observability. None of those changes alter the orchestrator's API. The shape established here is load-bearing for the next four modules.
The mental model the module installs: an operator is a Task, a topology is an OperatorGraph, a running pipeline is a BuiltGraph driven by a Supervisor. Failures are dispatched on a four-case TaskExit enum. Retries use decorrelated jitter. Resource-isolation needs are met with separate runtimes. None of these patterns are SDA-specific; the orchestrator crate is meant to be reusable across any pipeline that fits the dataflow model.
Learning Outcomes
After completing this module, you will be able to:
- Articulate why a pipeline is naturally expressed as a graph of supervised tasks rather than a single async function, and explain the operational properties of each shape
- Distinguish CPU-bound from IO-bound operators and place each on the correct part of the Tokio runtime (
spawnvsspawn_blocking) - Reason about cancel-safety as a per-await-point property and identify operator implementations that violate it
- Build a declarative
OperatorGraphwith build-time validation of cycles and disconnected operators - Implement a supervisor with bounded restart budgets that distinguishes panics from errors from clean exits
- Design retry policies that classify errors correctly, use decorrelated-jitter backoff, and compose with idempotency to produce effective exactly-once processing
- Apply circuit breakers and runtime-level bulkheading where retries alone are insufficient
Lesson Summary
Lesson 1 — The Task Model
What tokio::spawn actually does, why CPU-bound work belongs on spawn_blocking, what JoinHandle::abort actually means (cooperative, observed at the next await point), and what cancel-safety means as a per-await-point property. Closes with the Task wrapper struct that gives the orchestrator a uniform handle type for every operator and the TaskExit enum that distinguishes the four operationally meaningful exit cases.
Key question: If JoinHandle::abort() is called and the handle resolves to Err(JoinError) with is_cancelled() == true, does that mean the task has actually stopped running?
Lesson 2 — DAG Scheduling
The OperatorGraph builder, Kahn's algorithm topological sort with cycle detection, the four-pass build() (per-role validation, topo sort, channel allocation, spawn), and JoinSet for managing N operator handles with whichever-finishes-first semantics. Why the bounded-channel-per-edge invariant is what makes backpressure-traversal-through-the-DAG tractable and why fan-in/fan-out are expressed as explicit router operators rather than multi-edge vertices.
Key question: The pipeline has three sources fanning into a single normalize operator. What does the topo-sorted spawn order look like, and which property of the order is what makes the channel-wiring code work?
Lesson 3 — Retries and Idempotency
Three pieces of retry discipline: classifying transient vs permanent vs discardable errors (and why the classification is the operator's responsibility, not the framework's), exponential backoff with decorrelated jitter (and why fixed-delay retries amplify outages), idempotency as a property of the operation that lets at-least-once delivery compose into effective exactly-once. The with_retry wrapper, the RetryDisposition enum, and the dedup sink with sliding-window seen-set bounded by both time and count.
Key question: A hundred operator instances all hit the same downstream failure at the same instant. With fixed-delay retries, what happens to the downstream during recovery, and why?
Lesson 4 — Failure Modes
What retry cannot address: panics, cascading slowdowns, resource exhaustion in shared pools. The supervisor pattern with JoinSet::join_next and TaskExit dispatch. Bounded restart budgets and why "always restart" is dangerous. Three levels of bulkheading (channel, runtime, process). Cascading failures and the discipline of addressing the cause rather than the symptom. Circuit breakers as the fail-fast complement to retries.
Key question: The validate operator panics on a bad input. The pipeline has no supervisor. What happens to the pipeline's apparent behavior, and what changes when the supervisor pattern is wired in?
Capstone Project — Fusion Pipeline Orchestrator
Build the sda-orchestrator library: declarative OperatorGraph, supervised JoinSet-driven Task lifecycle, retry wrapper with decorrelated jitter, circuit breaker, and runtime-level bulkheading via a dedicated tokio::runtime::Runtime for the orbital propagator. The Phase 1 sda-ingest binary from Module 1 is refactored to use the new orchestrator with no behavioral regression beyond the documented failure-handling improvements. Acceptance criteria, suggested architecture, deterministic-timing test patterns, and the full project brief are in project-fusion-orchestrator.md.
The orchestrator is not a throwaway. The interface stays stable through Modules 3, 4, 5, and 6.
File Index
module-02-pipeline-orchestration-internals/
├── README.md ← this file
├── lesson-01-task-model.md ← The task model
├── lesson-01-quiz.toml ← Quiz (5 questions)
├── lesson-02-dag-scheduling.md ← DAG scheduling
├── lesson-02-quiz.toml ← Quiz (5 questions)
├── lesson-03-retries-idempotency.md ← Retries and idempotency
├── lesson-03-quiz.toml ← Quiz (5 questions)
├── lesson-04-failure-modes.md ← Failure modes
├── lesson-04-quiz.toml ← Quiz (5 questions)
└── project-fusion-orchestrator.md ← Capstone project brief
Prerequisites
- Module 1 (Stream Processing Foundations) completed — the
Observationenvelope, theObservationSourcetrait, theChannelSinkpattern, and the bounded-channel backpressure model are assumed - Foundation Track completed — async Rust, channels, network programming
- Familiarity with
tokio::task::JoinSet,tokio::sync::oneshot,tokio_util::sync::CancellationToken, andanyhow::Result - Working comfort reading and writing structured logs with
tracing
What Comes Next
Module 3 (Event Time and Watermarks) replaces the processing-time dedup operator from Module 1 with a windowed event-time correlator that computes conjunction risk from observations of the same orbital event arriving from multiple sensors. The orchestrator interface stays the same — only one operator's implementation changes. Watermark propagation becomes a property of the graph's edges, which the orchestrator's channel structure already accommodates.
Lesson 1 — The Task Model
Module: Data Pipelines — M02: Pipeline Orchestration Internals
Position: Lesson 1 of 4
Source: Async Rust — Maxwell Flitton & Caroline Morton, the Tokio runtime and tasks chapters (tokio::spawn, JoinHandle, spawn_blocking, cancellation via select!); Fundamentals of Data Engineering — Joe Reis & Matt Housley, Chapter 2 ("The Data Engineering Lifecycle: Orchestration as an Undercurrent")
Context
Module 1 stood up a working pipeline by spawning tasks ad hoc — one for each radar source, one for each optical poller, one normalize stage, one channel sink. The wiring code was a single function in main.rs that grew thirty lines longer every time a new operator was added. That approach scales until it doesn't, and it has stopped scaling. The next quarter's roadmap adds five more sensor sources, a cross-sensor correlator, a windowed dedup stage, an alert emitter, and an audit sink. Hand-spawning that topology produces a function nobody wants to touch. We need an orchestrator: a layer that accepts a description of the pipeline and runs it, deciding what to spawn where, when to restart, and how to shut down cleanly.
Before we can build the orchestrator, we need a precise account of what it is orchestrating. The unit of orchestration is the task — an async function spawned onto the Tokio runtime, paired with a JoinHandle for lifecycle reference. Most engineers writing async Rust have spawned tasks; far fewer have a working mental model of what the runtime guarantees, where the cooperative-scheduling contract breaks, and what cancel-safety actually means in practice. This lesson establishes that vocabulary. The next three lessons (DAG scheduling, retries and idempotency, failure modes) each build on it.
The Reis & Housley framing is useful here: orchestration is one of the "undercurrents" of the data engineering lifecycle — present everywhere, easy to ignore until it breaks. Fundamentals of Data Engineering makes the point that orchestration is not a service you call out to; it is the connective tissue of the whole system. That framing matters because it shapes what the orchestrator is responsible for. Not "scheduling jobs" in the Airflow sense — the SDA Fusion Service is one long-running pipeline, not a graph of nightly batch jobs. Orchestration here means: own the lifecycle of every operator in the pipeline, supervise it, and present a single coherent abstraction to the rest of the program.
Core Concepts
Tasks as the Unit of Work
A task is an async function that has been handed to the Tokio runtime to drive. The runtime owns scheduling — picking which task to poll next, which worker thread to poll it on, and when to suspend a task that is awaiting an unready future. The application owns lifecycle reference through the JoinHandle returned by tokio::spawn. This split is the thing that makes async Rust productive at scale: the runtime is in charge of when work happens; the application is in charge of what work happens.
A task is much cheaper than an OS thread. A thread carries an OS-level kernel stack (typically 8 MiB on Linux, allocated lazily but reserved in virtual address space), a kernel-scheduled execution context, and the per-thread bookkeeping the kernel maintains. A task is a single allocation containing the future plus a small header — usually under a kilobyte. Spawning a million tasks in a single process is routine; spawning a million OS threads is not. This cheapness is why streaming pipelines are naturally expressed as tasks-per-operator: we can afford one task per source, one per stage, one per partition replica, without thinking hard about the resource cost.
A pipeline operator is one task per stage instance. The radar source task reads from its UDP socket and forwards observations to the next channel. The normalize task reads from that channel and forwards. Every stage is independently schedulable; the runtime interleaves them across worker threads as their inputs become available. Module 1's spawn_ingestion_topology already worked this way; the orchestrator we will build in Lesson 2 makes the pattern explicit and composable.
CPU-Bound vs IO-Bound Tasks
Tokio is a cooperative scheduler. A task progresses by being polled; it yields control by hitting an .await that returns Pending. Between awaits, the task holds the worker thread it is running on. If a task spends 200 ms on a CPU-bound computation between awaits, no other task scheduled on that worker thread runs for 200 ms — even if that worker has a thousand pending tasks queued. This is the cooperative-scheduling contract, and violating it does not produce a clean error. It produces tail-latency spikes that look like "the runtime is mysteriously slow."
The rule: async tasks must yield frequently. The threshold has hardened around 10 microseconds as the natural budget per uninterrupted run — small enough that the runtime stays responsive at hundreds of thousands of tasks per second, large enough that you are not yielding inside tight inner loops. When the work between awaits exceeds 10 µs by more than a small factor — let alone 10 ms or 200 ms — the work belongs on a thread, not a task. The mechanism is tokio::task::spawn_blocking, which dispatches the closure to a separate thread pool sized for blocking work (default: 512 threads, configurable). The result comes back as a JoinHandle<T> you can .await from your task, but the work itself does not block the async runtime.
For SDA, the placement decisions are concrete. UDP frame deserialization in the radar source is sub-microsecond — stays in async. JSON deserialization of an optical archive response is single-digit microseconds — stays in async. The orbital propagation library's propagate(state_vector, dt) -> state_vector call runs in 1–5 ms per object — spawn_blocking. The cross-sensor correlator's covariance update is in the hundreds of microseconds — borderline; benchmark before deciding. When in doubt, measure. When unable to measure, prefer spawn_blocking for any computation that touches floating-point matrix algebra or calls into a non-async C library.
Lifecycle and JoinHandle
tokio::spawn(future) returns a JoinHandle<T> where T is the future's output. The handle is your sole reference to the running task. It carries three operations worth understanding precisely:
.awaiton the handle waits for the task to finish and yields itsResult<T, JoinError>. The error case captures task panics and aborts; the success case is the future's actual return value..abort()signals the task to stop. This is cooperative: the runtime sets a cancellation flag, and the task observes it the next time it reaches an await point. A task in a tight CPU loop with no awaits ignores.abort()indefinitely. This is the same problem that puts CPU-bound work onspawn_blocking— and it has the same solution.- Drop of the handle does not cancel the task. By default, dropping a
JoinHandlesimply detaches the task; it continues running, and you have lost your reference to it. Detached tasks are a frequent source of operational problems: they outlive the code that spawned them, they accumulate without being supervised, and they hold resources that nobody else can release. The orchestrator must never detach an operator task.
For long-running pipeline operators — the kind that drain a channel and forward to the next — the handle never resolves under normal operation. The task runs as long as the upstream channel produces. "Completion" for an operator means the upstream closed (the source ran out, or the orchestrator triggered shutdown). The handle resolving with Ok(()) is the shutdown-success signal; resolving with Err(JoinError) is a panic signal; never resolving (until aborted) is the steady state.
Cancel-Safety
A future is cancel-safe if dropping it at any await point leaves no observable side effects partially applied. Equivalently: the future has either completed an operation or not started it; there is no third state where a row was half-inserted, a transaction was started but not committed, or a kernel buffer was half-read. Cancel-safety is a property of an individual await point, not of a function as a whole. A function with one cancel-safe await and one non-cancel-safe await is non-cancel-safe.
The reason this matters for the orchestrator is that shutdown is implemented by aborting tasks, and aborting a task is effectively dropping its current future. If an operator's inner loop is non-cancel-safe at the await point that gets aborted, the operator leaves the system in an inconsistent state on shutdown. This is the underlying mechanism for "the system runs fine until we deploy a new version and then conjunction alerts go missing for ninety seconds" — the alerts that were in flight got aborted at a non-cancel-safe await.
The Tokio primitives are largely cancel-safe by design: mpsc::Receiver::recv, Notify::notified, time::sleep, select! with cancel-safe arms, UdpSocket::recv_from. Many third-party crates are not. Database drivers that hold a transaction across an await are usually non-cancel-safe (the transaction stays open with no committer if the future is dropped). HTTP clients that stream a response body are non-cancel-safe at the body-read await. The discipline: when you write or use a non-cancel-safe future, wrap it in select! against the cancellation signal and explicitly handle the cancel branch — close the transaction, drop the response, release the lock — before the task exits.
A Task Abstraction for the Orchestrator
The plain JoinHandle<Result<()>> is enough to spawn and abort a task, but the orchestrator needs more. It needs a name (for logs and metrics), a restart policy (Lesson 4 builds this out), and the ability to query whether the task is alive and what failed if it isn't. We wrap the handle:
pub struct Task {
name: String,
handle: JoinHandle<Result<()>>,
restart_policy: RestartPolicy,
}
This is the type that Lesson 2's DAG scheduler operates over and Lesson 4's supervisor restarts. The wrapper costs almost nothing at runtime — it is three fields next to an already-existing handle — but it gives the orchestrator a uniform handle type for every operator in the topology. Heterogeneous return types (some operators emit Result<Vec<Observation>>, some Result<()>) are not a concern in practice: every operator the orchestrator manages returns Result<()> because it runs forever until shut down, and a non-() return value would be discarded anyway. The standardization is on purpose.
Code Examples
Spawning a CPU-Bound Operator with spawn_blocking
The orbital propagator is the canonical CPU-bound operator in the SDA pipeline. Given a state vector and a propagation interval, it integrates the equations of motion forward in time — heavy floating-point work, no I/O, single-digit milliseconds per call. Putting that work in a normal async task is the standard mistake.
use anyhow::Result;
use tokio::sync::mpsc;
use tokio::task;
/// The wrong way: a CPU-bound operator in an async task.
/// This blocks the worker thread it runs on for the full duration of
/// `propagate`, starving every other task on that worker.
pub async fn propagate_inline(
mut input: mpsc::Receiver<Observation>,
output: mpsc::Sender<PropagatedObservation>,
) -> Result<()> {
while let Some(obs) = input.recv().await {
// propagate is 1-5ms of pure CPU. The tokio worker thread
// running this task is unavailable to anything else for the
// duration. With 8 worker threads, eight slow propagations
// in flight stalls the entire runtime until they complete.
let propagated = orbital::propagate(obs.target, obs.sensor_timestamp);
output.send(PropagatedObservation { obs, propagated }).await?;
}
Ok(())
}
/// The right way: hand the CPU work to the blocking pool, await its
/// JoinHandle from inside the async context. The async worker thread
/// remains free to poll other tasks while the propagation runs.
pub async fn propagate_offloaded(
mut input: mpsc::Receiver<Observation>,
output: mpsc::Sender<PropagatedObservation>,
) -> Result<()> {
while let Some(obs) = input.recv().await {
// spawn_blocking returns a JoinHandle<T>. Awaiting it suspends
// *this* task without blocking the worker thread; the runtime
// is free to poll other operators in the meantime. The closure
// runs on the blocking pool (default 512 threads).
let propagated = task::spawn_blocking(move || {
orbital::propagate(obs.target, obs.sensor_timestamp)
})
.await??; // outer ? for JoinError, inner ? for the Result inside
output.send(PropagatedObservation { obs, propagated }).await?;
}
Ok(())
}
The 10-microsecond budget rule says anything past that yields. In practice, the threshold worth respecting is much higher — closer to a millisecond — because the cost of the spawn_blocking dispatch itself (allocating a closure, signaling the blocking pool) is in the single-digit microseconds. Below that, you spend more on the offload than you save. Above it, you lose at least an order of magnitude of throughput by leaving the work inline. The two-version comparison above makes the difference visible in flame graphs: propagate_inline shows continuous on-CPU time with no idle worker threads even when input arrives in bursts; propagate_offloaded shows CPU on the blocking pool and idle async workers ready to handle the next operator's work.
A Cancel-Unsafe Operator and How to Fix It
The orchestrator triggers shutdown by aborting operator tasks. If an operator holds a resource across a non-cancel-safe await, that resource is leaked when the task is aborted. The example below is a sink that writes observations to an embedded SQLite — a common pattern for the audit sink — and the bug it has and the fix that resolves it.
use anyhow::Result;
use rusqlite::Connection;
use tokio::select;
use tokio::sync::mpsc;
use tokio_util::sync::CancellationToken;
/// Cancel-unsafe: the transaction is opened, observations are written
/// inside it, and the commit happens after .await on the channel recv.
/// If the task is aborted while waiting for the next batch, the
/// transaction stays open — and SQLite holds a writer lock the rest
/// of the process cannot release until the connection is dropped.
pub async fn audit_sink_unsafe(
mut input: mpsc::Receiver<Observation>,
mut conn: Connection,
) -> Result<()> {
loop {
let txn = conn.transaction()?; // begin
for _ in 0..1000 {
let obs = match input.recv().await { // ← abort point inside txn
Some(o) => o,
None => return Ok(()),
};
txn.execute(INSERT_OBS_SQL, params![&obs.observation_id])?;
}
txn.commit()?;
}
}
/// Cancel-safe: select! between the channel recv and an explicit
/// cancellation token. The cancel branch closes the transaction
/// before the task exits. Aborts of the task itself become
/// rare — shutdown comes through the token.
pub async fn audit_sink_safe(
mut input: mpsc::Receiver<Observation>,
mut conn: Connection,
shutdown: CancellationToken,
) -> Result<()> {
loop {
let txn = conn.transaction()?;
for _ in 0..1000 {
select! {
recv = input.recv() => match recv {
Some(obs) => {
txn.execute(INSERT_OBS_SQL, params![&obs.observation_id])?;
}
None => {
txn.commit()?;
return Ok(());
}
},
_ = shutdown.cancelled() => {
// Explicit unwind: drop the in-flight transaction
// before returning. SQLite will roll it back when
// the txn binding goes out of scope.
drop(txn);
return Ok(());
}
}
}
txn.commit()?;
}
}
Two design points are worth lingering on. First, the fix is not "make the SQLite call cancel-safe" — that is not a property the underlying library can offer. The fix is to wrap the non-cancel-safe await in a select! whose other arm is a cancellation signal that the orchestrator controls. The task is now the one that decides how to unwind, on its own terms. Second, the cooperative shutdown via CancellationToken makes JoinHandle::abort a fallback rather than the primary mechanism. Production orchestrators emit shutdown.cancel() first, give every operator a grace window (typically 5–10 seconds) to drain, and only fall back to .abort() for operators that have not exited. Lesson 4 returns to this pattern.
The Task Wrapper
This is the type the rest of the orchestrator works with. Heterogeneous operator implementations — a UDP source, an HTTP poller, a windowed correlator — all become Task values once spawned, with a uniform interface for the scheduler and supervisor.
use anyhow::{Context, Result};
use std::time::{Duration, Instant};
use tokio::sync::oneshot;
use tokio::task::{JoinError, JoinHandle};
/// What the supervisor should do when this task exits.
#[derive(Debug, Clone, Copy)]
pub enum RestartPolicy {
/// Restart the operator on any non-graceful exit.
Always,
/// Restart up to N times within the given window.
Bounded { max_restarts: u32, window: Duration },
/// Never restart; failure of this operator should fail the pipeline.
/// Reserved for operators where data integrity is at stake (e.g.,
/// a sink whose retry would produce double-writes).
Never,
}
/// Spawned operator handle. The orchestrator stores one of these per
/// operator in the running topology.
pub struct Task {
name: String,
handle: JoinHandle<Result<()>>,
restart_policy: RestartPolicy,
spawned_at: Instant,
}
impl Task {
/// Spawn an operator and wrap its JoinHandle.
pub fn spawn(
name: impl Into<String>,
restart_policy: RestartPolicy,
future: impl std::future::Future<Output = Result<()>> + Send + 'static,
) -> Self {
Self {
name: name.into(),
handle: tokio::spawn(future),
restart_policy,
spawned_at: Instant::now(),
}
}
pub fn name(&self) -> &str { &self.name }
pub fn restart_policy(&self) -> RestartPolicy { self.restart_policy }
pub fn uptime(&self) -> Duration { self.spawned_at.elapsed() }
/// True if the task has not yet completed. Cheap; useful in
/// the supervisor's poll loop.
pub fn is_alive(&self) -> bool { !self.handle.is_finished() }
/// Wait for the task to exit. Distinguishes operator-returned
/// errors from runtime panics so the supervisor can react
/// differently to each (Lesson 4).
pub async fn join(self) -> TaskExit {
match self.handle.await {
Ok(Ok(())) => TaskExit::Ok,
Ok(Err(e)) => TaskExit::Errored(e),
Err(join_err) if join_err.is_panic() => {
TaskExit::Panicked(format!("{:?}", join_err.into_panic()))
}
Err(_) => TaskExit::Aborted,
}
}
/// Cooperative shutdown via the orchestrator's cancellation
/// token would normally have run first; this is the fallback
/// for operators that did not honor the token in time.
pub fn abort(&self) { self.handle.abort(); }
}
#[derive(Debug)]
pub enum TaskExit {
Ok,
Errored(anyhow::Error),
Panicked(String),
Aborted,
}
The join method is the single most important part of this type. It collapses Tokio's two-level error reporting (Result<Result<()>, JoinError>) into a flat TaskExit enum that distinguishes the four operationally meaningful cases. A panicked operator is a bug we want to alert on. An errored operator returned Err(_) from its future — it is an expected failure mode (a network partition, a transient API failure) and should be retried per its policy. An aborted operator was shut down by the orchestrator and is not a failure. An Ok exit means the operator's input ran out — the source closed cleanly — which is normal at end-of-stream but unusual in steady state. The supervisor in Lesson 4 dispatches on TaskExit directly; everything past this lesson assumes the wrapper exists.
Key Takeaways
- A task is the unit of work the runtime schedules; a
JoinHandleis your application's sole reference to it. Drop of the handle detaches rather than cancels, and detached tasks accumulate silently — the orchestrator must own every operator's handle. - The cooperative-scheduling contract demands tasks yield at sub-millisecond granularity. CPU-bound work belongs on
spawn_blocking, not ontokio::spawn. Failure to honor this rule produces tail-latency spikes that look like runtime instability rather than the application bug they are. JoinHandle::abortis cooperative: it sets a flag observed at the next await point. CPU-bound tasks ignore aborts until they yield. Production shutdown is a two-step protocol: signal aCancellationTokenfor cooperative drain, then fall back to.abort()for stragglers.- Cancel-safety is a per-await-point property. Non-cancel-safe awaits — database transactions, streaming HTTP bodies, custom locks — must be wrapped in
select!against an explicit cancellation signal so the operator unwinds on its own terms. Aborting a non-cancel-safe future leaks the resource it held. - The
Taskwrapper turns Tokio's two-level error reporting into a four-caseTaskExitenum (Ok,Errored,Panicked,Aborted) that the supervisor in Lesson 4 dispatches on. Standardize onResult<()>for every operator; standardize onTaskfor every spawned handle.
Lesson 2 — DAG Scheduling
Module: Data Pipelines — M02: Pipeline Orchestration Internals
Position: Lesson 2 of 4
Source: Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11 (the operator-graph execution model in stream processing); Async Rust — Maxwell Flitton & Caroline Morton, sections on tokio::task::JoinSet and tokio::select!; Fundamentals of Data Engineering — Joe Reis & Matt Housley, Chapter 2 (Orchestration as the connective tissue of the lifecycle)
Source note: The DDIA chapter discussion of operator-graph execution is referenced from training knowledge of the printed text; the core model — vertices as operators, edges as channels, scheduling via topological order — is well-established and unchanged across editions. Specific algorithmic choices (Kahn's algorithm vs DFS-postorder) are decisions made for this curriculum, not direct citations.
Context
Lesson 1 established the task as the unit of orchestration and the Task wrapper as the orchestrator's handle on each one. We now need a way to describe an entire pipeline's worth of operators — radar source, ISL listener, optical poller, three normalizers, a windowed dedup, a correlator, a conjunction emitter, an audit sink — and turn that description into running tasks with their channels correctly wired. Module 1's spawn_ingestion_topology did this by hard-coding the topology in a long function. That technique is fine for three operators and fragile for ten. The orchestrator replaces it with a declarative graph: the application says what the pipeline looks like, and the orchestrator figures out what to spawn and in what order.
The data structure that captures "what the pipeline looks like" is a directed acyclic graph. Vertices are operators; directed edges are typed channels carrying observations (or, later, watermarks and barriers) from one operator to the next. The acyclic constraint is operationally critical: a pipeline with a cycle is a pipeline that can deadlock under backpressure, and a pipeline that can deadlock under backpressure is one we want to refuse to start rather than start and hope. We will spend most of this lesson building the graph type and its scheduler. The remaining time is on the operational properties the graph buys us: clean shutdown via JoinSet, end-to-end backpressure preservation, and detection of unreachable or orphaned operators at build time rather than at 3 AM.
The forward references matter. Module 3 adds watermark propagation to the graph's edges; the graph type designed here must accommodate that addition without a rewrite. Module 4 deepens the backpressure analysis; the bounded-channel-per-edge invariant established here is what makes it tractable. Module 5 adds checkpoint barriers as graph-level events; same invariant applies. The shape of OperatorGraph is load-bearing for the rest of the track.
Core Concepts
The Pipeline as a DAG
A pipeline's logical structure is a graph. Vertices are operators — the named pieces of work the pipeline performs (a source, a normalizer, a windowed dedup, a sink). Edges are channels — the typed conduits along which data flows from one operator to the next. The graph is directed because data flows in one direction along each edge. The graph is acyclic because allowing cycles introduces deadlock conditions that are tedious to reason about and unnecessary for the SDA pipeline's needs.
The vertices carry the operator's identity and the means to spawn it: a name, the async function (or closure) that runs the operator, the channel ends it expects to receive, and its restart policy from Lesson 1. The edges carry the channel itself plus its capacity. Edges between specific operators are typed — a channel from radar to normalize carries Observation, a channel from windowed-dedup to correlator might carry DedupedObservation — but the graph as a whole is heterogeneously typed and represented internally with type erasure (the operator function is boxed; the channels are stored in a typed map keyed on edge ID). This is a deliberate tradeoff that we will spell out in the code: the static-typing alternative produces a graph type whose generic parameters explode with the number of edges, and the orchestrator becomes harder to use than the manual spawn it replaces.
Forbidding cycles is a real constraint, not just a convention. A cyclic pipeline has at least one operator whose downstream depends on its own upstream; under backpressure, the upstream blocks waiting for the downstream to consume, and the downstream blocks waiting for the upstream to produce. Without a tie-breaking mechanism (an explicit budget on the cycle, an unbounded buffer, a priority order on which side blocks first), the cycle deadlocks. SDA's pipelines do not need cycles — every legitimate use case (a "rerun the failed observations" loop, a "feed audit results back to the source for sampling weight adjustment" path) can be expressed as a separate downstream operator that re-emits into a new pipeline. The orchestrator refuses to build a graph that contains a cycle, and produces an error message naming the cycle.
Topological Sort and Channel Wiring
Spawning the operators in arbitrary order does not work. The downstream operator's task must be created with the receiver end of the channel that connects it to the upstream operator. That receiver does not exist until the channel itself is created, which (in the natural construction order) happens when the upstream is being prepared. The natural flow is therefore: walk the graph in topological order (every operator visited before any of its descendants), create each operator's outgoing channels first, hand the receiver halves to the descendants when their turn comes.
Two algorithms produce a topological order: Kahn's algorithm (repeatedly pick a vertex with no remaining incoming edges and remove it) and DFS post-order (depth-first traversal, emit each vertex after all its descendants). Both are O(V + E). Kahn's is operationally preferred for pipelines because the order it produces is level-by-level — sources first, then their immediate consumers, then theirs, and so on — which corresponds to how an operator engineer thinks about the pipeline. DFS post-order can produce a less intuitive order in which a deep chain is finished before a shallow sibling is started; the spawned tasks are then more interleaved than the engineer expected.
The wiring problem has a subtle two-pass structure that the code below makes explicit: the first pass over the graph creates every channel (sender + receiver pair) and stores each pair in a map keyed on the edge ID; the second pass spawns each operator with the receiver end of its incoming edge and the sender end of its outgoing edge. This separation is what makes downstream-first iteration impossible: we do not know who the receiver belongs to until the second pass identifies the consumer of that edge. Single-pass approaches that try to allocate channels lazily as operators are walked tend to produce subtle bugs — operators spawned with placeholder channels that have to be patched, or scheduling orders that look topological but skip an edge.
JoinSet for Lifecycle Management
A pipeline with N operators produces N JoinHandles. The orchestrator's supervisor (Lesson 4) needs to react when any one of them completes — usually because of a panic or an error, sometimes because of a clean shutdown signal. Iterating over a Vec<JoinHandle> and calling .await on each is the wrong pattern: it serializes on the slowest operator's completion and offers no way to react to whichever finishes first. The right primitive is tokio::task::JoinSet.
A JoinSet<Result<()>> is a bag of in-flight tasks with a join_next() -> Future<Output = Option<Result<Result<()>, JoinError>>> method. The future resolves whenever any task in the set finishes, returning that task's result. The supervisor awaits join_next in a loop, dispatching on each result as it arrives. When the supervisor decides the pipeline should shut down, it calls JoinSet::abort_all(), which sends an abort signal to every task and lets the supervisor drain the resulting JoinErrors through the same join_next loop. The structural property: every operator's lifecycle ends through the supervisor's mouth, never through a detached return.
JoinSet is also the right place for the heterogeneous-return-type discipline from Lesson 1. Every operator returns Result<()>; that uniformity is what lets JoinSet be a single typed structure for the entire pipeline. The standardization paid for itself the moment we needed a single supervisor.
Backpressure Through the DAG
Module 1 established mpsc::Sender::send().await as the foundation of backpressure: a full channel suspends the upstream until the downstream catches up. The DAG inherits that property as long as every edge in the DAG is a bounded channel. The orchestrator enforces this with a graph-level invariant: there are no unbounded_channel calls anywhere in the graph builder. Edges are sized at construction time; sizes are documented per-edge with a comment about expected burst behavior; Module 4 develops the sizing discipline.
The DAG-level corollary is that backpressure traverses the entire graph. A slow sink causes its upstream's channel to fill, suspends that upstream, which causes its upstream's channel to fill, and so on back to the sources. The sources themselves either suspend on their own producing primitive (the UDP socket recv, the HTTP poll) or, for sources that cannot suspend (a radar feed that produces whether we are listening or not), drop at the kernel level. The pipeline never grows unbounded internal buffers under load. This is the property the audit script in Lesson 3 verifies and the property the Module 4 burst test exercises.
The DAG is also the natural place to check cycles. We forbid them because a cyclic pipeline cannot have this backpressure-traversal property — a cycle has no "back" to propagate to. If a future requirement legitimately needs cyclic data flow (a feedback path for late-arriving observations, a "retry the failed conjunction analysis" loop), it must be implemented as a feed-through to a new pipeline, not as a back-edge in the existing one. The constraint is a feature.
Unreachable Tasks and Orphaned Channels
The first non-trivial bug a pipeline-graph type catches is the disconnection bug. An engineer adds a new operator, registers it with the graph, but forgets to call connect(upstream, new_operator) or connect(new_operator, downstream). The graph builds. The operator spawns. It either reads from an empty channel forever (its upstream was never wired) or writes into a channel nobody reads (its downstream was never wired) — and the former scenario has no symptoms until somebody notices the operator is silent, while the latter scenario fills its channel and applies false backpressure to its upstream. Both bugs cost real time when they happen in production.
The graph builder catches both cases at build() time. An operator that has no incoming edges is either a registered source (legitimate) or a misconfigured operator (illegal). An operator that has no outgoing edges is either a registered sink (legitimate) or a misconfigured operator (illegal). The builder's role enumeration distinguishes these cases; an operator declared as add_operator (interior) is required to have both an incoming and an outgoing edge, while add_source and add_sink adjust the constraint accordingly. The error message names the offending operator and the missing edge direction. This kind of build-time validation is what gives the declarative orchestrator its primary advantage over hand-spawned topology: bugs that would require runtime instrumentation to detect become compile-time-style errors at startup.
Code Examples
The OperatorGraph Builder
The graph is constructed by a builder that collects vertices and edges and validates them at build() time. Operators are stored as type-erased boxed closures so the graph itself is not generic over each operator's type signature.
use anyhow::{anyhow, bail, Context, Result};
use std::collections::{BTreeMap, BTreeSet};
use std::future::Future;
use std::pin::Pin;
use tokio::sync::mpsc;
/// What an operator does once spawned. Erased to a boxed future so the
/// graph stores a heterogeneous collection.
pub type OperatorFuture = Pin<Box<dyn Future<Output = Result<()>> + Send + 'static>>;
/// An operator's role in the topology. The builder uses this to validate
/// that operators have the right edges connected.
#[derive(Debug, Clone, Copy)]
pub enum Role { Source, Operator, Sink }
struct VertexSpec {
name: String,
role: Role,
/// Constructed once the channels are wired in build(). Closure takes
/// the operator's incoming and outgoing channel-end handles.
factory: Box<dyn FnOnce(WiredEnds) -> OperatorFuture + Send>,
incoming: Option<EdgeId>,
outgoing: Option<EdgeId>,
}
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub struct EdgeId(u32);
/// The handles passed into the operator's factory closure when build()
/// wires the topology.
pub struct WiredEnds {
pub rx: Option<mpsc::Receiver<Observation>>,
pub tx: Option<mpsc::Sender<Observation>>,
}
pub struct OperatorGraph {
vertices: Vec<VertexSpec>,
edges: Vec<EdgeSpec>,
}
struct EdgeSpec {
id: EdgeId,
from_idx: usize,
to_idx: usize,
capacity: usize,
}
impl OperatorGraph {
pub fn new() -> Self {
Self { vertices: Vec::new(), edges: Vec::new() }
}
pub fn add_source(&mut self, name: impl Into<String>,
factory: impl FnOnce(WiredEnds) -> OperatorFuture + Send + 'static)
-> usize
{
self.push_vertex(name.into(), Role::Source, factory)
}
pub fn add_operator(&mut self, name: impl Into<String>,
factory: impl FnOnce(WiredEnds) -> OperatorFuture + Send + 'static)
-> usize
{
self.push_vertex(name.into(), Role::Operator, factory)
}
pub fn add_sink(&mut self, name: impl Into<String>,
factory: impl FnOnce(WiredEnds) -> OperatorFuture + Send + 'static)
-> usize
{
self.push_vertex(name.into(), Role::Sink, factory)
}
fn push_vertex(&mut self, name: String, role: Role,
factory: impl FnOnce(WiredEnds) -> OperatorFuture + Send + 'static)
-> usize
{
let idx = self.vertices.len();
self.vertices.push(VertexSpec {
name,
role,
factory: Box::new(factory),
incoming: None,
outgoing: None,
});
idx
}
pub fn connect(&mut self, from: usize, to: usize, capacity: usize) -> Result<EdgeId> {
if self.vertices[from].outgoing.is_some() {
bail!("operator {} already has an outgoing edge", self.vertices[from].name);
}
if self.vertices[to].incoming.is_some() {
bail!("operator {} already has an incoming edge", self.vertices[to].name);
}
let id = EdgeId(self.edges.len() as u32);
self.edges.push(EdgeSpec { id, from_idx: from, to_idx: to, capacity });
self.vertices[from].outgoing = Some(id);
self.vertices[to].incoming = Some(id);
Ok(id)
}
}
The shape is more verbose than a typical graph type because it carries the operator-spawn closure inline. The closure captures whatever the operator needs (config, source addresses, etc.) and is only invoked once build() has wired the channels. This delays the construction of channel-touching state until we actually have the channels, which is exactly the two-pass structure the lesson described. The single-incoming and single-outgoing restriction (one edge per direction per operator) is a simplification that the SDA pipeline does not need to violate; fan-in and fan-out are handled by intermediate router operators rather than by multi-edge vertices. This keeps the topo-sort and the wiring logic simple at modest cost in expressiveness, and the cost is recoverable when needed by introducing explicit fan-in/fan-out operators with their own typed semantics.
Topo-Sort, Cycle Detection, and Build
The build() step is where validation happens. It runs Kahn's algorithm to produce a topological order and to detect cycles, validates per-role edge requirements, allocates the channels, and spawns each operator with its wired channel ends.
use std::collections::{HashMap, VecDeque};
pub struct BuiltGraph {
/// Spawned tasks, in topological order. The supervisor takes ownership.
pub tasks: Vec<Task>,
}
impl OperatorGraph {
pub fn build(self) -> Result<BuiltGraph> {
// Pass 1: validate role constraints.
for v in &self.vertices {
match v.role {
Role::Source if v.incoming.is_some() =>
bail!("source {} has an incoming edge; sources have no upstream", v.name),
Role::Sink if v.outgoing.is_some() =>
bail!("sink {} has an outgoing edge; sinks have no downstream", v.name),
Role::Operator if v.incoming.is_none() =>
bail!("operator {} has no incoming edge; did you forget to connect()?", v.name),
Role::Operator if v.outgoing.is_none() =>
bail!("operator {} has no outgoing edge; did you forget to connect()?", v.name),
Role::Source if v.outgoing.is_none() =>
bail!("source {} has no outgoing edge; did you forget to connect()?", v.name),
Role::Sink if v.incoming.is_none() =>
bail!("sink {} has no incoming edge; did you forget to connect()?", v.name),
_ => {}
}
}
// Pass 2: Kahn's algorithm for topo sort + cycle detection.
let n = self.vertices.len();
let mut in_degree = vec![0usize; n];
for e in &self.edges { in_degree[e.to_idx] += 1; }
let mut ready: VecDeque<usize> = (0..n).filter(|&i| in_degree[i] == 0).collect();
let mut order: Vec<usize> = Vec::with_capacity(n);
while let Some(idx) = ready.pop_front() {
order.push(idx);
for e in &self.edges {
if e.from_idx == idx {
in_degree[e.to_idx] -= 1;
if in_degree[e.to_idx] == 0 { ready.push_back(e.to_idx); }
}
}
}
if order.len() != n {
// The remaining vertices form one or more cycles.
let cycle_members: Vec<&str> = (0..n)
.filter(|i| !order.contains(i))
.map(|i| self.vertices[i].name.as_str())
.collect();
bail!("pipeline graph has a cycle involving operators: {:?}", cycle_members);
}
// Pass 3: allocate channels (sender + receiver pair per edge).
let mut chan_tx: HashMap<EdgeId, mpsc::Sender<Observation>> = HashMap::new();
let mut chan_rx: HashMap<EdgeId, mpsc::Receiver<Observation>> = HashMap::new();
for e in &self.edges {
let (tx, rx) = mpsc::channel(e.capacity);
chan_tx.insert(e.id, tx);
chan_rx.insert(e.id, rx);
}
// Pass 4: walk in topo order, spawning each operator with its wired ends.
let mut tasks: Vec<Task> = Vec::with_capacity(n);
for idx in order {
let v = &self.vertices[idx];
let rx = v.incoming.and_then(|e| chan_rx.remove(&e));
let tx = v.outgoing.and_then(|e| chan_tx.remove(&e));
// The factory closure consumes itself — pull it out via a take()
// pattern. (Real code uses a helper to deduplicate this.)
// ... factory invocation and Task::spawn elided for brevity.
tasks.push(spawn_via_factory(&v.name, v.role, v.factory_take(), WiredEnds { rx, tx }));
}
Ok(BuiltGraph { tasks })
}
}
Three things to notice. The validation in pass 1 rejects misconfigured graphs before any channels are allocated, which gives the engineer the cleanest possible error message — naming the operator that is missing its edge — rather than a downstream "channel is empty forever" symptom at runtime. The Kahn's-algorithm pass in pass 2 doubles as both topological sort and cycle detection: if any vertex has nonzero in-degree at the end, it is part of a cycle, and the unreached vertices name the cycle's members in the error message. The channel allocation in pass 3 is a single pass that creates every channel before any operator spawns; pass 4 then walks the topological order and hands each operator its channel ends, removing them from the maps as it goes. The remove rather than get is intentional: each receiver is owned by exactly one operator, and the map's emptiness at the end is itself a sanity check.
Cycle Detection in Practice
What the cycle error looks like to the engineer who wrote the offending pipeline. The example below intentionally builds a cycle (dedup → normalize → dedup) to illustrate the diagnostic.
fn build_buggy_pipeline() -> Result<BuiltGraph> {
let mut g = OperatorGraph::new();
let radar = g.add_source("radar", radar_factory());
let normalize = g.add_operator("normalize", normalize_factory());
let dedup = g.add_operator("dedup", dedup_factory());
let sink = g.add_sink("audit", audit_factory());
g.connect(radar, normalize, 1024)?;
g.connect(normalize, dedup, 1024)?;
g.connect(dedup, normalize, 1024)?; // accidental back-edge
g.connect(dedup, sink, 64)?;
g.build() // returns Err with the cycle named
}
// The error returned by g.build():
// Error: pipeline graph has a cycle involving operators: ["normalize", "dedup"]
The diagnostic is not perfect — Kahn's algorithm reports the set of vertices in cycles rather than the specific edges that form them — but it is enough to direct the engineer to the right neighborhood. Production graph types augment this with a second pass that finds the strongly connected components and reports each cycle's edge sequence; the SDA orchestrator's diagnostic is sufficient for the topology sizes we expect (tens of operators per pipeline) and we leave the SCC enhancement for when it is needed. The point of the example is that a misconfigured pipeline fails at build time with an actionable message, not at runtime with a deadlocked operator.
Key Takeaways
- The pipeline is a directed acyclic graph of operators connected by typed bounded channels. Vertices are operators with a name, role (source / operator / sink), and spawn factory; edges are channels with a documented capacity. The acyclic constraint is a deadlock-prevention requirement, not a stylistic choice.
OperatorGraph::build()runs four passes: per-role edge validation, Kahn's topological sort with cycle detection, channel allocation in a single pass, then operator spawning in topological order. Each pass produces an actionable error message at the earliest possible point.tokio::task::JoinSet<Result<()>>is the right primitive for owning N operator handles. It supportsjoin_nextfor whichever-finishes-first reaction andabort_allfor clean shutdown. Standardize every operator onResult<()>and theJoinSetis homogeneously typed.- Bounded channels everywhere preserves the end-to-end backpressure property the rest of the track depends on. The graph builder makes "no
unbounded_channel" a structural invariant, not a code-review rule. - Build-time validation catches disconnection bugs (unwired operators) and cycle bugs (back-edges). Both produce readable startup-time errors rather than runtime symptoms that require instrumentation to diagnose.
Lesson 3 — Retries and Idempotency
Module: Data Pipelines — M02: Pipeline Orchestration Internals Position: Lesson 3 of 4 Source: Kafka: The Definitive Guide — Shapira, Palino, Sivaram, Petty, Chapter 7 (Reliable Data Delivery — Send Acknowledgments, Configuring Producer Retries, Idempotent Producer); Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11 (failure handling in stream processing)
Context
Network calls in a streaming pipeline fail. The optical-archive REST endpoint goes down for thirty seconds during a partner deploy. The Kafka broker the alert sink is producing into has a leader election. The conjunction-emitter HTTP subscriber returns 503 because its database is being patched. None of these are "the pipeline is broken" — they are normal transient conditions in a system whose dependencies have their own operational lifecycle. The pipeline must keep running through them, which means it must retry.
Naive retries are themselves a source of incidents. A pipeline that retries every failure with no backoff turns a one-second downstream blip into a thundering-herd amplification of a hundred operator instances all reconnecting at once, each triggering more downstream work, each failing again. A pipeline that retries permanent errors (a 4xx, a deserialization failure, a schema-mismatch exception) loops forever on a poison pill. A pipeline that retries idempotent operations gets the right answer; one that retries non-idempotent operations produces duplicates whose downstream cost is sometimes invisible (a duplicate row in an analytics table) and sometimes catastrophic (a duplicate "fire thrusters" command to a satellite).
This lesson assembles three pieces of discipline. What to retry — distinguishing transient from permanent errors so retries help rather than hurt. How to retry — exponential backoff with jitter, so a hundred instances retrying after the same failure do not all retry at the same instant. How to make retries safe — idempotency, the property that lets at-least-once delivery (the default Kafka producer guarantee) compose into effective exactly-once processing (covered fully in Module 5; previewed here as the tooling we install now). The mission framing is concrete: an SDA-2026-NNNN incident two months ago saw a junior engineer add naive retries to the optical poller, which during a 30-second archive outage hammered the recovering archive with five thousand reconnect attempts per second and extended the outage to ninety minutes. The architectural fix from that postmortem is what this lesson encodes.
Core Concepts
Transient vs Permanent Errors
The first decision in a retry policy is whether a given error is a thing retrying might fix. Transient errors are conditions that are likely to resolve on their own within a useful timescale: timeouts, 5xx responses, connection-refused, broker-not-available, leader-election-in-progress. Permanent errors are conditions that will not resolve no matter how many times you ask: 4xx responses (the request is malformed), deserialization failures (the bytes are not what you expected), schema-mismatch errors (the producer and consumer disagree about the type), validation failures (the observation references a satellite that has been decommissioned). Discardable errors are a third category — invariant violations that should drop the event entirely without retry or DLQ, like a malformed UUID in a field that is required to be a UUID and is generated at the source.
The classification logic lives in the operator, not in the framework. A general retry wrapper that retries every error is the wrong shape. A wrapper that takes a RetryDisposition::Retry | Permanent | Discard and expects the operator to classify is the right shape. Lesson 4 introduces the dead-letter-queue as the destination for Permanent errors (not retried; not discarded; routed somewhere the engineer can examine them), so for now we focus on the retry path itself.
A useful default for unknown errors is Permanent. If you do not know whether retrying helps, assume it does not — better to surface the unknown error to the operational dashboard than to loop on a poison pill. Engineers add explicit Retry disposition for errors they have classified; everything else falls through to Permanent and gets attention.
Exponential Backoff with Jitter
When a transient error occurs, retrying instantly is rarely right. The downstream that is failing is usually doing so because it is overloaded, recovering from a fault, or being deployed; an instant retry adds load to a system that needs the opposite. The basic shape of a sensible retry policy is exponential backoff: wait some initial delay, double it on each retry, cap at some maximum. The exponential growth means an outage of any duration eventually backs off to a low retry rate; the cap prevents runaway delays past the point where recovery is plausible.
Exponential backoff alone has a subtle problem at scale. A hundred operator instances all hit the same downstream failure at the same instant. They all back off the same delay. They all retry at the same instant. The downstream is still recovering, fails again, and the cycle repeats. The retry traffic looks like a square wave. The fix is jitter: a random component added to (or replacing) the deterministic delay. With jitter, the hundred instances retry at randomly distributed times within a window, smoothing the load.
The two jitter formulas worth knowing are full jitter and decorrelated jitter. Full jitter: delay = random_uniform(0, backoff_cap) — discard the deterministic backoff entirely; the cap is the only thing that matters. Decorrelated jitter: delay = min(cap, random_uniform(initial, prev_delay * 3)) — keep the previous attempt's delay as the lower bound's anchor; the cap is the upper. Decorrelated jitter is the AWS Architecture blog's recommendation and the one we use here. It is more aggressive than full jitter (the median delay is higher) which produces less retry pressure on a slow-recovering downstream.
Idempotency Keys
Retries can produce duplicates. A producer that retries after a partial failure may have actually succeeded on the original attempt, the success ack just got lost; the retry is a duplicate. A consumer that processes-then-commits may crash between the process and the commit; on restart it reads the message again and processes it twice. At-least-once delivery — the natural guarantee of any retry-capable system — admits duplicates by definition. To compose it into something stronger, we need the operations downstream of delivery to be idempotent: applying them twice with the same input produces the same result as applying them once.
Idempotency is a property of the operation, not the framework. Setting a record's value to a specific number is idempotent; incrementing a counter is not. Inserting a row keyed on observation_id with ON CONFLICT DO NOTHING is idempotent; a plain INSERT is not. A POST request with an Idempotency-Key header that the server respects is idempotent; the same POST without the header is not. The pipeline's operators must each be designed with the question "what does this operation do if I call it twice with the same input?" answered.
The natural idempotency key for SDA is the envelope's observation_id — a UUID generated at the source, carried through every stage, present on every observation. For derived events (a ConjunctionRisk produced by the correlator from two observations), the natural key is a hash of the inputs' observation_ids plus the window ID; deterministic, content-derived, identical across retries of the same input set. The tooling installed here will be reused in Module 5 when we discuss exactly-once delivery in depth.
Where to Carry the Key
The envelope carries the key. The downstream sink uses the key for dedup. The middle operators do not need the key for their own correctness (a stateless map operator is idempotent regardless), but they must propagate it to the downstream. The key must not change as the envelope passes through stages — a stage that "enriches" the observation by attaching a catalog entry must not regenerate observation_id; it must preserve it. This is the rule that makes the rest of the system composable: the producer-side at-least-once guarantee plus the sink-side dedup, with the same key visible at both ends, gives the pipeline effective exactly-once.
External system boundaries also need the key. An HTTP request to a downstream service includes the observation_id as the Idempotency-Key header; the downstream service uses it to dedup retries server-side. A Kafka producer with enable.idempotence=true (the underlying mechanism is producer ID + sequence number, but the conceptual model is the same) ensures the broker drops duplicate messages. A database write uses INSERT ... ON CONFLICT (observation_id) DO NOTHING. The pattern is the same in every case: the key crosses the boundary, the downstream uses it.
At-Least-Once + Idempotent = Effective Exactly-Once
This is the conceptual frame the rest of the track depends on. True exactly-once delivery — the network actually delivering each message exactly once — is impossible without either coordination (two-phase commit, transactional Kafka producers across topics) or strong assumptions about the network. Pragmatic exactly-once is achieved by combining two things that are individually achievable: at-least-once delivery at the transport layer (achievable with retries) and idempotent processing at the application layer (achievable with operation design). The two together produce a system in which every event is processed as if it were delivered exactly once, even though under the covers the transport layer may have delivered some events many times.
This frame is foundational for Module 5, where checkpointing, dead-letter queues, and exactly-once Kafka producers each get full treatment. Mention here so that every retry decision in this lesson is made with that downstream landscape in mind: we are not trying to avoid duplicates; we are trying to make sure duplicates are safe.
Code Examples
A Retry Wrapper with Decorrelated-Jitter Backoff
The wrapper takes a closure that performs the operation, plus a policy struct. It loops, dispatching on the operator's RetryDisposition. The backoff is computed per attempt with decorrelated jitter using the previous delay as the anchor.
use anyhow::Result;
use rand::Rng;
use std::time::Duration;
use tokio::time::sleep;
#[derive(Debug, Clone, Copy)]
pub struct RetryPolicy {
pub initial: Duration,
pub cap: Duration,
pub max_attempts: u32,
}
#[derive(Debug)]
pub enum RetryDisposition<T> {
/// Operation succeeded; return the value.
Ok(T),
/// Transient error; the wrapper should retry per the policy.
Retry(anyhow::Error),
/// Permanent error; the wrapper should not retry. Caller decides
/// whether to DLQ (Lesson 4) or propagate.
Permanent(anyhow::Error),
/// Discard with no retry, no DLQ. The event is invariant-violating
/// in a way that does not warrant operational attention.
Discard,
}
/// Retry the given operation per the policy. Returns Ok on success,
/// Err on permanent failure or attempt-budget exhaustion.
pub async fn with_retry<T, F, Fut>(policy: RetryPolicy, mut op: F) -> Result<Option<T>>
where
F: FnMut() -> Fut,
Fut: std::future::Future<Output = RetryDisposition<T>>,
{
let mut attempt: u32 = 0;
let mut prev_delay = policy.initial;
loop {
attempt += 1;
match op().await {
RetryDisposition::Ok(v) => return Ok(Some(v)),
RetryDisposition::Discard => return Ok(None),
RetryDisposition::Permanent(e) => {
return Err(e.context(format!("permanent failure on attempt {attempt}")));
}
RetryDisposition::Retry(e) if attempt >= policy.max_attempts => {
return Err(e.context(format!(
"exhausted retry budget after {attempt} attempts"
)));
}
RetryDisposition::Retry(_) => {
// Decorrelated jitter: random_uniform(initial, prev_delay * 3),
// capped at policy.cap. This produces a per-instance schedule
// that is uncorrelated with other instances retrying the same
// downstream — no thundering herd.
let upper_bound = (prev_delay.as_millis() as u64).saturating_mul(3);
let upper = upper_bound.max(policy.initial.as_millis() as u64);
let delay_ms = rand::thread_rng()
.gen_range(policy.initial.as_millis() as u64..=upper);
let delay = Duration::from_millis(delay_ms).min(policy.cap);
prev_delay = delay;
sleep(delay).await;
}
}
}
}
Two things worth noting. The wrapper accepts a FnMut() -> Future rather than a single future — this matters because each retry needs to be a fresh operation. A future is single-use; calling await on it again is forbidden. The closure's job is to construct a fresh future on each invocation. The second point: the Discard arm returns Ok(None) rather than Err(_). This distinguishes the "this event was an invariant violation we chose to drop" case from the "this operation failed" case at the type level. The caller can dispatch on the Option and update a discards_total metric without using errors as control flow.
Classifying Errors at the Boundary
The HTTP-source-side operator is the canonical place where transient and permanent errors arrive interleaved. The classification logic looks at the response status code and the error variant; it returns the right RetryDisposition for the wrapper to act on.
use reqwest::{Client, Error as ReqwestError, StatusCode};
async fn poll_optical_archive(
client: &Client,
endpoint: &str,
since: chrono::DateTime<chrono::Utc>,
) -> RetryDisposition<Vec<RawObservation>> {
let resp = match client.get(endpoint).query(&[("since", since.to_rfc3339())]).send().await {
Ok(r) => r,
Err(e) if e.is_timeout() || e.is_connect() => {
return RetryDisposition::Retry(e.into());
}
Err(e) => {
// Other reqwest errors (URL parse, header mismatch) are
// configuration bugs — permanent, not transient.
return RetryDisposition::Permanent(e.into());
}
};
match resp.status() {
s if s.is_success() => match resp.json::<Vec<RawObservation>>().await {
Ok(v) => RetryDisposition::Ok(v),
Err(e) => {
// Body is malformed JSON or has the wrong schema. Retrying
// does not help; the next response will be the same shape.
RetryDisposition::Permanent(e.into())
}
},
s if s.is_server_error() => {
// 500-class: transient. Retry.
RetryDisposition::Retry(anyhow::anyhow!("optical archive {s}"))
}
StatusCode::TOO_MANY_REQUESTS => {
// 429: explicitly transient, with the server asking us to slow.
// Real code would honor a Retry-After header here.
RetryDisposition::Retry(anyhow::anyhow!("optical archive 429"))
}
s if s.is_client_error() => {
// 400-class: permanent. The request is malformed or unauthorized.
// Retrying produces the same error.
RetryDisposition::Permanent(anyhow::anyhow!("optical archive {s}"))
}
s => RetryDisposition::Permanent(anyhow::anyhow!("optical archive unexpected {s}")),
}
}
The matching is exhaustive on the categories that matter — transient timeouts and connect failures, transient 5xx and 429, permanent 4xx, permanent everything-else. The Permanent for malformed JSON is a real consideration: if a partner's API change rolls out without coordination, the field they renamed produces a deserialization error on every request, and the pipeline starts pumping DLQ entries. That is the right behavior — alerting on DLQ growth is how we discover the partner's breaking change. Retrying the deserialization in a tight loop instead would mask the partner's bug and produce the same effect as a self-inflicted DDoS.
Idempotent Sink-Side Write
The dedup sink keeps a sliding-window set of recently-seen observation IDs and writes downstream only on novel observations. This is the application-layer half of the at-least-once-plus-idempotent composition.
use anyhow::Result;
use std::collections::{BTreeSet, VecDeque};
use std::time::{Duration, SystemTime};
use tokio::sync::mpsc;
use uuid::Uuid;
/// A sink that drops duplicate observations within a rolling window.
/// The window's capacity bounds memory; observations seen outside the
/// window may be re-emitted (acceptable for the SDA pipeline's downstream
/// alert subscriber, which itself is idempotent on alert ID).
pub struct DedupSink {
seen: BTreeSet<Uuid>,
/// Insertion order, used for FIFO eviction.
order: VecDeque<(Uuid, SystemTime)>,
window: Duration,
capacity: usize,
downstream: mpsc::Sender<Observation>,
}
impl DedupSink {
pub fn new(window: Duration, capacity: usize, downstream: mpsc::Sender<Observation>) -> Self {
Self { seen: BTreeSet::new(), order: VecDeque::new(), window, capacity, downstream }
}
pub async fn write(&mut self, obs: Observation) -> Result<()> {
let now = SystemTime::now();
// Evict expired entries first.
while let Some(&(id, ts)) = self.order.front() {
if now.duration_since(ts).unwrap_or_default() > self.window
|| self.order.len() >= self.capacity
{
self.order.pop_front();
self.seen.remove(&id);
} else {
break;
}
}
if self.seen.contains(&obs.observation_id) {
// Duplicate; drop silently. (Production: increment a metric.)
return Ok(());
}
self.seen.insert(obs.observation_id);
self.order.push_back((obs.observation_id, now));
self.downstream.send(obs).await
.map_err(|_| anyhow::anyhow!("dedup sink downstream dropped"))
}
}
The dedup window is bounded both by time and by count — the size cap is the safety valve in case the time-based eviction gets behind during a burst. A real production sink would also persist the seen set across restarts (via a small embedded store) so that a process restart does not produce a duplicate-observation surge while the seen set rebuilds; Module 5's checkpointing lesson supplies that machinery. Until then, a process restart causes a one-window worth of potential duplicates downstream — acceptable for SDA's alert subscriber, which has its own idempotency on alert ID, and worth flagging as a deliberate cost. The pattern at every layer is the same: choose a key, choose a bound, dedup against the bound.
Key Takeaways
- Errors are either transient (retry helps), permanent (retry makes it worse), or discardable (drop without operational attention). The classification is the operator's responsibility, not the framework's. Default unknown errors to permanent; that surfaces them rather than masking them.
- Exponential backoff with jitter is the right retry shape. Decorrelated jitter (random_uniform(initial, prev_delay * 3), capped) keeps a hundred operator instances from synchronizing their retries during a downstream outage. Naive fixed-delay retries amplify outages into self-inflicted DDoS events.
- Idempotency is a property of the operation, not of the framework. The pipeline composes at-least-once delivery (achievable with retries) plus idempotent operations (achievable by design) into effective exactly-once processing. This frame is the tooling that Module 5 will fully develop.
- The idempotency key is the envelope's
observation_idfor SDA, propagated unchanged through every stage. External boundaries — Kafka producers, HTTP requests, database writes — each have their own way of consuming the key (idempotent producer,Idempotency-Keyheader,ON CONFLICT DO NOTHING); the pattern is the same at every boundary. - The dedup sink is the canonical exactly-once-effective endpoint: a sliding-window set keyed on observation_id, bounded by both time and count, with documented behavior on cold start and known cost on duplicate-burst conditions.
Lesson 4 — Failure Modes
Module: Data Pipelines — M02: Pipeline Orchestration Internals Position: Lesson 4 of 4 Source: Async Rust — Maxwell Flitton & Caroline Morton, sections on panics in async tasks and structured shutdown; Fundamentals of Data Engineering — Joe Reis & Matt Housley, Chapter 3 ("Plan for Failure," "Build Loosely Coupled Systems")
Source note: The supervisor pattern as presented here draws on the Erlang/OTP design that influenced subsequent supervised-task systems and is well-documented across many sources beyond the cited texts. The Erlang one-for-one and rest-for-one strategies are not directly cited from a single source; they are general knowledge in the field and adapted here to the SDA orchestrator.
Context
Lesson 1 established the task; Lesson 2 wired tasks into a graph; Lesson 3 made each task's network calls survive transient failures. The pipeline now tolerates the failures that retry can address. This lesson handles the three classes of failure that retry cannot. Panics, where an operator hits a panic! or an unwrap() on None and the task is torn down by the runtime. Cascading slowdowns, where one operator's degradation propagates through the topology in ways that look like other operators failing. Resource exhaustion in shared pools, where one misbehaving operator starves the rest of the pipeline of file descriptors, connections, or thread time.
The mission framing is concrete. Two months ago, a release rolled out with a new validate operator that called unwrap() on the lookup result of a thread-local catalog cache. Some keys had been evicted from the cache during a startup race; the unwrap panicked. The pipeline did not crash; the panicking task was simply removed from the runtime, its channels orphaned, its upstream blocked on a full channel and its downstream starved of input. For the next four minutes, conjunction alerts were emitted, but they were emitted on the post-validate stream from observations that had skipped the validate stage. Two of those alerts later turned out to be false positives that should have been filtered. The postmortem traced the problem to a missing supervisor — there was nothing watching the validate task and restarting it.
The discipline this lesson installs is explicit failure-mode management. The supervisor pattern detects an operator's exit, classifies it, and applies a policy (restart, escalate, or fail). Bulkheading separates resources so one operator's failure cannot starve the others. Circuit breakers provide local fail-fast behavior for repeatedly-failing downstreams. Together, these patterns turn the pipeline from "runs until something panics" into "runs through panics with documented recovery semantics." Reis & Housley's framing is direct: planning for failure is what distinguishes a system that operates from a system that runs.
Core Concepts
Panic vs Error in Tokio Tasks
A future returns errors via Result. A future panics when its execution hits a runtime panic — an unwrap on a None, an out-of-bounds index, a division by zero, an explicit panic! macro. Tokio captures the panic at the task boundary and surfaces it through the task's JoinHandle: awaiting the handle returns Err(JoinError) where is_panic() is true. The runtime continues running; only the panicked task is torn down. The application's view of the panic depends entirely on whether anyone is awaiting the handle. If the handle was detached (dropped), the panic is silent — logged at most, and the application has no recovery hook.
The corollary is structural: the orchestrator must own every operator's JoinHandle and join on it. Detached operators that panic disappear with no signal beyond a log line. The Task wrapper from Lesson 1 keeps the handle owned; the supervisor in this lesson awaits each handle through JoinSet::join_next and dispatches on the four-case TaskExit enum (Ok, Errored, Panicked, Aborted). The four cases call for different actions. A panic is a programming bug that should be alerted on, not silently retried — the same panic at the same code path will recur on restart, and an unbounded restart loop on a panic burns CPU without making progress. An error is a runtime condition (a Permanent retry-disposition that propagated past the wrapper) that may or may not warrant restart depending on policy. An abort is a deliberate shutdown signal from the orchestrator. An ok-exit means the operator's input ended cleanly, which is normal at end-of-stream.
The default Tokio behavior on a panic in a task is "ignore" — the runtime captures the panic, marks the handle as failed, and continues. There is a runtime configuration to abort the process on any task panic (unhandled_panic = Shutdown); we do not enable it. A pipeline that crashes on any operator panic is more brittle than one that supervises them. The supervisor decides the response; the runtime's job is to surface the panic, not to act on it.
The Supervisor Pattern
A supervisor is a parent task that owns the JoinHandles of its children, watches for any of them to exit, and applies a policy on exit. The pattern is the same one that made Erlang's OTP framework famous: structured concurrency with explicit failure dispatch. The SDA orchestrator's supervisor is a loop over JoinSet::join_next with a match on the resulting TaskExit.
The policy decides the response. Restart spawns a new instance of the operator (using its registered factory closure), inserting the new Task into the JoinSet. Escalate propagates the failure upward — in our single-supervisor design, this means tearing down the entire pipeline. Ignore lets the operator stay dead and the pipeline continues with the missing stage; rarely the right answer for SDA but useful for purely-observational sidecars (a metrics exporter, a debug logger). The policy is per-operator and registered when the operator is added to the graph.
A naive "always restart" policy is dangerous because it does not bound the restart rate. An operator that panics on its first input panics again on the same input after restart, and again, and again — a tight restart loop that burns CPU and produces a flood of metrics noise without ever making progress. The right shape is a bounded restart budget: at most N restarts within a time window W. Past the budget, the supervisor escalates. Erlang's documented rule of thumb is "5 restarts in 60 seconds before escalation"; the SDA orchestrator uses the same default, configurable per-operator. The policy is in RestartPolicy::Bounded { max_restarts, window } from Lesson 1's Task wrapper.
Bulkheading
A bulkhead is a physical partition that prevents flooding from spreading between compartments of a ship; in software, it is a separation of resources so that a failure or resource exhaustion in one part of the system cannot affect other parts. Three levels are useful for the SDA orchestrator.
Channel-level bulkheading is what we already have: every operator pair is connected by its own bounded channel, so a slow downstream applies backpressure only to its direct upstream, not to the entire pipeline. A misbehaving correlator does not block the audit sink; the audit sink reads from a different channel that is unaffected.
Runtime-level bulkheading separates the Tokio runtime workers used by different sets of operators. The default Tokio runtime has a shared worker pool; a CPU-bound operator (placed on spawn_blocking, not spawn) holds a blocking-pool thread while it runs. The blocking pool's default size of 512 is sized for the assumption that blocking work is incidental, not the steady-state workload. For SDA, the orbital propagator runs a steady stream of CPU-bound calls; if it shares the blocking pool with file I/O on the audit sink, the propagator can starve the audit sink of pool slots during a fragmentation-event surge. The fix is to give the propagator its own runtime (a dedicated tokio::runtime::Runtime with a separate blocking pool) and have the propagator operator submit to that runtime instead. This is more complex than channel-level bulkheading and reserved for operators where the resource isolation is needed.
Process-level bulkheading is out of scope for this track but worth naming. Production deployments of the SDA pipeline run different stages in different processes (or even different hosts) so that an OOM in the correlator does not kill the radar source. The separation is bought via Kafka topics between stages, which Module 5 covers in depth.
Cascading Failures
A cascading failure is when one operator's degradation produces failure-like symptoms in unrelated operators that are not themselves degrading. Three common shapes.
Backpressure-induced timeouts: a slow correlator's full channel suspends the upstream normalize, which suspends its upstream radar source's send. The normalize and the source are not failing, but their per-event latency goes up. If a downstream operator (an alert emitter that consumes from after the correlator) has a deadline that exceeded its budget, it times out — and the timeout looks like the alert emitter is broken even though the cause is the correlator's slowness. The correct response is not to retry the alert emitter (which makes the situation worse by adding more requests to a slow downstream) but to address the correlator.
Resource-exhaustion cascades: an operator with a leaked file descriptor hits the process's FD limit. Subsequent operators that try to open files (the audit sink writing to disk; the optical poller opening TCP connections) fail with EMFILE. The fault propagates with no obvious connection to its source.
Retry-storm cascades: an operator's retry policy (Lesson 3) is misconfigured with no jitter. A downstream blip triggers synchronized retries across a fleet of operator instances, which prevent the downstream from recovering, which extends the blip into an outage, which extends the retry storm. The original blip was 200 ms; the cascade is hours.
The architectural response to cascades is to address the cause, not the symptom. Backpressure-induced timeouts are diagnosed by tracing per-stage occupancy and processing latency together; a full channel upstream of the timing-out operator is the smoking gun. Resource exhaustion is diagnosed by tracking FD counts, connection counts, blocking-pool slot counts as Prometheus gauges. Retry storms are prevented by jitter (already done in Lesson 3) and detected by retry-rate metrics. Module 6 builds this observability tooling out fully.
Circuit Breakers
A circuit breaker is a local pattern for protecting a downstream from a fleet of operator instances all hammering it during a failure. The breaker has three states. Closed: the breaker passes calls through normally and tracks the failure rate. Open: the breaker has observed too many failures recently and rejects calls immediately without making the downstream call at all. Half-open: after a cooldown, the breaker lets a single call through to test if the downstream has recovered; on success, it closes; on failure, it opens again.
The breaker complements the retry wrapper from Lesson 3. Retry handles individual call failures; the breaker handles patterns of failure. Together they produce a system that handles brief blips with a few retries, sustained outages with the breaker opening to spare the downstream from amplification, and recovery with the breaker probing to detect the downstream's return without flooding it.
The breaker's tuning is operational. The trip threshold (failures-per-window before opening) is set high enough that brief blips do not trip but low enough that genuine outages do. The cooldown (time spent open before going to half-open) is set to the longest expected recovery time; too short causes thrash, too long delays recovery detection. The half-open success criterion is one call usually; some implementations use a small batch with a quorum requirement. The SDA orchestrator's default is a 50% failure rate over a 30-second window to trip, 30-second cooldown, single-call probe — sized to match the downstream's typical recovery characteristics and adjustable per-downstream.
Code Examples
A Supervisor with Bounded Restart Budget
The supervisor loops over JoinSet::join_next, dispatching on the TaskExit enum from Lesson 1, applying the per-operator restart policy, and escalating when the budget is exhausted.
use anyhow::Result;
use std::collections::HashMap;
use std::time::Instant;
use tokio::task::JoinSet;
pub struct Supervisor {
/// In-flight operator tasks by name. JoinSet drives the watch loop.
set: JoinSet<(String, TaskExit)>,
/// Per-operator restart history, used for budget enforcement.
/// Each entry is the timestamps of recent restarts.
restart_history: HashMap<String, Vec<Instant>>,
/// Per-operator factories so the supervisor can respawn.
factories: HashMap<String, OperatorFactory>,
policies: HashMap<String, RestartPolicy>,
}
pub type OperatorFactory = Box<dyn FnMut() -> OperatorFuture + Send>;
#[derive(Debug)]
pub enum SupervisorEvent {
/// An operator panicked; not retried. Pipeline should be torn down.
Panicked { name: String, message: String },
/// An operator exhausted its restart budget; pipeline should be torn down.
BudgetExhausted { name: String },
/// All operators exited cleanly; pipeline shut down normally.
AllOk,
}
impl Supervisor {
/// Run the supervisor loop. Returns when all operators have exited
/// or when one fails in a non-recoverable way.
pub async fn run(&mut self) -> SupervisorEvent {
loop {
match self.set.join_next().await {
None => return SupervisorEvent::AllOk,
Some(Ok((name, TaskExit::Ok))) => {
// End-of-stream from one operator. Other operators may
// still be running; let them finish naturally.
tracing::info!(operator = %name, "operator exited cleanly");
}
Some(Ok((name, TaskExit::Panicked(msg)))) => {
// Panic is a programming bug. Do not restart; escalate.
return SupervisorEvent::Panicked { name, message: msg };
}
Some(Ok((name, TaskExit::Errored(e)))) => {
// An operator returned Err. Apply its restart policy.
let policy = self.policies.get(&name).copied()
.unwrap_or(RestartPolicy::Never);
if !self.try_restart(&name, policy) {
return SupervisorEvent::BudgetExhausted { name };
}
}
Some(Ok((name, TaskExit::Aborted))) => {
// Aborted by the orchestrator. Not a failure.
tracing::info!(operator = %name, "operator aborted by orchestrator");
}
Some(Err(join_err)) => {
// join_next returned a JoinError directly — abnormal.
tracing::error!(?join_err, "join_next produced unexpected JoinError");
}
}
}
}
fn try_restart(&mut self, name: &str, policy: RestartPolicy) -> bool {
match policy {
RestartPolicy::Never => false,
RestartPolicy::Always => {
self.respawn(name);
true
}
RestartPolicy::Bounded { max_restarts, window } => {
let history = self.restart_history.entry(name.to_string()).or_default();
let cutoff = Instant::now() - window;
history.retain(|ts| *ts >= cutoff);
if history.len() >= max_restarts as usize {
tracing::error!(operator = %name, "restart budget exhausted");
return false;
}
history.push(Instant::now());
self.respawn(name);
true
}
}
}
fn respawn(&mut self, name: &str) {
if let Some(factory) = self.factories.get_mut(name) {
let future = factory();
let name_owned = name.to_string();
self.set.spawn(async move {
let result = future.await;
let exit = match result {
Ok(()) => TaskExit::Ok,
Err(e) => TaskExit::Errored(e),
};
(name_owned, exit)
});
tracing::info!(operator = %name, "operator restarted");
}
}
}
The supervisor is the type that the rest of the orchestrator hangs on. Its key property is structural: every spawned task's exit funnels through join_next, so every panic, error, abort, and clean exit is observed and dispatched on. There are no detached tasks in the system the supervisor knows about; if one is added by accident (a tokio::spawn somewhere outside the supervisor's view), it is a structural bug. The restart budget enforcement is straightforward: keep a sliding window of restart timestamps per operator, evict expired entries on each restart attempt, escalate when the budget is exhausted. The escalation just exits the supervisor's run loop with BudgetExhausted; the caller is the orchestrator's top-level entrypoint, which decides whether to abort the rest of the pipeline or whether to retry the supervisor itself with a longer cooldown — usually the former.
A Circuit Breaker for the Optical Archive
The breaker wraps calls to the flaky downstream. It tracks recent failures and trips when the failure rate exceeds a threshold; while open, calls return Err immediately without making the downstream call. After the cooldown, it lets a single probe through.
use std::sync::Mutex;
use std::time::{Duration, Instant};
#[derive(Debug, Clone, Copy)]
enum BreakerState {
Closed,
Open { opened_at: Instant },
HalfOpen,
}
pub struct CircuitBreaker {
state: Mutex<BreakerState>,
/// Window of (timestamp, was_failure) tuples for failure-rate calc.
history: Mutex<Vec<(Instant, bool)>>,
threshold: f32, // e.g., 0.5 = trip on 50% failures
window: Duration,
cooldown: Duration,
}
impl CircuitBreaker {
pub fn new(threshold: f32, window: Duration, cooldown: Duration) -> Self {
Self {
state: Mutex::new(BreakerState::Closed),
history: Mutex::new(Vec::new()),
threshold,
window,
cooldown,
}
}
/// Returns true if the call should be allowed through. False means
/// the breaker is open and the caller should fail-fast without
/// touching the downstream.
pub fn allow(&self) -> bool {
let mut state = self.state.lock().unwrap();
match *state {
BreakerState::Closed => true,
BreakerState::HalfOpen => {
// Already testing; do not allow concurrent probes.
false
}
BreakerState::Open { opened_at } => {
if opened_at.elapsed() >= self.cooldown {
*state = BreakerState::HalfOpen;
true
} else {
false
}
}
}
}
pub fn record_outcome(&self, was_failure: bool) {
let now = Instant::now();
let mut history = self.history.lock().unwrap();
history.push((now, was_failure));
// Drop entries outside the window.
let cutoff = now - self.window;
history.retain(|(ts, _)| *ts >= cutoff);
let total = history.len();
let failures = history.iter().filter(|(_, f)| *f).count();
let rate = failures as f32 / total.max(1) as f32;
let mut state = self.state.lock().unwrap();
match *state {
BreakerState::HalfOpen => {
if was_failure {
*state = BreakerState::Open { opened_at: now };
} else {
*state = BreakerState::Closed;
history.clear();
}
}
BreakerState::Closed => {
if total >= 5 && rate >= self.threshold {
*state = BreakerState::Open { opened_at: now };
}
}
BreakerState::Open { .. } => {
// Outcome from a stale in-flight call; ignore.
}
}
}
}
The breaker integrates with the retry wrapper from Lesson 3 by gating the retry attempt: if breaker.allow() returns false, the retry attempt is short-circuited and the wrapper continues to the next backoff. The combination produces the layered response the lesson promises: individual transient failures are retried; sustained failure patterns trip the breaker, sparing the downstream from a fleet of synchronized retries; recovery is detected by the half-open probe without flooding the downstream. The minimum-call threshold (total >= 5) prevents the breaker from tripping on a single early failure during operator startup, when the failure rate is mathematically high but the sample is too small to be meaningful.
Bulkheading the CPU-Bound Propagator
A dedicated runtime for the propagator isolates it from the main async runtime's blocking pool. The propagator submits to its own pool; the rest of the pipeline submits to the default. A starvation in one does not affect the other.
use std::sync::Arc;
use tokio::runtime::Runtime;
pub struct PropagatorPool {
runtime: Arc<Runtime>,
}
impl PropagatorPool {
/// Build a dedicated runtime for CPU-bound propagation. The pool size
/// is documented per-deployment based on expected propagation rate.
pub fn new(blocking_threads: usize) -> Self {
let runtime = tokio::runtime::Builder::new_multi_thread()
.worker_threads(2) // small async pool for handle wiring
.max_blocking_threads(blocking_threads)
.thread_name("propagator")
.enable_all()
.build()
.expect("propagator runtime build");
Self { runtime: Arc::new(runtime) }
}
/// Submit propagation work. The closure runs on the propagator's
/// blocking pool, completely isolated from the main runtime's pool.
pub async fn propagate(&self, obs: Observation) -> Result<PropagatedObservation> {
let runtime = self.runtime.clone();
let handle = runtime.spawn_blocking(move || {
orbital::propagate(obs.target, obs.sensor_timestamp)
});
let propagated = handle.await??;
Ok(PropagatedObservation { obs, propagated })
}
}
The cost is that the propagator now lives behind an Arc<Runtime> rather than directly in the main runtime, and the pipeline graph has to pass the PropagatorPool to operators that need it. The benefit is operational isolation: a propagator surge that blocks every blocking thread it has does not touch the audit sink's blocking-pool needs. Channel-level bulkheading would not have addressed this; the audit sink and propagator share a different resource (blocking pool slots) than the channels that connect them, and the channel between them is irrelevant to the starvation. This is the case the lesson called out for runtime-level bulkheading.
Key Takeaways
- Panics surface through
JoinHandleasErr(JoinError)withis_panic()true. The orchestrator must own every operator's handle; detached tasks that panic disappear silently. TheTaskwrapper from Lesson 1 plusJoinSet::join_nextis the structural mechanism. - The supervisor pattern dispatches on
TaskExit. Restart on errors per the operator'sRestartPolicy; escalate on panics (do not retry programming bugs); ignore aborts (deliberate shutdown). Always use a bounded restart budget — Erlang's "5 in 60 seconds" is a sensible default — and escalate on budget exhaustion. - Channel-level bulkheading is already provided by per-operator bounded channels. Runtime-level bulkheading separates blocking pools for resource-isolated operators (the propagator with its own
Runtime). Process-level bulkheading is for Module 5 and beyond. - Cascading failures are diagnosed by recognizing that the failing-looking operator is downstream of the actual cause. Address the cause, not the symptom — retrying a timed-out alert emitter does not fix a slow correlator. Module 6's observability tooling makes the cause visible.
- Circuit breakers complement retries: retries handle individual failures, breakers handle failure patterns. Closed → Open on threshold breach → Half-Open after cooldown → probe → Closed or back to Open. The combination prevents synchronized-retry amplification of downstream outages.
Capstone Project — Fusion Pipeline Orchestrator
Module: Data Pipelines — M02: Pipeline Orchestration Internals Estimated effort: 1–2 weeks of focused work Prerequisites: All four lessons in this module passed at ≥70%
Mission Brief
OPS DIRECTIVE — SDA-2026-0119 / Phase 2 Implementation Classification: ORCHESTRATION TIER STAND-UP
The Phase 1 ingestion service from Module 1 (
sda-ingest) is in production and stable, but the next quarter's roadmap adds five new sensor sources, a windowed dedup stage, a cross-sensor correlator, an alert emitter, and an audit sink. The current hand-spawned topology inmain.rsis at the practical limit of what one engineer can hold in their head. The Phase 1 postmortem also flagged two operational gaps: a panicking task is silently torn down with no recovery hook, and the retry policy on the optical poller is unjittered fixed-delay (the 90-minute outage extension last quarter is the canonical incident).Phase 2 builds the orchestrator that addresses both gaps. The deliverable is a Rust library that accepts a declarative DAG of operators, spawns them with their channels correctly wired, supervises their lifecycles, and applies retry policy with jitter. The Phase 1 binary is refactored to use the orchestrator with no behavioral regression beyond the documented failure-handling improvements.
Success criteria for Phase 2: the orchestrator handles every Phase 1 source plus a synthetic 5-source future-load profile; a panic in any operator surfaces to operations rather than disappearing; transient downstream failures recover via backoff without thundering-herd behavior. Failure-isolated subsystems (the new orbital propagator) run on a dedicated runtime to avoid blocking-pool contention with the rest of the pipeline.
What You're Building
A Rust library crate, sda-orchestrator, that exposes:
- An
OperatorGraphbuilder withadd_source,add_operator,add_sink, andconnectmethods (Lesson 2) - A
BuiltGraph::run(supervisor_policy) -> Futurethat spawns the topology with the supervisor (Lesson 4) wired in - The
Taskwrapper (Lesson 1),RetryPolicyandwith_retry(Lesson 3), andCircuitBreaker(Lesson 4) types as public API - Cycle detection and per-role edge validation in
OperatorGraph::build()with named-operator error messages - Bounded-restart-budget supervision with structured logging on every supervisor decision
Plus the refactor of the Phase 1 sda-ingest binary:
- The Module 1 binary is refactored to declare its topology as an
OperatorGraphrather than hand-spawn it; behavior is preserved end-to-end - The optical-archive HTTP poller wraps its requests in
with_retryusing decorrelated-jitter backoff - The (new for this module) orbital propagator runs on a dedicated
tokio::runtime::Runtimefor blocking-pool isolation
The deliverable is the library, the refactored binary, the test suite (including a deterministic supervisor-restart test using tokio::time::pause), and a 1-page operational README documenting the orchestrator's API and its failure-handling guarantees.
Suggested Architecture
OperatorGraph (declarative)
│
│ build()
▼
┌───────────────────────────────────────────────────────────────┐
│ BuiltGraph: per-edge channels allocated, operators spawned │
│ in topological order with their wired ends. │
└───────────────────────────────┬───────────────────────────────┘
│ run(policy)
▼
┌───────────────────┐
│ Supervisor │
│ (JoinSet loop) │
└─────────┬─────────┘
│
┌────────────────────────────┴────────────────────────────┐
│ │
▼ ▼
┌───────┐ ┌────────────┐ ┌──────────┐ ┌──────────────┐ ┌──────┐
│ radar │→│ normalize │→ │ dedup │→│ correlator │→│ sink │
│ src │ └────────────┘ └──────────┘ │ (Propagator- │ │ │
└───────┘ ↑ ↑ │ Pool runtime)│ └──────┘
┌───────┐ │ │ └──────────────┘
│optical│─────┘ │
│ src │ │ Each edge: bounded mpsc::channel
└───────┘ │ Each operator: Task wrapped, supervised
┌───────┐ │ Each network call: with_retry + breaker
│ ISL │─────────┘
│ src │
└───────┘
The orchestrator does not know about Observation specifically; it operates on type-erased operator factories that consume and produce arbitrary types. The library is generic in the sense that the application (the SDA binary) wires up its own operator types and topology. Resist the temptation to bake SDA-specific assumptions into the orchestrator crate; it is meant to be reusable across pipelines.
Acceptance Criteria
Functional Requirements
-
OperatorGraphexposesnew,add_source,add_operator,add_sink,connect,buildmatching the Lesson 2 signatures -
OperatorGraph::build()runs all four passes (per-role validation, Kahn's topo sort with cycle detection, channel allocation, spawn) and produces actionable error messages on validation failures -
BuiltGraph::run(policy)drives the supervisor loop and returns aSupervisorEventthat distinguishes clean shutdown from panic from budget exhaustion -
Task::join()collapses Tokio's two-level error reporting into aTaskExitenum (Ok,Errored,Panicked,Aborted) -
RestartPolicy::{Never, Always, Bounded { max_restarts, window }}is honored by the supervisor -
with_retry(policy, op)retriesRetryDisposition::Retryresults with decorrelated-jitter backoff, propagatesPermanentimmediately, and discardsDiscardcleanly -
CircuitBreakerimplements Closed → Open → HalfOpen transitions with the threshold, window, and cooldown the lesson described; integration withwith_retryis documented in the API -
The
sda-ingestbinary is refactored to useOperatorGraphdeclaratively; the topology fits in a singlebuild_topology()function under 80 lines
Quality Requirements
- DAG cycle test: a unit test attempts to build a graph with a cycle and asserts the error message names the cycle's vertices
- Disconnection test: a unit test attempts to build a graph with an unconnected operator and asserts the per-role validation error names the operator and the missing direction
-
Supervisor restart test: a unit test injects an error from a fake operator, asserts the supervisor restarts it within the budget, then asserts budget exhaustion when the error rate exceeds the policy. Use
tokio::time::pause()andadvancefor deterministic timing — no flakysleepin tests. -
Supervisor panic test: a unit test injects a panic from a fake operator and asserts the supervisor returns
SupervisorEvent::Panickedwithout restart attempts - Decorrelated-jitter math test: a unit test fixes the RNG and asserts the per-attempt delays match the documented schedule
-
No
.unwrap()or.expect()in non-startup code paths
Operational Requirements
-
HTTP control plane (extending Module 1's): adds
GET /metricsfields foroperator_restart_total{operator},operator_uptime_seconds{operator},circuit_breaker_state{breaker}(encoded as 0/1/2 for Closed/Open/HalfOpen), andretry_attempts_total{operator} - Structured log line on every supervisor decision: spawn, restart, budget-exhausted, escalate, clean-exit. JSON formatter, one event per decision (not one per observation).
- The operational README updated for Phase 2: documents the orchestrator API, the new metrics, and the new failure-handling semantics. One-page constraint preserved.
Self-Assessed Stretch Goals
-
(self-assessed) The optical source survives a 30-second downstream outage with no operator restarts and no observation drops at the application layer (the kernel may drop UDP frames during the outage; that is acceptable). Demonstrate via integration test using
wiremockto simulate the outage. -
(self-assessed)
OperatorGraph::build()for a 10-operator topology completes in under 100 ms (cold) and under 10 ms (warm). Provide acriterionbenchmark. -
(self-assessed) The
PropagatorPool(dedicated runtime for orbital propagation) is demonstrated to be isolated from the main runtime: an artificial 100x propagator load does not affect the main runtime's audit-sink P99 latency. Include the load-test harness.
Hints
How should I represent operators in the graph without making it generic over every operator type?
A boxed factory closure is the cleanest path. The graph stores Box<dyn FnOnce(WiredEnds) -> OperatorFuture + Send>, where OperatorFuture = Pin<Box<dyn Future<Output = Result<()>> + Send>>. The closure is called once build() has the channels; it captures whatever the operator needs (config, addresses, references to shared state).
type OperatorFuture = Pin<Box<dyn Future<Output = Result<()>> + Send>>;
type OperatorFactory = Box<dyn FnOnce(WiredEnds) -> OperatorFuture + Send>;
let radar_factory: OperatorFactory = Box::new(|ends| {
Box::pin(async move {
let radar = UdpRadarSource::bind("0.0.0.0:7001", "radar-01").await?;
let tx = ends.tx.expect("source has tx");
run_source_loop(radar, tx).await
})
});
This keeps the graph type non-generic at the cost of a heap allocation per operator at build time — negligible for the topology sizes the SDA pipeline reaches.
How do I handle the channel-creation order problem in the topo-sort spawn pass?
The two-pass structure from the lesson:
// Pass 3: allocate every channel up-front.
let mut chan_tx: HashMap<EdgeId, mpsc::Sender<Observation>> = HashMap::new();
let mut chan_rx: HashMap<EdgeId, mpsc::Receiver<Observation>> = HashMap::new();
for e in &self.edges {
let (tx, rx) = mpsc::channel(e.capacity);
chan_tx.insert(e.id, tx);
chan_rx.insert(e.id, rx);
}
// Pass 4: walk in topo order, hand each operator its channel ends.
for idx in topo_order {
let v = &self.vertices[idx];
let rx = v.incoming.and_then(|e| chan_rx.remove(&e));
let tx = v.outgoing.and_then(|e| chan_tx.remove(&e));
let future = (v.factory)(WiredEnds { rx, tx });
let task = Task::spawn(&v.name, v.restart_policy, future);
tasks.push(task);
}
The receiver halves are removed (not get'd) because each receiver belongs to exactly one operator — moving rather than cloning. The map's emptiness at the end of pass 4 is itself a structural sanity check.
How do I test the supervisor's restart logic without flaky timing?
tokio::time::pause() makes time deterministic in tests: sleep and Instant are driven by tokio::time::advance, not by wall clock. The supervisor's window-based budget can be tested by advancing time forward over the window and observing eviction.
#[tokio::test(start_paused = true)]
async fn supervisor_restarts_within_budget() {
let policy = RestartPolicy::Bounded {
max_restarts: 3,
window: Duration::from_secs(60),
};
let mut sup = Supervisor::for_test(policy, panic_factory());
// First failure within budget: should restart.
sup.simulate_exit(TaskExit::Errored(anyhow!("boom")));
assert_eq!(sup.restart_count(), 1);
// Three more failures exhaust the budget.
for _ in 0..3 {
sup.simulate_exit(TaskExit::Errored(anyhow!("boom")));
}
assert_eq!(sup.event(), SupervisorEvent::BudgetExhausted { .. });
// Advance past the window; restart history evicts; new failures recover budget.
tokio::time::advance(Duration::from_secs(61)).await;
sup.simulate_exit(TaskExit::Errored(anyhow!("boom")));
assert_eq!(sup.restart_count(), 5);
}
The Supervisor::for_test constructor and simulate_exit methods are testing-only API that you expose with #[cfg(test)] or behind a testing feature flag. The principle: the supervisor should be testable without spawning actual tasks.
How do I refactor the M1 binary without breaking the integration tests?
Two-step refactor. First, build the orchestrator library with the test harness asserting the supervisor and graph behavior. Second, refactor the binary in-place, leaving its existing integration tests unmodified — they should pass against the refactored binary because the observable behavior is unchanged.
A common temptation is to build a parallel sda-ingest-v2 binary alongside the original. Resist this; it produces two binaries to maintain. The right approach is a single binary whose internals change. Keep the original integration test suite running on every commit during the refactor.
What restart policy should I default to per operator?
The defaults that match SDA's operational stance:
- Sources (radar, optical, ISL):
Bounded { max_restarts: 5, window: Duration::from_secs(60) }. Sources are the most external part of the pipeline; they are most likely to encounter transient external failures (a network blip, a partner deploy). Restart-with-budget is the right shape. - Stateless operators (normalize, validate):
Bounded { max_restarts: 5, window: Duration::from_secs(60) }. Same defaults as sources — they have no state to corrupt and restart is cheap. - Stateful operators (dedup, correlator):
Bounded { max_restarts: 3, window: Duration::from_secs(60) }— fewer restarts because state loss on each restart costs more, and a higher recurrence rate suggests a deeper problem. - Sinks where data integrity is at stake (audit log, alert emitter):
Never. A retry on these may produce duplicate emits to downstream subscribers; better to fail the pipeline loudly. Module 5's idempotent-sink machinery will let you change this default later.
Document the choice per-operator in the topology declaration with a one-line comment explaining the reasoning.
Getting Started
Recommended order:
Taskwrapper. DefineTask::spawn,Task::join,TaskExit,RestartPolicy. Write unit tests that spawn synthetic tasks (success, error, panic) and assert the rightTaskExitvariant.OperatorGraphbuilder. Define the builder API and the per-role validation. Write tests for: well-formed graphs build, dangling edges are rejected, role mismatches are rejected.- Topo sort + cycle detection. Implement Kahn's algorithm in
build(). Write tests for: linear chains, fan-in, fan-out, and (failing) cycles. - Channel allocation and spawn. The two-pass structure from the hint. Write an end-to-end test that builds a 3-operator graph and confirms data flows from source to sink.
Supervisor. Wrap theJoinSetloop. Write the deterministic-timing tests usingtokio::time::pause.with_retryandCircuitBreaker. These are independent of the orchestrator; they can be developed and tested in isolation, then wired into the optical source's polling code.PropagatorPool. The dedicated-runtime wrapper for the orbital propagator. The propagator itself is mocked for this project — the real propagator is from Meridian'sorbitalcrate, which is out of scope. Amock_propagate(state, dt) -> statefunction that does a deterministic-but-CPU-bound computation is sufficient.- Refactor
sda-ingest. The topology declaration becomes a singlebuild_topology()function. Keep the existing integration tests passing. - Operational README. Document the orchestrator API and the new metrics. One page, terse.
Aim for a working orchestrator and a passing-tests refactor by day 7 even if the operational polish (control-plane metrics, README) is incomplete. The orchestrator's correctness is what matters; the polish is finishing work.
What This Module Sets Up
In Module 3 you will replace this module's processing-time dedup operator with an event-time windowed correlator. The orchestrator interface stays the same; only one operator's implementation changes. The watermark machinery you build there flows along the same channels the orchestrator wired up here.
In Module 4 you will harden the channel boundaries against burst-load failure modes. The bounded-channel-per-edge invariant the orchestrator enforces structurally is what makes that work tractable. You will revisit buffer sizing with rigor.
In Module 5 you will make the windowed operator's state crash-safe via checkpointing. The supervisor's restart machinery you built here is what the checkpoint recovery path hooks into. The Never restart policy on data-integrity-critical sinks gets revisited with idempotent-sink tooling that lets it become Bounded safely.
The orchestrator is not a throwaway. It is the connective tissue every subsequent module's project hangs on.
Module 03 — Event Time and Watermarks
Track: Data Pipelines — Space Domain Awareness Fusion Position: Module 3 of 6 Source material: Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11 (Reasoning About Time, Windowing, Knowing When You're Ready, Out-of-Order Events); Streaming Data — Andrew Psaltis, Chapter 4 (Analyzing Streaming Data, Windowing Patterns, Watermarks); Kafka: The Definitive Guide — Shapira, Palino, Sivaram, Petty, Chapter 14 (Stream Processing Concepts: Time, Out-of-Sequence Events); Fundamentals of Data Engineering — Joe Reis & Matt Housley, Chapter 7 (Late-Arriving Data) Quiz pass threshold: 70% on all four lessons to unlock the project
Mission Context
OPS ALERT — SDA-2026-0142 Classification: CORRELATION TIER UPGRADE Subject: Replace processing-time dedup with event-time windowed correlation
The Phase 2 orchestrator from Module 2 is in production. Its dedup operator at the top of the correlation tier is processing-time-based — buckets observations by arrival time, not by sensor_timestamp. Internal review of conjunction alerts from the past quarter found a 2.3% rate of cross-sensor mismatches traceable to optical-vs-radar arrival skew straddling the 5-second dedup window. Two of the missed correlations were later determined to be real conjunctions. The fix is structural: replace processing-time dedup with event-time windowed correlation that respects sensor_timestamp regardless of arrival skew, with watermarks driving window close and allowed-lateness handling the long tail of partner-API delays.
This module is the correctness foundation of the rest of the track. The orchestrator built in Module 2 stays unchanged — only one operator's implementation changes (dedup → correlator). What does change is the conceptual machinery: every event-time-aware operator from this point on consumes a watermark stream, maintains windowed state with explicit close triggers, and handles the rare late event explicitly rather than silently dropping it. The patterns this module installs — event-time-on-the-envelope, per-source max-lateness, min-of-inputs watermark propagation, allowed-lateness retention, retract-and-correct downstream emission — are the canonical streaming-system shape that Module 4 (backpressure and flow control), Module 5 (delivery guarantees and fault tolerance), and Module 6 (observability and lineage) all build on.
The mental model the module installs is the four-piece event-time discipline: (1) the envelope carries event time and ingest time as separate first-class fields, (2) operators bucket by event time using one of four canonical window shapes, (3) watermarks are guarantees that drive window close, propagated through fan-ins by the min rule, (4) late events past the watermark are handled explicitly per a per-output strategy. Every event-time pipeline that succeeds in production is some combination of these four; pipelines that skip any of them have correctness bugs that look like flakiness.
Learning Outcomes
After completing this module, you will be able to:
- Distinguish event time from ingest time on the observation envelope and choose the right time for any aggregation question
- Reason about per-source clock heterogeneity (GPS-locked, NTP-synced, drifting) and carry source-specific quality forward through the pipeline
- Choose among the four canonical window shapes (tumbling, hopping, sliding, session) based on the question being asked, not on the perceived complexity of each
- Implement per-event sliding windows with bounded memory and correct event-time eviction, plus session windows with the production-safety double-bound
- Define watermarks precisely as guarantees, generate them at sources from per-source max-lateness, and propagate them through fan-in operators via the min-of-inputs rule
- Handle late events with the appropriate strategy — drop, allowed-lateness, or retract — and recognize which strategy fits which output
- Compose the operator-level retract-and-correct pattern with M2's at-least-once-plus-idempotent delivery to produce effective exactly-once at the sink
Lesson Summary
Lesson 1 — Event Time vs Processing Time
The two operationally distinct timestamps every observation carries: sensor_timestamp (when the event happened in the world) and ingest_timestamp (when the pipeline received it). Sensor-clock heterogeneity carried forward as ClockQuality so per-source skew can widen the correlator's matching window. Out-of-order arrival as the rule. Lag (split into source lag and pipeline lag) as the master diagnostic for "is the problem ours or theirs."
Key question: A radar observation arrives 80 ms after its event time; an optical observation of the same event arrives 30 seconds after its event time. Should they be correlated, and which window-assignment strategy gets that right?
Lesson 2 — Windowing
The four canonical window shapes — tumbling, hopping, sliding, session — and the question shape each fits. The conjunction-risk question fits per-event sliding windows. BTreeMap-keyed-on-window-end as the data structure that makes close-up-to-watermark a cheap O(log N) prefix query. Session windows' production-safety double-bound (gap timeout AND max session duration) as the safety valve that prevents unbounded growth.
Key question: The correlator must answer "for each new observation, find others within W of its event_time." Which window shape does that question fit, and what is the cost profile?
Lesson 3 — Watermarks
The watermark as a per-event-time guarantee, not an estimate. Heuristic watermarks computed as max_observed_event_time - max_lateness, with per-source documented bounds (radar 100ms, optical 30s, ISL 10s for SDA). The min-of-inputs rule for propagation through fan-in operators, and the operational consequence: the slowest source dominates the downstream watermark. Watermarks interleaved with data items on the same channel via enum SourceItem { Observation(_), Watermark(_) } to preserve their relationship to the data they bound.
Key question: Three sources at watermarks T-100ms, T-30s, T-10s. What is the downstream watermark, and what is the operational consequence of that answer?
Lesson 4 — Late Data
The three strategies for events that arrive after their watermark: drop (cheap, lossy), accumulate-with-allowed-lateness (medium cost, eventual completeness), retract-and-correct (highest cost, strongest correctness). Two-tier window state (active and retained). Retract-then-insert ordering with sequence numbers for downstream correction. Retract-aware sinks with strict-greater UPSERT semantics that absorb duplicates and out-of-order retransmits.
Key question: A late observation invalidates a previously-emitted conjunction alert. The alert subscriber has already triggered an avoidance maneuver. Which late-data strategy should the correlator use, and why?
Capstone Project — Conjunction Window Engine
Replace the M2 dedup operator with a windowed event-time correlator. Per-source watermarks, min-of-inputs fan-in propagation, per-key sliding windows of 30 seconds with 5 seconds of allowed lateness, retract-then-insert emission on late events, and a sequence-keyed retraction-aware SQLite sink. The replay-correctness test (byte-identical output under random arrival order) is the canary for every windowing bug. Acceptance criteria, suggested architecture, and the full project brief are in project-conjunction-window-engine.md.
The orchestrator from Module 2 is unchanged; only one operator's implementation changes. The patterns established here repeat in every subsequent module's project.
File Index
module-03-event-time-and-watermarks/
├── README.md ← this file
├── lesson-01-event-vs-processing-time.md ← Event time vs processing time
├── lesson-01-quiz.toml ← Quiz (5 questions)
├── lesson-02-windowing.md ← Windowing
├── lesson-02-quiz.toml ← Quiz (5 questions)
├── lesson-03-watermarks.md ← Watermarks
├── lesson-03-quiz.toml ← Quiz (5 questions)
├── lesson-04-late-data.md ← Late data
├── lesson-04-quiz.toml ← Quiz (5 questions)
└── project-conjunction-window-engine.md ← Capstone project brief
Prerequisites
- Module 1 (Stream Processing Foundations) and Module 2 (Pipeline Orchestration Internals) completed — the
Observationenvelope, theOperatorGraphbuilder, the supervisor pattern, and the at-least-once-plus-idempotent delivery frame are all assumed - Foundation Track completed — async Rust, channels, BTreeMap and VecDeque algorithmic intuitions
- Familiarity with
tokio::sync::mpsc,std::time::SystemTime, andserdefor the watermark envelope extension - Comfort with the
tokio::select!pattern from Module 2's cancel-safety lesson
What Comes Next
Module 4 (Backpressure and Flow Control) hardens the channel boundaries against burst load. The bounded-channel-per-edge invariant from M2 plus the watermark machinery from M3 are both inputs to that work — burst load affects watermark advance, and watermark stall affects late-event handling. The flow-policy machinery developed in M4 plugs in upstream of the windowed correlator without changing its windowing logic.
Lesson 1 — Event Time vs Processing Time
Module: Data Pipelines — M03: Event Time and Watermarks Position: Lesson 1 of 4 Source: Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11 ("Reasoning About Time" — the three different times in stream processing); Streaming Data — Andrew Psaltis, Chapter 4 (Analyzing Streaming Data); Kafka: The Definitive Guide — Shapira, Palino, Sivaram, Petty, Chapter 14 (Stream Processing Concepts: Time)
Context
Module 1's Observation envelope carried two timestamps, and Module 1 said the distinction would matter later. This is later. The dedup operator in M1 (and the orchestrator that wraps it in M2) assigns observations to windows by processing time — the wall-clock instant at which the pipeline received the observation. That works for ingestion-tier deduplication but it is wrong for correlation-tier reasoning. Two radars observe the same orbital event at the same instant. Their observations leave the radars within microseconds of each other. They arrive at the pipeline 100 milliseconds apart because one radar's link to the ground network has a long-haul fiber detour. A processing-time correlator concludes "two separate events." An event-time correlator concludes "one event, two views." The conjunction risk computation depends on the latter being right.
The mental shift this lesson installs is that every observation has multiple timestamps that are operationally non-interchangeable. Sensor timestamp is when the event happened in the world. Ingest timestamp is when the pipeline received it. The pipeline's wall clock is when now is. Each of these answers a different question. Throughput metrics (events per second across the pipeline) want processing time. SLO compliance (P99 ingest-to-emit latency) wants ingest time as the start. Conjunction-window assignment (which observations should be considered together for risk computation) wants sensor timestamp. Confusing them produces incorrect aggregates that are individually plausible but collectively inconsistent.
This module's job is to take the orchestrator from Module 2 and replace its processing-time dedup with an event-time windowed correlator. The lessons proceed in dependency order. This lesson establishes the time vocabulary. Lesson 2 builds windows on top of it. Lesson 3 introduces watermarks — the mechanism by which the pipeline decides "I have seen all events for window W with sufficient confidence to emit the window's result." Lesson 4 handles the late events that arrive after a watermark has already declared the window closed. The capstone project replaces M2's dedup operator with the result.
Core Concepts
The Three Times
DDIA Chapter 11 makes the case precisely: an event in a streaming system can carry up to three distinct timestamps, and a production pipeline that does not distinguish them will have correctness bugs that look like flakiness.
- Event time — when the event actually happened in the source system. For an SDA observation, this is the instant the sensor's hardware recorded the detection. The radar's GPS-disciplined clock captures this to nanosecond precision. The optical telescope's NTP-disciplined clock captures it to about ten milliseconds. The ISL beacon's onboard satellite clock captures it with drift up to seconds between syncs.
- Ingest time (sometimes called server time) — when the pipeline received the event. This is what
SystemTime::now()returns in the source operator's recv loop. It is monotonic-ish across observations from the same sensor but not across sensors. - Processing time — the wall-clock instant at which a given operator processes the event. Different operators process the same event at different processing times because the event flows through them sequentially. For most aggregation purposes this and ingest time are interchangeable; the distinction matters when you are reasoning about a single operator's local clock.
For SDA's purposes, ingest time and processing time can be collapsed: every observation's ingest_timestamp is captured at the ingestion-tier source operator (Module 1), and downstream operators inherit that timestamp without modification. The two distinct times the pipeline carries forward are event time (sensor_timestamp) and ingest time (ingest_timestamp). The lessons that follow refer to these by their envelope field names.
Where Each Time Belongs
The decision is per-question, not per-pipeline. A useful rule of thumb is "what is the question being asked, and what time would make the answer right?"
| Question | The right time |
|---|---|
| How many events did we see this minute? | Processing/ingest time (operator's local view) |
| What is the P99 latency from sensor to alert? | Both — (emit_time - sensor_timestamp) for end-to-end, (emit_time - ingest_timestamp) for pipeline-only |
| Which observations are part of this orbital event? | Event time (sensor_timestamp) — we want all observations that physically co-occurred |
| Is the pipeline keeping up with real time? | The lag — (now - sensor_timestamp) summarized over the recent window |
| Did we receive any observation in the last second? | Ingest time |
| For a 5-second event-time window starting at T, when can we close it? | This is the watermark question — Lesson 3 |
The trap is using the wrong time for the question and getting an answer that looks plausible. A conjunction correlator that buckets by ingest time produces correct-looking output most of the time — the typical optical-vs-radar arrival skew is small enough that most observations of the same event do land in the same processing-time bucket. The bug shows up only when a sensor source has a delay that pushes an observation across a bucket boundary. That is when correlations are missed. The bug is rare and silent and load-dependent and exactly the kind of thing that gets diagnosed only after a high-profile incident.
Clock Skew and Source Quality
Every sensor's clock has its own accuracy story, and the pipeline must accept the heterogeneity rather than pretend it away.
- GPS-disciplined clocks — radar arrays. Accurate to about 100 nanoseconds against UTC. Locked to a constellation that disciplines drift continuously. The radar's
sensor_timestampis the most trustworthy event time in the system. - NTP-disciplined clocks — ground-based optical telescopes. Accurate to about 10 milliseconds in steady state, occasionally worse when the NTP server is degraded. NTP self-reports its sync state, which is data the source operator can include in the envelope as a quality flag.
- Onboard satellite clocks for ISL beacons — disciplined by the satellite's own GPS or by occasional ground-loop syncs. Drift between syncs can grow to seconds. The beacon's envelope reports
time_since_last_sync_sso the pipeline can estimate the drift and treat the timestamp accordingly.
The pipeline cannot fix bad clocks at the source, but it can carry the per-source quality forward so downstream operators can reason about it. A correlator that knows an ISL beacon's timestamp is uncertain to within ±5 seconds can either widen the matching window for that source or downweight its contribution; a correlator that treats every timestamp as ground truth produces incorrect correlations during sync drift.
Out-of-Order Arrival
Even with perfectly synchronized clocks, observations arrive at the pipeline out of event-time order. The optical archive's 30-second polling interval means optical observations can lag radar observations of the same event by up to 30 seconds of event time. The ISL beacon's 10-second buffering before downlink means beacon observations can lag both. A correlator that assumes events arrive in event-time order produces correctness bugs the first time a slow source's observation arrives after a faster source's observation that was generated later.
Out-of-order arrival is the rule, not the exception. The system must accept it. The mechanism is the watermark protocol covered in Lesson 3, which generalizes "wait for late events" into a tractable operator-level abstraction. The lesson here is conceptual: design every event-time operator under the assumption that observations arrive in arbitrary order with respect to event time, and verify the assumption with a replay test that injects deliberate out-of-order traffic.
Lag as the Master Diagnostic
The diagnostic metric that operations depends on most is lag: lag = ingest_timestamp - sensor_timestamp (or now - sensor_timestamp for currently-streaming events). Lag answers "how far behind real time is the pipeline?" — the single most operationally important question for an event-driven system.
Lag's two components separate cleanly. Source lag (ingest_timestamp - sensor_timestamp) is how long the pipeline took to receive the event after the sensor recorded it. Source lag changes when sensors get slower, when network paths degrade, when partner APIs back up. Pipeline lag (now - ingest_timestamp for a still-flowing event, or for an emitted event the difference between emit-time and ingest-time) is how long the pipeline itself takes once it has the event. Pipeline lag changes when operators slow down, when channels back up, when the pipeline is overloaded. Distinguishing the two is what lets ops answer "is the problem ours or theirs?" without playing detective.
A naive lag metric (just now - sensor_timestamp summarized as a histogram) is a useful single-number diagnostic and the right thing to put on the dashboard. A diagnostic-grade metric (split by source for source lag, by stage for pipeline lag) is the right thing to have available when you need it. Module 6 builds these out fully; for this module we instrument lag at one point — the sink — and use it as the SLO indicator.
Code Examples
Source-Side Event-Time Capture with Quality Flags
The radar source already populates sensor_timestamp from its frame data. We extend each source's emission with a quality flag describing how trustworthy the timestamp is. This is small, additive, and the foundation of every event-time decision downstream.
use std::time::{Duration, SystemTime};
/// How trustworthy is this observation's sensor_timestamp?
/// Carried alongside the timestamp so downstream operators can
/// widen matching windows or downweight observations from
/// sources with degraded clocks.
#[derive(Debug, Clone, Copy, PartialEq, Eq, serde::Serialize, serde::Deserialize)]
pub enum ClockQuality {
/// GPS-disciplined; accurate to ~100ns against UTC.
GpsLocked,
/// NTP-disciplined; accurate to ~10ms in steady state.
NtpSynced { last_sync_age_s: u32 },
/// Onboard clock with measurable drift since last discipline.
OnboardDrift { time_since_sync_s: u32 },
/// Source could not provide quality; treat conservatively.
Unknown,
}
impl ClockQuality {
/// Worst-case event-time uncertainty for this source. Used by
/// the correlator to widen its matching window.
pub fn max_skew(&self) -> Duration {
match *self {
ClockQuality::GpsLocked => Duration::from_micros(1),
ClockQuality::NtpSynced { last_sync_age_s } => {
// NTP accuracy degrades roughly linearly with sync age,
// capped at the original 10ms accuracy + 1ms per minute.
let extra_ms = (last_sync_age_s / 60) as u64;
Duration::from_millis(10 + extra_ms)
}
ClockQuality::OnboardDrift { time_since_sync_s } => {
// Conservative: 100us drift per second since last sync.
Duration::from_micros((time_since_sync_s as u64) * 100)
}
ClockQuality::Unknown => Duration::from_secs(5),
}
}
}
The max_skew method gives the correlator a single number to expand its event-time matching window for observations from this source. A radar observation gets a 1-microsecond skew (effectively zero); an ISL beacon two minutes past its last sync gets 12 milliseconds; a beacon at the bound of its sync interval (an hour without sync, hypothetically) gets 360 milliseconds. The correlator widens its window by the max-skew of any participating observation, ensuring legitimate correlations are not missed because of clock differences. The pattern is the same one Kafka Streams calls grace period for out-of-order events and Flink calls allowed lateness in source time domain.
Computing and Emitting Lag
The lag operator sits at the sink end of the pipeline (or near it). It computes the lag for every event flowing through and exports it as a histogram. The operator is stateless and zero-overhead in the hot path.
use std::time::{Duration, SystemTime, UNIX_EPOCH};
/// Compute end-to-end lag for an observation emitted at the sink.
/// Returns (source_lag, pipeline_lag).
/// source_lag = ingest_timestamp - sensor_timestamp (how late the sensor's
/// event was when the pipeline first saw it)
/// pipeline_lag = now - ingest_timestamp (how long the pipeline itself
/// took to process the event)
pub fn compute_lag(obs: &Observation) -> (Duration, Duration) {
let now = SystemTime::now();
let source_lag = obs
.ingest_timestamp
.duration_since(obs.sensor_timestamp)
.unwrap_or_else(|_| {
// Negative source lag means the source's clock is ahead of
// ours — possible during deploy if the source is GPS-locked
// and our host's NTP is degraded. Surface as zero rather
// than silently subtracting.
Duration::ZERO
});
let pipeline_lag = now
.duration_since(obs.ingest_timestamp)
.unwrap_or(Duration::ZERO);
(source_lag, pipeline_lag)
}
/// Operator that observes lag at the sink and exports it.
/// The Prometheus histogram split by source kind makes it actionable:
/// a source-specific lag spike points at the source; a uniform spike
/// points at the pipeline.
pub fn observe_lag(obs: &Observation) {
let (source_lag, pipeline_lag) = compute_lag(obs);
let kind = format!("{:?}", obs.source_kind);
metrics::histogram!("source_lag_seconds", "source" => kind.clone())
.record(source_lag.as_secs_f64());
metrics::histogram!("pipeline_lag_seconds", "source" => kind)
.record(pipeline_lag.as_secs_f64());
}
The negative-lag handling is a real concern — the pipeline's host clock and a source's clock can disagree, and the difference can be in either direction. The right behavior is to surface the disagreement as a separate signal (a clock_skew_observed_total counter) rather than producing absurd lag values. The implementation collapses to zero for simplicity here; production code should split this out and alert on persistent negative lag. The Prometheus split by source is the operationally important part: when the lag dashboard shows the radar's source lag has doubled while the optical's is unchanged, you know what to investigate.
A Pitfall: Window Assignment Using Processing Time
The buggy operator below assigns observations to 5-second windows by calling Instant::now() at processing time. A teammate proposes it during code review with the rationale "this is simpler and the difference is small." It is wrong. The example shows the failure mode the lesson keeps warning about.
// BUG: assigns to windows by processing time, not by event time.
// Observations of the same event from different sources land in
// different windows because they arrive at slightly different times.
pub async fn buggy_window_assigner(
mut input: mpsc::Receiver<Observation>,
output: mpsc::Sender<(WindowId, Observation)>,
) -> Result<()> {
let start = Instant::now();
while let Some(obs) = input.recv().await {
let elapsed_secs = start.elapsed().as_secs();
let window_id = WindowId(elapsed_secs / 5); // 5-second windows
output.send((window_id, obs)).await?;
}
Ok(())
}
// CORRECT: assigns by sensor_timestamp. Observations of the same
// physical event land in the same window regardless of arrival skew.
pub async fn correct_window_assigner(
mut input: mpsc::Receiver<Observation>,
output: mpsc::Sender<(WindowId, Observation)>,
epoch: SystemTime,
) -> Result<()> {
while let Some(obs) = input.recv().await {
let event_offset = obs
.sensor_timestamp
.duration_since(epoch)
.unwrap_or_default();
let window_id = WindowId(event_offset.as_secs() / 5);
output.send((window_id, obs)).await?;
}
Ok(())
}
Two observations that physically co-occurred — same orbital event, same instant of detection — but arriving at the pipeline 100 milliseconds apart land in different processing-time windows whenever the 100-ms gap straddles a 5-second boundary. About 2% of correlated events per the optical-radar arrival distribution, in the SDA pipeline's actual deployment, would be mis-correlated by the buggy operator. That is enough to produce a phantom-conjunction rate that operations notices but cannot easily explain — every correlator failure produces a defensible-looking output, and only the aggregate statistics reveal the bias. The fix is exactly the operator above: bucket by sensor_timestamp, not by elapsed processing time. Lesson 2 develops this into a full windowing operator with bounded memory and explicit close conditions.
Key Takeaways
- Every observation carries two operationally distinct timestamps:
sensor_timestamp(when the event happened in the world) andingest_timestamp(when the pipeline received it). Processing time at any given operator can be collapsed to ingest time for SDA's purposes; the distinction that matters is event time vs ingest time. - The right time depends on the question. Throughput and SLO metrics want ingest/processing time. Window assignment for correlation wants event time. Confusing them produces output that looks correct in aggregate but is silently wrong in ways that only the aggregate-of-aggregates statistics reveal.
- Sensor clocks are heterogeneous: GPS-locked (radar, ~100ns), NTP-synced (optical, ~10ms), onboard-with-drift (ISL, up to seconds). Carry clock quality forward in the envelope so downstream operators can widen matching windows or downweight observations from degraded sources.
- Out-of-order arrival is the rule in event-time pipelines. Late observations from slow sources arrive after observations of later events from faster sources. Every event-time operator must be designed with this assumption; the watermark protocol in Lesson 3 makes the rule operational.
- Lag (
now - sensor_timestamp) is the master diagnostic for an event-driven pipeline. Split into source lag and pipeline lag to answer "is the problem ours or theirs?" without ambiguity. Module 6 builds the full observability stack on top of this primitive.
Lesson 2 — Windowing
Module: Data Pipelines — M03: Event Time and Watermarks Position: Lesson 2 of 4 Source: Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11 (Windowing); Kafka: The Definitive Guide — Shapira, Palino, Sivaram, Petty, Chapter 14 (Time Windows in Stream Processing); Streaming Data — Andrew Psaltis, Chapter 4 (Windowing patterns and their implementation costs)
Context
Lesson 1 established that observations carry an event time and that correlation logic must bucket them by event time rather than by arrival time. This lesson is about how to do that bucketing efficiently and correctly. A window is a bounded slice of the event stream — defined by the rule "all events whose event-time falls within this range" — that the operator can hold in memory, compute over, and emit a result for. Windows are how unbounded streams admit aggregation: you cannot compute "all conjunction risks ever," but you can compute "all conjunction risks where the contributing observations fell within this 30-second event-time slice."
The choice of window shape is a correctness decision, not a performance optimization. The conjunction risk computation cares about pairs of orbital objects whose closest-approach time falls within a small window. If we use the wrong window shape — say, fixed 30-second buckets aligned to clock minutes — every legitimate conjunction whose closest approach happens to straddle a bucket boundary is silently missed. Choosing the right shape requires understanding the four canonical kinds (tumbling, hopping, sliding, session), the cost profile of each, and the question each is shaped to answer. This lesson takes them in turn and develops a sliding-window operator that the capstone correlator builds on.
The forward references stay concrete. The window operator built here holds events in memory until it can declare a window "closed" and emit its result. Lesson 3 supplies the close mechanism — watermarks. Lesson 4 handles late events that arrive after a window has been closed. The capstone project replaces M2's processing-time dedup with this lesson's sliding-window correlator. The pattern of "windowed accumulation, watermark-driven emit, allowed-lateness for stragglers" is the standard streaming-system shape; we are building the SDA-specific instance of it.
Core Concepts
Tumbling Windows
The simplest window shape. Fixed size, non-overlapping, every event lands in exactly one window. A 5-second tumbling window over an event-time stream produces one window for [0, 5), the next for [5, 10), the next for [10, 15), and so on. The boundaries are typically aligned to a stable epoch (00:00:00 UTC, or the pipeline's start time, or some other fixed reference) so that a given event_time always maps to the same window across replicas.
Tumbling windows are the right choice for aggregates over disjoint time slices: events-per-minute counts, hourly throughput summaries, "how many observations did each sensor produce this minute." They are the wrong choice for any question of the form "find pairs of events within a small time gap" — the gap can straddle a tumbling-window boundary, and events in adjacent windows are never seen together by the operator. The conjunction-risk computation is exactly that kind of question, which is why M3 needs sliding windows rather than tumbling.
Memory cost: bounded by the largest single window's events. Once a window closes (Lesson 3), its state is freed. State per active window scales with the window size and the per-key event rate.
Hopping Windows
A generalization of tumbling. Fixed size, fixed step (sometimes called advance or slide), step typically smaller than size — which produces overlap. A 30-second window with 5-second step produces windows for [0, 30), [5, 35), [10, 40), ..., overlapping by 25 seconds each. Every event lands in size / step windows simultaneously (30 / 5 = 6 here).
Hopping windows are useful for "rolling aggregates emitted on a regular cadence" — every 5 seconds, emit the count of events in the last 30 seconds. The emit cadence (step) is decoupled from the aggregation breadth (size). Some streaming systems call this a sliding window; the terminology is unfortunately not standardized. We use hopping for fixed-size-fixed-step and reserve sliding for the per-event-driven shape below.
Memory cost: scales with size / step. Each event is held in size / step windows simultaneously. For a 30-second window with 1-second step, every event is in 30 windows. The per-event memory is small (a reference, not a copy), but the constant factor is real and shows up at scale.
Sliding Windows (Per-Event)
Every event creates its own window of [event_time - W, event_time] for some window size W. Maximum overlap. For a 30-second sliding window, an event at time T anchors a window covering T-30 to T, and the next event at T+1 anchors a window covering T-29 to T+1. The "matching set" for any given event is whatever other events fall within W of its event_time.
This is the right shape for the conjunction-risk question. The natural framing of the problem is "for each new observation, what other observations are within 30 seconds of event_time?" — exactly what a per-event sliding window computes. Production systems implement sliding windows as a deque per key, with the front evicted as new events arrive that push the deque's tail past the window boundary. The capstone operator builds this.
Memory cost: bounded by the most-active key's event rate × W. For a 30-second window over a key that sees 100 events per second, the deque holds 3000 entries — trivial. For a key that sees 100,000 events per second the deque is much larger; mitigations are key-level rate limits, sampling, or shorter windows. Module 4's load-shedding work develops these mitigations.
Session Windows
Variable-length, defined by gap of inactivity. A session window collects events while they keep arriving within a gap timeout of each other; when the gap is exceeded, the window closes. The classic use is "user web session" — collect events while the user is active, close the session when they go idle for 5 minutes. For SDA, the natural use is "satellite pass" — a satellite's beacons during a single overhead pass form a session, with the gap between passes (when the satellite is below the horizon) closing the session.
Session windows are the only window shape with state proportional to the active session's duration, not to a fixed configuration. A satellite that orbits for 90 minutes during a long pass produces a 90-minute session; a quiet pass might be 10 seconds. The gap timeout is what bounds the session — without it, a continuous event stream produces an unbounded session that never closes.
Memory cost: the dominant risk. A misconfigured gap timeout (too long) or a continuous source (a beacon that emits without gaps) produces unbounded growth. Production code adds a hard maximum session duration alongside the gap timeout: even if events keep arriving, force-close after N minutes. The double-bound makes session windows safe to deploy.
Window State and the Close Trigger
For every window class, the operator maintains state per active window: the events that have been assigned to it and any aggregations being computed (running counts, partial join results, etc.). The state grows as events arrive and shrinks when windows close. The close trigger — the condition under which the operator declares "this window is done; emit its result and discard its state" — is the single most important correctness decision in a windowed operator.
The naive close trigger is wall-clock time: a 5-second tumbling window starting at T closes when wall-clock-time reaches T+5. This is wrong in event-time semantics because late events with sensor_timestamp ≤ T+5 may still arrive after wall-clock-time T+5 — a tumbling window closed by wall clock would miss them. The correct close trigger is the watermark — Lesson 3's full topic. The intuition we install here is structural: the windowed operator does not decide on its own when to close a window; it consumes a watermark signal that tells it "no event with event_time ≤ X will arrive after this point," and it closes any window whose end ≤ X.
Until Lesson 3, this lesson's operator implementations declare windows closed via a placeholder mechanism (an explicit close_windows_up_to(T) method). The capstone operator wires that placeholder to a real watermark stream.
Code Examples
Tumbling Window Operator
The simplest implementation. A BTreeMap<WindowEnd, WindowState> keyed on the window's end time. Each event is bucketed by its sensor_timestamp; close-up-to-watermark walks the map's prefix and emits.
use std::collections::BTreeMap;
use std::time::{Duration, SystemTime};
use anyhow::Result;
use tokio::sync::mpsc;
/// Tumbling-window operator. Emits aggregated state for every window
/// whose end has been crossed by the watermark. Close-trigger is supplied
/// externally via close_up_to(); Lesson 3 wires this to a watermark.
pub struct TumblingWindow {
/// Windows by end time. BTreeMap gives O(log N) range-prefix iteration
/// in close_up_to(), which is the hot path on watermark advance.
windows: BTreeMap<SystemTime, Vec<Observation>>,
window_size: Duration,
epoch: SystemTime,
output: mpsc::Sender<WindowResult>,
}
impl TumblingWindow {
pub fn new(
window_size: Duration,
epoch: SystemTime,
output: mpsc::Sender<WindowResult>,
) -> Self {
Self {
windows: BTreeMap::new(),
window_size,
epoch,
output,
}
}
/// Assign an observation to its window and accumulate it.
pub fn ingest(&mut self, obs: Observation) {
let offset = obs.sensor_timestamp.duration_since(self.epoch).unwrap_or_default();
let window_idx = offset.as_secs() / self.window_size.as_secs().max(1);
let window_end = self.epoch + self.window_size * (window_idx + 1) as u32;
self.windows.entry(window_end).or_default().push(obs);
}
/// Close every window whose end ≤ watermark. Emit result, free state.
pub async fn close_up_to(&mut self, watermark: SystemTime) -> Result<()> {
// BTreeMap::split_off gives us O(log N) access to the prefix
// we want to drain. The remainder stays in self.windows.
let still_open = self.windows.split_off(&(watermark + Duration::from_nanos(1)));
let to_close = std::mem::replace(&mut self.windows, still_open);
for (window_end, observations) in to_close {
let result = WindowResult { window_end, count: observations.len() };
self.output.send(result).await
.map_err(|_| anyhow::anyhow!("window result downstream dropped"))?;
}
Ok(())
}
/// For diagnostics: number of windows currently held in state.
pub fn pending_window_count(&self) -> usize { self.windows.len() }
}
#[derive(Debug, Clone)]
pub struct WindowResult {
pub window_end: SystemTime,
pub count: usize,
}
The BTreeMap-keyed-on-window-end choice is load-bearing for this operator. The hot path on watermark advance is "find every window whose end ≤ watermark," which is a prefix range query; BTreeMap does this in O(log N) via split_off. A HashMap-keyed-on-window-id would require a full scan on every watermark advance — fine for a few windows, untenable when the operator holds hundreds. The cost of BTreeMap inserts (O(log N) instead of HashMap's O(1) amortized) is paid back many times over by the cheap range query on close. The pattern generalizes: for any operator whose hot path is "find everything ≤ X," choose a data structure that gives you that operation cheaply. A naive HashMap.iter().filter() is O(N) and quietly catastrophic at scale.
Sliding Window Operator (Per-Event)
The conjunction-risk operator's foundation. A VecDeque<Observation> per key. New events are appended to the back; on each new event, evict from the front any events whose sensor_timestamp is older than event_time - W. The deque always represents the current sliding window for that key.
use std::collections::{HashMap, VecDeque};
use std::time::Duration;
/// Per-key sliding window operator. Each new event for a key emits
/// the set of other events within W of its event_time as a candidate
/// match list. The conjunction correlator (capstone project) builds
/// on this primitive.
pub struct SlidingWindow {
deques: HashMap<ObjectId, VecDeque<Observation>>,
window: Duration,
}
impl SlidingWindow {
pub fn new(window: Duration) -> Self {
Self { deques: HashMap::new(), window }
}
/// Process an observation. Evict expired entries, append the new
/// observation, return the current contents of the window for
/// downstream matching.
pub fn step(&mut self, obs: Observation) -> &VecDeque<Observation> {
let key = obs.target_object_id.clone();
let deque = self.deques.entry(key.clone()).or_default();
let cutoff = obs.sensor_timestamp - self.window;
// Evict from front while the front's event_time is older than
// the cutoff. Front is the oldest; the deque is event-time
// ordered by construction (we only append to the back, and
// each appended event has a sensor_timestamp ≥ all existing
// ones, modulo out-of-order arrival within the window).
while let Some(front) = deque.front() {
if front.sensor_timestamp < cutoff {
deque.pop_front();
} else {
break;
}
}
deque.push_back(obs);
// Return a reference; caller decides what matching to compute.
&*deque
}
/// Drop a key's deque entirely — used when the watermark indicates
/// no more events will arrive for this key (e.g., a satellite
/// has decayed). Not used in the steady-state hot path.
pub fn close_key(&mut self, key: &ObjectId) -> Option<VecDeque<Observation>> {
self.deques.remove(key)
}
pub fn pending_keys(&self) -> usize { self.deques.len() }
}
The eviction-on-each-event pattern keeps the deque always sized to the current window for that key — no separate "garbage collect old entries" pass needed. Two design choices that are subtle. The eviction cutoff is obs.sensor_timestamp - self.window, not now - self.window; we are doing event-time windowing, so the cutoff is in event time, not wall-clock. The deque is event-time ordered by the assumption that within a single key, observations arrive in roughly event-time order — which is true for a single-source single-key stream and approximately true for a multi-source stream (out-of-order is possible within tens of milliseconds). Strict event-time ordering is not required; the eviction loop simply advances while the front is older than the cutoff and stops. A late event whose sensor_timestamp is older than the cutoff is silently dropped — that is the late event problem that Lesson 4 covers in full.
A Session Window with Hard-Cap Safety
The ISL beacon's per-satellite pass. Events arrive while the satellite is overhead; the session closes when the satellite goes below the horizon (gap exceeds threshold) or, as a safety valve, when the session has been open for longer than a configured maximum.
use std::time::{Duration, Instant};
pub struct SessionWindow {
session_start: Instant,
last_event: Instant,
gap_timeout: Duration,
max_session: Duration,
events: Vec<Observation>,
}
impl SessionWindow {
pub fn new(first: Observation, gap_timeout: Duration, max_session: Duration) -> Self {
let now = Instant::now();
Self {
session_start: now,
last_event: now,
gap_timeout,
max_session,
events: vec![first],
}
}
/// Add an event to the session if the gap is acceptable. Returns
/// Err with the new event if the session has timed out and the
/// caller should start a fresh session.
pub fn try_add(&mut self, obs: Observation) -> Result<(), Observation> {
let now = Instant::now();
let gap_ok = now.duration_since(self.last_event) < self.gap_timeout;
let max_ok = now.duration_since(self.session_start) < self.max_session;
if gap_ok && max_ok {
self.last_event = now;
self.events.push(obs);
Ok(())
} else {
Err(obs)
}
}
pub fn close(self) -> Vec<Observation> { self.events }
pub fn open_duration(&self) -> Duration { self.last_event.duration_since(self.session_start) }
}
The double-bound pattern (gap timeout AND max session duration) is the safety property that makes session windows production-safe. A gap-only design hits unbounded memory the first time a source emits without any gap (a stuck-open beacon, a misconfigured emitter); the max-session bound is the safety valve that always fires. The cost of the bound is occasional artificial session breaks for legitimately-long sessions — acceptable for SDA's beacon-pass use case (passes are bounded by orbital mechanics) and adjustable per use case. The session window operator is the third canonical shape; the SDA capstone uses sliding windows (the second shape), but the pattern generalizes.
Key Takeaways
- Window shape is a correctness decision, not a performance optimization. The four canonical shapes — tumbling (disjoint, fixed-size), hopping (overlapping, fixed-step), sliding (per-event), session (gap-defined) — each fit different question shapes. The conjunction-risk question fits sliding windows; aggregations fit tumbling; emit-on-cadence fits hopping; satellite-pass fits session.
- The window operator does not decide its own close trigger. It consumes a watermark signal from upstream that says "no event with event_time ≤ X will arrive" and closes any window whose end ≤ X. This decouples the window's accumulation semantics from the close logic; Lesson 3 supplies the watermark.
- BTreeMap keyed on window end gives the close-up-to-watermark hot path O(log N) prefix iteration. Choose data structures whose hot-path operations are cheap; a HashMap with O(N) iteration is quietly catastrophic at scale.
- Sliding windows are bounded by the most-active key's event rate × W. Per-key deques with on-each-event front eviction keep memory always sized to the current window; no separate GC pass needed.
- Session windows need a hard maximum alongside the gap timeout. A gap-only design hits unbounded memory the first time a source streams continuously; the max-session bound is the safety valve.
Lesson 3 — Watermarks
Module: Data Pipelines — M03: Event Time and Watermarks Position: Lesson 3 of 4 Source: Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11 ("Knowing When You're Ready to Receive Events" — the watermark/punctuation discussion); Streaming Data — Andrew Psaltis, Chapter 4 (Out-of-Order Events and the Watermark Mechanism); Kafka: The Definitive Guide — Shapira, Palino, Sivaram, Petty, Chapter 14 (Out-of-Sequence Events)
Context
Lesson 2's windowed operator builds up state per active window and waits for a close trigger — the signal that says "this window is done; emit its result and discard its state." We deferred the close trigger to this lesson. The naive answer is wall-clock time: a 5-second window starting at T closes when wall-clock time exceeds T+5. This is wrong in event-time semantics. A late-arriving observation with sensor_timestamp ≤ T+5 may still arrive after wall-clock-time T+5 because the optical archive's polling cadence delayed it. Closing on wall-clock loses that observation; the window's emitted result is wrong. The right close trigger has to be a per-event-time signal, not a per-wall-clock signal.
The mechanism is the watermark: a declaration, made by a source or computed by an operator, that "no event with event_time less than X will arrive at this point in the pipeline after this watermark." A watermark is a guarantee, not an estimate. With a watermark, the windowed operator can close any window whose end ≤ the watermark's value and be confident no more relevant events will arrive. Without a watermark, the operator either holds windows forever (correct but useless — no result is ever emitted) or closes them too early on a wall-clock bound (fast but wrong). The watermark is the necessary fourth piece of event-time semantics, alongside event-time-on-the-envelope (Lesson 1) and event-time-windowing (Lesson 2).
This lesson develops watermarks in three pieces. What a watermark is precisely (and the perfect-vs-heuristic distinction). How sources generate watermarks for their own observations. How operators propagate watermarks through the pipeline (the min-of-inputs rule that the rest of the track depends on). The forward references stay tight. Lesson 4 handles late events that arrive after a watermark has already declared their window closed. The capstone wires watermarks from sources through normalize through the windowed correlator. Module 6 surfaces watermark progress as the master observability metric — "the pipeline is currently complete through event-time T."
Core Concepts
A Watermark, Defined Precisely
A watermark is a value of the form Watermark(t) whose meaning is: no event with event_time < t will arrive after this point. The watermark is a guarantee, not a hope or an estimate. When an operator receives a watermark, it can act on that guarantee — close windows, emit results, evict state — confident that the guarantee will hold.
A watermark's value monotonically advances: each new watermark from a given source has a value greater than or equal to the previous one. (Equality is permitted but uninteresting; in practice watermarks strictly advance.) A source that emits a non-monotonic watermark has violated the protocol; the operator may treat this as a bug and either ignore the regressed value or fail loudly. The orchestrator's structured-logging discipline applies — log a structured event for every regressed watermark, alert on persistent regression.
The watermark is a separate item from observations on the same channel. The convention this track uses is to interleave two kinds of items on the source-to-operator channels: data items (Observation) and control items (Watermark(SystemTime)). Operators consume both, processing data items as they arrive and updating their watermark state on watermark items. The alternative — a separate side channel for watermarks — exists in some streaming frameworks but introduces its own coordination problems (a fast data channel out-pacing a slow watermark channel produces apparent regressions). In-band interleaving keeps the ordering coherent.
Perfect vs Heuristic Watermarks
A perfect watermark is one where the source can prove the guarantee — there is some monotonic property of the source that lets it declare with certainty when no earlier event will arrive. The clearest example is a single-stream source whose events arrive in event-time order: every event is a watermark, because the source can emit "watermark = this event's event_time" and be sure no earlier events will follow.
Most production sources cannot offer perfect watermarks. They offer heuristic watermarks: an estimate of the maximum lateness an event can have, used to compute a bound. The source picks a max-lateness estimate (call it M) and emits, periodically, a watermark of value current_time - M. The estimate is documented per-source based on the source's known properties. If M is too small, late events arrive past the watermark — the watermark's guarantee was wrong, and the late event must be dropped or held in allowed-lateness state (Lesson 4). If M is too large, the watermark advances too slowly and downstream windows close later than necessary, increasing pipeline latency.
For SDA, the per-source max-lateness values are:
| Source | Max lateness | Reasoning |
|---|---|---|
| Radar (UDP) | 100 ms | GPS-locked clocks; fiber path round-trip; no buffering |
| Optical (HTTP poll) | 30 s | Polling cadence is 30 s; an event recorded just after a poll waits one full cycle |
| ISL beacon (TCP) | 10 s | Onboard buffering before downlink; downlink-to-ground propagation |
The estimates are conservative: real lateness is typically much less, but the bound covers the tail of the lateness distribution. Lesson 4 covers what happens when the estimate is wrong.
Generation at Sources
A source emits watermarks alongside its observations. The pattern is the same for each source kind, parameterized by the source's max-lateness estimate.
pub enum SourceItem {
Observation(Observation),
Watermark(SystemTime),
}
Each source's emit loop interleaves Observation items with periodic Watermark items. The frequency of watermark emission is operationally important: too rarely, downstream windows are held longer than necessary because the operator does not know the watermark has advanced; too frequently, the watermark items add bandwidth overhead. A watermark every 1 second of wall-clock time is a reasonable starting point for SDA — well below the optical source's 30-second polling cadence, well above the per-event rate.
The source-side watermark value is max(observed event_times) - max_lateness. A source that has just produced an observation with sensor_timestamp = T emits the watermark T - M (where M is its max-lateness estimate). The watermark trails the source's most recent observed event-time by exactly M, which is the guarantee shape the watermark protocol expects.
Propagation Through Operators
When an operator has multiple inputs (a fan-in normalize, a join, a correlation), it must compute its output watermark from its input watermarks. The rule is the minimum: the output watermark is the minimum of the most recent watermark from each input. The reason for min, not max: we can only guarantee what the worst upstream guarantees. If the radar input has watermark T and the optical input has watermark T-30, we cannot guarantee that no event with event_time < T will arrive — because the optical input might still produce one. The strongest claim we can make is "no event with event_time < T-30 will arrive," so that is the output watermark.
The min-rule has a counterintuitive consequence: the slowest source dominates the downstream watermark. A pipeline with three sources at watermarks T, T+10, and T+20 has a downstream watermark of T — the slow source's value. Improving any of the faster sources does nothing for the downstream watermark; only improving the slowest source does. This is the operational property that makes per-source max-lateness estimates so important: tightening any one estimate lowers that source's watermark trail-time, which lowers the downstream watermark trail-time only if that source was the dominant one.
Implementation in code: the operator tracks the most recent watermark per input channel, recomputes the minimum on every watermark item, and emits a new output watermark when the minimum advances.
The Aggressive-vs-Conservative Tradeoff
The watermark designer's main lever is the per-source max-lateness M. Aggressive M (small) → fast watermark → fast window close → low pipeline latency, but late events arriving past the watermark are dropped or pushed into allowed-lateness state. Conservative M (large) → slow watermark → slow window close → higher pipeline latency, but few late events are missed.
The right setting is operational, not theoretical. For the SDA pipeline's 30-second conjunction-detection SLA, the source-side max-lateness values above produce a downstream watermark that trails real time by ~30 seconds (dominated by the optical source). That gives windowed correlators 30 seconds to close before the SLA is at risk. The aggressive-vs-conservative tradeoff is per-pipeline; we set defaults that match SDA and document them in the per-source code.
Code Examples
A Source That Emits Both Observations and Watermarks
The radar UDP source from M1 emits only Observations. We extend it to interleave watermark items on a wall-clock cadence. The pattern is the same for every source; the per-source max-lateness M is the only parameter that changes.
use std::time::{Duration, Instant, SystemTime};
use anyhow::Result;
use tokio::sync::mpsc;
use tokio::time;
pub enum SourceItem {
Observation(Observation),
Watermark(SystemTime),
}
/// Wraps an existing source and interleaves periodic watermarks based
/// on observed event-time and the source's documented max-lateness.
pub async fn run_source_with_watermarks<S>(
mut source: S,
output: mpsc::Sender<SourceItem>,
max_lateness: Duration,
watermark_interval: Duration,
) -> Result<()>
where
S: ObservationSource,
{
let mut last_watermark_emit = Instant::now();
let mut max_observed_event_time = SystemTime::UNIX_EPOCH;
loop {
match source.next().await? {
Some(obs) => {
if obs.sensor_timestamp > max_observed_event_time {
max_observed_event_time = obs.sensor_timestamp;
}
output.send(SourceItem::Observation(obs)).await
.map_err(|_| anyhow::anyhow!("downstream dropped"))?;
// Emit a watermark on cadence, regardless of event rate.
if last_watermark_emit.elapsed() >= watermark_interval {
let wm = max_observed_event_time
.checked_sub(max_lateness)
.unwrap_or(SystemTime::UNIX_EPOCH);
output.send(SourceItem::Watermark(wm)).await
.map_err(|_| anyhow::anyhow!("downstream dropped"))?;
last_watermark_emit = Instant::now();
}
}
None => return Ok(()),
}
}
}
The watermark value is max_observed_event_time - max_lateness — the most recent event-time the source has seen, minus the source's documented worst-case lateness. The watermark monotonically advances because max_observed_event_time does and max_lateness is constant. The cadence (watermark_interval) is wall-clock-driven so the watermark advances even if the source has been silent for a stretch — important so a downstream operator's windows do not sit idle waiting for events that are not coming. A real production source also emits watermarks during idle gaps via a tokio::time::interval; we elide that for clarity but the capstone implementation includes it.
A Fan-In Operator That Computes Min-of-Inputs
The normalize operator from M1 fanned three sources into one channel. With watermarks, the fan-in must compute its output watermark as the minimum of the most recent watermarks from each input. The implementation tracks per-input watermarks in a Vec<Option<SystemTime>> and recomputes the min on each watermark item.
use std::time::SystemTime;
use anyhow::Result;
use tokio::sync::mpsc;
/// Fan-in normalize operator that consumes from N upstream channels
/// (each carrying SourceItem) and emits a single SourceItem stream
/// downstream with a properly-propagated min-of-inputs watermark.
pub async fn normalize_fan_in(
mut inputs: Vec<mpsc::Receiver<SourceItem>>,
output: mpsc::Sender<SourceItem>,
) -> Result<()> {
use tokio::select;
let n = inputs.len();
let mut input_watermarks: Vec<Option<SystemTime>> = vec![None; n];
let mut last_emitted_watermark: Option<SystemTime> = None;
// Simplified select: real implementation uses select_all from
// futures::future for arbitrary N. Here we sketch the per-input
// handling for clarity.
for input_idx in 0..n {
// ... in a real implementation, all inputs are polled
// concurrently via select_all; this loop is illustrative.
while let Some(item) = inputs[input_idx].recv().await {
match item {
SourceItem::Observation(obs) => {
let normalized = normalize(obs);
output.send(SourceItem::Observation(normalized)).await
.map_err(|_| anyhow::anyhow!("downstream dropped"))?;
}
SourceItem::Watermark(wm) => {
input_watermarks[input_idx] = Some(wm);
// Compute min only when every input has reported at
// least one watermark. Until then, the operator's
// output watermark is undefined.
if input_watermarks.iter().all(|w| w.is_some()) {
let new_wm = input_watermarks
.iter()
.map(|w| w.unwrap())
.min()
.unwrap();
if Some(new_wm) > last_emitted_watermark {
output.send(SourceItem::Watermark(new_wm)).await
.map_err(|_| anyhow::anyhow!("downstream dropped"))?;
last_emitted_watermark = Some(new_wm);
}
}
}
}
}
}
Ok(())
}
Three subtle points. The output watermark is undefined until every input has reported at least one watermark. A fan-in with three sources where one source has not yet sent a watermark cannot propagate a min — there is no upper bound on what that source's watermark might be once it arrives, so any min computed without it is unsafe. The fix is structural: hold downstream emission until every input is heard from. Second, the operator emits a new output watermark only when the min strictly advances. Re-emitting the same watermark would be correct but wasteful; the strict-advance check throttles the per-event watermark traffic to what is operationally meaningful. Third, the per-input bookkeeping is intentionally simple — a Vec<Option<SystemTime>> indexed by input position, no fancier structure needed. Production code that joins many inputs uses the same pattern with Vec lengths in the dozens; the constant-factor cost of the .min() recomputation on each watermark item is negligible at any realistic input count.
Wiring Watermarks Into the Tumbling Window Operator
Lesson 2's TumblingWindow::close_up_to(watermark) becomes the consumer of watermark items. The operator no longer has its own close logic; it reacts to the watermark stream the upstream produced.
use std::time::SystemTime;
use anyhow::Result;
use tokio::sync::mpsc;
/// Drive a TumblingWindow operator from a single SourceItem stream
/// that interleaves Observations with Watermarks. Observations are
/// ingested into the window state; watermarks trigger close-up-to.
pub async fn run_tumbling_with_watermarks(
mut window_op: TumblingWindow,
mut input: mpsc::Receiver<SourceItem>,
) -> Result<()> {
while let Some(item) = input.recv().await {
match item {
SourceItem::Observation(obs) => {
window_op.ingest(obs);
}
SourceItem::Watermark(wm) => {
window_op.close_up_to(wm).await?;
}
}
}
Ok(())
}
The pattern is the structural property the lesson promised. The operator does not decide its own close; it consumes a watermark stream that supplies the close trigger. The same shape applies to sliding-window operators (which evict per-key state on watermark advance), session-window operators (which close sessions whose session_end + gap is past the watermark), and any other event-time-windowed operator. The watermark is the universal close trigger.
Key Takeaways
- A watermark is a per-event-time guarantee:
Watermark(t)means "no event with event_time < t will arrive after this point." It is the only correct close trigger for event-time windows; wall-clock-based close drops late events whose event_time precedes the wall-clock cutoff. - Heuristic watermarks (the production default) are computed as
max_observed_event_time - max_lateness, wheremax_latenessis a per-source documented bound. Tightermax_lateness→ faster watermark → lower pipeline latency, at the cost of more events arriving past the watermark. - The min-of-inputs rule propagates watermarks through fan-in operators: the output watermark is the minimum of the most recent watermarks from each input. We can only guarantee what the slowest upstream guarantees. The slowest source dominates the downstream watermark.
- Watermarks are interleaved with data items on the same channel (
enum SourceItem { Observation(_), Watermark(_) }). In-band ordering keeps the watermark's relationship to the data items it bounds coherent; a separate side-channel can produce apparent regressions. - The windowed operator does not decide its own close. It consumes a watermark stream and closes any window whose end ≤ the watermark. This decoupling is what makes windowed operators composable across the pipeline and re-usable across window shapes.
Lesson 4 — Late Data
Module: Data Pipelines — M03: Event Time and Watermarks Position: Lesson 4 of 4 Source: Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11 (Out-of-Order Events and corrections); Fundamentals of Data Engineering — Joe Reis & Matt Housley, Chapter 7 (Late-Arriving Data); Kafka: The Definitive Guide — Shapira, Palino, Sivaram, Petty, Chapter 14 (Out-of-Sequence Events and Reprocessing)
Context
A watermark is a guarantee, but the guarantee is built on an estimate — the per-source max_lateness. The estimate is calibrated to cover the typical worst case, not every possible case. Eventually a source's actual lateness exceeds its estimate. The optical archive's polling cadence runs slow because the partner's API is degraded; an observation arrives 35 seconds after its event time when max_lateness was set to 30. The radar's fiber path takes a 200-ms detour through a ground-station relay; an observation arrives 250 ms after its event time when max_lateness was 100 ms. In every case, the observation arrives after the watermark for its window has already passed. The window has already been closed and emitted. The observation is late.
Three things can happen with a late event. Drop it. Cheap, lossy, the default in many systems. Accumulate it into a re-opened window. Hold the window's state in a holding pattern past its watermark close, accept events into it for a bounded allowed lateness period, re-emit the window's result on each accepted late event. Medium cost, requires the downstream to handle re-emissions. Retract and re-emit. Emit a "negation" of the previously-emitted result, then emit a corrected one. Most expensive, requires the downstream to handle retractions, gives the strongest correctness guarantee.
The choice is operational. For SDA's conjunction-alert pipeline, the cost of a missed alert (a real conjunction not flagged) is collision risk; the cost of a phantom alert (a false conjunction emitted then never corrected) is a needless avoidance maneuver burning satellite fuel. The right choice depends on which cost is being weighed against what. For high-rate dashboards (events-per-second, throughput summaries), drop is fine — individual late events do not change the aggregate. For batched analytics where eventual completeness matters, accumulate-with-bound is the right shape. For human-actionable alerts, retract-and-correct is necessary because a phantom alert that is never withdrawn produces lasting downstream effects. This lesson covers all three, develops the implementation patterns, and closes the module by tying back to the orchestrator's at-least-once-plus-idempotency frame from Module 2.
Core Concepts
The Three Strategies
Drop. When a late event arrives — its sensor_timestamp falls within an already-closed window — discard it. The window's emitted result remains the canonical answer. The lost event contributes nothing. Cost: minimal. Cost paid: events that should have contributed to the result do not. Best for high-rate streams where individual events are statistically insignificant.
Accumulate (Allowed Lateness). Each window's state is held for a bounded allowed_lateness period past its watermark-driven close. Late events arriving within [window_end, window_end + allowed_lateness] in event time are accepted into the window's state, and the window's result is re-emitted with the additional event included. After the allowed-lateness period expires (the watermark advances past window_end + allowed_lateness), the window's state is finally evicted; events arriving past that point are dropped. Cost: state held longer; downstream must handle re-emission of previously-emitted results. Best for analytics-style pipelines where some delay is acceptable.
Retract. When a late event arrives within a previously-emitted window's lateness period, emit two records: a retraction of the previously-emitted result, then an insertion of the corrected result. The downstream consumer treats retraction-then-insertion as an atomic correction. Cost: highest — requires retraction-aware downstream. Best for human-actionable outputs where a wrong result that is never corrected is worse than a delayed-but-correct one.
The three are a progression: drop is a special case of accumulate-with-zero-allowed-lateness; accumulate is a special case of retract that does not bother distinguishing the previous result from the next; retract is the most general. Most production pipelines use a mix: drop for non-critical metrics, accumulate for batched reporting, retract for human-actionable alerts.
Allowed Lateness, Concretely
The accumulate strategy parameterizes allowed_lateness per operator. A windowed correlator with allowed_lateness = 5s holds each closed window's state for 5 seconds (in event time, not wall clock) past its watermark close. In code, this means: when the watermark advances past window_end, the operator emits the window's result but does not free the window's state; the state stays in a retained state. Late events arriving while the watermark is in [window_end, window_end + allowed_lateness] are added to the retained state and trigger a re-emission. When the watermark advances past window_end + allowed_lateness, the state is finally evicted.
The memory cost of allowed lateness scales as (allowed_lateness / window_size) × window_state. A 30-second window with 5-second allowed lateness holds ~1.17x the steady-state window memory; a 30-second window with 30-second allowed lateness doubles it. The per-key cardinality multiplies through the same ratio. For SDA's correlator at typical scales (tens of thousands of orbital objects, 30-second windows, 5-second allowed lateness), the additional memory is bounded and tractable.
The operational tuning is per-pipeline. Aggressive allowed-lateness (small, e.g. 1s) → low memory cost, fast window finalization → late events past 1s are silently lost. Conservative allowed-lateness (large, e.g. 60s) → higher memory cost, slow finalization → almost all late events captured. The right choice depends on the lateness distribution of the slowest source. SDA's setting of 5s on top of optical's 30s max_lateness covers the long tail of optical archive delays without doubling memory.
Retractions
A retraction is a downstream-visible "undo." For a window that previously emitted result R1, the operator emits a retraction of R1 followed by an insertion of R2 (the corrected result). The downstream subscriber sees the pair as: invalidate the previously-stated R1; the correct value is now R2.
The implementation requires a sequence number on each emit so the downstream can match retractions to the correct previous emit. The convention this track uses: each window has a (window_id, sequence) pair on every emission, with sequence starting at 0 for the first emit and incrementing on each correction. The downstream uses (window_id, sequence) as the primary key with last-write-wins semantics — at any point in time, the latest sequence for a given window_id is the canonical answer.
The retraction emission shape:
pub enum WindowEmit {
Insert { window_id: WindowId, sequence: u32, result: WindowResult },
Retract { window_id: WindowId, sequence: u32 },
}
A retract-then-insert pair on window 17 looks like: Retract { window_id: 17, sequence: 1 } followed by Insert { window_id: 17, sequence: 2, result: corrected }. The downstream's stored state, keyed on window_id, is updated to sequence 2's result. The sequence number prevents a delayed retraction from being applied to a newer result emitted in the meantime: if the retraction is for sequence 1 but the downstream has already stored sequence 2, the retract is a no-op.
The Idempotency Requirement Downstream
Retractions only work when the downstream is idempotent on (window_id, sequence). A SQL sink with ON CONFLICT (window_id) DO UPDATE SET ... WHERE EXCLUDED.sequence > stored.sequence is idempotent. A Kafka topic where the consumer dedups on (window_id, sequence) is idempotent. An HTTP webhook that does not respect any keying is not idempotent — duplicate retractions or out-of-order retract/insert pairs produce wrong final state at the subscriber.
The pattern composes with Module 2's at-least-once-plus-idempotency frame: the pipeline emits at least once (with retries on transient failures) and the downstream is idempotent (on (window_id, sequence)), giving effective exactly-once delivery of the corrected stream. Module 5 covers the full machinery for cross-pipeline exactly-once, including transactional Kafka producers; this lesson establishes the pattern at the operator level.
Choosing the Strategy
A decision table for SDA's pipeline:
| Output | Strategy | Reasoning |
|---|---|---|
events_per_minute dashboard counter | Drop | Late events negligible at this granularity; dashboard precision is loose |
conjunction_risk_summary analytics emit | Accumulate (5s) | Eventual completeness matters; some delay acceptable |
conjunction_alert to subscriber | Retract | Phantom alerts cause real-world action (avoidance maneuvers); must be correctable |
audit_log of every observation | Drop | The audit log is the input record, not a derived computation; latency is irrelevant |
The decision is per-output, not per-pipeline. The same windowed correlator can produce different outputs with different strategies — a retract-emitting alert stream alongside a drop-emitting metrics stream. The implementation factors out the strategy as a per-output configuration that the operator dispatches on.
Code Examples
Allowed-Lateness Window Operator
The L2 tumbling-window operator extended with allowed-lateness retention. Window state is held in two tiers: active (window not yet closed) and retained (window closed by watermark, held for late events for allowed_lateness).
use std::collections::BTreeMap;
use std::time::{Duration, SystemTime};
use anyhow::Result;
use tokio::sync::mpsc;
pub struct AllowedLatenessWindow {
/// Active windows: state still being accumulated, watermark hasn't passed.
active: BTreeMap<SystemTime, WindowState>,
/// Retained windows: emitted once, held for allowed_lateness in case
/// late events arrive. Will be re-emitted on each accepted late event.
retained: BTreeMap<SystemTime, WindowState>,
window_size: Duration,
allowed_lateness: Duration,
epoch: SystemTime,
output: mpsc::Sender<WindowEmit>,
}
#[derive(Debug, Clone)]
struct WindowState {
observations: Vec<Observation>,
sequence: u32,
}
impl AllowedLatenessWindow {
pub fn new(
window_size: Duration,
allowed_lateness: Duration,
epoch: SystemTime,
output: mpsc::Sender<WindowEmit>,
) -> Self {
Self {
active: BTreeMap::new(),
retained: BTreeMap::new(),
window_size,
allowed_lateness,
epoch,
output,
}
}
/// Ingest an observation, dispatching on whether it lands in an
/// active window or a retained (late) window.
pub async fn ingest(&mut self, obs: Observation) -> Result<()> {
let window_end = self.window_end_for(obs.sensor_timestamp);
if let Some(state) = self.active.get_mut(&window_end) {
state.observations.push(obs);
return Ok(());
}
if let Some(state) = self.retained.get_mut(&window_end) {
// Late event into a retained window — accept and re-emit.
state.observations.push(obs);
state.sequence += 1;
self.emit(window_end, state.clone(), EmitKind::Insert).await?;
return Ok(());
}
// Fresh window.
self.active.insert(
window_end,
WindowState { observations: vec![obs], sequence: 0 },
);
Ok(())
}
/// Watermark advance: close every active window whose end ≤ watermark
/// (move to retained); evict every retained window whose end +
/// allowed_lateness ≤ watermark (final eviction, no more late events
/// accepted for this window).
pub async fn on_watermark(&mut self, watermark: SystemTime) -> Result<()> {
// Move closed-by-watermark windows from active to retained, emitting
// the initial result on the way through.
let still_active = self.active.split_off(&(watermark + Duration::from_nanos(1)));
let to_close = std::mem::replace(&mut self.active, still_active);
for (window_end, state) in to_close {
self.emit(window_end, state.clone(), EmitKind::Insert).await?;
self.retained.insert(window_end, state);
}
// Evict retained windows whose lateness budget is exhausted.
let retain_cutoff = watermark.checked_sub(self.allowed_lateness)
.unwrap_or(SystemTime::UNIX_EPOCH);
let still_retained = self.retained.split_off(&(retain_cutoff + Duration::from_nanos(1)));
let to_evict = std::mem::replace(&mut self.retained, still_retained);
// Evicted windows are silently dropped; their state is gone.
// Lesson 4 also discusses the retraction strategy for cases where
// even-after-eviction late events need to update results.
drop(to_evict);
Ok(())
}
fn window_end_for(&self, ts: SystemTime) -> SystemTime {
let offset = ts.duration_since(self.epoch).unwrap_or_default();
let idx = offset.as_secs() / self.window_size.as_secs().max(1);
self.epoch + self.window_size * (idx + 1) as u32
}
async fn emit(&self, window_end: SystemTime, state: WindowState, _kind: EmitKind) -> Result<()> {
let result = WindowResult {
window_end,
count: state.observations.len(),
};
self.output.send(WindowEmit::Insert {
window_id: WindowId(window_end),
sequence: state.sequence,
result,
}).await
.map_err(|_| anyhow::anyhow!("downstream dropped"))
}
}
enum EmitKind { Insert, Retract }
The two-tier state — active and retained — is the structural pattern. Active windows accumulate; the watermark advance moves them to retained and triggers their first emit; retained windows can still receive late events and re-emit; final eviction at watermark - allowed_lateness frees the state for good. The two-tier structure makes the lateness behavior explicit rather than implicit; an operator with a single tier and ad-hoc "late event" handling tends to grow correctness bugs as the pattern complicates. The cost of two tiers is a few extra BTreeMap operations per watermark advance — negligible at any realistic scale.
A Retracting Operator
The retract strategy emits explicit Retract records before each Insert of a corrected result. The downstream is responsible for processing the pair atomically. The operator's emit logic factors slightly differently than the accumulate-only version above.
async fn emit_retract_then_insert(
output: &mpsc::Sender<WindowEmit>,
window_end: SystemTime,
prev_sequence: u32,
new_state: &WindowState,
) -> Result<()> {
let window_id = WindowId(window_end);
// Retract the previously-emitted sequence.
output.send(WindowEmit::Retract {
window_id,
sequence: prev_sequence,
}).await
.map_err(|_| anyhow::anyhow!("downstream dropped"))?;
// Then the corrected result at the new sequence.
output.send(WindowEmit::Insert {
window_id,
sequence: new_state.sequence,
result: WindowResult {
window_end,
count: new_state.observations.len(),
},
}).await
.map_err(|_| anyhow::anyhow!("downstream dropped"))?;
Ok(())
}
The retraction must be emitted before the corrected insert. Otherwise the downstream sees insert-then-retract — the corrected result lands first, gets retracted, and the downstream's stored state ends up empty. The two-step pattern depends on the channel preserving FIFO ordering, which mpsc::Sender::send does. The sequence on the retract is the previous sequence; the sequence on the insert is the new one. The downstream, keyed on (window_id, sequence), processes the two records in order and ends up with the corrected state. This is the pattern that makes retract-and-correct safe under at-least-once delivery: the downstream's last-write-wins semantics absorb duplicate or out-of-order retract/insert pairs as long as the sequence ordering is respected.
A Retraction-Aware Sink
The sink that consumes the retractor's output. It uses an embedded SQLite as its idempotent state store. The schema is (window_id, sequence, result_blob); UPSERT on conflict by window_id with the higher sequence winning.
use rusqlite::{params, Connection};
use std::time::SystemTime;
const UPSERT_SQL: &str = r#"
INSERT INTO window_results (window_id, sequence, result_blob)
VALUES (?1, ?2, ?3)
ON CONFLICT (window_id) DO UPDATE
SET sequence = excluded.sequence, result_blob = excluded.result_blob
WHERE excluded.sequence > window_results.sequence
"#;
const RETRACT_SQL: &str = r#"
DELETE FROM window_results
WHERE window_id = ?1 AND sequence = ?2
"#;
pub fn apply_emit(conn: &Connection, emit: WindowEmit) -> rusqlite::Result<()> {
match emit {
WindowEmit::Insert { window_id, sequence, result } => {
let blob = serde_json::to_vec(&result).unwrap();
conn.execute(UPSERT_SQL, params![
window_id.0.duration_since(SystemTime::UNIX_EPOCH).unwrap().as_nanos() as i64,
sequence,
blob,
])?;
}
WindowEmit::Retract { window_id, sequence } => {
conn.execute(RETRACT_SQL, params![
window_id.0.duration_since(SystemTime::UNIX_EPOCH).unwrap().as_nanos() as i64,
sequence,
])?;
}
}
Ok(())
}
The WHERE excluded.sequence > window_results.sequence clause is what makes the UPSERT idempotent: a duplicate insert at sequence N (delivered twice by at-least-once) produces no change because the comparison is strict. A retraction's WHERE clause matches on both window_id and sequence, so a stale retraction (window_id matches but sequence is below the current stored value) deletes nothing — exactly the right behavior. The composition with the at-least-once-with-retries delivery layer (Module 2's with_retry) gives the exactly-once-effective property the lesson promises.
Key Takeaways
- Late events arrive when a source's actual lateness exceeds its
max_latenessestimate. The pipeline has three strategies: drop (cheap, lossy), accumulate-with-allowed-lateness (medium cost, eventual completeness), retract-and-correct (highest cost, strongest correctness guarantee). The choice is per-output, not per-pipeline. - Allowed lateness holds window state for a bounded period past the watermark close. The state lives in a retained tier alongside the active tier; late events into retained windows trigger re-emit; final eviction at
watermark - allowed_latenessfrees the state. Memory scales as(allowed_lateness / window_size) × steady_state_memory. - Retractions emit a
Retractof the previously-emitted result before each correctedInsert. The downstream is keyed on(window_id, sequence)with last-write-wins semantics; sequence numbers prevent delayed retractions from clobbering newer results. Retract-then-insert ordering is FIFO-channel-dependent and must not be reordered. - Retractions only work with idempotent downstream: SQL
ON CONFLICT DO UPDATEkeyed on window_id with strict-greater sequence comparison, Kafka consumers that dedup on(window_id, sequence), or any sink whose effect on the world is keyed on the same identifier the operator emits. Non-idempotent downstreams produce wrong final state under retry. - The lateness machinery composes with Module 2's at-least-once-plus-idempotency frame: the operator emits at least once with retries on transient failures, the downstream is idempotent on
(window_id, sequence), giving effective exactly-once delivery of the corrected stream. Module 5 generalizes this pattern to cross-pipeline boundaries.
Capstone Project — Conjunction Window Engine
Module: Data Pipelines — M03: Event Time and Watermarks Estimated effort: 1–2 weeks of focused work Prerequisites: All four lessons in this module passed at ≥70%
Mission Brief
OPS DIRECTIVE — SDA-2026-0142 / Phase 3 Implementation Classification: CORRELATION TIER UPGRADE
The Phase 2 orchestrator from Module 2 is in production and stable. The dedup operator at the top of the correlation tier is processing- time-based, which is correct for ingestion deduplication but wrong for cross-sensor correlation. Internal review of conjunction alerts from the past quarter found a 2.3% rate of cross-sensor mismatches traceable to optical-vs-radar arrival skew straddling the 5-second dedup window. Two of the missed correlations were later determined to be real conjunctions, surfaced only by post-hoc batch reprocessing. The fix is structural: replace processing-time dedup with event-time windowed correlation that respects sensor_timestamp regardless of arrival skew.
Success criteria for Phase 3: cross-sensor correlations are computed by event time using the watermark protocol; the per-source max-lateness values from the M3 lesson (radar 100ms, optical 30s, ISL 10s) are respected; allowed-lateness of 5s captures the long tail of optical archive delays; conjunction alerts emit a retraction when a late observation invalidates a previously-emitted alert.
What You're Building
Replace the dedup operator from M2's pipeline with a windowed event-time
correlator that emits ConjunctionRisk envelopes when two orbital
objects' observations within the same event-time window indicate a
close approach.
The deliverable is:
- A
WatermarkSourcetrait extending the M1ObservationSourceto interleave watermarks with observations on the channel - Wrapped versions of M1's three sources (radar, optical, ISL) that emit per-source watermarks per the lesson's max-lateness table
- A min-of-inputs watermark-propagating fan-in normalize operator
- A
WindowedCorrelatoroperator that holds per-key sliding windows of observations, emitsConjunctionRiskenvelopes when pairs cross a configured proximity threshold, and supports allowed-lateness retention - A retraction-aware alert sink that emits
WindowEmit::{Insert, Retract}records via a sequence-number-keyed downstream - A test harness that drives the pipeline with synthetic out-of-order events and verifies correctness across replay
The orchestrator from M2 is unchanged; only one operator's implementation changes (dedup → correlator). The OperatorGraph declaration is updated to reflect the new operator. Refresh the operational README to document the new metrics (watermark trail per source, allowed-lateness eviction count, retractions emitted).
Suggested Architecture
┌─────────────────┐
┌───────────────┐ SourceItem │ Alert Sink │
│ Radar Source │═══════════════╗ │ (retract-aware)│
│ + watermarks │ ║ │ │
└───────────────┘ ║ └────────▲────────┘
║ │
┌───────────────┐ SourceItem ▼ │
│ Optical Src │═══════>┌────────────┐ SourceItem ┌──────────────┐
│ + watermarks │═══════>│ Normalize │═════════════════>│ Windowed │
└───────────────┘ │ Fan-In │ │ Correlator │
│ min-WM │ │ (sliding, │
┌───────────────┐ ═════>└────────────┘ │ allowed- │
│ ISL Source │ │ lateness) │
│ + watermarks │ └──────────────┘
└───────────────┘
Each source runs its own task (orchestrated by M2's OperatorGraph). Each source emits enum SourceItem { Observation(_), Watermark(_) } on its outgoing channel. The fan-in normalize operator consumes from all three and emits a single SourceItem stream with min-of-inputs watermark propagation. The windowed correlator consumes that stream, holds per-orbital-object sliding windows, computes pairwise close-approach proximity within each window, and emits to the alert sink. The orchestrator's restart, retry, and circuit-breaker machinery from M2 wraps all of this without modification.
Acceptance Criteria
Functional Requirements
-
WatermarkSourcetrait with methodnext() -> Result<Option<SourceItem>>; the existing M1ObservationSourceis wrapped via the lesson'srun_source_with_watermarkshelper -
Each source's
max_latenessmatches the lesson's table: radar 100ms, optical 30s, ISL 10s - Watermarks emitted at a wall-clock cadence (default 1 second) AND on-demand whenever the observed event_time advances; idle sources still advance their watermark
- Fan-in normalize operator computes min-of-inputs watermark; output watermark held until every input has reported at least once
-
Windowed correlator uses sliding windows keyed on
(object_a_id, object_b_id); window size 30s, allowed lateness 5s -
ConjunctionRiskemit triggered when two observations of distinct objects within the same window indicate proximity below threshold - Late events into already-emitted windows trigger a retraction-then-insert pair on the alert channel; sequence numbers on every emit
- Alert sink uses an embedded SQLite (or equivalent) keyed on (window_id, sequence) with strict-greater UPSERT semantics
Quality Requirements
- Replay test for correctness: deterministic test injecting a fixed event sequence in event-time order, then re-running with the same events injected in random arrival order. The final state of the alert sink must be byte-identical between the two runs.
- Watermark progress test: a unit test feeds watermarks into the fan-in operator and asserts the output watermark advances per the min-of-inputs rule and not before all inputs have reported
- Allowed-lateness test: a unit test injects a late event into a retained window and asserts the retraction-then-insert pair is emitted; injects a late event after the lateness budget has expired and asserts it is dropped silently with the corresponding metric incremented
-
Memory bound test: under sustained load, the correlator's per-key window state stays bounded by
(window_size + allowed_lateness) × per_key_event_rate; an integration test asserts steady-state memory for a synthetic workload
Operational Requirements
-
/metricsextends M2's:source_watermark_seconds{source}(gauge),pipeline_watermark_seconds(gauge — the min-of-inputs at fan-in),pending_windows{tier}(gauge — split between active and retained),late_events_dropped_total(counter),retractions_emitted_total(counter) - Lag dashboard split into source lag and pipeline lag per the M3 L1 framing; the pipeline lag panel makes "is the problem ours or theirs" answerable in seconds
- Operational runbook section "Reading the Watermark Dashboard" documenting how to interpret a watermark stall (which source's max-lateness is dominant; what the per-source values are; what tightens what)
Self-Assessed Stretch Goals
-
(self-assessed) Sustain 50K observations/sec input with a 30-second window, P99 emit latency < 1 second after watermark advance. Provide a
criterionbenchmark and a flame graph showing where the per-event cost lives. - (self-assessed) Replay-correctness test runs against a corpus of 10 randomly-shuffled arrival orders and produces byte-identical final state in every case
- (self-assessed) The operational dashboard exposes a "watermark stall" alert that fires when the pipeline watermark fails to advance for > 60 seconds, distinguishing source-side stalls from pipeline-side stalls in the alert text
Hints
How do I generate watermarks for an event-driven UDP source like the radar where there is no natural "tick"?
Two interleaved triggers. On observation: each observation updates the source's max_observed_event_time and (every Nth observation, or whenever wall-clock has advanced past the watermark interval) emits a watermark of max_observed_event_time - max_lateness. On idle: a tokio::time::interval ticking at the watermark interval emits a watermark even when no observations have arrived recently — important during quiet periods so the downstream operator's windows do not stall waiting for the source to wake up.
use tokio::select;
let mut interval = tokio::time::interval(watermark_interval);
loop {
select! {
obs = source.next() => {
// Emit Observation, update max_observed_event_time,
// possibly emit Watermark.
}
_ = interval.tick() => {
// Emit Watermark even if no observations arrived;
// the source's clock is what advances the watermark
// during idle periods.
}
}
}
The select-against-interval pattern is the same one Module 2 used for the source-side retry timer. Reusable.
How do I efficiently store retained windows alongside active windows?
Two BTreeMap<SystemTime, WindowState> — one for active, one for retained — and a small dispatch on ingest. The on_watermark step uses BTreeMap::split_off for both maps to do prefix range eviction in O(log N). The combined cost per watermark advance is two prefix splits + one drain over the closed-active set; for hundreds of windows, this is sub-millisecond.
A single BTreeMap with a per-window enum tag (Active or Retained) also works and saves one allocation; both designs are fine. The two-map design makes the operations more obvious in the code and is the one the lesson uses.
How do I deterministically test replay correctness?
Two ingredients: a fixed input set of observations with known event_times, and a way to inject them in a specific arrival order. The test runs the same input set through the pipeline twice — once in event-time order, once in a deliberately scrambled order — and asserts the final state of the alert sink is identical between the two runs.
#[tokio::test]
async fn replay_correctness_under_scrambled_arrival() {
let observations = build_test_observations(); // 1000 observations, event-time ordered
let scrambled = {
let mut o = observations.clone();
// Shuffle deterministically with a fixed seed.
use rand::seq::SliceRandom;
use rand::rngs::StdRng;
use rand::SeedableRng;
o.shuffle(&mut StdRng::seed_from_u64(42));
o
};
let result_ordered = run_pipeline_to_completion(&observations).await;
let result_scrambled = run_pipeline_to_completion(&scrambled).await;
assert_eq!(result_ordered, result_scrambled,
"pipeline output should be invariant to arrival order");
}
The run_pipeline_to_completion helper drains the alert sink at the end and returns the stored state (the SQLite contents serialized to a comparable form). The assertion is the correctness property the watermark protocol is supposed to give you; if it fails, the bug is in the operator's allowed-lateness logic or the retract-then-insert ordering.
How small can the safety margin be on retained-window eviction?
The eviction happens at watermark - allowed_lateness. With watermark = max_observed_event_time - max_lateness, the eviction occurs when the latest event-time has advanced past window_end + allowed_lateness + max_lateness. So the real event-time-domain retention is allowed_lateness + max_lateness, not just allowed_lateness. The pipeline lag adds a third term: allowed_lateness + max_lateness + pipeline_lag.
For SDA's defaults: 5 + 30 + 1 = 36 seconds of total state retention. A budget of ~40s for the correlator's worst-case-per-window memory budget is a reasonable plan-for-the-tail value. Operations should monitor the actual eviction lag (now - retained_window_evict_event_time) and alert when it grows past, say, 120 seconds — that signals either pipeline lag growth or a stuck source.
How do I avoid duplicate retractions when the operator restarts?
The operator's emit history is in-memory; on restart, it has no idea what it already emitted. M2's at-least-once-plus-idempotent composition saves us here: a duplicate emit (the same window_id, sequence) is absorbed by the sink's strict-greater UPSERT comparison. The restart re-emits the same sequence numbers it emitted before; the sink stores the result the same way it was already stored; no observable change. A late event arriving for a previously-retained window after the operator restarts and that window is no longer in retained state is silently dropped — same as if the operator had not restarted but the lateness had simply expired.
For the full crash-safety story (where the restart should resume from a checkpoint of in-flight window state), see Module 5. This module's pipeline survives restarts but loses some allowed-lateness windows during the restart window — acceptable for SDA's tolerance.
Getting Started
Recommended order:
SourceItemenum and the watermark-emitting source wrapper. Implement once; reuse across all three sources. Unit-test by feeding fixed observations and asserting the watermarks emitted match the documented formula.- Min-of-inputs fan-in. Implement against three synthetic input channels in tests; do the unit test for the "hold until all inputs report" case explicitly.
- Sliding-window state per key. Reuse the L2 SlidingWindow primitive; add the per-key dispatch in the correlator. Test eviction with manual event-time scenarios.
- Allowed-lateness retention. Add the retained tier; wire watermark advance to move active→retained and to evict expired retained.
- Conjunction-risk computation. The actual proximity logic — given two observations of distinct objects within the same window, decide whether they form a
ConjunctionRisk. The math is out of scope for this curriculum; a stubcompute_proximity(obs_a, obs_b) -> f64returning a deterministic number based on inputs is sufficient for testing the pipeline structure. - Retract-and-correct emit logic. When a late event lands in a retained window, emit
Retract { sequence: N }followed byInsert { sequence: N+1, result: corrected }. - Retraction-aware sink. Embedded SQLite with the UPSERT pattern from L4. Test replay-after-restart correctness by killing and restarting the pipeline mid-stream.
- Replay-correctness integration test. The byte-identical-output-across-arrival-orders test from the hint.
- Refresh the operational README and the dashboard.
Aim for the sliding-window correlator with min-of-inputs watermark propagation working end-to-end by day 7; allowed-lateness and retractions can land in days 8-10. The replay-correctness test is the canary that catches every windowing bug; if it fails, stop and diagnose before adding features.
What This Module Sets Up
In Module 4 you will harden the channel boundaries against burst load. The bounded-channel-per-edge invariant from M2 plus the watermark machinery from M3 are both inputs to that work — burst load affects watermark advance, and watermark stall affects late-event handling. The flow-policy machinery developed in M4 plugs in upstream of the correlator without changing its windowing logic.
In Module 5 you will make the correlator's window state crash-safe. The two-tier (active/retained) state structure is exactly what the checkpoint code serializes. The sequence numbers on emits are exactly what the at-least-once-with-checkpoint recovery uses to replay safely. M3's correctness foundation is what M5's durability layer rests on.
In Module 6 you will surface the watermark progress as the master observability metric. The per-source watermark gauges, the pipeline watermark gauge, and the retained-window count are the diagnostic dashboard's first-row panels. The lag-distinct-from-watermark distinction is the framing that makes the dashboard usable.
The correlator built here is the canonical event-time-windowed operator. Every subsequent module's project either extends it directly or applies the same patterns to a different operator.
Module 04 — Backpressure and Flow Control
Track: Data Pipelines — Space Domain Awareness Fusion
Position: Module 4 of 6
Source material: Async Rust — Maxwell Flitton & Caroline Morton, the chapters on tokio::sync::mpsc semantics and channel patterns; Network Programming with Rust — Abhishek Chanda, sections on application-level flow control and TCP windowing as transport-level backpressure; Kafka: The Definitive Guide — Shapira, Palino, Sivaram, Petty, Chapter 3 (Producer Configuration and max.in.flight.requests.per.connection); Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11 (Buffering and Pushback)
Quiz pass threshold: 70% on all three lessons to unlock the project
Mission Context
OPS ALERT — SDA-2026-0188 Classification: BURST-LOAD HARDENING Subject: Replace uniform load shed with priority-aware FlowPolicy
Last week's anti-satellite test added 1,800 newly tracked debris objects to the catalog within 90 seconds. The Phase 3 pipeline survived but with twelve minutes of catch-up time and four dropped conjunction alerts during the spike. The postmortem traced the dropped alerts to a single edge — the alert-emitter's incoming channel — where the buffer was sized for nominal traffic, the upstream operator was IO-bound and could not slow further, and the downstream had no mechanism to triage which observations to drop. The pipeline's response to the burst was uniform load shed without policy. The four critical alerts that got dropped were no more important to the system than the four hundred non-critical observations dropped alongside them.
The pipeline at the start of this module is correct under steady-state load and tolerates transient downstream slowdowns. It is not yet correct under burst load. The mechanism for backpressure (send().await on bounded channels) propagates the slowdown but does not give the operator any policy choice about what to do when the channel fills up — the upstream just slows, regardless of what the data being slowed represents. For the post-Cosmos burst, that uniform behavior was exactly the wrong shape: the system needed to triage during the spike, preserving the high-priority alerts that could not afford the latency hit while shedding the low-priority redundant samples that contribute little under load.
This module installs the discipline. Per-edge FlowPolicy (Backpressure / Shed / Timed) makes the policy explicit at every edge. A priority classifier distinguishes high-priority from low-priority observations. The audit script becomes a CI test that fails the build on new unbounded channels or undocumented FlowPolicy choices. The burst simulator becomes the regression-detection canary. The patterns generalize to any streaming pipeline that must survive bursts; the module's specifics are where the techniques meet SDA's actual workload.
The mental model the module installs is the four-piece backpressure discipline: (1) every edge has an explicit per-edge FlowPolicy chosen for its operational role, (2) buffer sizing is documented per-edge with a BurstProfile rather than picked by reflex, (3) credit-based flow is reached for specifically when the receiver needs to pause without occupying a slot (checkpoint flushes, cross-runtime edges, graceful drains), (4) the pressure chain is continuously audited and the per-channel occupancy gradient is the first-look dashboard panel.
Learning Outcomes
After completing this module, you will be able to:
- Choose between
send().await,try_send(), andsend_timeout()based on the operator's load-shedding policy, and encode the choice as a per-edgeFlowPolicy - Size channel buffers using a documented
BurstProfilerather than by reflex, with the math (rate gap × duration × safety factor) made explicit per edge - Implement a load-shedding sink with priority-aware sub-channels and a biased select that gives high-priority data deterministic preference under shed conditions
- Reach for credit-based flow control specifically when the receiver needs to pause without occupying an in-flight slot — checkpoint flushes, cross-runtime edges, graceful drains
- Diagnose backpressure-chain breakage from the per-channel occupancy gradient: the rightmost persistently-full edge points at the slowest operator
- Recognize the canonical pressure-chain breakage patterns (per-event
tokio::spawn,unbounded_channel, fire-and-forget logging, unboundedVec::pushaccumulators, drop-and-recreate task patterns) and their structural fixes - Reason about backpressure across boundary cases: TCP windowing as transport-level chain link; Kafka's deliberate decoupling that requires explicit consumer-lag monitoring as the substitute pressure signal
Lesson Summary
Lesson 1 — Bounded Channels
The three send semantics — send().await for backpressure, try_send for explicit load shed (always paired with a drop counter), send_timeout for SLO-bound edges where blocking past a deadline is worse than dropping — encoded as a per-edge FlowPolicy enum. Buffer sizing as documented BurstProfile rather than reflexive 1024; the math made explicit. Why unbounded_channel is a footgun for any data-path edge.
Key question: The conjunction-emitter has a 200ms SLO from observation-arrival to alert-emit, and the alert subscriber occasionally returns 503 during deploys. Which FlowPolicy is the right choice for the emitter's outgoing edge, and why?
Lesson 2 — Credit-Based Flow Control
Credit-based flow as the alternative to bounded-channel-plus-await for cases where the receiver needs to pause the upstream without occupying any in-flight slot. The structural difference (decoupled credit signal and data channel), the production protocols that use it (HTTP/2 windows, AMQP prefetch, Kafka's max.in.flight.requests.per.connection as a single-credit degenerate case), the SDA cases (checkpoint flushes, cross-runtime edges, graceful drains), and the in-flight-bounded property the AFTER-processing return preserves.
Key question: The CreditHandle has a Drop impl that warns when the handle is dropped without return_credit() being called. What canonical failure mode does that warning surface, and why does it matter operationally?
Lesson 3 — End-to-End Backpressure Propagation
The pressure chain as a contiguous sequence of bounded buffers from source to sink. The five canonical breakage patterns (per-event tokio::spawn, unbounded_channel, fire-and-forget unbounded logging, Vec::push accumulators, drop-and-recreate task patterns) and their structural fixes. Reading the per-channel occupancy gradient as first-look diagnostic. TCP windowing as transport-level chain link. Kafka's deliberate decoupling that requires explicit consumer-lag monitoring as the substitute pressure signal.
Key question: The dashboard shows three persistently-full edges across a six-edge topology. Which operator is the bottleneck, and why does the gradient-reading discipline give an unambiguous answer?
Capstone Project — Backpressure-Aware Fusion Broker
Harden the Phase 3 pipeline against the post-Cosmos-1408 burst failure mode. Every edge gets an explicit FlowPolicy; a priority classifier distinguishes High from Low observations; a PriorityShedSink with biased select gives High deterministic preference and drops Low first under shed conditions. The audit script becomes a cargo test that fails the build on new unbounded channels or undocumented policies. The BurstSimulator drives 10x rate for 5 simulated minutes and asserts zero High-priority drops. Acceptance criteria, suggested architecture, and the full project brief are in project-backpressure-broker.md.
The orchestrator from M2 and the windowed correlator from M3 are unchanged in structure. Only the edges' policies change, and a new operator (the priority classifier) sits between the normalize and the correlator.
File Index
module-04-backpressure-and-flow-control/
├── README.md ← this file
├── lesson-01-bounded-channels.md ← Bounded channels and FlowPolicy
├── lesson-01-quiz.toml ← Quiz (5 questions)
├── lesson-02-credit-based-flow.md ← Credit-based flow control
├── lesson-02-quiz.toml ← Quiz (5 questions)
├── lesson-03-end-to-end-propagation.md ← End-to-end backpressure propagation
├── lesson-03-quiz.toml ← Quiz (5 questions)
└── project-backpressure-broker.md ← Capstone project brief
Prerequisites
- Modules 1, 2, and 3 completed — the
Observationenvelope, the orchestrator'sOperatorGraphand supervisor, and the watermark-aware windowed correlator are all assumed - Foundation Track completed — async Rust, channels, scheduling intuitions
- Familiarity with
tokio::sync::mpsc::{Sender, Receiver}semantics andtokio::time::{timeout, sleep, pause} - Comfort reading Prometheus-style metric names (
channel_occupancy{edge=...}) and reasoning about per-label counters and gauges
What Comes Next
Module 5 (Delivery Guarantees and Fault Tolerance) makes the windowed correlator's state crash-safe via checkpointing. The credit-based-flow primitive from this module's Lesson 2 is the mechanism that pauses the upstream during the checkpoint flush. The bounded-channel-everywhere invariant the audit enforces is what lets the checkpointed state size be bounded and predictable. The exactly-once-via-idempotency frame from M2 plus the retract-aware sink from M3 plus the priority shedding from M4 compose into the pipeline M5 will harden against process restarts.
Lesson 1 — Bounded Channels
Module: Data Pipelines — M04: Backpressure and Flow Control
Position: Lesson 1 of 3
Source: Async Rust — Maxwell Flitton & Caroline Morton, the chapters on tokio::sync::mpsc semantics and channel patterns; Network Programming with Rust — Abhishek Chanda, sections on application-level flow control over TCP; Kafka: The Definitive Guide — Shapira, Palino, Sivaram, Petty, Chapter 3 (Producer Configuration: buffer.memory, max.block.ms, acks)
Context
Module 1 introduced mpsc::Sender::send().await as the foundation of backpressure: a full bounded channel suspends the sending future until the consumer makes capacity, propagating the slowdown upstream. Module 2's orchestrator structurally enforced "every edge in the operator graph is a bounded channel." Module 3 added watermarks and windowing on top of that channel structure. The pipeline that exists at the start of this module is correct under steady-state load and tolerates transient downstream slowdowns.
It does not tolerate burst load. Last quarter's Cosmos-1408 anti-satellite test added 1,800 newly tracked debris objects to the catalog within 90 seconds. The pipeline survived in the sense that no operator panicked, but with twelve minutes of catch-up time and four dropped conjunction alerts during the spike. The postmortem traced the alerts to a single channel where the buffer was sized for nominal traffic, the upstream operator was IO-bound and could not slow further, and the downstream operator had no mechanism to triage which observations to drop and which to preserve. The pipeline's response to the burst was uniform load shed without policy — the four critical alerts that got dropped were no more important to the system than the four hundred non-critical observations dropped alongside them.
This module is the response. The orchestrator's bounded-channel invariant is correct but not sufficient; what to do when a bounded channel fills up is a design decision that has been left implicit, and the explicit answer is a per-edge FlowPolicy with three named choices. This lesson establishes the choices, develops the channel semantics behind each, and develops the discipline of sizing channel buffers on purpose rather than by reflex. Lesson 2 covers credit-based flow control as the alternative shape for cases where awaiting is not enough. Lesson 3 audits the full pipeline for places where the backpressure chain breaks. The capstone hardens the M3 pipeline against a 10x burst simulation with explicit flow policies and load-shedding.
Core Concepts
Bounded vs Unbounded Channels
tokio::sync::mpsc::channel(N) produces a bounded channel: at most N items can be in flight between the sender and the receiver. Beyond N, the next send suspends until the receiver consumes. tokio::sync::mpsc::unbounded_channel() produces an unbounded channel: items are buffered without limit, growing memory as long as the sender produces faster than the receiver consumes.
Unbounded channels are the wrong default. The reason is structural: any unbounded buffer in the pipeline is a place where backpressure does not propagate. The upstream sender writes into the unbounded channel, never suspends, and never observes the downstream's slowness. The buffer grows. The process's resident memory grows. Either a higher-level resource limit (the OOM killer; a container's memory cap; a tokio runtime memory budget) eventually intervenes, or the process keeps growing until something else gives. None of these outcomes are graceful; all are surprising at runtime, and all are diagnosed only after the symptom shows up.
The legitimate use cases for unbounded channels are narrow. In-process notification of singleton events — the orchestrator's "shutdown signal" that fires once and is consumed once. Tests where the test harness controls both ends and the bound is implicit in the test's structure. Bounded-by-construction sources where the application can prove the channel cannot fill. None of these apply to the data path of a streaming pipeline. The orchestrator's audit script in Lesson 3 asserts that no unbounded_channel appears anywhere in the operator graph; this lesson is the conceptual justification for that assertion.
send().await Semantics
send(item).await is the default and the right choice for most operator-pair edges. The semantics: when the channel is full, the future returned by send does not resolve until there is capacity for the item; the calling task is suspended (cooperatively, in the M2 L1 sense) and another task on the worker thread can run. The send is cancel-safe — dropping the future at any point is well-defined: either the item was inserted into the channel (Ok(()) returned) or it was not (the future was dropped before the channel had capacity). No third state.
The operational consequence is that the channel's capacity IS the backpressure mechanism. A channel of capacity 1024 is a slack budget: the upstream operator can produce 1024 items ahead of the downstream's processing before backpressure begins to apply. Within the budget, the upstream runs at its own rate; past the budget, the upstream's rate is capped at the downstream's rate. The relationship is exactly the dataflow-model contract: no operator runs faster than its slowest downstream.
The choice of send().await over try_send is the choice of "applied backpressure" over "load shed." Any pipeline whose default behavior under load should be "the upstream slows down" uses send().await. The radar source operator that calls send(observation).await on a full channel ends up suspended at that point; the next time it polls its UDP socket, the kernel buffer has had time to fill or drop frames at its layer. This is the right behavior for a UDP source: kernel-level drop preserves the pipeline's invariants while applying the right kind of pressure.
try_send() Semantics
try_send(item) returns immediately. On a full channel, it returns Err(TrySendError::Full(item)) — the item is handed back to the caller, untouched. The caller decides what to do: drop it, log it, route it to a DLQ, retry it later. The semantics are explicit load-shed: the channel's capacity is a hard limit, and an attempt to exceed it does not block.
The right use case is operator-side load shedding when the data being shed has lower marginal value than the work blocking the upstream. The metrics-export operator that emits sample metrics is the canonical case: a metric that did not get published is a small, recoverable loss; blocking the upstream operator on the metrics channel would impose a larger cost than the benefit of every metric reaching its destination. try_send with a gauge!("metrics_drops_total") increment is the right shape.
The trap that the lesson called out at the top is try_send with no logging. An operator that calls try_send and discards the Err(Full) produces silent drops that are invisible until aggregate output deficits show up downstream. The discipline is uniform: every try_send site has a counter increment and a structured log entry on Full. The orchestrator's metrics endpoint exposes the counter; an alert fires when the drop rate per second exceeds a threshold. Without the counter, try_send is a footgun; with the counter, it is a load-shedding tool.
send_timeout() Semantics
send_timeout(item, dur) is the third primitive. It suspends like send().await but resolves with Err(SendTimeoutError::Timeout(item)) after dur has elapsed without capacity becoming available. The caller decides what to do with the timed-out item: drop, DLQ, retry on a different channel.
This is the right primitive for operators with a real-time deadline. The conjunction-alert emitter has a 200 ms SLO from observation-arrival to alert-emit. If the alert subscriber's HTTP endpoint is too slow to drain the alert channel within 200 ms, send_timeout lets the operator make an explicit choice — drop the alert (with metrics and DLQ), route to a slow-path archive, or whatever the operational decision is — rather than blocking past the SLO. The deadline-bound choice fits naturally between send().await (no deadline, may block forever) and try_send (no wait at all).
The defaults across the SDA pipeline:
| Edge | Policy | Reasoning |
|---|---|---|
| Source → normalize | send().await | Apply backpressure to source; UDP drops at kernel are acceptable |
| Normalize → correlator | send().await | No deadline; correlator is the natural-rate consumer |
| Correlator → alert sink | send_timeout(200ms) | SLO-bound; timed-out alerts go to DLQ |
| Pipeline → metrics export | try_send + drop counter | Metrics are sheddable; observability of drops is what matters |
The discipline is per-edge, documented in the operator graph declaration with a brief comment about the choice. The orchestrator's structured log emits the policy at startup so runbooks can confirm what is configured.
Buffer Sizing
The capacity of a bounded channel is the slack budget between the producer and the consumer — how much burst the channel absorbs before backpressure begins to apply. The right sizing is operational, not magical. Three considerations.
Sustained rate vs burst rate: a channel sized for the sustained rate has effectively zero slack and applies backpressure constantly under nominal load, which is wrong. A channel sized for the burst rate has too much slack and adds latency under load (every item in the channel is one waiting in front of the next one). The right size is bounded by the expected burst duration × the rate gap between producer and consumer, with a 2x safety factor for headroom.
Per-item processing time at the consumer: a channel sized for 1024 items where each item takes 100 µs to process represents 100 ms of latency at the consumer side under steady-state full-channel conditions. If the operator's SLO budget is 200 ms, that 100 ms of channel-induced latency might be more than the budget allows, and the right answer is a smaller channel.
Memory cost per slot: each slot in the channel holds one Observation (or whatever the channel's item type is). For envelopes of a few kilobytes, a channel of 1024 is negligible. For envelopes that carry larger payloads, the per-slot memory matters and the channel should be sized accordingly.
The default for SDA's pipeline is 1024 for source-to-normalize edges, 4096 for normalize-to-correlator (the correlator's per-event work is heavier and the slack absorbs more of the burst), and 256 for the alert-emit edge (low rate, tight latency). Each edge is documented with a comment in the graph declaration explaining the choice. Sizing by reflex (everything is 1024, "because that's what we used last time") is the pattern the post-Cosmos postmortem identified as having contributed to the dropped alerts.
Code Examples
Three Sinks with Three Policies
The same logical sink role with three different flow policies, each appropriate for a different edge in the topology.
use anyhow::Result;
use std::time::Duration;
use tokio::sync::mpsc;
use tokio::time::timeout;
/// `BackpressureSink` applies upstream backpressure on a full channel.
/// The right choice for an edge where the upstream should slow rather
/// than the data should be dropped.
pub struct BackpressureSink {
tx: mpsc::Sender<Observation>,
}
impl BackpressureSink {
pub async fn write(&self, obs: Observation) -> Result<()> {
// send().await suspends until capacity. Cancel-safe.
self.tx.send(obs).await
.map_err(|_| anyhow::anyhow!("downstream receiver dropped"))?;
Ok(())
}
}
/// `SheddingSink` drops on a full channel rather than block. The right
/// choice for sheddable data; ALWAYS pair with a drop counter.
pub struct SheddingSink {
tx: mpsc::Sender<Observation>,
drops: metrics::Counter,
}
impl SheddingSink {
pub fn new(tx: mpsc::Sender<Observation>) -> Self {
Self {
tx,
drops: metrics::counter!("shedding_sink_drops_total"),
}
}
pub fn write(&self, obs: Observation) -> Result<()> {
match self.tx.try_send(obs) {
Ok(()) => Ok(()),
Err(mpsc::error::TrySendError::Full(_obs)) => {
self.drops.increment(1);
tracing::debug!("shedding sink dropped observation; channel full");
Ok(())
}
Err(mpsc::error::TrySendError::Closed(_)) => {
Err(anyhow::anyhow!("downstream receiver dropped"))
}
}
}
}
/// `TimedSink` writes with a deadline. After the deadline, the item is
/// returned to the caller; production code routes it to a DLQ or
/// archive sink.
pub struct TimedSink {
tx: mpsc::Sender<Observation>,
deadline: Duration,
dropped_with_deadline: metrics::Counter,
}
impl TimedSink {
pub fn new(tx: mpsc::Sender<Observation>, deadline: Duration) -> Self {
Self {
tx,
deadline,
dropped_with_deadline: metrics::counter!("timed_sink_deadline_drops_total"),
}
}
pub async fn write(&self, obs: Observation) -> Result<()> {
// tokio::time::timeout wraps the send; on timeout, we get the
// item back via Err and decide what to do with it.
match timeout(self.deadline, self.tx.send(obs)).await {
Ok(Ok(())) => Ok(()),
Ok(Err(_)) => Err(anyhow::anyhow!("downstream receiver dropped")),
Err(_elapsed) => {
self.dropped_with_deadline.increment(1);
// Production: route to slow-path archive sink. Here:
// log and drop.
tracing::warn!("timed sink missed deadline; dropping");
Ok(())
}
}
}
}
The three sinks are interchangeable in shape — same write(obs) -> Result<()> signature — but operationally different. The orchestrator's edge wiring chooses one per edge; the operator that consumes the sink does not need to know which policy is in effect. The metrics surfaced by each (shedding_sink_drops_total, timed_sink_deadline_drops_total) are the operator-visibility property the lesson keeps flagging — silent drops are the bug, instrumented drops are the tool.
A FlowPolicy Enum for Per-Edge Configuration
The lesson's discipline is per-edge. Encoding the policy in an enum makes the choice visible in the operator graph declaration and lets a single sink implementation dispatch to the right semantics.
use std::time::Duration;
#[derive(Debug, Clone, Copy)]
pub enum FlowPolicy {
/// Apply backpressure: the upstream slows down rather than items
/// being dropped. The default for most pipeline edges.
Backpressure,
/// Drop items on a full channel; log via the `dropped_total` metric.
/// For sheddable data: metrics export, optional logging, etc.
Shed,
/// Wait up to `deadline` for capacity; drop on timeout. For SLO-bound
/// edges where blocking past the deadline is worse than dropping.
Timed(Duration),
}
pub struct ConfigurableSink {
tx: mpsc::Sender<Observation>,
policy: FlowPolicy,
}
impl ConfigurableSink {
pub fn new(tx: mpsc::Sender<Observation>, policy: FlowPolicy) -> Self {
Self { tx, policy }
}
pub async fn write(&self, obs: Observation) -> Result<()> {
match self.policy {
FlowPolicy::Backpressure => self.tx.send(obs).await
.map_err(|_| anyhow::anyhow!("receiver dropped")),
FlowPolicy::Shed => match self.tx.try_send(obs) {
Ok(()) => Ok(()),
Err(mpsc::error::TrySendError::Full(_)) => {
metrics::counter!("flow_drops_total", "policy" => "shed").increment(1);
Ok(())
}
Err(mpsc::error::TrySendError::Closed(_)) =>
Err(anyhow::anyhow!("receiver dropped")),
},
FlowPolicy::Timed(deadline) => match timeout(deadline, self.tx.send(obs)).await {
Ok(Ok(())) => Ok(()),
Ok(Err(_)) => Err(anyhow::anyhow!("receiver dropped")),
Err(_) => {
metrics::counter!("flow_drops_total", "policy" => "timed").increment(1);
Ok(())
}
},
}
}
}
Two design points. The metric label policy is what makes the metric useful operationally: a single counter with the policy label tells you which kind of drop is happening, and the dashboards filter by it. A separate counter per policy works too but produces dashboard duplication. Second, the dispatch on self.policy is per-call; the cost is a single match against a copy of an enum, which is sub-nanosecond in the hot path. The expressiveness gain over three separate sink types is worth that cost.
Buffer Sizing With Documented Reasoning
A small helper that captures the reasoning behind a buffer size as data the operator graph carries forward into structured logs and runbook references.
use std::time::Duration;
/// The expected burst characteristics of a channel and the resulting
/// recommended buffer size. Documented per-edge in the topology
/// declaration; the orchestrator emits a startup log with the values.
#[derive(Debug, Clone, Copy)]
pub struct BurstProfile {
pub peak_rate_per_s: u64,
pub peak_duration_s: u64,
pub processing_rate_per_s: u64,
pub safety_factor: f32,
}
impl BurstProfile {
/// Recommended buffer size: how many items the channel needs to
/// hold to absorb the burst without applying backpressure for its
/// duration. Past the size, backpressure begins to apply normally.
pub fn recommended_buffer(&self) -> usize {
let rate_gap = self.peak_rate_per_s.saturating_sub(self.processing_rate_per_s);
let burst_items = rate_gap * self.peak_duration_s;
((burst_items as f32) * self.safety_factor) as usize
}
}
// Example: source→normalize edge for the radar source.
// peak_rate_per_s: 5000 (during a fragmentation event)
// peak_duration_s: 60 (the burst absorbs about a minute)
// processing_rate_per_s: 4500 (the normalizer's measured throughput)
// safety_factor: 2.0
// recommended_buffer() = (5000 - 4500) * 60 * 2.0 = 60,000 items
//
// 60,000 items at ~200 bytes per Observation envelope = 12 MB.
// Acceptable cost for the burst-absorption property; documented in
// the graph declaration with this comment.
const RADAR_SOURCE_BURST: BurstProfile = BurstProfile {
peak_rate_per_s: 5000,
peak_duration_s: 60,
processing_rate_per_s: 4500,
safety_factor: 2.0,
};
The math is not magic but it is also not "1024 because we always use 1024." Each per-edge buffer size has a BurstProfile constant declaring the assumptions, and the orchestrator's startup log emits the (edge_name, buffer_size, profile) triple so the runbook can reference it. The numbers are operational — they come from load tests and production observation, and they evolve as the workload changes. The discipline this lesson installs is making the assumptions visible rather than baked into magic numbers.
Key Takeaways
- Bounded channels are the default in any production pipeline.
unbounded_channelis a footgun for data-path edges; use it only for singleton-event signals or test scaffolding. The orchestrator's audit asserts none appear in the operator graph. - The three send semantics are operationally distinct.
send().awaitapplies backpressure (upstream slows).try_sendload-sheds on full (always pair with a drop counter).send_timeout(dur)is for SLO-bound edges where blocking past a deadline is worse than dropping. Encode the choice as a per-edgeFlowPolicy. - The right primitive depends on the question being asked. Should the upstream slow down? →
send().await. Is this data sheddable under load? →try_send+ counter. Does this edge have a real-time deadline? →send_timeout+ DLQ/archive. - Buffer sizing is per-edge and documented. The math:
(peak_rate - processing_rate) × peak_duration × safety_factor. The default of 1024 for everything is the pattern that the post-Cosmos-1408 postmortem identified as a contributing cause of dropped alerts. - Silent drops are the bug; instrumented drops are the tool. Every load-shedding site has a counter, a label, and a dashboard panel. An undocumented
try_sendis a bug waiting to be diagnosed; an instrumented one is operationally legible.
Lesson 2 — Credit-Based Flow Control
Module: Data Pipelines — M04: Backpressure and Flow Control
Position: Lesson 2 of 3
Source: Network Programming with Rust — Abhishek Chanda, the chapter on application-level flow control over TCP and the AMQP/HTTP-2 patterns; Kafka: The Definitive Guide — Shapira, Palino, Sivaram, Petty, Chapter 3 (max.in.flight.requests.per.connection as a degenerate single-credit scheme); Async Rust — Maxwell Flitton & Caroline Morton, advanced channel patterns
Context
Lesson 1 established send().await on a bounded channel as the foundation of backpressure. The mechanism is implicit credit: the channel's capacity is the credit pool; every successful send consumes one slot; the consumer's recv returns a slot to the pool. The sender does not manage credits explicitly; it just calls send and gets backpressure for free. This is the right shape for almost every operator-pair edge in the SDA pipeline and the right default for nearly all dataflow systems.
There is one shape it does not fit well. When the receiver wants to pause ingestion entirely without holding the in-flight item in a buffer. The canonical case is a checkpoint flush — the receiver needs to durably persist its current state before accepting new events, and during that window it cannot afford to even buffer the next event because the in-flight buffer is exactly what the checkpoint is trying to capture. The bounded-channel-plus-await design has no clean way to express "stop sending, but don't fill the slot you'd have used." The receiver can just not call recv for a while, but the next item the sender produces fills the channel, occupies a slot, and is now part of the in-flight state the checkpoint must serialize.
Credit-based flow control is the alternative shape. The receiver issues credits explicitly; the sender consumes credits on each send and refuses to send when out of credits. The credit return path is a separate channel from the data path, which means the receiver can pause issuing credits without touching the data channel at all — the sender's credit counter goes to zero, the sender stops, no in-flight item is created. This decoupling is what makes credit-based flow control the right primitive for pause-without-buffer-fill operations and for pipelines where data and control are physically separated (HTTP/2 stream flow control; AMQP prefetch; Kafka's max.in.flight.requests.per.connection as a single-credit degenerate case).
This lesson develops credit-based flow control as the operational alternative to bounded-channel-plus-await, identifies precisely when each fits, and develops the implementation pattern as a wrapper around the M2 channel primitives. The capstone uses credit-based flow control upstream of the windowed correlator's checkpoint flush in Module 5; this lesson installs the machinery now so that work has the primitive available.
Core Concepts
Credits, Defined
A credit is permission to send one item. The receiver has a finite pool of credits at any moment; the sender consumes one credit per send; sending without a credit is forbidden. The receiver replenishes the pool by sending credit-return messages — back to the sender, indicating that some prior items have been consumed and N new credits are available. The sender's local credit counter starts at the receiver's initial grant, decrements on each send, increments on each credit-return.
When credits = 0 the sender stops. It does not buffer. It does not block on a channel send. It returns to its caller (or sits at its own await point waiting for credit returns) without producing the next item. This is the mechanism that makes the receiver's pause real: by withholding credits, the receiver creates an upstream stop without occupying a single channel slot.
The shape resembles backpressure-via-await in the steady state — when credits flow normally, the sender produces at the receiver's rate, just like it would on a bounded channel — but differs in two structural ways. First, the credit signal is on a separate channel, decoupled from the data path. Second, the sender's behavior on "out of credits" is its own decision (block, drop, route elsewhere) rather than the channel's.
Credit-Based vs Backpressure-via-Await
Both produce upstream slowdowns under sustained downstream pressure. The differences are operational.
Backpressure-via-await binds the credit signal to the data channel — the channel's capacity IS the credit pool. To pause the sender, the receiver must not call recv, which means the channel fills, which means the next in-flight item occupies a slot. The receiver cannot pause without buffering at least one more item.
Credit-based decouples them. The receiver pauses by withholding credits on the side channel; the data channel remains empty (or at whatever steady state it was in). The sender's local credit counter goes to zero; the sender stops without producing the next item.
For most pipeline edges this difference does not matter. The data channel filling with one extra item before the sender suspends is not a problem; it does not change the pipeline's correctness. For the checkpoint case it does matter — the in-flight buffer being empty is the property the checkpoint depends on. Credit-based is the right primitive when the receiver needs that empty-buffer property.
There is a secondary difference. Credit-based supports batched grants: the receiver can issue 100 credits at once, and the sender can fire 100 sends without coordination. Backpressure-via-await supports the same pattern only via the channel's capacity, which couples the burst size to the persistent slack. Credits let burst size and steady-state slack be independent, which is occasionally useful (a receiver that wants 10 in-flight messages steady-state but allows occasional 100-message bursts).
The Credit Return Path
Two channels: data and credit-return. The data channel is the same mpsc::Sender<T> / mpsc::Receiver<T> pair we have used throughout. The credit-return channel is mpsc::Sender<u32> / mpsc::Receiver<u32>, where each message is a credit count being returned. The receiver's per-event behavior:
recvan item from the data channel.- Process the item.
- Send
1(or some larger batch count) on the credit-return channel.
The sender's per-event behavior:
- Drain the credit-return channel of any pending returns; increment the local counter by the sum.
- If the local counter is 0, await a credit-return on the credit-return channel.
- Decrement the counter; send the item on the data channel.
The data channel itself does not need to be bounded — credits bound the in-flight count, which is what the bounded channel was for. In practice, the data channel is given a small bound (matching the maximum credit grant) to keep the implementation defensive against credit-counter bugs.
Where Production Uses Credits
HTTP/2 has per-stream and per-connection flow-control windows (RFC 7540 Section 5.2). The receiver's WINDOW_UPDATE frame is exactly a credit-return: it tells the sender how many more bytes it may send on a given stream. The use case is multiplexing many streams on a single TCP connection — each stream needs its own backpressure that does not affect the others, and credits on a side channel give that.
AMQP (RabbitMQ, ActiveMQ) uses per-channel prefetch limits. The consumer declares its prefetch count; the broker delivers up to that many unacknowledged messages; the consumer's ack returns a credit. The mechanism is identical to the lesson's pattern, just with the broker and consumer in the producer/consumer roles.
Kafka's max.in.flight.requests.per.connection is a degenerate single-credit case. The producer can have at most N in-flight requests to a given broker; each completed request returns one credit. With N=1 (the strongest setting), the producer is effectively serial-pipelined. With N=5 (the default), small bursts are allowed.
The pattern is widespread in production protocols. It is less commonly used in single-process pipelines because backpressure-via-await is sufficient most of the time; credit-based shows up specifically where the additional decoupling is needed.
When to Reach for Credits in SDA
Three concrete cases.
Checkpoint flush (Module 5). The windowed correlator's state is being durably written to disk. During the write, the operator cannot afford to even buffer the next event. The flush operator pauses ingestion by withholding credits; the upstream correlator sender stops cleanly without occupying a slot.
Cross-runtime edges (advanced bulkheading from M2 L4). When an operator in one runtime sends to an operator in a different runtime (the propagator pool versus the main pipeline), the bounded channel between them does not propagate backpressure cleanly because the runtimes have independent schedulers. Credit-based flow gives an explicit mechanism that the receiving runtime controls.
Operator handoff during a graceful drain. During shutdown, the orchestrator wants the upstream to stop producing while in-flight items finish flowing. The orchestrator (or a control-plane operator) withholds credits on the affected edges; the upstream halts; downstream drains; the pipeline shuts down cleanly without losing in-flight items.
For everything else, prefer the simpler send().await pattern from Lesson 1. Credit-based flow is a heavier mechanism with more coordination overhead, and using it where it is not needed adds operational surface area.
Code Examples
A CreditChannel Wrapper
The wrapper is two channels and a small bookkeeping struct. The sender side checks credits before each send; the receiver side issues credit returns after each successful processing.
use anyhow::Result;
use tokio::sync::mpsc;
/// One end of a credit-based flow channel. The sender consumes credits
/// from its local counter on each send; when the counter is zero, it
/// awaits a credit return.
pub struct CreditSender<T> {
data_tx: mpsc::Sender<T>,
credit_rx: mpsc::Receiver<u32>,
local_credits: u32,
}
impl<T> CreditSender<T> {
pub async fn send(&mut self, item: T) -> Result<()> {
// Drain any pending credit returns first.
while let Ok(returned) = self.credit_rx.try_recv() {
self.local_credits = self.local_credits.saturating_add(returned);
}
// If we are out of credits, block on a return.
while self.local_credits == 0 {
match self.credit_rx.recv().await {
Some(returned) => {
self.local_credits = self.local_credits.saturating_add(returned);
}
None => return Err(anyhow::anyhow!("credit return channel closed")),
}
}
self.local_credits -= 1;
self.data_tx.send(item).await
.map_err(|_| anyhow::anyhow!("data channel receiver dropped"))?;
Ok(())
}
/// Current local credit count — useful for diagnostics.
pub fn credits(&self) -> u32 { self.local_credits }
}
/// The receiver end. Reading an item produces a `CreditHandle` that
/// MUST be returned (via return_credit) after the item is processed.
/// Forgetting to return credits is the canonical credit-leak bug.
pub struct CreditReceiver<T> {
data_rx: mpsc::Receiver<T>,
credit_tx: mpsc::Sender<u32>,
}
pub struct CreditHandle<'a> {
credit_tx: &'a mpsc::Sender<u32>,
returned: bool,
}
impl<T> CreditReceiver<T> {
pub async fn recv(&mut self) -> Option<(T, CreditHandle<'_>)> {
let item = self.data_rx.recv().await?;
let handle = CreditHandle {
credit_tx: &self.credit_tx,
returned: false,
};
Some((item, handle))
}
}
impl<'a> CreditHandle<'a> {
/// Return one credit to the sender. Should be called after the
/// associated item has been processed.
pub async fn return_credit(mut self) -> Result<()> {
self.returned = true;
self.credit_tx.send(1).await
.map_err(|_| anyhow::anyhow!("credit return channel sender dropped"))?;
Ok(())
}
}
impl<'a> Drop for CreditHandle<'a> {
fn drop(&mut self) {
if !self.returned {
// Forgotten credit return — log it. This is a programming bug,
// not a recoverable condition; production code should alert.
tracing::error!("CreditHandle dropped without returning credit; credit leaked");
}
}
}
/// Pair constructor: returns matched sender + receiver with the given
/// initial credit grant.
pub fn credit_channel<T: Send + 'static>(
data_capacity: usize,
initial_credits: u32,
) -> (CreditSender<T>, CreditReceiver<T>) {
let (data_tx, data_rx) = mpsc::channel(data_capacity);
let (credit_tx, credit_rx) = mpsc::channel(data_capacity);
(
CreditSender { data_tx, credit_rx, local_credits: initial_credits },
CreditReceiver { data_rx, credit_tx },
)
}
The CreditHandle with its Drop-emits-error pattern is a useful safety net: forgetting to return a credit is the canonical bug in this pattern, and the warning at Drop makes the bug visible in logs rather than mysterious in production. Production code goes a step further and refuses to compile if return_credit is not called (via a must_use lint or similar); for clarity here we use the runtime warning. The data_capacity parameter sets the data channel's bound — it should equal or exceed the maximum credit grant ever issued so the data channel itself is never the limiting factor.
A Receiver That Pauses by Withholding Credits
The case the lesson called out as the primary use: pausing the upstream during a checkpoint flush without occupying any in-flight slots.
use std::time::Duration;
use tokio::time::sleep;
/// Operator that periodically checkpoints. During the checkpoint
/// window it withholds credits, pausing the upstream sender cleanly.
pub async fn run_checkpointing_operator<T>(
mut input: CreditReceiver<T>,
output: mpsc::Sender<T>,
checkpoint_interval: Duration,
) -> Result<()>
where
T: Send + 'static,
{
let mut last_checkpoint = std::time::Instant::now();
loop {
if last_checkpoint.elapsed() >= checkpoint_interval {
// Time to checkpoint. The CRITICAL property: by NOT calling
// input.recv() during the flush, we both stop accepting new
// items AND withhold credit returns to the upstream. The
// upstream's local credit counter drains; the upstream stops
// producing without occupying any in-flight slot.
tracing::info!("starting checkpoint flush");
do_checkpoint_flush().await?;
last_checkpoint = std::time::Instant::now();
tracing::info!("checkpoint flush complete; resuming");
// Returning credits resumes the upstream.
continue;
}
match input.recv().await {
Some((item, credit)) => {
// Process the item.
output.send(item).await
.map_err(|_| anyhow::anyhow!("downstream dropped"))?;
// Return the credit AFTER processing. Returning before
// processing would defeat the bounded-in-flight property
// (the upstream could fire another item while this one
// is still in flight at the operator).
credit.return_credit().await?;
}
None => return Ok(()),
}
}
}
async fn do_checkpoint_flush() -> Result<()> {
// Module 5 develops checkpointing in full. Here: stand-in.
sleep(Duration::from_millis(150)).await;
Ok(())
}
Two design points worth dwelling on. The credit return happens after item processing, not before. Returning before processing breaks the in-flight-bounded property: the upstream sees the credit, fires the next item, and now there are two items in flight at this operator — one being processed, one in the data channel. With the AFTER ordering, the bound is exactly the initial credit grant: at most that many items are in flight at any moment.
The second is structural: the operator's "pause during checkpoint" mechanism is not calling recv. There is no explicit "pause" or "resume" message; the credit-return mechanic falls out of the recv-loop's natural shape. When the operator is in the checkpoint branch, no recv happens, no credit is returned, the upstream's counter drains. When the operator returns to the recv loop, credits flow again, the upstream resumes. The implementation is small precisely because the mechanism is doing the work.
Fairness Across Multiple Senders
When multiple senders share a credit pool with a single receiver, the issuance policy decides who gets what share. Two strategies, each with a different fairness profile.
use std::collections::VecDeque;
/// Round-robin credit issuance: cycle through senders, granting one
/// credit per sender per round. Fair share regardless of demand.
pub struct RoundRobinCreditIssuer {
senders: Vec<mpsc::Sender<u32>>,
cursor: usize,
}
impl RoundRobinCreditIssuer {
pub async fn grant_one(&mut self) -> Result<()> {
let target = self.cursor % self.senders.len();
self.senders[target].send(1).await
.map_err(|_| anyhow::anyhow!("sender's credit channel closed"))?;
self.cursor += 1;
Ok(())
}
}
/// First-asker-wins issuance: senders queue up requests; the issuer
/// satisfies in arrival order. Greedy senders dominate.
pub struct FifoCreditIssuer {
request_queue: VecDeque<usize>, // sender indices in arrival order
senders: Vec<mpsc::Sender<u32>>,
}
impl FifoCreditIssuer {
pub fn enqueue_request(&mut self, sender_idx: usize) {
self.request_queue.push_back(sender_idx);
}
pub async fn grant_one(&mut self) -> Result<()> {
if let Some(target) = self.request_queue.pop_front() {
self.senders[target].send(1).await
.map_err(|_| anyhow::anyhow!("sender's credit channel closed"))?;
}
Ok(())
}
}
Round-robin gives every sender a predictable share regardless of their per-sender rate. FIFO gives faster senders a larger share because they request more. The choice is operational. SDA's three sources have very different per-source rates (radar at thousands per second; ISL beacons at tens per second), and FIFO would let the radar dominate the credit pool — possibly correct if "throughput" is the optimization, definitely wrong if "fair representation" is. Round-robin is the SDA default. Production credit-issuance schemes can be more sophisticated still (weighted round-robin, deficit weighted, fair queueing); the framework above is the starting point that the rest builds on.
Key Takeaways
- Credit-based flow control is the alternative to bounded-channel-plus-await for cases where the receiver needs to pause the upstream without occupying any in-flight slot. Credits flow on a separate channel from data; the receiver's pause is "stop returning credits."
- The structural difference vs
send().await: backpressure-via-await binds the credit signal to the channel capacity; credit-based decouples them. Use credit-based when decoupling matters — checkpoint flushes, cross-runtime edges, graceful-drain pauses. Use bounded-channel-plus-await for everything else. - Production protocols use credits widely. HTTP/2 flow control windows. AMQP prefetch. Kafka's
max.in.flight.requests.per.connectionas a single-credit degenerate case. The pattern is well-established in distributed systems; this lesson brings it into the single-process pipeline for the cases that benefit. - The implementation is small: data channel + credit-return channel + per-sender credit counter. The credit return happens after item processing, not before — the AFTER ordering is what makes the in-flight bound exactly equal to the credit grant.
- Credit-issuance fairness is operational. Round-robin gives predictable per-sender share regardless of demand; FIFO lets greedy senders dominate. SDA defaults to round-robin; the heterogeneity of per-source rates would otherwise let one source starve the others.
Lesson 3 — End-to-End Backpressure Propagation
Module: Data Pipelines — M04: Backpressure and Flow Control
Position: Lesson 3 of 3
Source: Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11 (Buffering and Pushback in stream processing); Network Programming with Rust — Abhishek Chanda, sections on TCP windowing as transport-level backpressure; Kafka: The Definitive Guide — Shapira, Palino, Sivaram, Petty, Chapter 3 (Producer max.block.ms and the producer-side buffer)
Context
Module 1 introduced the bounded-channel-plus-await chain that propagates backpressure from sink to source. Module 2 wrapped that chain in an orchestrator. Lesson 1 of this module established per-edge FlowPolicy. Lesson 2 added credit-based flow as a sharper tool for cases where the bounded-channel pattern is not enough. The pipeline at this point is well-equipped to apply backpressure correctly if the chain is intact end-to-end.
The chain is rarely intact end-to-end. Engineers add small "convenience" pieces — a tokio::spawn for "fire-and-forget" logging, an unbounded_channel for "this side channel cannot fill," a Vec::push into an in-process collection — each one a place where the backpressure traversal stops. The post-Cosmos-1408 incident from this module's mission framing took two hours to diagnose because the pressure chain was structurally broken at a single tokio::spawn inside the correlator's per-event loop. The spawn looked harmless in code review, did not block, did not fill any visible buffer, and silently amplified any downstream slowdown into unbounded task accumulation. The fix was three lines; the diagnosis was the hard part.
This lesson is the audit. It identifies the canonical patterns that break the pressure chain, develops the diagnostic approach (read the channel-occupancy gradient — the slowest stage shows up as the channel-full upstream of itself), and discusses the two boundary cases where backpressure does not propagate naturally: across a Kafka topic between two pipeline halves, and through retry/loop topologies. The capstone integrates the audit as a CI test against the operator graph and a BurstSimulator that drives the M3 pipeline at 10x normal rate to verify end-to-end propagation under burst load.
Core Concepts
The Pressure Chain
A single contiguous chain of bounded buffers from source to sink. Every adjacent pair of operators connected by a bounded channel; every operator's emit using send().await (or its FlowPolicy equivalent) on its outgoing channel; no detached tasks per item; no in-process unbounded buffers. With those conditions, a slowdown at the sink propagates: the sink's incoming channel fills, the upstream operator's send().await suspends, that upstream's incoming channel fills, that upstream's upstream suspends, all the way back to the source. The source either suspends on its own producing primitive (the UDP recv_from, the HTTP client.get) or, for sources that cannot suspend (a UDP feed that produces whether anyone is listening or not), the kernel-level buffer fills and the kernel drops at its layer.
The chain has a shape: operators are the links, channels are the connections. Breaking the chain means inserting something between two operators that does not propagate the suspend signal. The next subsection enumerates the canonical breakage shapes.
Where Pressure Breaks: Anti-Patterns
Five patterns the lesson identifies as the recurring pressure-chain breakers.
tokio::spawn per-event. An operator's hot loop does tokio::spawn(async move { handle_event(e).await }). The spawn returns immediately; the operator continues processing the next event without waiting for the spawned task. Under steady-state load the spawned tasks complete fast enough that the count stays bounded. Under any sustained slowdown, spawned tasks accumulate without limit. The operator's outgoing channel never fills (because the spawned tasks do the work asynchronously) and never propagates pressure. This is the M1 lesson 3 anti-pattern, and it is the single most common pressure-chain breaker because it looks harmless in code review.
mpsc::unbounded_channel. Lesson 1's footgun. Any unbounded buffer is a pressure-chain stopgap: the upstream's send always succeeds, so the upstream never observes the downstream's slowness. The buffer grows in proportion to the sustained gap.
Fire-and-forget logging via channels. A common pattern: emit a structured log via an mpsc::Sender to a separate logging task. If that channel is unbounded (or if the operator uses try_send and discards the Err), logging events can pile up under load without anyone noticing. The fix is not "make logging block on the hot path" — that has its own problems — but rather: log via the standard tracing crate's blocking machinery (which is fast), or use try_send + counter pattern from Lesson 1.
Vec::push into ever-growing collections. An operator accumulates events into a Vec for a deferred batch operation. The accumulation has no bound; under load, the Vec grows without limit. The pattern is structurally identical to unbounded_channel and has the same fix: the operator must bound the accumulation by size or time, and apply backpressure or load-shed when the bound is reached.
Drop-and-recreate task patterns. A supervisor that, on every event, drops the current operator task and spawns a fresh one. The motivation is usually "stateless restart for cleanliness," but the effect is that the channel between this operator and its downstream is being reconstructed faster than it can drain — the new task starts with an empty channel, the old task's in-flight items are dropped or orphaned, pressure does not propagate because the channel does not persist.
The canonical fix in every case is structural: replace the breaking pattern with bounded-and-suspending equivalents. The tokio::spawn per-event becomes an inline await. The unbounded_channel becomes channel(N). The Vec::push accumulator becomes a sized VecDeque with explicit eviction or backpressure. The drop-and-recreate becomes a long-lived supervised operator (Module 2 L4).
Reading the Pressure Gradient
When a pipeline is correctly chained but slow somewhere, the channel-occupancy gradient identifies the slow stage. The slowest operator's incoming channel is consistently full. Upstream of that operator, channels are partially filled (the stages between the source and the slow operator are running at the pipeline's bottleneck rate, with channels carrying some slack). Downstream of the slow operator, channels are mostly empty (the downstream is faster than the upstream is producing).
The pattern looks like a step function in the per-channel occupancy gauges: 100% at and just before the bottleneck, decreasing toward 0% as you move downstream. The ops engineer reading the dashboard finds the bottleneck by looking for the rightmost-100%-channel — the one whose upstream is the slowest stage and whose downstream is what is starving.
The lesson develops the audit as a diagnostic operator that exports per-channel occupancy as a Prometheus gauge. Module 6 generalizes this into the operational dashboard's primary panel; this lesson installs the foundation.
TCP Windowing as Transport-Level Backpressure
Module 1 introduced TCP windowing as the kernel-level mechanism that propagates backpressure from a slow application back to the producer over the network. The receiver's TCP stack advertises a receive window — how many more bytes it can buffer; as the application reads, the window grows; when the application stops reading, the window shrinks. The sender's TCP stack respects the advertised window and pauses sending when the window is zero.
This works only if the application reads synchronously from the socket and processes each read before reading the next one. An application that reads as fast as possible into an in-process buffer and processes asynchronously breaks TCP-level backpressure: the application drains the kernel buffer as fast as the network can fill it, the receive window stays advertised at maximum, and the sender does not slow regardless of how slow the application's processing is. The backpressure chain ends at the in-process buffer, which is unbounded by definition.
The discipline is to drive the read loop and the channel send from the same task. The radar source from Module 1 does this: recv_from().await followed by sink.write(obs).await. When the sink's downstream channel is full, send().await suspends, the next iteration of the loop is delayed, the next recv_from is delayed, the kernel buffer fills, the TCP receive window shrinks, the sender's TCP stack slows. Every link in the chain — application→channel→kernel→network — propagates the pressure. Breaking any one link breaks the whole.
Backpressure Across Kafka
The pipeline often has a Kafka topic between two halves: the ingestion half writes to a topic, a consumer half reads from it. Backpressure across the topic boundary does not work the same way as within a single process.
The producer's view: a Kafka producer maintains a producer-side buffer (buffer.memory, default 32 MB). When the topic's brokers acknowledge slowly (or the consumer is slow and the topic's retention is bursting), the producer-side buffer fills. With max.block.ms configured (default 60 seconds), the producer's send blocks waiting for buffer space — which propagates backpressure into the producer-side application. With max.block.ms = 0 and acks=0, the producer drops on full buffer. The producer-side configuration determines the boundary behavior.
The consumer's view: the consumer's lag (the gap between the topic's high-watermark and the consumer's committed offset) grows when the consumer is slow. Backpressure does not propagate to the producer instantaneously — Kafka decouples the two halves intentionally. The producer keeps writing (up to its broker's retention limits) regardless of consumer lag; the consumer falls behind silently until lag is observed via metrics. The pipeline operator's responsibility is to monitor consumer lag explicitly and alert when it grows past a threshold. The implicit pressure-chain that works within a single process becomes an explicit observability discipline at the Kafka boundary.
For the SDA pipeline, this is mostly future work — Module 5 introduces Kafka as a checkpoint persistence layer and Module 6 develops the lag monitoring discipline. This lesson surfaces the boundary so the audit script does not flag Kafka producer/consumer pairs as pressure-chain breaks (they ARE breaks within a single process; they ARE intended at the cross-pipeline boundary; the monitoring is what restores the missing signal).
Code Examples
A Pressure-Chain Audit Script
The audit walks the operator graph from M2 and flags edges that are unbounded, operators that have detached tokio::spawn calls in their hot path (a heuristic check against the source code), and channels that lack a documented FlowPolicy. Failing edges produce a CI error.
use anyhow::{anyhow, Result};
#[derive(Debug)]
pub struct AuditFinding {
pub edge_or_operator: String,
pub category: AuditCategory,
pub detail: String,
}
#[derive(Debug)]
pub enum AuditCategory {
UnboundedChannel,
NoFlowPolicy,
DetachedSpawnSuspected, // heuristic: source-grep for tokio::spawn inside operator
}
/// Audit an OperatorGraph for backpressure-chain integrity. Returns
/// the list of findings; an empty list means the audit passed.
pub fn audit(graph: &OperatorGraph) -> Vec<AuditFinding> {
let mut findings = Vec::new();
for edge in graph.edges() {
if !edge.is_bounded() {
findings.push(AuditFinding {
edge_or_operator: format!("{} -> {}", edge.from_name, edge.to_name),
category: AuditCategory::UnboundedChannel,
detail: "edge uses unbounded_channel; pressure does not propagate".into(),
});
}
if edge.flow_policy().is_none() {
findings.push(AuditFinding {
edge_or_operator: format!("{} -> {}", edge.from_name, edge.to_name),
category: AuditCategory::NoFlowPolicy,
detail: "edge has no documented FlowPolicy; default of Backpressure assumed but should be explicit".into(),
});
}
}
findings
}
/// CI helper: convert findings into a Result that fails the test.
pub fn audit_or_fail(graph: &OperatorGraph) -> Result<()> {
let findings = audit(graph);
if findings.is_empty() { return Ok(()); }
let summary: Vec<String> = findings
.iter()
.map(|f| format!("[{:?}] {}: {}", f.category, f.edge_or_operator, f.detail))
.collect();
Err(anyhow!("backpressure audit failed:\n{}", summary.join("\n")))
}
The audit is intentionally conservative: an unflagged graph is probably correct but the audit cannot prove it. The DetachedSpawnSuspected heuristic is the weakest part — a source-grep for tokio::spawn inside operator bodies catches the obvious cases but misses cases where the spawn is hidden inside a helper function. Production audit tooling extends to AST-level inspection or annotation-based pattern matching; the lesson's version is sufficient as a CI canary that catches the regressions the postmortem identified.
A Per-Channel Occupancy Gauge
The diagnostic operator that exports the channel-occupancy gradient. Drops in transparently between any two operators by wrapping the channel.
use anyhow::Result;
use std::sync::Arc;
use tokio::sync::mpsc;
/// A wrapper around mpsc::Sender that exports the channel's current
/// occupancy as a metric on every send. Used between operators where
/// occupancy needs to be observable for the pressure-gradient diagnostic.
pub struct InstrumentedSender<T> {
inner: mpsc::Sender<T>,
capacity: usize,
edge_label: String,
}
impl<T> InstrumentedSender<T> {
pub fn new(inner: mpsc::Sender<T>, capacity: usize, edge_label: impl Into<String>) -> Self {
Self { inner, capacity, edge_label: edge_label.into() }
}
pub async fn send(&self, item: T) -> Result<()> {
// Exporting occupancy as a Prometheus gauge labeled by edge.
// The dashboard's primary panel filters this by edge_label
// and shows the per-edge gradient.
let used = self.capacity - self.inner.capacity();
metrics::gauge!("channel_occupancy", "edge" => self.edge_label.clone())
.set(used as f64);
self.inner.send(item).await
.map_err(|_| anyhow::anyhow!("downstream dropped"))
}
}
The mpsc::Sender::capacity() method returns the remaining capacity (slots free), so used = total - remaining. The gauge update is per-send overhead; for SDA's volumes (tens of thousands per second), the cost is negligible — sub-microsecond per send. For higher-throughput pipelines the sample rate would be lower (every Nth send) at the cost of dashboard responsiveness. Module 6 generalizes this pattern into a structured metric for every operator-pair edge.
A BurstSimulator for End-to-End Pressure Verification
The integration test that drives a synthetic 10x burst through the pipeline and asserts the per-channel occupancy gradient stabilizes at the expected bottleneck. The simulator's value is not the simulation itself but the assertion structure: under burst load, the slowest operator's incoming channel should be persistently full, every other channel should be measurably below full.
use std::time::{Duration, Instant};
pub struct BurstSimulator {
target_rate_per_s: u64,
duration: Duration,
}
impl BurstSimulator {
pub fn new(target_rate_per_s: u64, duration: Duration) -> Self {
Self { target_rate_per_s, duration }
}
/// Drive `target_rate_per_s` synthetic observations into the
/// pipeline's source for `duration`, sampling channel occupancy
/// at 10 Hz. Returns the per-edge occupancy time series.
pub async fn drive(&self, source: impl SyntheticSource) -> Result<OccupancyReport> {
let start = Instant::now();
let interval = Duration::from_millis(1000 / self.target_rate_per_s.max(1));
let mut sample_at = start + Duration::from_millis(100);
let mut samples: Vec<OccupancySample> = Vec::new();
while start.elapsed() < self.duration {
source.emit_observation().await?;
tokio::time::sleep(interval).await;
if Instant::now() >= sample_at {
samples.push(sample_all_edges());
sample_at = Instant::now() + Duration::from_millis(100);
}
}
Ok(OccupancyReport { samples })
}
}
#[derive(Debug)]
pub struct OccupancyReport {
pub samples: Vec<OccupancySample>,
}
impl OccupancyReport {
/// Identify the persistently-full edge (>= 95% occupancy in the
/// final third of the simulation). That edge's downstream operator
/// is the bottleneck.
pub fn identify_bottleneck(&self) -> Option<String> {
let final_third_start = self.samples.len() * 2 / 3;
let final_samples = &self.samples[final_third_start..];
for edge_name in self.edge_names() {
let avg_occupancy: f64 = final_samples.iter()
.map(|s| s.edge_occupancy(&edge_name))
.sum::<f64>() / final_samples.len() as f64;
if avg_occupancy >= 0.95 {
return Some(edge_name);
}
}
None
}
fn edge_names(&self) -> Vec<String> { /* ... */ vec![] }
}
The simulator is more useful as a CI canary than as a load-testing tool — its value is the bottleneck-identification assertion, not the absolute throughput numbers. A regression that pushes the bottleneck from where ops expects it to be (the correlator) to somewhere else (a normalize that just got slower because of an unrelated change) is exactly the kind of thing the burst simulator catches before it becomes a production incident. The capstone wires this into CI.
Key Takeaways
- The pressure chain is a contiguous sequence of bounded buffers from source to sink with
send().await(orFlowPolicyequivalent) on every edge. A slowdown at any operator propagates upstream all the way to the source's producing primitive. The chain breaks at any unbounded buffer or detachedtokio::spawnper event. - The canonical breakage patterns are: per-event
tokio::spawn(M1's anti-pattern revisited),mpsc::unbounded_channel, fire-and-forget logging on unbounded channels,Vec::pushinto unbounded accumulators, and drop-and-recreate task patterns. The fix in every case is structural: replace with bounded-and-suspending equivalents. - Reading the pressure gradient identifies the slowest stage. The slowest operator's incoming channel is persistently full; channels upstream are partially full; channels downstream are mostly empty. The dashboard panel for per-channel occupancy is the primary diagnostic.
- TCP windowing as backpressure works only when the application reads synchronously from the socket and processes inline. An async-buffer pattern that reads as fast as possible breaks the kernel-level chain just like an unbounded
Vecbreaks the application-level chain. - Backpressure across Kafka does not propagate synchronously. Kafka decouples producer and consumer intentionally. The pipeline's discipline is to monitor consumer lag explicitly as the substitute for the within-process pressure signal. Module 5 develops Kafka as a checkpoint store; Module 6 develops the lag monitoring.
Capstone Project — Backpressure-Aware Fusion Broker
Module: Data Pipelines — M04: Backpressure and Flow Control Estimated effort: 1–2 weeks of focused work Prerequisites: All three lessons in this module passed at ≥70%
Mission Brief
OPS DIRECTIVE — SDA-2026-0188 / Phase 4 Implementation Classification: BURST-LOAD HARDENING
Last week's anti-satellite test added 1,800 newly tracked debris objects to the catalog within 90 seconds. The Phase 3 pipeline survived but with twelve minutes of catch-up time and four dropped conjunction alerts during the spike. The postmortem traced the dropped alerts to a single edge — the alert-emitter's incoming channel — where the buffer was sized for nominal traffic, the upstream correlator was IO-bound and could not slow further, and the alert-emitter had no mechanism to triage which observations to drop and which to preserve. The pipeline's response to the burst was uniform load shed without policy: the four critical alerts that got dropped were no more important to the system than the four hundred non-critical observations dropped alongside them.
Phase 4 hardens the pipeline against this failure mode. Every edge in the operator graph gets an explicit FlowPolicy. A priority classifier distinguishes high-priority observations (previously- unseen objects, sustained-trajectory updates) from low-priority ones (redundant samples for objects already being tracked). Under shed conditions, low-priority observations are dropped before high-priority. The audit script becomes a CI test that fails the build if a new edge is added without a documented FlowPolicy. A burst simulator drives the pipeline at 10x normal rate for five minutes and asserts no high-priority alerts are dropped during the spike.
Success criteria for Phase 4: a deliberate 10x burst test passes with zero high-priority alerts dropped, with the pipeline memory bounded throughout the spike, and with the operational dashboard identifying the bottleneck operator within 30 seconds of the spike's onset.
What You're Building
Harden the Phase 3 pipeline (M3's conjunction window engine running on M2's orchestrator) against burst load. The deliverable is:
- Every channel in the topology has an explicit
FlowPolicyset at construction time and documented with a one-line comment - A priority classifier (
fn classify_priority(obs: &Observation) -> Priority) that distinguishes High and Low based on whether the observation is a fresh sighting, a sustained-trajectory update, or a redundant sample - A
PriorityShedSinkthat maintains separate sub-channels per priority and drops Low-priority items first when the shared channel approaches capacity - The audit script from L3 wired into the
cargo testsuite as a CI gate that fails on any new unbounded channel or undocumented FlowPolicy - The
BurstSimulatorfrom L3 running as an integration test that drives 10x rate for 5 minutes and asserts the high-priority retention property - Per-channel occupancy gauges (the
InstrumentedSenderfrom L3) on every edge, plus aflow_policy_drops_total{policy, priority}counter - An updated operational README documenting how to read the channel-occupancy gradient, what each FlowPolicy means, and what the burst simulator's pass/fail criteria are
The orchestrator from M2 and the windowed correlator from M3 are unchanged in structure. Only the channels' policies change, and a new operator (the priority classifier) sits between the normalize and the correlator.
Suggested Architecture
┌───────┐ FP=Backpressure FP=Backpressure
│ Radar │═════════>┌────────┐ ┌──────────────┐
│ Src │═════════>│ Norm │ FP=Backpressure │ Windowed │
└───────┘ │ FanIn │═════>┌─────────────┐═════>│ Correlator │
┌───────┐ ═══════>│ │ │ Priority │═════>│ (M3) │
│ Optical│ └────────┘ │ Classifier │ └──────┬───────┘
│ Src │ │ + Shed │ │ FP=Timed(200ms)
└───────┘ └─────────────┘ ▼
┌───────┐ ┌──────────────┐
│ ISL │ │ Alert Sink │
│ Src │ │ (priority │
└───────┘ │ preserving) │
└──────────────┘
│
▼ shed_drop
┌──────────────┐
│ DLQ │
└──────────────┘
Plus a sidecar:
┌─────────────────────────────────────┐
│ /metrics endpoint exporting: │
│ - channel_occupancy{edge} │
│ - flow_policy_drops_total{policy} │
│ - per-priority counters │
└─────────────────────────────────────┘
The priority classifier operator sits between the fan-in normalize and the correlator. Its outgoing edge is a PriorityShedSink — a sink that holds two sub-channels (one per priority) and feeds the correlator from a select! that prefers High when both have items. Under shed conditions (the underlying correlator's incoming channel is filling), Low items are dropped first while High items still flow.
Acceptance Criteria
Functional Requirements
-
Every channel in the operator graph has a
FlowPolicyset and documented with a comment naming the choice and the reasoning -
No
unbounded_channelanywhere in the data path; the audit script fails CI if one is introduced -
No
tokio::spawninside any operator's per-event hot path; the audit script's heuristic check fails CI on new occurrences -
The priority classifier returns
Priority::{High, Low}based on the documented rules; the rules are encoded in code with a// Reason: ...comment per branch -
The
PriorityShedSinkmaintains separate sub-channels and drops Low first when the shared underlying channel is over a configured high-water mark (e.g., 80% capacity) -
The alert-emitter uses
FlowPolicy::Timed(200ms)and routes timed-out alerts to a DLQ rather than blocking past the SLO
Quality Requirements
- Audit script test: a unit test runs the audit on a synthetic graph that contains an unbounded channel, asserts the audit fails with a recognizable error message
-
Burst simulator test: an integration test drives the pipeline at 10x normal rate for 5 minutes; asserts (a) zero High-priority alerts dropped, (b) memory plateau reached within 30 seconds (no unbounded growth), (c)
flow_policy_drops_total{priority="low"}increments whileflow_policy_drops_total{priority="high"}stays at zero - Bottleneck identification test: a deterministic test injects a synthetic slow operator, runs the simulator, asserts the gradient-reading helper identifies the right operator as the bottleneck within 30 seconds
-
All channel buffer sizes have a documented
BurstProfilein the topology declaration; the orchestrator's startup log emits the (edge, buffer_size, profile) triple
Operational Requirements
-
/metricsendpoint extends Phase 3's with:channel_occupancy{edge}(gauge per edge),flow_policy_drops_total{policy, priority}(counter),priority_classifier_decisions_total{priority}(counter) - Operational runbook section "Diagnosing a Backpressure Incident" documenting the gradient-reading discipline, with a worked example using the burst simulator's output as the canonical case
-
The audit script runs as part of
cargo test --release; failing it fails the CI build
Self-Assessed Stretch Goals
- (self-assessed) The 10x burst test holds P99 high-priority alert latency under 5 seconds throughout the spike (the 30-second SLA is comfortably met). Provide a histogram from a real 5-minute run.
-
(self-assessed) A circuit breaker (Module 2 L4) wraps the optical-archive call: if the archive's failure rate exceeds 50% over 30 seconds, the breaker opens for 30 seconds; the dashboard shows breaker state. Demonstrated with a
wiremocksimulating intermittent 503 responses. -
(self-assessed) The pipeline's checkpoint operator (foreshadowed for M5) uses the Lesson 2
CreditChannelto pause ingestion during the flush; the burst simulator runs through a checkpoint and continues without observable High-priority alert loss.
Hints
How should I encode the priority classification rules?
A small enum and a function with explicit branches. Each branch carries a comment naming the operational rationale.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum Priority { High, Low }
pub fn classify_priority(obs: &Observation, recent_objects: &RecentSet) -> Priority {
// Reason: previously-unseen objects are time-critical; their first
// observation defines the orbital track and a missed alert can mean
// a missed conjunction.
if !recent_objects.contains(&obs.target_object_id) {
return Priority::High;
}
// Reason: sustained-trajectory updates from the high-cadence radar
// refine existing tracks but redundant samples (>4 within 30s) add
// little to track quality and are sheddable.
if obs.source_kind == SourceKind::Radar
&& recent_objects.sample_count(&obs.target_object_id, Duration::from_secs(30)) > 4
{
return Priority::Low;
}
Priority::High
}
The RecentSet is a small auxiliary structure that the classifier owns; it tracks the last N seconds of (object_id, sensor_timestamp) pairs and supports contains and sample_count lookups. Bound it by both time and count (the L2 sliding-window pattern from M3 applies here).
How should the PriorityShedSink select between sub-channels?
A tokio::select! over two recv futures, with a bias toward High when both are ready. The biased mode of select! gives deterministic precedence — without it, the macro picks randomly among ready arms.
pub async fn run_priority_sink(
mut high_rx: mpsc::Receiver<Observation>,
mut low_rx: mpsc::Receiver<Observation>,
output: mpsc::Sender<Observation>,
) -> Result<()> {
loop {
tokio::select! {
biased;
recv = high_rx.recv() => match recv {
Some(obs) => output.send(obs).await
.map_err(|_| anyhow!("downstream dropped"))?,
None => break,
},
recv = low_rx.recv() => match recv {
Some(obs) => output.send(obs).await
.map_err(|_| anyhow!("downstream dropped"))?,
None => continue,
},
}
}
Ok(())
}
The biased; directive is what gives High priority deterministic preference; without it, ties (both channels ready) resolve randomly, which means the High channel only gets ~50% of the throughput. Documented in the operator's comment.
How do I make the burst simulator deterministic?
Drive synthetic events with a fixed-seed RNG; sample channel occupancy at deterministic wall-clock intervals; assert against the resulting time series rather than against transient peaks. The key insight: real-world burst behavior is not deterministic, but the simulator's value is the regression-detection property — same inputs produce same outputs, so a regression is visible as a change in the time series.
Use tokio::time::pause() and tokio::time::advance() to drive the simulation in fast-forward without real wall-clock waits. The simulator runs in milliseconds rather than minutes, fits in CI, runs deterministically every time.
How do I size the High and Low sub-channels?
Total capacity should match the underlying channel's capacity — say, 4096 in total. Split: 75% High (3072), 25% Low (1024). The High side is sized to the steady-state High rate plus burst headroom; the Low side is sized small because it is the first to shed under load and additional capacity does not add value.
The split is documented in the BurstProfile for the edge and visible at startup in the orchestrator's structured log. Operational tuning revisits the split if dashboard data shows the High side hitting capacity under normal load (then it is undersized) or the Low side being persistently empty (then it is oversized and the capacity could be reallocated).
How do I integrate the audit script with cargo test?
A #[test] function that builds the production graph and runs audit_or_fail on it. The test fails the CI build whenever a new edge or operator violates the audit rules.
#[test]
fn pipeline_passes_backpressure_audit() {
let graph = build_production_pipeline_graph();
let result = audit_or_fail(&graph);
assert!(
result.is_ok(),
"backpressure audit failed:\n{}",
result.unwrap_err()
);
}
The build_production_pipeline_graph is the same function the binary calls for its actual topology — sharing the construction code between binary and test ensures the test exercises what production runs, not a stale fixture.
Getting Started
Recommended order:
FlowPolicyenum andConfigurableSink. Lesson 1's primitive; wire it through every existing edge in the topology withFlowPolicy::Backpressureas the explicit choice on most edges.- Audit script. Lesson 3's primitive; encode it as a
cargo testfunction. It will pass at this stage; the value is preventing future regressions. InstrumentedSender+ per-channel occupancy gauge. Lesson 3's primitive; wrap everympsc::Senderin the production graph. Verify the dashboard shows occupancy values at startup.- Priority classifier. Encode the classification rules; unit-test them against synthetic inputs.
PriorityShedSinkwith sub-channel split. The biased-select pattern; integration-test it with a synthetic load mix.BurstSimulatorintegration test. Drive 10x rate for 5 simulated minutes; assert High-priority retention.- Operational runbook + structured log emit at startup. The "what each policy means" reference for ops.
- CI integration: audit script in
cargo test, burst simulator in nightly CI.
Aim for the 10x burst simulator passing with the priority classifier in place by day 7. The audit script and runbook are finishing work that pays back later.
What This Module Sets Up
In Module 5 you will make the windowed correlator's state crash-safe via checkpointing. The credit-based-flow primitive from this module's Lesson 2 is the mechanism that pauses the upstream during the checkpoint flush. The bounded-channel-everywhere invariant the audit enforces is what lets the checkpointed state size be bounded and predictable. The exactly-once-via-idempotency frame from M2 plus the retract-aware sink from M3 plus the priority shedding from M4 compose into a pipeline that is correct, hardenable under load, and crash-safe under restart — the full M5 deliverable.
In Module 6 you will surface the per-channel occupancy gradient and the per-priority drop counters as the operational dashboard's primary panels. The audit script becomes part of the deploy gate; the BurstSimulator becomes part of the SLO compliance test. The work in this module is the operational foundation for that observability stack.
The patterns the module installs — explicit per-edge FlowPolicy, priority-aware load shedding, audit-as-CI-test — generalize beyond SDA. They are the standard streaming-pipeline hardening techniques for any system that must survive bursts; the module's specifics are where the techniques meet a real workload.
Module 05 — Delivery Guarantees and Fault Tolerance
Track: Data Pipelines — Space Domain Awareness Fusion Position: Module 5 of 6 Source material: Kafka: The Definitive Guide — Shapira, Palino, Sivaram, Petty, Chapter 7 (Reliable Data Delivery), Chapter 8 (Exactly-Once Semantics), Chapter 9 (Failure Handling and Reprocessing); Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11 (Fault Tolerance, Microbatching and Checkpointing); Database Internals — Alex Petrov, Chapter 5 (Checkpointing in Recovery); Fundamentals of Data Engineering — Joe Reis & Matt Housley, Chapter 7 (Error Handling and Dead-Letter Queues, Late-Arriving Data) Quiz pass threshold: 70% on all four lessons to unlock the project
Mission Context
OPS ALERT — SDA-2026-0207 Classification: RESTART-SAFETY HARDENING Subject: Make the windowed correlator crash-safe and the alert path exactly-once-effective
Two months ago, a maintenance window required restarting the pipeline to apply a security patch. The orchestrator's graceful-drain logic worked correctly — every operator drained its incoming channel before exiting — but the alert subscriber had already received fourteen alerts that the new pipeline did not know about, and the new pipeline emitted six alerts that the subscriber had already acted on. Two false-positive collision-avoidance maneuvers were executed as a consequence. The postmortem identified two missing pieces: durable state on the producer side (so restart resumes from where the previous instance left off), and idempotent processing on the consumer side (so duplicate deliveries do not produce duplicate effects). This module installs both.
The pipeline at the start of this module handles steady-state load (M4), produces correct event-time results (M3), and is correctly orchestrated and supervised (M2). It has one structural blind spot: it loses data on a process restart. The windowed correlator's state is in process memory; the supervisor restarts a panicked operator with a fresh empty state; in-flight observations between operators are buffered in tokio channels that do not survive a process exit. Every restart loses some non-trivial amount of work, and the SDA-2026-0207 incident's failure mode is what happens when that lost work straddles a real-world action boundary like an alert subscriber.
Module 5 is the response. At-least-once delivery at the transport layer (Kafka producer with acks=all + retries; consumer with process-then-commit) gives the property "every observation reaches the consumer at least once, with duplicates as the operational cost." Idempotency at the application layer (sink-side dedup keyed on observation_id, idempotent SQL UPSERT, Kafka's idempotent producer) gives the property "duplicate deliveries produce identical effects on the world." The two together are effective exactly-once — the pipeline's net effect is exactly-once even though the underlying transport admits duplicates. Checkpointing captures the windowed correlator's state durably so restart resumes from a saved snapshot rather than rebuilding from scratch. Dead-letter queues route permanent errors to a separate sink with metadata so engineers can investigate and re-inject after fixes.
The mental model the module installs is the four-piece reliability discipline: (1) at-least-once at every transport boundary, (2) idempotency at every state-modifying boundary, (3) checkpointed state on every stateful operator, (4) DLQ for every permanent-error path with explicit re-processing tooling. Every streaming pipeline in production combines these four; the module's specifics are where the patterns meet SDA's actual workload.
Learning Outcomes
After completing this module, you will be able to:
- Distinguish at-most-once, at-least-once, and exactly-once delivery semantics rigorously, and configure the Kafka producer/consumer pair for at-least-once
- Compose at-least-once delivery with application-layer idempotency to produce effective exactly-once processing, and recognize where the guarantee holds versus where boundary owners must implement their own dedup
- Implement bounded sliding-window dedup sets keyed on natural or derived idempotency keys, with the production-safety double-bound (time AND count)
- Configure Kafka's idempotent producer (
enable.idempotence=true) and reason about its partition-scoped guarantee versus transactional Kafka's cross-partition guarantee - Implement checkpointing of stateful operators with the State+Offset recovery contract, atomic temp-file + rename writes, and the pause-snapshot-resume protocol via credit-withholding
- Choose between aligned (Flink-style barriers) and per-operator checkpoints based on the pipeline's idempotency machinery and operational tradeoffs
- Classify operator errors into transient/permanent/discardable and route each to the appropriate destination (retry/DLQ/drop with counter), with DLQ entries carrying schema-versioned metadata for re-processing tools
- Recognize the discard-bucket anti-pattern and apply the operational disciplines (alert on growth rate, periodic review, bounded retention) that prevent it
Lesson Summary
Lesson 1 — At-Least-Once Delivery
The three levels of delivery guarantee (at-most-once, at-least-once, exactly-once) and the operational tradeoffs each implies. Producer-side at-least-once via acks=all + retries + max.in.flight (with the ordering caveat). Consumer-side at-least-once via process-then-commit and enable.auto.commit=false. Three sources of duplicates (producer retries after partial success, consumer crashes between process and commit, rebalances during processing). The cost of duplicates is proportional to the duplicate rate × per-event downstream cost; for SDA's alert-triggering subscriber, this drives the discipline.
Key question: A consumer is configured with enable.auto.commit=true. The application crashes after auto-commit fired but before the application processed the messages whose offsets it committed. What happens, and what is the lesson's recommended fix?
Lesson 2 — Exactly-Once via Idempotency
Idempotency as the application-layer property that composes with at-least-once delivery into effective exactly-once. Natural keys (observation_id) versus derived keys (sorted content-addressable hash). Bounded dedup sets with the double-bound (time AND count) for production safety. Kafka's enable.idempotence=true as broker-side PID + sequence dedup, partition-scoped. Where the guarantee holds (within the pipeline, at idempotent SQL/Kafka/HTTP-with-key boundaries) versus where boundary owners must implement dedup themselves (alerts that trigger external actions).
Key question: The pipeline emits ConjunctionRisk events to a non-idempotent webhook that triggers an email to a human operator. Does the at-least-once + idempotency composition give effective exactly-once at the email side, and what is the framework for thinking about it?
Lesson 3 — Checkpointing
Checkpointing as the durable-state mechanism that lets a restarted operator resume from a saved State+Offset rather than rebuilding from scratch. The pause-snapshot-resume protocol via credit-withholding from M4 L2. Aligned (Flink barriers) versus per-operator checkpoints — SDA uses per-operator with idempotency-driven recovery for global consistency. Storage tiers: local fast NVMe primary, remote object-storage durable replicate, hybrid as production default. Atomic temp-file + rename for all-or-nothing durability.
Key question: The teammate proposes storing only the operator's serialized state in the checkpoint, omitting the offset on the grounds that 'the consumer's committed offset is already durable.' Why does the lesson reject this and insist on the State+Offset pair?
Lesson 4 — Dead Letter Queues
The DLQ pattern. Three error categories (transient/retry, permanent/DLQ, discardable/drop). DLQ metadata as the debug-tool foundation: timestamp, operator, error_kind, error_message, retry_count, original_payload, schema_version. Poison pills as the canonical case the DLQ exists for. The DLQ as a re-processing source after underlying issues are fixed, absorbed by L2's idempotency machinery. Three operational disciplines that prevent the discard-bucket anti-pattern: alert on growth rate, periodic review cadence, bounded retention.
Key question: Operations sees a sudden 40x spike in dlq_entries_total{error_kind="Deserialization"} from a single operator. What does the DLQ's schema design support as a first-look diagnosis, and what is the typical resolution path?
Capstone Project — Exactly-Once Conjunction Alert Pipeline
Make the M4 hardened pipeline crash-safe and exactly-once-effective. The windowed correlator becomes a CheckpointingOperator with 30-second cadence, atomic writes, and local-first/remote-second/fresh-third recovery. The alert sink uses the L2 DedupSet keyed on alert_id. Kafka consumer reconfigures for process-then-commit. Permanent errors route to a DLQ with the L4 schema; an sda-reprocess CLI tool re-injects DLQ entries with filter and dry-run support. Three crash tests (mid-process, mid-checkpoint, mid-emit) assert post-restart correctness. Acceptance criteria, suggested architecture, deterministic crash-test patterns, and the full project brief are in project-exactly-once-alerts.md.
The orchestrator from M2, the windowed correlator from M3, and the priority-aware shedding from M4 are all unchanged in structure. The new operators wrap or extend the existing pieces; the operator graph grows by a few nodes (DLQ sink, alert subscriber boundary state).
File Index
module-05-delivery-guarantees-and-fault-tolerance/
├── README.md ← this file
├── lesson-01-at-least-once.md ← At-least-once delivery
├── lesson-01-quiz.toml ← Quiz (5 questions)
├── lesson-02-exactly-once-idempotency.md ← Exactly-once via idempotency
├── lesson-02-quiz.toml ← Quiz (5 questions)
├── lesson-03-checkpointing.md ← Checkpointing
├── lesson-03-quiz.toml ← Quiz (5 questions)
├── lesson-04-dead-letter-queues.md ← Dead letter queues
├── lesson-04-quiz.toml ← Quiz (5 questions)
└── project-exactly-once-alerts.md ← Capstone project brief
Prerequisites
- Modules 1 through 4 completed — the
Observationenvelope, theOperatorGraphwith supervisor, the watermark-aware windowed correlator, and the per-edgeFlowPolicydiscipline are all assumed - Foundation Track completed — async Rust, channels, runtime intuitions
- Familiarity with
tokio::sync::mpsc,tokio::fs(for atomic-rename writes),serde(for state serialization),bincode(for compact binary checkpoint format), and therdkafkaRust client's producer/consumer APIs - Comfort reading Kafka's producer configuration documentation (
acks,retries,enable.idempotence,max.in.flight.requests.per.connection) and consumer configuration (enable.auto.commit, commit modes)
What Comes Next
Module 6 (Observability and Lineage) makes the pipeline's correctness visible to operations. The new metrics from this module — checkpoint_age_seconds, dlq_entries_total, recovery_path_total — become panels on the resilience dashboard. The runbook discipline established here (per-error-kind playbooks, re-processing protocols) becomes part of the on-call rotation's standard procedure. The patterns this module installed — at-least-once + idempotent + checkpointed + DLQ'd — generalize beyond SDA to any streaming system that must survive restarts; M6 develops the observability stack that makes those patterns operationally legible.
The pipeline at the end of this module is correct under load, correct across restart, correct in event time, and correctly orchestrated. Module 6 turns "correct" into "correct AND visible," which is the difference between a system that works and a system that operations can trust.
Lesson 1 — At-Least-Once Delivery
Module: Data Pipelines — M05: Delivery Guarantees and Fault Tolerance
Position: Lesson 1 of 4
Source: Kafka: The Definitive Guide — Shapira, Palino, Sivaram, Petty, Chapter 7 (Reliable Data Delivery — acks, enable.idempotence, retries, in-flight requests, consumer commit ordering); Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11 (Fault tolerance in stream processing)
Context
Modules 1 through 4 produced a pipeline that handles steady-state load, propagates backpressure cleanly, computes correct event-time results, and survives most operational failures. It has one structural blind spot: it loses data on a process restart. The windowed correlator from Module 3 holds in-process state. The orchestrator's supervisor restarts a panicked operator with a fresh Task, which means a fresh empty state. In-flight observations between the source and the correlator are buffered in tokio channels that do not survive a process exit. When a deploy rolls a new pipeline binary, the previous binary's in-flight buffer is gone and the windows it had been accumulating are gone.
The mission framing for this module is the SDA-2026-0207 incident two months ago. A maintenance window required restarting the pipeline to apply a security patch. The orchestrator's graceful-drain logic worked correctly — every operator drained its incoming channel before exiting — but the alert subscriber had already received fourteen alerts that the new pipeline did not know about, and the new pipeline emitted six alerts that the subscriber had already acted on. Two false-positive collision-avoidance maneuvers were executed as a consequence. The postmortem identified two missing pieces: durable state on the producer side (so restart resumes from where the previous instance left off), and idempotent processing on the consumer side (so duplicate deliveries do not produce duplicate effects).
This module installs both pieces. Lesson 1 establishes the delivery-semantics vocabulary — the three levels of guarantee (at-most-once, at-least-once, exactly-once) — and the producer-and-consumer-side configuration choices that produce at-least-once. Lesson 2 develops idempotency as the application-layer property that composes with at-least-once delivery to produce effective exactly-once processing. Lesson 3 introduces checkpointing — the durable state mechanism that lets a restarted operator resume without losing window state. Lesson 4 covers dead-letter queues for events that cannot be processed regardless of how many times they are retried. The capstone wires all four into the M4 pipeline; by the end of M5, the alert pipeline is correct under load AND across restarts.
Core Concepts
The Three Levels
Every message-delivery system makes one of three guarantees about a given message.
At-most-once. The message is delivered zero or one times. Loss is possible (the message can be dropped); duplication is not. The simplest configuration: send and forget. UDP without retries falls in this category.
At-least-once. The message is delivered one or more times. Loss is not possible (every message reaches the consumer at least once); duplication is possible (the same message can be delivered multiple times under retry). The default for any system that retries on failure. Kafka's standard producer/consumer pair, configured with acks=all and retries, falls in this category.
Exactly-once. The message is delivered exactly one time. No loss, no duplication. The strongest guarantee, and the most expensive — true exactly-once delivery is impossible without coordination between producer and consumer (two-phase commit, transactional Kafka, or strong assumptions about the network). Most systems labeled "exactly-once" are actually "at-least-once delivery + idempotent processing = effective exactly-once at the application layer."
The choice is operational and determined by what the consumer's failure mode looks like under each. A throughput counter is fine with at-most-once — losing 0.1% of events does not change the per-minute aggregate. An audit log requires at-least-once — every observation must be recorded, even at the cost of duplicates that the audit reader can dedupe. A conjunction alert that triggers a real-world action requires effective exactly-once — neither a missed alert (collision risk) nor a phantom alert (unnecessary maneuver) is acceptable.
At-Least-Once on the Producer Side
The Kafka producer's reliability is configured by three settings working together.
acks. Controls when the producer considers a send "successful." acks=0 means "send and assume success" (at-most-once: a network drop is silently lost). acks=1 means "wait for the partition leader to acknowledge" (still loses if the leader dies before replicating). acks=all means "wait for the full in-sync replica set to acknowledge" (durable on the broker side; the message will not be lost barring catastrophic broker failures). For at-least-once, acks=all is required.
retries. How many times the producer retries a failed send. Failures here are transport-level: network errors, broker timeouts, leader-elections-in-progress. With retries=0 and acks=all you get at-most-once with strong durability when delivery succeeds; with retries > 0 you get at-least-once. Production setting is typically a high number (Integer.MAX_VALUE is common — the producer retries until either success or delivery.timeout.ms elapses). For at-least-once, retries must be enabled.
max.in.flight.requests.per.connection. How many sends the producer can have outstanding to a given broker at once. With idempotence disabled, values > 1 risk re-ordering on retry: send A is in flight, send B is in flight; A fails and is retried; B succeeds before A's retry; the broker sees B-then-A. For ordered at-least-once, set this to 1 (degenerate single-credit case from Module 4 L2). For unordered at-least-once, higher values give more throughput at the cost of order.
The combination acks=all + retries > 0 produces at-least-once. Duplicates are possible because the producer might retry a send that actually succeeded but whose acknowledgment was lost in transit; the broker sees the same message twice. The application must tolerate that — the lesson's title is the guarantee, not "exactly-once."
At-Least-Once on the Consumer Side
The consumer's reliability is about the ordering of processing and committing. The consumer reads a batch of messages, processes them, then commits the offset back to the broker. If the consumer crashes between reading and processing, the next consumer instance starts at the previously-committed offset and re-reads (and re-processes) the unprocessed batch — duplicate processing, but no loss.
The critical ordering is process-then-commit. The consumer must finish processing a message before committing its offset. The wrong ordering (commit-then-process) loses messages: if the consumer commits and then crashes before processing, the next instance starts at the committed offset and never re-reads the messages that the first instance had committed but not yet processed. The messages are silently lost.
The Kafka consumer's enable.auto.commit=true configuration is the canonical version of the wrong ordering. Auto-commit fires on a timer, regardless of whether the application has actually processed the messages whose offsets it commits. For any reliability discipline beyond the loosest, enable.auto.commit=false and explicit commitSync after processing is required.
The duplicate-on-restart property is acceptable because the application is idempotent on observation_id (the topic of Lesson 2). Messages re-read after a restart go through the dedup logic and are silently dropped. The combination of producer-side at-least-once + consumer-side process-then-commit + sink-side idempotency is the effective-exactly-once shape this module is building toward.
Where Duplicates Come From
Three sources of duplication under at-least-once.
Producer-side retries after partial success. The producer sends; the broker writes the message to its log; the broker's acknowledgment is lost in transit; the producer's retry timer fires; the producer sends again; the broker writes the message again. The same logical message is now in the log twice with different offsets. Consumers see both copies. This is the duplicate that the producer-side idempotent producer (enable.idempotence=true, Lesson 2) is designed to prevent — using a producer ID + sequence number that the broker uses to dedup.
Consumer-side crashes between process and commit. The consumer processes a batch; the consumer crashes before committing; the next consumer instance reads the same batch and processes it again. The application's effect on the world has been applied twice. Idempotency on the application's effect (e.g., UPSERT-by-natural-key in the sink) is what absorbs this.
Rebalances during processing. A consumer group rebalance reassigns partitions among consumer instances. If a partition is reassigned mid-batch (the original consumer was processing but had not committed when the rebalance fired), the new consumer reads from the previously-committed offset and re-processes. Same shape as the crash case.
The lesson's framing: duplicates are not a bug, they are a configurable cost. Rare under good operational conditions, frequent during deploys or partition migrations, always possible. The application is responsible for handling them.
Operational Cost of Duplicates
The cost is proportional to the duplicate rate × the per-event cost of duplicate processing downstream.
For a sink that does an UPSERT keyed on observation_id with a strict-greater check (the M3 L4 retract-aware shape): the duplicate is absorbed by the comparison, the cost is one wasted SQL round-trip per duplicate. At a typical duplicate rate of 0.1% during a healthy steady-state, this is invisible at SDA's scale.
For a sink that increments a counter without an idempotency check: every duplicate is a miscount. The counter ends up high by the duplicate rate × window count. The audit dashboard reports inflated numbers. This is the canonical "we have at-least-once + non-idempotent sink" failure mode and the reason Module 2 L3 made idempotency a first-class topic.
For a sink that triggers an external action (an alert subscriber that fires a satellite-avoidance maneuver): every duplicate is a real-world wrong action. The cost is real fuel burn or a real hardware adjustment. This is the case the SDA-2026-0207 incident reflected, and the case where exactly-once-effective via idempotency on alert ID is necessary, not optional.
The cost shape determines how aggressively the application tightens the at-least-once bound (idempotent producer, smaller in-flight, faster commit cadence) and how robust the sink's idempotency must be.
Code Examples
A Kafka Producer Configured for At-Least-Once
The rdkafka crate's producer configuration. The settings encode the lesson's at-least-once shape exactly.
use anyhow::Result;
use rdkafka::config::ClientConfig;
use rdkafka::producer::{FutureProducer, FutureRecord};
use std::time::Duration;
pub fn build_at_least_once_producer(brokers: &str) -> Result<FutureProducer> {
let producer: FutureProducer = ClientConfig::new()
.set("bootstrap.servers", brokers)
// acks=all: wait for the full in-sync replica set to ack.
// The message is durable on the broker side before send returns.
.set("acks", "all")
// retries: keep retrying transient transport failures.
// i32::MAX is the conventional "retry until delivery.timeout.ms".
.set("retries", "2147483647")
// delivery.timeout.ms: how long the producer keeps retrying
// before giving up. 2 minutes is reasonable for the SDA pipeline;
// longer encourages the producer to ride out longer broker
// transient failures.
.set("delivery.timeout.ms", "120000")
// enable.idempotence is OFF deliberately for this lesson — we
// want raw at-least-once. Lesson 2 turns it on for the
// exactly-once-effective producer.
.set("enable.idempotence", "false")
// max.in.flight.requests: 1 forces the strongest ordering
// guarantee at the cost of throughput. Production might use
// 5 (the broker-enforced max for idempotent mode) when ordering
// can be reconstructed via observation_id at the consumer.
.set("max.in.flight.requests.per.connection", "1")
.create()?;
Ok(producer)
}
pub async fn send_observation(
producer: &FutureProducer,
topic: &str,
obs: &Observation,
) -> Result<()> {
let payload = serde_json::to_vec(obs)?;
let key = obs.observation_id.to_string();
let record = FutureRecord::to(topic)
.key(&key)
.payload(&payload);
// The future resolves when the full in-sync replica set has acked.
// On any broker-acknowledged-but-network-lost case, the producer
// retries automatically per the configuration above; the resolution
// happens on the eventual successful retry.
producer.send(record, Duration::from_secs(30)).await
.map_err(|(e, _)| anyhow::anyhow!("send failed after retries: {e}"))?;
Ok(())
}
The enable.idempotence=false is deliberate here — we are demonstrating raw at-least-once. The producer's retry-on-failure is what gives the at-least guarantee; duplicates are the cost. Lesson 2 turns idempotence on and explains the producer-ID + sequence-number mechanism that the broker uses to dedup at the producer side. This lesson's pipeline accepts the producer-side duplicates and absorbs them at the consumer-side sink.
A Process-Then-Commit Consumer
The Kafka consumer pattern that gives at-least-once on the consumer side. enable.auto.commit=false, explicit commit_message after the batch is processed.
use rdkafka::consumer::{CommitMode, Consumer, StreamConsumer};
use rdkafka::message::{BorrowedMessage, Message};
pub fn build_at_least_once_consumer(brokers: &str, group: &str) -> Result<StreamConsumer> {
let consumer: StreamConsumer = ClientConfig::new()
.set("bootstrap.servers", brokers)
.set("group.id", group)
// The critical setting: commits must be explicit, not auto.
// Auto-commit fires on a timer regardless of processing
// progress, which is the canonical 'commit-then-process'
// bug shape.
.set("enable.auto.commit", "false")
.set("auto.offset.reset", "earliest")
.create()?;
Ok(consumer)
}
pub async fn run_consumer_loop(
consumer: &StreamConsumer,
sink: &impl ObservationSink,
) -> Result<()> {
use rdkafka::consumer::CommitMode;
use tokio_stream::StreamExt;
let mut stream = consumer.stream();
while let Some(result) = stream.next().await {
let message: BorrowedMessage = result?;
let payload = message.payload()
.ok_or_else(|| anyhow::anyhow!("empty payload"))?;
let obs: Observation = serde_json::from_slice(payload)?;
// PROCESS first: feed the sink. The sink's idempotency
// (Lesson 2) absorbs duplicates from rare retries.
sink.write(obs).await?;
// COMMIT only after process succeeds. A crash between process
// and commit causes the next consumer instance to re-read
// and re-process; the sink dedups.
consumer.commit_message(&message, CommitMode::Sync)?;
}
Ok(())
}
The commit_message(..., CommitMode::Sync) blocks until the broker confirms the commit — the next message is not read until the previous commit is durable. The async variant (CommitMode::Async) commits in the background, which is faster but introduces a window where a crash between the async commit's queue and its broker confirmation can lose the commit; for SDA's reliability budget we use sync commits. The duplicate window is exactly the time between sink.write returning and commit_message returning — typically sub-millisecond.
A Crash-Injection Test Harness
The test harness that verifies the at-least-once guarantee under deliberate crash conditions. The same harness is used in Lesson 2 to verify exactly-once-effective when idempotency is added.
use std::sync::atomic::{AtomicU32, Ordering};
use std::sync::Arc;
/// A test-only sink that counts writes and panics on the Nth call.
pub struct CrashingSink {
writes: Arc<AtomicU32>,
panic_at: u32,
}
impl CrashingSink {
pub fn new(panic_at: u32) -> Self {
Self {
writes: Arc::new(AtomicU32::new(0)),
panic_at,
}
}
pub fn writes(&self) -> u32 { self.writes.load(Ordering::SeqCst) }
}
#[async_trait::async_trait]
impl ObservationSink for CrashingSink {
async fn write(&self, _obs: Observation) -> Result<()> {
let n = self.writes.fetch_add(1, Ordering::SeqCst) + 1;
if n == self.panic_at {
// Simulate process crash mid-write. In a real test this
// would be a tokio::process restart or similar; for
// illustration, a panic captured by the orchestrator.
panic!("simulated crash at write #{n}");
}
Ok(())
}
}
#[tokio::test]
async fn crash_between_process_and_commit_replays_messages() {
// Setup: Kafka producer feeds 10 observations to topic. Consumer
// processes them with the CrashingSink that panics on write #5.
let producer = build_at_least_once_producer("localhost:9092").unwrap();
for i in 0..10 {
send_observation(&producer, "test-topic", &test_obs(i)).await.unwrap();
}
// First consumer instance: panics at write 5. Writes 1-4 succeeded
// and were committed (commit happens after each process); write 5
// panicked before commit.
let sink_a = CrashingSink::new(5);
let _ = run_consumer_with_sink("test-group", &sink_a).await;
assert_eq!(sink_a.writes(), 5); // 4 succeeded + the panicked one
// Second consumer instance: starts from offset 4 (last committed).
// Re-reads message 5 and processes it, plus 6-10. Total writes
// for the system: 5 (from instance A) + 6 (from instance B) = 11.
// The duplicate is the M5's at-least-once cost.
let sink_b = CrashingSink::new(0); // does not crash this time
let _ = run_consumer_with_sink("test-group", &sink_b).await;
assert_eq!(sink_b.writes(), 6); // 5 through 10
}
The test asserts the at-least-once contract: every observation reaches a sink at least once, and message 5 reaches a sink exactly twice. With idempotent processing in Lesson 2, the duplicate's effect on the world is unchanged from a single processing — but the raw write count at the sink is still 11. The lesson's framing: at-least-once delivery is the transport guarantee; idempotency at the application layer is what turns it into exactly-once-effective.
Key Takeaways
- The three levels of delivery guarantee are at-most-once, at-least-once, and exactly-once. True exactly-once delivery requires coordination beyond what most systems implement; pragmatic exactly-once is achieved by composing at-least-once delivery with idempotent processing at the application layer.
- At-least-once on the producer side requires
acks=all(wait for the full in-sync replica set),retries > 0(retry transient failures until delivery.timeout.ms elapses), and a deliberate choice onmax.in.flight.requests.per.connection(1 for strict ordering; higher for throughput at the cost of order under retry). - At-least-once on the consumer side requires process-then-commit ordering with
enable.auto.commit=false. Commit-then-process loses messages on a crash between commit and process. Auto-commit's timer-based behavior is the canonical version of the wrong ordering. - Three sources of duplicates under at-least-once: producer retries after partial success, consumer crashes between process and commit, rebalances during processing. All three are absorbed by sink-side idempotency (Lesson 2's topic).
- The cost of duplicates is proportional to duplicate rate × per-event cost of duplicate processing downstream. For UPSERT-by-natural-key sinks the cost is one wasted SQL round-trip; for counter-style sinks it is silent miscounting; for sinks that trigger external actions it is real-world wrong actions. The cost determines how tightly the at-least-once bound is held.
Lesson 2 — Exactly-Once via Idempotency
Module: Data Pipelines — M05: Delivery Guarantees and Fault Tolerance Position: Lesson 2 of 4 Source: Kafka: The Definitive Guide — Shapira, Palino, Sivaram, Petty, Chapter 8 (Exactly-Once Semantics — Idempotent Producer, Transactions); Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11 (Idempotent Operations and Atomicity)
Context
Lesson 1 established at-least-once delivery — every message reaches the consumer at least once, with duplicates as the operational cost. This lesson is the second half of the composition. The application makes its operations idempotent: processing the same message twice produces the same effect on the world as processing it once. The pair — at-least-once delivery + idempotent processing — gives effective exactly-once semantics. The pipeline's net effect is exactly-once even though the underlying transport admits duplicates. This is the standard production approach to exactly-once; true transport-level exactly-once requires coordination (transactional Kafka, two-phase commit) that costs more than the application-layer pattern in throughput, complexity, and failure modes.
The pattern is not new to this module. Module 2 L3 introduced idempotency keys carried on the envelope. Module 3 L4 added retract-aware sinks with strict-greater UPSERT semantics. This lesson develops the topic in depth: what makes an operation idempotent, where the natural keys come from, how to bound the dedup state, what Kafka's idempotent producer does at the broker level, and where the exactly-once guarantee holds versus where it doesn't. The capstone in this module composes all of it — at-least-once delivery from L1, idempotent processing from this lesson, checkpointing from L3, dead-letter queues from L4 — into a pipeline that survives the SDA-2026-0207 incident's failure mode without dropping or duplicating alerts.
Core Concepts
Idempotency, Defined
A function f is idempotent if f(f(x)) = f(x) — applying it twice with the same input produces the same result as applying it once. The classic example is setting a field to a specific value: setting x = 5 is idempotent; setting x += 1 is not. The streaming-system version is about operator effects on durable state: an operator that writes "the value of window 17 is result" is idempotent on the (window_id, result) pair; an operator that writes "increment the count of window 17" is not.
Idempotency is a property of the operation, not of the framework or the pipeline. The pipeline can be at-least-once at the transport layer; the application layer is what determines whether duplicates produce identical effects or accumulate. Each operator's effect on the world has its own idempotency story, and the system's overall behavior depends on every operator's individual choice.
The discipline this lesson installs: every operator that writes to durable state — a database, a topic, an external service — must declare in its design which operations are idempotent and on what key. The operator graph from M2 carries forward an explicit idempotency_key_field per stage, documented in the operator's specification. The capstone uses this metadata to assert at startup that every sink is configured idempotently against its expected duplicate-source.
Natural Keys and Derived Keys
The natural key for SDA observations is the envelope's observation_id — a UUID generated at the source, carried unchanged through every stage, present on every observation. Every operator-level dedup logic uses this key. The orchestrator from M2 enforces that operators preserve observation_id across transformations (a normalize that re-generates the UUID is a bug; the supervisor's audit asserts this).
For derived events — the ConjunctionRisk that the M3 correlator emits from two observations — there is no natural ID. The standard derivation is a deterministic hash of the contributing inputs:
derived_id = hash(left_observation_id || right_observation_id || window_id)
Sorting the input IDs before hashing makes the result symmetric (same two observations produce the same derived_id regardless of order). The hash is content-addressable and reproducible: any two operator instances correlating the same pair of observations produce the same derived_id, which is what makes downstream dedup work.
For events derived from a sliding window over many inputs (analytics aggregates), the natural derived key is (window_id, sequence) — Module 3 L4's retract sequence numbering. The window is the deterministic identifier; the sequence number distinguishes the original emit from corrected emits.
Bounded Dedup State
The sink-side dedup is implemented by a seen-set: a record of recently-seen IDs that the sink consults before applying each operation. A new ID is processed; a previously-seen ID is silently dropped. The set must be bounded — an unbounded seen-set is the silent-OOM pattern Module 4's audit catches.
The bound is by time or by count, ideally by both. Time-based: keep IDs seen in the last N seconds; evict older entries on each insert. Count-based: keep at most M IDs; evict the oldest when at capacity. The double-bound is the production-safety pattern: time alone fails when a burst causes more entries to land in the window than memory permits; count alone fails when a slow stream has its old entries evicted before the duplicates that would be deduped against them.
The window size is operationally determined: it must be larger than the maximum re-delivery window (the longest gap between an original send and a duplicate retry). For Kafka with default settings, this is typically minutes; for SDA's pipeline with 30-second consumer commit cadence and bounded checkpoint duration, 5 minutes is comfortable. Set the window too narrow and duplicates leak through; set it too wide and memory grows. Module 4's BurstProfile pattern applies here — document the chosen window with the rationale.
Kafka's Idempotent Producer
Kafka offers a producer-side idempotency mechanism that prevents duplicates from producer retries. With enable.idempotence=true, the producer attaches a producer ID (PID) and a sequence number to every message. The broker tracks the highest sequence number seen per PID per partition; on a retry that arrives with a sequence number it has already accepted, the broker drops the duplicate silently. The producer continues to retry on transient errors, but the broker dedups before persisting.
The mechanism is single-partition, single-producer scoped — the dedup is per (PID, partition). Across partitions, the producer's idempotence does not provide a cross-partition consistency guarantee; that requires Kafka's transactional producer (a separate mechanism with transactional.id set). For SDA's pipeline, partition-scoped idempotence is sufficient because every observation is keyed on observation_id and routed to a partition by hash(observation_id), so duplicate retries land on the same partition where they get dedupped.
The throughput impact is small (<5%) and the configuration is straightforward. Production pipelines should enable enable.idempotence=true by default; the lesson's at-least-once L1 deliberately turned it off to demonstrate the bare semantics. Enabling it complements the application-layer dedup: producer-side dedup catches retries before they reach the broker; application-layer dedup catches duplicates from other sources (consumer crashes, rebalances). The two layers are belt-and-suspenders, and both are worth having.
Where the Guarantee Holds
Effective exactly-once via at-least-once + idempotency holds at three boundaries.
Within the pipeline. Every operator that produces output to a downstream operator is implicitly part of the dedup chain because every operator preserves observation_id. The sink-side dedup at the end of the pipeline catches every duplicate that traveled from any source. This works as long as the chain is intact — no operator silently regenerates IDs, no operator emits a fresh ID for a copy of an existing observation.
At external sink boundaries. SQL writes use INSERT ... ON CONFLICT (observation_id) DO NOTHING (or DO UPDATE per Module 3's retract-aware shape). Kafka producers use enable.idempotence=true. HTTP requests carry an Idempotency-Key header (the standard HTTP convention) that the downstream service uses to dedup. Every external write has its own idempotency story, and the operator's responsibility is to know what it is and configure it.
Where the guarantee does NOT hold. Operations whose effect on the world is inherently non-idempotent — sending an email, charging a credit card, firing a satellite avoidance maneuver. For these, idempotency must be added at the boundary by storing recently-seen IDs and ignoring duplicates at the consumer side of the boundary, not just within the pipeline. The conjunction-alert subscriber from M5 L1's example is exactly this case; the capstone wires up the subscriber's seen-set as an integrated piece of the alert delivery path.
The lesson's framing is precise: the pipeline can guarantee effective exactly-once delivery TO a boundary; whether the action AT the boundary is itself exactly-once depends on the boundary's idempotency. Boundary owners must implement their own dedup; the pipeline's responsibility is to deliver the keys correctly and document the contract.
Code Examples
A Sliding-Window Dedup Set with Double-Bound Eviction
The sink-side primitive that absorbs duplicates from at-least-once delivery. Bounded by both time (seen IDs older than window are evicted) and count (no more than capacity IDs held at once).
use std::collections::{BTreeSet, VecDeque};
use std::time::{Duration, SystemTime};
use anyhow::Result;
use uuid::Uuid;
/// Bounded seen-set keyed on observation_id. Maintains:
/// - seen: a BTreeSet for O(log N) lookup of duplicate-or-not
/// - order: a VecDeque<(Uuid, SystemTime)> for FIFO eviction
///
/// Invariants:
/// - len(seen) == len(order)
/// - order is strictly time-ordered by insert time
/// - on every insert, expired entries are evicted from order's front
pub struct DedupSet {
seen: BTreeSet<Uuid>,
order: VecDeque<(Uuid, SystemTime)>,
window: Duration,
capacity: usize,
}
impl DedupSet {
pub fn new(window: Duration, capacity: usize) -> Self {
Self {
seen: BTreeSet::new(),
order: VecDeque::with_capacity(capacity),
window,
capacity,
}
}
/// Record an observation; return true if the ID was novel (caller
/// should process it), false if the ID is a duplicate (caller
/// should drop).
pub fn record(&mut self, id: Uuid, now: SystemTime) -> bool {
// Evict by time first.
let cutoff = now.checked_sub(self.window).unwrap_or(SystemTime::UNIX_EPOCH);
while let Some(&(front_id, front_ts)) = self.order.front() {
if front_ts < cutoff || self.order.len() >= self.capacity {
self.order.pop_front();
self.seen.remove(&front_id);
} else {
break;
}
}
// Now check: if seen, return false (duplicate).
if self.seen.contains(&id) { return false; }
// Otherwise insert.
self.seen.insert(id);
self.order.push_back((id, now));
true
}
pub fn len(&self) -> usize { self.seen.len() }
pub fn is_empty(&self) -> bool { self.seen.is_empty() }
}
The record method is the only public mutation; the boolean return is the dispatch signal for the sink ("true → write, false → skip"). The order-of-operations matters: time-eviction first, then capacity-eviction, then duplicate check. Doing the duplicate check before eviction would leave time-expired entries in the seen set briefly, which is harmless but wastes memory. Doing eviction-then-insert means the seen set's size is always bounded by capacity after any single insert, regardless of how the inserts came in. The pattern is the L4 retract-aware sink's ancestor: same shape, same eviction discipline, different check direction (this records true-once-then-false; that one writes-once-then-overwrites).
A SQL Sink with Idempotent UPSERT
The boundary-side idempotency. The sink writes observations to a Postgres table; the UPSERT with DO NOTHING is the idempotent operation.
use anyhow::Result;
use sqlx::{Pool, Postgres};
const UPSERT_OBSERVATION: &str = r#"
INSERT INTO observations (observation_id, source_kind, sensor_timestamp, payload)
VALUES ($1, $2, $3, $4)
ON CONFLICT (observation_id) DO NOTHING
"#;
pub struct PostgresSink {
pool: Pool<Postgres>,
}
impl PostgresSink {
pub async fn write(&self, obs: &Observation) -> Result<()> {
// The sql query is idempotent on observation_id (the primary key
// with the ON CONFLICT clause). A duplicate row at the same
// observation_id is silently ignored — the existing row is
// preserved unchanged. The sink never produces a duplicate
// effect on the world even under heavy at-least-once duplication.
let payload = serde_json::to_value(obs)?;
sqlx::query(UPSERT_OBSERVATION)
.bind(obs.observation_id)
.bind(format!("{:?}", obs.source_kind))
.bind(obs.sensor_timestamp)
.bind(payload)
.execute(&self.pool)
.await?;
Ok(())
}
}
The ON CONFLICT DO NOTHING is the SQL-standard idempotent insert. A duplicate row at the same observation_id is silently rejected; the sink's write returns Ok regardless of whether the row was new or pre-existing. The duplicate cost is one network round-trip plus a brief lock on the existing row — meaningfully cheap, and well within the at-least-once duplicate budget. Alternative shapes: DO UPDATE SET ... WHERE EXCLUDED.sequence > rows.sequence for the M3 L4 retract-aware case; DO UPDATE SET ... WHERE EXCLUDED.value > rows.value for last-write-wins on a different field. The pattern at every external sink: choose the operation, document the idempotency, configure the sink correctly.
A Non-Idempotent Operation Made Idempotent
Increment is the canonical non-idempotent operation. The pattern to make it idempotent is to wrap with a seen-set check and only increment on first sight.
use std::collections::HashSet;
use std::sync::Mutex;
use uuid::Uuid;
pub struct IdempotentCounter {
count: Mutex<u64>,
seen: Mutex<HashSet<Uuid>>,
}
impl IdempotentCounter {
pub fn new() -> Self {
Self {
count: Mutex::new(0),
seen: Mutex::new(HashSet::new()),
}
}
/// Increment if the observation_id has not been seen before;
/// no-op for duplicates. Returns the post-increment count.
pub fn record_unique(&self, id: Uuid) -> u64 {
let mut seen = self.seen.lock().unwrap();
if !seen.insert(id) {
// Already seen; return current count without incrementing.
return *self.count.lock().unwrap();
}
let mut count = self.count.lock().unwrap();
*count += 1;
*count
}
pub fn count(&self) -> u64 {
*self.count.lock().unwrap()
}
}
The record_unique pattern is the standard wrapping. The seen.insert(id) returns false if the ID was already present, in which case the function returns early; otherwise it increments. The two locks are taken in a fixed order (seen before count) which prevents deadlock; in production code you would typically combine them into a single struct to make the locking simpler. The seen-set must be bounded (per the previous example) — an unbounded seen-set in a long-running counter eventually OOMs the process. The capstone's counter-style metrics use the bounded seen-set pattern at every sink that does increment-style operations; the pipeline as a whole is exactly-once-effective despite at-least-once delivery.
Key Takeaways
- Idempotency is a property of the operation, not the framework. Every operator that writes to durable state must declare which operations are idempotent and on what key. The orchestrator's metadata carries the per-stage
idempotency_key_field; the audit asserts it at startup. - Natural keys come from the envelope (the SDA pipeline uses
observation_idend-to-end). Derived keys are deterministic hashes of contributing inputs (sorted, content-addressable). Every operator preserves the natural key and computes derived keys reproducibly. - The bounded dedup set uses time AND count for production safety. Double-bound is the pattern: time alone fails under bursts, count alone fails under slow streams. Window size is operationally determined by maximum re-delivery window plus margin.
- Kafka's idempotent producer (
enable.idempotence=true) catches duplicates from producer retries at the broker level via PID + sequence numbers. Partition-scoped, sub-5% throughput cost. Default-on for production pipelines. - The at-least-once + idempotent = exactly-once-effective composition holds within the pipeline and at idempotent boundaries (SQL UPSERT, idempotent producers, HTTP
Idempotency-Keyheaders). At non-idempotent boundaries (alerts that trigger external actions), the boundary's owner must implement dedup; the pipeline's responsibility is to deliver the keys correctly and document the contract.
Lesson 3 — Checkpointing
Module: Data Pipelines — M05: Delivery Guarantees and Fault Tolerance Position: Lesson 3 of 4 Source: Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11 (Fault Tolerance — Microbatching and Checkpointing); Database Internals — Alex Petrov, Chapter 5 (Checkpointing in Recovery — the same conceptual machinery applied to streaming state)
Context
Lessons 1 and 2 produced a pipeline that delivers messages at-least-once and processes them with effective exactly-once semantics via idempotency. Both rely on durable state at the boundary — the SQL UPSERT, the Kafka idempotent producer's PID+sequence state, the consumer's committed offset. The state inside the pipeline — the windowed correlator's per-key sliding windows, the M3 retract-aware sink's in-flight retained windows, the M2 supervisor's restart history — lives in process memory and disappears on restart.
For most operators that does not matter. A normalize stateless map operator restarts cleanly and resumes by reading from the consumer offset. The orchestrator's supervisor restarts a panicked operator with a fresh Task, which means a fresh empty state, which is exactly right for stateless operators because their state IS empty by definition. For stateful operators — the windowed correlator is the canonical case — restart means losing the current windows in flight. The pipeline restarts; the input replay from the consumer offset begins; the windows that had been accumulating before the crash are gone, and the replay rebuilds them from scratch.
The cost is real. A 30-second window's worth of in-flight events takes 30 seconds of replay to reconstruct. During that 30 seconds, the pipeline is producing alerts whose source data is incomplete — the correlator has not yet seen all the observations that should be in its window because they are in the past, before the consumer offset's restart point. Module 5's mission framing (the SDA-2026-0207 incident) had exactly this shape: the restart's replay window straddled an active conjunction event, the windows were rebuilt without the early observations of that event, and the resulting alerts were emitted with degraded confidence.
This lesson installs checkpointing — the durable-state mechanism that lets a restarted operator resume from a saved state plus the input offset that produced it, without replaying from scratch. Database engines use checkpointing in WAL recovery (Petrov Chapter 5); Spark and Flink use it for streaming-state recovery; the SDA pipeline uses it for the same purpose at the operator level. The pattern is the same in every system: pause the operator briefly, snapshot its state to durable storage, record the input offset that the snapshot reflects, resume. On restart, load the latest snapshot, set the input offset, resume processing. The window of data lost on restart shrinks from "since process start" to "since last checkpoint."
Core Concepts
State + Offset = Checkpoint
A checkpoint is the durable record of an operator's recoverable progress. It has two components.
State: the operator's in-memory state at the moment the checkpoint was taken. For the windowed correlator, this is the per-key sliding windows. For the retract-aware sink, this is the recently-emitted (window_id, sequence) records. For a stateful aggregator, this is the running aggregations. The state is whatever the operator needs to resume processing without losing data.
Offset: the position in the input stream that the checkpointed state reflects. For a Kafka consumer, this is the broker offset of the last fully-processed message. For a file-based source, this is the byte offset. The offset is the link between the state and the input — it answers "if I restart with this state, where do I start reading from?"
The two together are the recovery contract: load the state, set the offset, resume. Either one alone is insufficient. A state without an offset is a snapshot of indeterminate vintage; resuming from it produces incorrect output because the input position is unknown. An offset without a state is a position to resume reading from, but the operator's running aggregations are gone — the pipeline replays correctly only if the operator is stateless. The capstone enforces the State+Offset invariant structurally: every checkpoint write atomically includes both, and every checkpoint read returns both or fails.
Pause-Snapshot-Resume Protocol
The simplest checkpoint protocol: pause the operator's input for the duration of the checkpoint, write the state to durable storage along with the current offset, resume input. The pause is what guarantees the State+Offset pair is consistent — no new input arrives during the snapshot, so the state and the offset reflect the same point in the stream.
The credit-based-flow primitive from Module 4 L2 is the natural mechanism for the pause. The operator withholds credits from its upstream during the snapshot; the upstream's local credit counter drains; the upstream stops producing without occupying any in-flight slot. The snapshot completes; the operator returns credits; the upstream resumes. The duration of the pause is the snapshot's wall-clock cost, typically 50-200ms for SDA-scale operator state.
The cost of the pause is end-to-end latency. During the pause, no observations flow through the operator; downstream operators see a brief gap in their input. For SDA's 30-second alert SLO, a 200ms pause is 0.7% of the budget — acceptable. For tighter SLOs the pause must be tighter; production checkpoints write to local fast storage (NVMe, RAM disk) to keep the pause sub-millisecond. The discipline: choose checkpoint frequency and pause duration to fit the SLO budget.
Aligned Checkpoints (Flink's Approach)
For a single-operator checkpoint the pause-snapshot-resume protocol is sufficient. For a multi-operator pipeline, the operators' checkpoints must be aligned — the saved state across operators must reflect the same point in the input stream — or the recovery produces inconsistent results (operator A resumes from input position X, operator B resumes from input position X+10, the pipeline's overall state is incoherent).
Flink's solution is barrier alignment. The orchestrator injects a special checkpoint barrier marker into the source streams. The barrier flows through the pipeline as a normal event; when an operator receives a barrier on every input, it snapshots its current state and forwards the barrier to its downstream. The barrier reaches the next operator, which waits until it has the barrier on every input, snapshots, forwards. The structure produces a consistent cut across the pipeline: every operator's snapshot reflects the same set of input events.
The cost is operationally real. The slowest input's barrier delay determines the alignment time — operators wait at the barrier until every input has reported. A slow input stalls the entire checkpoint. For SDA's pipeline with three sources at very different rates (radar 5000/sec, ISL 100/sec, optical 50/sec) the alignment is dominated by the slow inputs. Flink offers an unaligned mode that trades consistency for speed; the SDA pipeline uses aligned checkpoints because the consistency property is what the recovery contract relies on.
Per-Operator (Unaligned) Checkpoints
The simpler alternative: each operator checkpoints on its own schedule, with no cross-operator alignment. Each operator's checkpoint reflects only its own state and the position in its input stream. Recovery is then per-operator: each operator independently loads its latest checkpoint and resumes from its checkpointed offset.
The simplification has a real cost: cross-operator state is no longer consistent. If operator A checkpointed at position X and operator B (downstream of A) checkpointed at position Y > X, then after a restart, A starts at X but B starts at Y. A re-emits messages in (X, Y]; B has already processed them and dedups via idempotency. This works correctly under the at-least-once + idempotent composition from Lessons 1 and 2 — duplicates are absorbed.
For SDA's pipeline, per-operator checkpointing is the right shape because the idempotency machinery is already in place. The pipeline does not need the strong consistency that aligned checkpoints provide; it needs each operator to recover its own state, and the global consistency is recovered via dedup at every boundary. The capstone uses per-operator checkpoints; the lesson covers aligned checkpoints for completeness because they show up in production streaming systems and the framing matters when those systems are introduced.
Checkpoint Storage
The state must go somewhere durable. Three tiers, each with its own cost profile.
Local disk (NVMe / SSD). Fastest write latency (~100µs for small writes). The natural choice for the checkpoint's hot-path destination. Limitation: a host failure loses the checkpoint along with the host. For pipeline restart-without-host-loss (the common case), local-disk checkpoints suffice.
Remote object storage (S3, GCS). Slower write latency (10-100ms). Survives host loss because the storage is redundant across availability zones. The natural choice for durable checkpoints that must survive worst-case failures.
Hybrid: local-then-async-replicated-to-remote. The standard production pattern. Write to local first (fast, on the hot path); replicate asynchronously to remote (durable, off the hot path). The pause-snapshot-resume cost is bounded by the local write; the remote-replication catches up in the background. On restart, prefer the local checkpoint if it exists (fast recovery); fall back to remote if it doesn't (host loss recovery).
The choice is operational. For SDA's pipeline, the hybrid approach with a 1-second local checkpoint cadence and 60-second remote replication is the production default — adjustable per operator based on the state size and the recovery time budget. The capstone exposes the cadence and replication parameters per operator; ops tuning happens in the orchestrator's startup configuration rather than in code.
Code Examples
A Periodic-Checkpoint Operator with Local Disk
The simplest implementation: every N seconds, pause the input via credit-withholding, serialize the state, write to disk, resume. On startup, look for the latest checkpoint and restore.
use anyhow::Result;
use serde::{Deserialize, Serialize};
use std::path::PathBuf;
use std::time::{Duration, Instant};
use tokio::fs;
use tokio::time::sleep;
#[derive(Debug, Serialize, Deserialize)]
pub struct OperatorCheckpoint<S> {
/// The operator's serialized state at checkpoint time.
pub state: S,
/// The input-stream offset that the state reflects.
pub offset: u64,
/// When the checkpoint was taken (for diagnostics).
pub taken_at_unix_ms: u64,
}
/// A checkpointing wrapper around any stateful operator. The operator
/// implements a Checkpointable trait that exposes serialize/restore
/// hooks; the wrapper handles the pause-snapshot-resume protocol and
/// the disk I/O.
pub struct CheckpointingOperator<S> {
state: S,
last_offset: u64,
interval: Duration,
checkpoint_dir: PathBuf,
operator_name: String,
}
impl<S> CheckpointingOperator<S>
where
S: Serialize + for<'de> Deserialize<'de> + Default,
{
/// Build a fresh operator OR restore from the latest checkpoint
/// if one exists. The constructor does the recovery.
pub async fn new_or_restore(
operator_name: impl Into<String>,
interval: Duration,
checkpoint_dir: PathBuf,
) -> Result<Self> {
let operator_name = operator_name.into();
let path = checkpoint_dir.join(format!("{operator_name}.bin"));
let (state, last_offset) = if path.exists() {
let bytes = fs::read(&path).await?;
let cp: OperatorCheckpoint<S> = bincode::deserialize(&bytes)?;
tracing::info!(
operator = %operator_name,
offset = cp.offset,
"restored from checkpoint"
);
(cp.state, cp.offset)
} else {
tracing::info!(operator = %operator_name, "no checkpoint found; starting fresh");
(S::default(), 0)
};
Ok(Self {
state,
last_offset,
interval,
checkpoint_dir,
operator_name,
})
}
/// Take a checkpoint NOW. Caller is responsible for pausing input
/// (e.g., via credit-withholding) before invocation; this method
/// is just the snapshot and write.
pub async fn checkpoint(&self) -> Result<()> {
let path = self.checkpoint_dir.join(format!("{}.bin", self.operator_name));
let cp = OperatorCheckpoint {
state: &self.state,
offset: self.last_offset,
taken_at_unix_ms: std::time::SystemTime::now()
.duration_since(std::time::UNIX_EPOCH)
.map(|d| d.as_millis() as u64)
.unwrap_or(0),
};
// Write to a temp file first, then atomically rename. The rename
// is the durability barrier — the file appears at its final path
// only when the write is complete, so a crash during write
// leaves the previous (consistent) checkpoint intact.
let tmp_path = path.with_extension("bin.tmp");
let bytes = bincode::serialize(&cp)?;
fs::write(&tmp_path, bytes).await?;
fs::rename(&tmp_path, &path).await?;
Ok(())
}
}
The atomic-rename pattern is the durability discipline. Writing directly to the final path leaves the file in a partial state if the process crashes mid-write; the next startup reads a corrupt file and either fails to deserialize or (worse) deserializes an inconsistent State+Offset pair. The rename is atomic at the filesystem level (POSIX rename is atomic for files in the same directory), so the file is either the previous good checkpoint or the new good checkpoint, never a partial state. Production code uses fs::sync_data or fsync before the rename to ensure the write is durable past a power failure; we elide that for clarity.
A Barrier-Based Coordinator (Aligned Checkpoint)
For aligned checkpoints, the orchestrator injects a barrier marker into source streams. Each operator forwards the barrier after snapshotting; the coordinator waits for all operators' barrier acknowledgments to declare the checkpoint complete.
use std::sync::Arc;
use tokio::sync::{mpsc, Notify};
/// A barrier marker that flows through the pipeline as a control
/// item. Operators forward it after taking their checkpoint.
#[derive(Debug, Clone, Copy)]
pub struct CheckpointBarrier(pub u64); // checkpoint id
/// The orchestrator's coordinator. Tracks barrier acknowledgments from
/// every operator; the checkpoint is complete when every operator has
/// acked the barrier with the same checkpoint id.
pub struct CheckpointCoordinator {
expected_acks: usize,
received_acks: Arc<Mutex<HashMap<u64, HashSet<String>>>>,
completion: Arc<Notify>,
}
impl CheckpointCoordinator {
/// Inject a barrier into the source streams and wait for all
/// operators to ack.
pub async fn run_checkpoint(
&self,
cp_id: u64,
sources: &[mpsc::Sender<SourceItem>],
) -> Result<()> {
// Reset ack state for this checkpoint.
self.received_acks.lock().unwrap().entry(cp_id).or_default();
// Inject the barrier into every source stream.
for src in sources {
src.send(SourceItem::Barrier(cp_id)).await?;
}
// Wait for all operators to ack.
loop {
self.completion.notified().await;
let acks = self.received_acks.lock().unwrap();
if let Some(set) = acks.get(&cp_id) {
if set.len() >= self.expected_acks {
return Ok(());
}
}
}
}
/// Operator-side ack: called when an operator has snapshotted in
/// response to a barrier and is forwarding it downstream.
pub fn ack(&self, cp_id: u64, operator_name: &str) {
let mut acks = self.received_acks.lock().unwrap();
acks.entry(cp_id).or_default().insert(operator_name.to_string());
drop(acks);
self.completion.notify_waiters();
}
}
The barrier mechanism is more complex than the per-operator pattern but produces the consistent-cut property the framework guarantees. The cost is operational: the coordinator is a centralized component that the supervisor must keep alive, the barrier protocol must be implemented in every operator, and the slowest input stalls the entire checkpoint. Production systems that use aligned checkpointing (Flink's defaults) accept this complexity in exchange for cross-operator consistency. The lesson covers it for completeness; SDA's pipeline uses the simpler per-operator approach.
Recovery from a Checkpoint at Startup
The startup path that calls new_or_restore for every checkpointing operator. The orchestrator's bootstrap logic looks for local checkpoints first, falls back to remote, defaults to fresh state if neither exists.
pub async fn bootstrap_pipeline(
config: &PipelineConfig,
) -> Result<RunningPipeline> {
let local_dir = &config.checkpoint_dir;
let remote_dir = &config.remote_checkpoint_uri;
let mut operators: Vec<CheckpointingOperator<_>> = Vec::new();
for op_spec in &config.operators {
// Try local first. If nothing exists locally, try remote.
// If nothing exists in either, start fresh.
let local_path = local_dir.join(format!("{}.bin", op_spec.name));
let restored = if local_path.exists() {
CheckpointingOperator::new_or_restore(
&op_spec.name,
op_spec.checkpoint_interval,
local_dir.clone(),
).await?
} else if remote_exists(remote_dir, &op_spec.name).await? {
tracing::info!(operator = %op_spec.name, "local missing; pulling from remote");
pull_remote_to_local(remote_dir, local_dir, &op_spec.name).await?;
CheckpointingOperator::new_or_restore(
&op_spec.name,
op_spec.checkpoint_interval,
local_dir.clone(),
).await?
} else {
tracing::warn!(operator = %op_spec.name, "no checkpoint anywhere; starting fresh");
CheckpointingOperator::new_or_restore(
&op_spec.name,
op_spec.checkpoint_interval,
local_dir.clone(),
).await?
};
operators.push(restored);
}
Ok(RunningPipeline { operators, /* ... */ })
}
The local-then-remote-then-fresh hierarchy is what gives the pipeline different recovery profiles for different failure modes. A normal restart (process exit, redeploy) recovers from local and is fast (sub-second). A host-failure recovery (the host went down) recovers from remote and is slower (tens of seconds for the pull, plus the deserialize time). A first-time start has no checkpoint and starts fresh. Each path is logged at the structured-log level the orchestrator's monitoring expects; the recovery route taken is itself a metric (pipeline_recovery_path{path="local|remote|fresh"} counter) that surfaces interesting events for ops to diagnose.
Key Takeaways
- A checkpoint is State + Offset: the operator's serialized state plus the input-stream offset that the state reflects. Either alone is insufficient; both together are the recovery contract. The atomic-rename pattern ensures the file on disk is either the previous good checkpoint or the new good checkpoint, never a partial state.
- The pause-snapshot-resume protocol is the simplest implementation: withhold credits from upstream, serialize, write to disk, return credits. The pause duration is end-to-end latency cost during the snapshot — typically 50-200ms for SDA-scale state, tunable via storage choice.
- Aligned checkpoints (Flink-style barriers) produce consistent cuts across operators at the cost of slowest-input stall time. Per-operator checkpoints trade the consistency property for simplicity and rely on at-least-once + idempotency to recover global consistency via dedup. SDA uses per-operator.
- Storage tiers are local disk (fast, host-bound), remote object storage (slower, durable across host loss), and the hybrid (local-first, async-replicated-to-remote) which is the standard production pattern. The hot-path cost is bounded by local; the durability comes from remote.
- Recovery hierarchy is local → remote → fresh. Normal restart recovers from local in sub-second; host failure pulls from remote in tens of seconds; cold start has no checkpoint and starts at offset 0. Each path is logged as a structured metric so ops can see the recovery profile of every restart.
Lesson 4 — Dead Letter Queues
Module: Data Pipelines — M05: Delivery Guarantees and Fault Tolerance Position: Lesson 4 of 4 Source: Fundamentals of Data Engineering — Joe Reis & Matt Housley, Chapter 7 (Error Handling and Dead-Letter Queues); Kafka: The Definitive Guide — Shapira, Palino, Sivaram, Petty, Chapter 9 (Failure Handling and Reprocessing)
Context
The retry policy from Module 2 L3 covers transient errors — the network blip, the broker leader election, the partner API hiccup. The idempotency machinery from M5 L2 absorbs the duplicates that retries inevitably produce. The checkpoint machinery from L3 lets the pipeline recover its state across restart. The combination handles every recoverable failure mode the SDA pipeline encounters.
It does not handle permanent failures. Some events cannot be processed regardless of how many times they are retried. A frame from the radar source whose binary payload cannot be deserialized — the wire format does not match the protocol the operator was built against, possibly because the radar's firmware was upgraded without coordination. An observation referencing a satellite catalog ID that has been decommissioned for orbit burn — the record is internally consistent but references state that no longer exists. An ISL beacon's state vector with an impossible-physics value (radius below Earth's surface, velocity above c) that violates the operator's input invariants. Retrying these does not help; every retry produces the same error, and the operator's retry-budget eventually exhausts. Dropping them silently violates the audit requirement that every observation is accounted for.
The third path is the dead-letter queue — a separate sink, distinct from the main pipeline, that receives events the operator cannot process. Each entry carries the original event plus enough metadata for an engineer to investigate: the error kind, the operator name, the timestamp, the retry attempts that were tried. The DLQ is not a discard bucket; it is a debug tool and a re-processing source. Engineers inspect it during incident response; engineers re-inject from it when the underlying issue is fixed. The pattern is universal in production streaming systems; this lesson develops it for SDA's pipeline.
Core Concepts
The Retry-Disposition Decision Tree
Every error from an operator's hot path classifies into one of three buckets.
Transient (Retry). Network errors, 5xx responses, timeouts, broker leader-elections. Will resolve on retry given enough time. The retry wrapper from M2 L3 handles these with decorrelated-jitter backoff. The operator's retry budget bounds the time spent retrying.
Permanent (DLQ). Deserialization errors, schema-mismatch errors, validation failures, references to non-existent state. Will NOT resolve on retry — every attempt produces the same error. Routes to the DLQ for human-investigable handling. Does not consume retry budget.
Discardable (drop). Invariant violations that should be dropped silently without operational attention. The radar frame whose wire format declares a length larger than the maximum permitted — a clear bug at the source, not worth investigating, not worth re-processing. Drops with a metric increment.
The classification is the operator's responsibility. M2 L3 introduced RetryDisposition::{Retry, Permanent, Discard}; this lesson extends Permanent to mean "DLQ" — the operator hands the event to the DLQ sink rather than just propagating the error. The lesson's discipline: every operator's error path classifies explicitly, every classification is documented in code, the default for unknown errors is Permanent (DLQ) so they surface for operational attention rather than getting silently retried or dropped.
DLQ Metadata
The DLQ entry is the event plus context. The context is what makes the DLQ a debug tool rather than a dump.
pub struct DlqEntry {
/// Wall-clock time when the error occurred.
pub timestamp: SystemTime,
/// The operator that produced the error.
pub operator: String,
/// The kind of error (deserialization, validation, processing exception).
pub error_kind: DlqErrorKind,
/// Free-form error message for human investigators.
pub error_message: String,
/// Number of retry attempts before giving up (typically 0 for
/// permanent errors, > 0 for transient errors that exceeded the
/// retry budget).
pub retry_count: u32,
/// The original event. Stored as the raw bytes (or the deserialized
/// envelope when available) so re-processing can replay it.
pub original_payload: Vec<u8>,
/// Schema version of the metadata format itself. Important for
/// re-processing tools that span DLQ entries from different
/// pipeline versions.
pub schema_version: u32,
}
pub enum DlqErrorKind {
Deserialization,
SchemaMismatch,
ValidationFailed,
ProcessingException,
RetryBudgetExhausted,
}
The schema_version is the often-overlooked field. The DLQ accumulates entries over a long time horizon; the metadata format will evolve as the pipeline evolves. A re-processing tool reading entries from six months ago needs to know what fields the entry carried when it was written. The version field is the migration mechanism — a future tool reads schema_version=1 entries with the v1 format and schema_version=2 entries with the v2 format. Without the version, future migrations require either guessing or losing entries.
Poison Pills
A poison pill is an event that causes errors every retry. The poison-pill scenario is what distinguishes DLQ-bearing pipelines from drop-only ones. Without a DLQ, a poison pill blocks the pipeline: the operator retries, fails, retries, fails, exhausts its retry budget, the supervisor restarts the operator, the operator reads the same poison pill from the consumer offset, fails again. The pipeline makes no progress past the poison pill.
With a DLQ, the poison pill is quarantined. The operator's first retry attempt classifies the error as Permanent, hands it to the DLQ, and continues with the next event. The pipeline stays healthy; the poison pill is in the DLQ for investigation. The operational discipline: a metric for dlq_entries_total{error_kind} and an alert when the rate exceeds a threshold. A spike in DLQ entries is a signal — a partner's API change, a schema migration that did not roll out everywhere, a bug in the operator's deserializer.
The threshold for the alert is tuned per error kind. A handful of Deserialization errors per day during steady-state is normal (occasional malformed wire packets). A spike to hundreds per minute signals a partner change or a rollback. RetryBudgetExhausted errors should be rare; a spike means a downstream is degraded longer than the retry budget covers, and operations should investigate.
The DLQ as Re-processing Source
The DLQ is also a stream. Once an underlying issue is fixed (a partner's API change is reverted, a schema migration completes, a bug fix deploys), the DLQ's entries can be re-processed through the pipeline. A re-processing tool reads from the DLQ, reconstructs the original events, and pushes them back into the pipeline's input topic. The operators process them as if they were freshly arrived, the dedup logic from L2 absorbs the duplicates from any prior partial processing, and the events end up in the right downstream state.
The re-processing tool is its own piece of code, separate from the live pipeline. It has access to the DLQ's stored entries, knows the schema_version migration story, can filter by error_kind or operator or time range. The tool is an operational lever: when an issue is fixed, ops runs the tool against the affected DLQ window to recover the affected events. Without the tool, the events are stuck in the DLQ forever; with the tool, the DLQ's role is "temporarily holding events while we figure out what to do," which is exactly the right framing.
The tool's correctness depends on the pipeline's idempotency. Re-injecting from the DLQ produces the same events the pipeline processed before; without idempotency, re-processing produces duplicate effects. The L2 machinery (sink-side dedup, idempotent UPSERT, Idempotency-Key headers) absorbs the re-injection; the re-processing is safe by construction. This is the L2-L4 composition the capstone exercises end-to-end.
The Discard-Bucket Anti-Pattern
A DLQ that nobody reads is worse than no DLQ. Entries pile up; ops loses track of what they mean; a real incident produces a small spike that gets lost in the noise of historical entries. This is the anti-pattern the lesson identifies and the discipline to prevent it.
Three operational practices distinguish a useful DLQ from a discard bucket.
Alert on growth rate. A DLQ growing at >10x its baseline rate for >5 minutes fires an alert. The alert text names the dominant error_kind in the recent window so on-call has an immediate hypothesis to investigate.
Periodic review. The DLQ has a weekly review cadence. Ops walks through new entries since last review, classifies the dominant patterns, decides on remediation per error_kind. Some patterns become permanent fixes (the operator's classifier is updated to handle the case). Some become re-processing tasks (the underlying issue was fixed; re-inject the affected window). Some become explicit non-events (the entry represents a known-and-accepted failure mode).
Bounded retention. The DLQ does not grow forever. Entries older than the retention window (typically 30 days for SDA) are evicted. The retention is a forcing function for operational discipline — the team cannot ignore the DLQ indefinitely; entries must be addressed before they age out. The retention is documented and the eviction is logged so the team knows what they're losing.
The DLQ's value is proportional to the discipline applied to it. A well-managed DLQ catches partner API changes within minutes; a discard bucket catches nothing in particular.
Code Examples
A DLQ Sink with Structured Metadata
The DLQ sink writes JSON-Lines to disk in the SDA pipeline's local filesystem; production deployments write to a Kafka topic with longer retention so re-processing tools can read from there. The local-disk version is sufficient for the lesson's purposes and matches the M3 L4 retract-aware sink's storage choice.
use anyhow::Result;
use serde::{Deserialize, Serialize};
use std::path::PathBuf;
use std::time::SystemTime;
use tokio::fs::OpenOptions;
use tokio::io::AsyncWriteExt;
#[derive(Debug, Clone, Copy, Serialize, Deserialize)]
pub enum DlqErrorKind {
Deserialization,
SchemaMismatch,
ValidationFailed,
ProcessingException,
RetryBudgetExhausted,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct DlqEntry {
pub schema_version: u32,
pub timestamp_unix_ms: u64,
pub operator: String,
pub error_kind: DlqErrorKind,
pub error_message: String,
pub retry_count: u32,
pub original_payload: Vec<u8>,
}
pub struct DlqSink {
file_path: PathBuf,
operator_name: String,
}
impl DlqSink {
pub fn new(file_path: PathBuf, operator_name: impl Into<String>) -> Self {
Self { file_path, operator_name: operator_name.into() }
}
/// Write a DLQ entry. Each entry is one JSON-Lines record.
/// Append-only; never overwrites existing entries.
pub async fn write(
&self,
kind: DlqErrorKind,
error_message: impl Into<String>,
retry_count: u32,
original_payload: Vec<u8>,
) -> Result<()> {
let entry = DlqEntry {
schema_version: 1,
timestamp_unix_ms: SystemTime::now()
.duration_since(SystemTime::UNIX_EPOCH)
.map(|d| d.as_millis() as u64)
.unwrap_or(0),
operator: self.operator_name.clone(),
error_kind: kind,
error_message: error_message.into(),
retry_count,
original_payload,
};
let mut line = serde_json::to_vec(&entry)?;
line.push(b'\n');
let mut file = OpenOptions::new()
.create(true)
.append(true)
.open(&self.file_path)
.await?;
file.write_all(&line).await?;
// For SDA's reliability: sync after each write. Production with
// higher DLQ rates batches multiple writes per fsync.
file.sync_data().await?;
Ok(())
}
}
The append-only + per-write fsync is the durability discipline. A DLQ entry that is buffered in the kernel's page cache and not yet on disk is at risk if the process crashes. For the SDA pipeline's DLQ rates (single-digit entries per minute under steady state) the per-write sync cost is negligible. For higher-rate DLQs the cost matters and batching is appropriate; the standard pattern is "sync at most every N entries or every M milliseconds, whichever first." JSON-Lines as the format makes the file human-inspectable (tail -f /path/to/dlq.jsonl | jq) and machine-readable in one pass — useful for both ad-hoc investigation and the re-processing tool.
An Operator That Routes to DLQ Based on RetryDisposition
The operator's error path. M2 L3's retry wrapper is extended: RetryDisposition::Permanent(e) triggers a DLQ write before the error propagates; Discard drops with a counter; Retry goes through the wrapper's backoff machinery as before.
use anyhow::Result;
pub async fn run_operator_with_dlq<F, Fut, T>(
mut op: F,
dlq: &DlqSink,
policy: RetryPolicy,
payload: Vec<u8>, // raw payload bytes for DLQ
) -> Result<Option<T>>
where
F: FnMut() -> Fut,
Fut: std::future::Future<Output = RetryDisposition<T>>,
{
use std::time::Duration;
use tokio::time::sleep;
let mut attempt = 0u32;
let mut prev_delay = policy.initial;
loop {
attempt += 1;
match op().await {
RetryDisposition::Ok(v) => return Ok(Some(v)),
RetryDisposition::Discard => {
metrics::counter!("operator_discards_total").increment(1);
return Ok(None);
}
RetryDisposition::Permanent(e) => {
// Permanent — route to DLQ.
dlq.write(
DlqErrorKind::ValidationFailed,
e.to_string(),
attempt - 1,
payload,
).await?;
return Ok(None); // operator continues to next event
}
RetryDisposition::Retry(e) if attempt >= policy.max_attempts => {
// Retry budget exhausted — also DLQ.
dlq.write(
DlqErrorKind::RetryBudgetExhausted,
e.to_string(),
attempt,
payload,
).await?;
return Ok(None);
}
RetryDisposition::Retry(_) => {
// Decorrelated jitter from M2 L3.
let upper = (prev_delay.as_millis() as u64).saturating_mul(3)
.max(policy.initial.as_millis() as u64);
let delay = Duration::from_millis(upper).min(policy.cap);
prev_delay = delay;
sleep(delay).await;
}
}
}
}
Two things to notice. The Permanent arm and the RetryBudgetExhausted arm both DLQ but with different error_kind labels — the DLQ entry distinguishes "this is a permanent error type" from "this was transient but we couldn't make it work." Operations dashboards split on the label to prioritize different remediation patterns. The operator's Result<Option<T>> return makes "go to next event" explicit at the type level: Ok(None) means "this event was discarded or DLQ'd, move on"; Ok(Some(v)) means "process this value"; Err(e) means "operator-level error, propagate to supervisor." The structure makes the operator's hot loop easy to read: match op_result { Some(v) => process(v), None => continue }.
A Re-Processing Tool
The CLI tool that reads from the DLQ and re-injects events into the pipeline's input topic. It is a separate binary from the pipeline itself, designed for operational use after an underlying issue has been fixed.
use anyhow::Result;
use std::path::PathBuf;
use tokio::fs::File;
use tokio::io::{AsyncBufReadExt, BufReader};
pub async fn reprocess(
dlq_path: PathBuf,
pipeline_input_topic: &str,
producer: &FutureProducer,
filter: ReprocessFilter,
) -> Result<ReprocessReport> {
let mut report = ReprocessReport::default();
let file = File::open(&dlq_path).await?;
let reader = BufReader::new(file);
let mut lines = reader.lines();
while let Some(line) = lines.next_line().await? {
let entry: DlqEntry = serde_json::from_str(&line)?;
if !filter.matches(&entry) {
report.skipped += 1;
continue;
}
// Push the original payload back into the pipeline's input.
// The pipeline's idempotency machinery from L2 absorbs any
// duplicates from prior partial processing.
producer.send(
FutureRecord::to(pipeline_input_topic)
.payload(&entry.original_payload),
std::time::Duration::from_secs(30),
).await
.map_err(|(e, _)| anyhow::anyhow!("send failed: {e}"))?;
report.reprocessed += 1;
}
Ok(report)
}
#[derive(Debug)]
pub struct ReprocessFilter {
pub error_kinds: Vec<DlqErrorKind>,
pub operators: Vec<String>,
pub since_unix_ms: Option<u64>,
pub until_unix_ms: Option<u64>,
}
impl ReprocessFilter {
pub fn matches(&self, entry: &DlqEntry) -> bool {
// Filter on error_kind, operator, time range. Empty filters
// match everything.
let kind_ok = self.error_kinds.is_empty() ||
self.error_kinds.iter().any(|k| matches!((k, &entry.error_kind),
(DlqErrorKind::Deserialization, DlqErrorKind::Deserialization) |
(DlqErrorKind::SchemaMismatch, DlqErrorKind::SchemaMismatch) |
(DlqErrorKind::ValidationFailed, DlqErrorKind::ValidationFailed) |
(DlqErrorKind::ProcessingException, DlqErrorKind::ProcessingException) |
(DlqErrorKind::RetryBudgetExhausted, DlqErrorKind::RetryBudgetExhausted)
));
let op_ok = self.operators.is_empty() || self.operators.contains(&entry.operator);
let since_ok = self.since_unix_ms.map_or(true, |s| entry.timestamp_unix_ms >= s);
let until_ok = self.until_unix_ms.map_or(true, |u| entry.timestamp_unix_ms <= u);
kind_ok && op_ok && since_ok && until_ok
}
}
#[derive(Debug, Default)]
pub struct ReprocessReport {
pub reprocessed: usize,
pub skipped: usize,
}
The filter API matters operationally. A typical re-processing run targets "all Deserialization errors from operator radar-ingest between 14:00 and 15:30 yesterday" — the time when the partner's wire format change rolled out, and the time it was reverted. The filter narrows the re-processing to the affected window, avoiding re-injecting unrelated DLQ entries that might now succeed and produce unwanted side effects. The L2 idempotency is the safety net that makes the re-injection correct under any duplicate; the filter is the operational discipline that limits re-injection to what is intended.
Key Takeaways
- Three error categories: transient (retry with backoff), permanent (DLQ), discardable (drop with metric). The classification is the operator's responsibility; the default for unknown errors is permanent (DLQ) so they surface for operational attention.
- DLQ entries carry metadata: timestamp, operator, error_kind, error_message, retry_count, original_payload, schema_version. The schema_version field is the migration mechanism for future re-processing tools that span DLQ entries from different pipeline versions.
- Poison pills are the case the DLQ exists for. Without a DLQ, a poison pill blocks the pipeline; the operator retries forever and makes no progress. With a DLQ, the pill is quarantined and the pipeline stays healthy.
- The DLQ is also a re-processing source. After an underlying issue is fixed, a re-processing tool reads from the DLQ and re-injects events into the pipeline. The L2 idempotency machinery absorbs any duplicates from prior partial processing.
- The discard-bucket anti-pattern is what makes a DLQ useless. The operational disciplines that prevent it: alert on growth rate, weekly review cadence, bounded retention. A well-managed DLQ catches partner API changes within minutes; a discard bucket catches nothing.
Capstone Project — Exactly-Once Conjunction Alert Pipeline
Module: Data Pipelines — M05: Delivery Guarantees and Fault Tolerance Estimated effort: 1–2 weeks of focused work Prerequisites: All four lessons in this module passed at ≥70%
Mission Brief
OPS DIRECTIVE — SDA-2026-0207 / Phase 5 Implementation Classification: RESTART-SAFETY HARDENING
Two months ago, a maintenance window required restarting the SDA pipeline to apply a security patch. The orchestrator's graceful-drain logic worked correctly — every operator drained its incoming channel before exiting — but the alert subscriber had already received fourteen alerts that the new pipeline did not know about, and the new pipeline emitted six alerts that the subscriber had already acted on. Two false-positive collision-avoidance maneuvers were executed as a consequence. The postmortem identified two missing pieces: durable state on the producer side (so restart resumes from where the previous instance left off), and idempotent processing on the consumer side (so duplicate deliveries do not produce duplicate effects).
Phase 5 installs both. The windowed correlator from M3 becomes crash-safe via periodic checkpoints. The alert-emit path becomes idempotent end-to-end via the M5 L2 dedup machinery extended to the subscriber boundary. Permanent errors route to a DLQ with metadata sufficient for re-processing after underlying issues are fixed.
Success criteria for Phase 5: a deliberate kill -9 of the pipeline at three different points (mid-process, mid-checkpoint, mid-emit) followed by restart produces an alert log with every alert exactly once. The 30-second alert SLO is held throughout the test. The DLQ captures permanent errors with full metadata; the re-processing tool re-injects DLQ entries without producing duplicate alerts.
What You're Building
Make the M4 hardened pipeline crash-safe and exactly-once-effective.
- The windowed correlator from M3 becomes a
CheckpointingOperator(L3 pattern): periodic checkpoints write its sliding-window state plus the consumer offset to local NVMe, async-replicated to S3 - The alert sink uses the L2
DedupSetkeyed on alert_id with a 5-minute window and 100K capacity bound - The Kafka consumer is configured for at-least-once (L1:
acks=all,enable.idempotence=true,enable.auto.commit=false, process-then-commit) - The alert subscriber boundary stores recently-seen alert_ids in a small embedded SQLite (durable across the subscriber's own restarts)
- Operators classify errors per L4: transient → retry, permanent → DLQ, invariant-violation → discard
- A DLQ sink writes JSON-Lines to local disk with the L4 metadata schema
- A re-processing CLI tool (
sda-reprocess) reads from the DLQ and re-injects events into the pipeline's input topic
The orchestrator from M2, the windowed correlator from M3, and the priority-aware shedding from M4 are all unchanged in structure. The new operators wrap or extend the existing pieces; the operator graph declaration grows by a few nodes (DLQ sink, alert subscriber boundary state).
Suggested Architecture
┌─────────────────────┐
│ Kafka input topic │
│ (consumer offset │
│ committed │
│ process-then- │
│ commit) │
└──────────┬──────────┘
│
▼
┌──────────────────────────────┐
│ Source operators (radar, │
│ optical, ISL) wrapped with │
│ retry + DLQ classifier │
└──────────────┬───────────────┘
│
▼
┌──────────────────────────────┐
│ Normalize fan-in (M3 L1) │
└──────────────┬───────────────┘
│ ──── credit channel from L2
▼
┌──────────────────────────────┐
│ Windowed Correlator │
│ (M3) wrapped as │
│ CheckpointingOperator │
│ - state: sliding windows │
│ - offset: consumer commit │
│ - cadence: 30s │
│ - storage: local + S3 │
└──────────────┬───────────────┘
│
▼
┌──────────────────────────────┐
│ Alert Sink (M2 L3 dedup + │
│ M3 L4 retract-aware) │
│ + 5-min DedupSet 100K bound │
└──────────────┬───────────────┘
│ alerts
▼
┌──────────────────────────────┐
│ Alert Subscriber Boundary │
│ (embedded SQLite seen-set) │
└──────────────────────────────┘
Side paths: Out-of-band:
┌──────────┐ ┌──────────┐ ┌──────────────┐
│ DLQ │◄───│ Operator │ │ sda-reprocess│
│ Sink │ │ classifier│ │ CLI tool │
└──────────┘ └──────────┘ └──────┬───────┘
│
▼
re-inject into Kafka
input topic
Acceptance Criteria
Functional Requirements
-
Kafka consumer configured per L1:
enable.auto.commit=false,acks=all,enable.idempotence=true,max.in.flight.requests.per.connection=5(idempotent producer makes higher in-flight safe) -
Process-then-commit ordering with explicit
commit_message(..., CommitMode::Sync)after each batch - Source-side internal log: every observation is durably recorded in a per-source append-only file with its consumer offset before being emitted to downstream
-
Sink-side
DedupSet(5-minute time bound, 100K capacity bound) on alert_id; duplicates absorbed silently -
CheckpointingOperatorwrapping the windowed correlator: 30-second cadence, atomic temp-file + rename writes, local NVMe primary + S3 async replicate - On restart: load latest local checkpoint if present; fall back to S3 if local is missing; fall back to fresh state if neither
-
DLQ sink with the L4 schema (
schema_version=1); per-error-kind classification by every operator -
sda-reprocessCLI tool with filter args (--error-kind,--operator,--since,--until) and a dry-run mode that prints what would be re-injected without sending
Quality Requirements
-
Three crash tests in the integration test suite, each with
kill -9at a different point: (a) mid-process (between consumer recv and sink write), (b) mid-checkpoint (during the checkpoint flush), (c) mid-emit (after sink write but before commit). Each test asserts the post-restart alert log contains every alert exactly once. - Checkpoint pause duration measured per snapshot; the histogram's P99 is below 200ms (the SLO budget). Performance test asserts this on representative load.
-
DLQ schema versioned at write; the re-processing tool dispatches on
schema_versionand supports the historical versions documented in the codebase -
No
.unwrap()or.expect()in non-startup code paths; all errors propagate to the operator's classifier
Operational Requirements
-
/metricsextends M4's with:checkpoint_age_seconds(gauge per operator),checkpoint_size_bytes(gauge),checkpoint_pause_duration_ms(histogram),dlq_entries_total{operator, error_kind}(counter),recovery_path_total{path}(counter for local/remote/fresh on each startup) -
Alert when
checkpoint_age_seconds > 2 × cadence(a stalled checkpoint indicates a problem) -
Alert when
dlq_entries_total{error_kind="Deserialization"}rate > 10× baseline for >5 minutes (partner schema change canary) - DLQ runbook: per-error-kind playbook documenting investigation steps and remediation patterns (entry, hypothesis, validation, fix, re-processing decision)
Self-Assessed Stretch Goals
- (self-assessed) Recovery time from a 100MB checkpoint is under 5 seconds end-to-end (load + deserialize + resume + first event emitted)
- (self-assessed) The re-processing tool handles 10K events without producing any new alerts (per the L4 lesson's exemplary "zero new effects" outcome). Demonstrated via a synthetic DLQ window from the integration tests.
- (self-assessed) The pipeline's restart-recovery test runs as a chaos-engineering integration test in CI, killing the pipeline at random points across 100 iterations; asserts every iteration ends with a consistent alert log
Hints
How do I serialize the windowed correlator's per-key sliding windows efficiently?
The natural representation is a BTreeMap<KeyType, VecDeque<Observation>>. bincode::serialize on this produces a compact binary format suitable for the checkpoint write. Pre-allocate the temp-file with a reasonable size hint to avoid re-allocation during the write. For the 30-second window at SDA's load (~10K observations/sec), the serialized state is in the tens of megabytes — well within the 200ms pause budget when written to NVMe.
#[derive(Serialize, Deserialize)]
struct CorrelatorState {
windows: BTreeMap<ObjectIdPair, VecDeque<Observation>>,
last_offset: u64,
}
let bytes = bincode::serialize(&state)?; // typically <50MB
fs::write(&tmp_path, bytes).await?;
fs::rename(&tmp_path, &final_path).await?;
The serialization can be parallelized for very large states: split the BTreeMap into chunks, serialize each chunk in parallel via rayon, concatenate. For SDA's scale this is unnecessary; the bincode serialize is single-digit microseconds per MB.
How do I inject crash points deterministically in tests?
The L1 test harness pattern with the CrashingSink is the foundation. Extend it: the operator's hot loop has a #[cfg(test)]-gated crash_after: Option<u32> field that panics after N successful events. The integration test sets crash_after = Some(N), runs the pipeline, asserts the panic was caught by the supervisor, then runs a second instance from the same Kafka topic and asserts the recovery completed correctly.
#[cfg(test)]
async fn process_event(&mut self, ev: Event) -> Result<()> {
self.events_processed += 1;
if let Some(crash_at) = self.crash_after {
if self.events_processed == crash_at {
panic!("test-injected crash at event {}", self.events_processed);
}
}
// ... actual processing
}
Combine with tokio::time::pause() (M2 L4 pattern) so the test runs in fast-forward without real wall-clock delays. The whole crash-recover-verify cycle should run in under a second for CI suitability.
How do I version the DLQ schema with serde tagged unions?
The simplest pattern is to use serde's #[serde(tag = "schema_version")] with a sum-type enum that wraps each version's struct. The reader dispatches on the tag automatically:
#[derive(Serialize, Deserialize)]
#[serde(tag = "schema_version")]
enum DlqEntryAnyVersion {
#[serde(rename = "1")]
V1(DlqEntryV1),
#[serde(rename = "2")]
V2(DlqEntryV2),
// ... future versions
}
The writer always emits the latest version. The reader's serde_json::from_str::<DlqEntryAnyVersion>(line) automatically dispatches based on the schema_version field in the JSON. New versions add a variant; old code that doesn't know about the new version returns an error on deserialization, which can be handled gracefully (skip with metric).
The DlqEntryV2 struct evolves backwards-compatibility — keep field names where possible, add new ones as Option<T> for graceful upgrade. Production tooling should provide a migrate-old-to-new tool that reads V1 entries and writes V2 entries; the re-processing tool reads either.
How do I size the dedup window and capacity for the alert sink?
The window must exceed the maximum re-delivery window. For SDA's pipeline, the dominant re-delivery source is checkpoint replay during restart: a checkpoint at cadence 30s means the pipeline can replay up to 30s of events (the time between the latest checkpoint and the crash). Add a safety factor — 5 minutes is comfortable. The capacity bound is the safety valve for bursts; during the 10x burst test from M4, the sink's incoming rate peaks at ~50K alerts/sec briefly, so a 100K capacity covers a 2-second peak comfortably.
The numbers are operational: tune based on actual observed re-delivery rates and burst characteristics. Document the chosen values with a BurstProfile-style comment in the topology declaration.
How do I test the re-processor without polluting the live pipeline?
A --target-topic test-mode-input argument that lets the tool point at a test topic instead of production. The integration test uses this flag; production runs use the default (the live input topic). The test asserts events landed in the test topic (via a test-mode consumer) without affecting the production state.
For dry-run validation in production, the --dry-run flag has the tool print what it would re-inject without actually sending. Operations uses dry-run before any production re-injection to confirm the filter is targeting the right window.
Getting Started
Recommended order:
- Source-side internal log. A per-source append-only file that records every observation with its Kafka offset before emitting downstream. The recovery story works only when this is durable.
- Sink-side
DedupSeton alert_id. L2's pattern; double-bound (time + count). Verify with a test that injects the same alert_id twice and asserts only one downstream emit. - Kafka consumer reconfiguration. Switch from auto-commit to process-then-commit. Verify with the L1 crash test: kill between process and commit, restart, observe the redelivery.
CheckpointingOperatorwrapping the windowed correlator. L3's pattern; atomic temp-file + rename; 30-second cadence to start.- Recovery routine on startup. Local-first, S3-second, fresh-third hierarchy.
- DLQ sink + per-operator classifier. L4's pattern; classify each error explicitly; route to DLQ with metadata.
sda-reprocessCLI tool. Read from DLQ, filter, re-inject. Test against a synthetic DLQ.- Three crash tests in the integration suite. Mid-process, mid-checkpoint, mid-emit. Assert every alert lands exactly once.
Aim for the first crash test passing by day 5 (consumer reconfig + checkpoint + recovery). The DLQ and re-processor land in the second week along with the operational runbook. The chaos-engineering stretch goal is an end-of-week-2 polish if time permits.
What This Module Sets Up
In Module 6 you will surface the new metrics — checkpoint_age, dlq_entries_total, recovery_path — as the operational dashboard's resilience panels. The runbook discipline you establish here (per-error-kind playbooks, re-processing protocols) becomes part of the on-call rotation's standard procedure. The audit script from M4 extends to verify every operator has a documented retry-disposition classifier and DLQ wiring.
The pipeline at the end of this module is correct under load (M4), correct across restart (M5), correct in event time (M3), and correctly orchestrated (M2). It produces output that downstream subscribers can trust to be exactly-once-effective. M6's work is making that correctness visible to operations — the dashboards, lineage, distributed tracing, and SLO monitoring that turn a correct pipeline into an operationally-legible one.
This is the module where the SDA Fusion Service crosses from "works under happy paths" to "works through real production failure modes." The patterns generalize beyond SDA to any streaming system that must survive restarts: at-least-once + idempotent + checkpointed + DLQ'd is the canonical streaming-pipeline reliability stack.
Module 1: Columnar Storage Foundations
Lesson 1 — Parquet File Layout
Module: Data Lakes — M01: Columnar Storage Foundations
Position: Lesson 1 of 3
Source: In-Memory Analytics with Apache Arrow — Matthew Topol, Chapter 3 ("Format and Memory Handling"); Designing Data-Intensive Applications — Martin Kleppmann and Chris Riccomini, Chapter 4 ("Storage and Retrieval — Column-Oriented Storage"); Apache Parquet specification (github.com/apache/parquet-format).
Context
The legacy Artemis cold archive stored every downlinked frame as compressed JSONL — one JSON object per row, gzipped per file. Writing was easy. Reading was not. An analyst asking "what was the panel-2 voltage across mission 2024-Q3?" had to wait for the archive reader to decompress every file in that mission's directory, parse every JSON object, and discard ninety-nine percent of the fields it had just parsed. The query pattern that mattered most — one column out of forty — was the pattern the storage format was worst at.
The replacement is Parquet. Parquet is a columnar on-disk format: values from the same column live together physically, so a reader that only needs panel_voltage reads only the bytes that hold panel_voltage values. Topol (Ch. 3) calls this the central tradeoff of binary columnar formats — they sacrifice human-readability and append-friendliness to make columnar reads cheap. Kleppmann (Ch. 4) makes the same point at the conceptual level: column storage is the answer to the "read few columns out of many" workload that dominates analytical querying.
This lesson develops Parquet's physical layout end to end. The file's overall structure, the row group as the unit of parallelism and memory budget, the column chunk and page levels that live inside row groups, the footer that holds the metadata and why it lives at the end of the file, and the design constraints that the layout imposes on writers. Subsequent lessons in this module add per-column encoding (Lesson 2) and the in-memory Arrow representation that the writer feeds from and the reader produces (Lesson 3). The capstone wires those together into the Artemis archive writer.
Core Concepts
The File, End to End
A Parquet file is a binary container with a fixed-size envelope and a variable-size body. The envelope is the four-byte magic number PAR1 at the very start of the file and the same four-byte magic number at the very end. Between them, in order, are row groups (the body), the footer metadata (a Thrift-serialized structure), and the four-byte little-endian length of the footer immediately preceding the trailing magic number. A reader opening an unknown Parquet file performs a deterministic sequence: seek to eight bytes before EOF, read those eight bytes, verify the magic, decode the four-byte footer length, seek backward by that length plus the eight-byte trailer, read the footer, and now the reader knows where every column chunk in the file lives.
The trailer-driven layout has one operational consequence the engineer must internalize: a Parquet file cannot be read until it has been fully written. The footer holds the offsets of every page in every column chunk; until the writer has emitted the footer and the trailing length, the reader has no way to find anything. This is incompatible with append-only streaming writes in the way that, say, a JSONL file is compatible — and that incompatibility is the central reason why open table formats (Iceberg, Delta) exist as a layer above Parquet rather than streams sitting alongside it. We cover that in Module 2.
Row Groups: The Unit of Parallelism and Memory Budget
A row group is a horizontal slice of the table: some number of consecutive rows, with all of their column values laid out columnarly inside the slice. A Parquet file contains one or more row groups, written sequentially. Each row group is self-contained — its column chunks live entirely inside the row group's byte range, and the footer records the row group's offset and total byte length.
Row groups are the unit of three things at once. They are the unit of parallelism: a reader can hand each row group to a separate worker thread without coordination, because no row group's data depends on any other row group's data. They are the unit of the writer's memory budget: the writer must buffer an entire row group in memory before flushing it to disk, because the per-column statistics (min, max, null count) and encoding decisions (dictionary fits or doesn't) depend on seeing every value in the row group. They are also the unit of column statistics: the footer's per-column-chunk min/max values are computed over a row group, so partition pruning at query time operates at row-group granularity, not row granularity.
The row group size is therefore a four-way tradeoff: bigger row groups produce better compression (more values to find patterns across), better statistics granularity (fewer false-positive row groups during pruning), and larger units of parallelism — but they require more writer memory and produce coarser-grained predicate pushdown. Topol (Ch. 3) reports that Parquet defaults to row groups in the 64 MB to 1 GB range; Artemis archive workloads use 128 MB row groups, which balances writer memory pressure against the analyst query patterns we see.
Column Chunks and Pages
Inside a row group, each column's values live in a column chunk: a contiguous byte range holding every value of that column for the row group's rows. A row group with forty columns has forty column chunks. The column chunks are written sequentially within the row group's byte range — column 0's chunk, then column 1's chunk, then column 2's chunk, and so on. The footer records each column chunk's byte offset within the file and its total byte length.
A column chunk is further subdivided into pages. A page is the smallest unit Parquet reads from disk: the reader cannot read fewer bytes than one page contains. There are three page types — data pages (the encoded column values themselves), dictionary pages (the dictionary for dictionary-encoded columns; we cover encoding in Lesson 2), and index pages (rare; column index and offset index in newer versions of the spec). The default page size is 1 MB, but this is configurable and pages are not required to be uniform within a column chunk — a writer typically targets a page size but emits a page whenever the encoder produces enough bytes.
The three-level structure — row groups, column chunks within row groups, pages within column chunks — is the structural fact that makes Parquet's read path efficient. A query that reads one column from one row group of a hundred-column file with twenty row groups skips ninety-nine percent of the column chunks at the row-group level (only reads one column out of a hundred per row group), and reads one row group of twenty. The reader hits about 0.5% of the file's total bytes. That is the speedup over JSONL that the Artemis migration captures.
The Footer
The footer is a Thrift-serialized FileMetaData structure containing the file's schema (column names, types, and nullability), the row group descriptors (count of rows, total byte size, and a column chunk descriptor for each column), and file-level metadata (created-by string, key-value metadata, schema-level statistics). Each column chunk descriptor inside a row group descriptor records the column's encoding, compression codec, byte offset within the file, total compressed byte size, total uncompressed byte size, count of values, and the column statistics (min value, max value, null count, distinct count if cheap to compute).
The footer-at-end layout is what makes the schema, the row group offsets, and the per-chunk statistics available to the reader before any data pages are read. Critically, the column statistics are what let the query engine prune row groups without reading their data — a query for panel_voltage > 28.5 reads the footer, finds that row group 7's panel_voltage chunk has max = 27.4, and skips the entire row group. This is partition pruning at the row-group level, and it is the property that the table-format layer in Module 2 will lift up to the file level. The pruning power of the footer is directly proportional to the size of the row group: small row groups have small ranges that prune more aggressively; large row groups have wider ranges that prune less aggressively. The 128 MB row group target for Artemis is calibrated to keep pruning useful for the analyst query patterns.
Picking a Row Group Size
The row group size decision is the writer's most consequential lever. The defaults are wrong for most workloads — the Parquet 2.x default of 128 MB is reasonable for general analytics, but the right size depends on the actual workload shape. Three factors drive the choice.
Writer memory budget. The writer holds an entire row group in memory while encoding. A 128 MB row group with forty columns averages 3.2 MB per column chunk in memory; a 1 GB row group is 25 MB per column chunk. For the Artemis writer, which runs on the ground-segment ingestion node alongside other services, 128 MB is the upper bound that keeps the writer's resident set under 2 GB total.
Query pattern and pruning granularity. Smaller row groups produce tighter column statistics, which prune more aggressively. If the typical analyst query selects on mission_id or orbit_pass, and the partition layout (Module 3) does not already isolate these, then smaller row groups buy more pruning per file. If queries scan large ranges, the pruning value is lower and the row group can be larger.
Parallelism shape. A reader parallelizes across row groups. A file with one row group is read by one worker; a file with thirty-two is read by up to thirty-two workers. For Artemis files, which top out at 4 GB on disk, 128 MB row groups produce roughly thirty-two row groups per file — well-matched to typical query-engine worker pools.
The Artemis archive's standard is 128 MB row groups for downlinked telemetry. Production code documents the choice in the writer's config and revisits it whenever query patterns or worker pool sizes change materially.
Code Examples
Reading the Footer
Before processing any data, a Parquet reader fetches the file's footer and works out which byte ranges to read. The parquet crate exposes this through SerializedFileReader, which performs the footer read internally and gives access to the parsed metadata. The example below opens a file from the Artemis archive, parses the footer, and inspects what is in it.
use std::fs::File;
use std::sync::Arc;
use anyhow::{Context, Result};
use parquet::file::reader::{FileReader, SerializedFileReader};
use parquet::file::metadata::ParquetMetaData;
fn inspect_parquet(path: &str) -> Result<()> {
let file = File::open(path)
.with_context(|| format!("opening {}", path))?;
let reader = SerializedFileReader::new(file)
.context("parsing footer; file may be truncated or corrupt")?;
let metadata: &ParquetMetaData = reader.metadata();
let file_meta = metadata.file_metadata();
println!("schema: {}", file_meta.schema_descr().root_schema().name());
println!("num_rows: {}", file_meta.num_rows());
println!("num_row_groups: {}", metadata.num_row_groups());
println!("created_by: {:?}", file_meta.created_by());
// Iterate the row group descriptors. Each one exposes the byte offsets
// and statistics that let the reader prune without touching data pages.
for (rg_idx, rg) in metadata.row_groups().iter().enumerate() {
println!(
" rg {:>3}: rows={:>8} total_bytes={:>10}",
rg_idx, rg.num_rows(), rg.total_byte_size()
);
for (col_idx, col) in rg.columns().iter().enumerate() {
// file_offset points into the file; the reader uses this to
// issue a single bounded read for the column chunk.
if let Some(stats) = col.statistics() {
println!(
" col {:>2} ({}): offset={} size={} nulls={:?}",
col_idx,
col.column_path(),
col.file_offset(),
col.compressed_size(),
stats.null_count_opt(),
);
}
}
}
Ok(())
}
The pattern to notice is that nothing here touches a data page. The footer parse gives the reader every byte offset, every compressed size, every per-chunk statistic — enough to plan exactly which byte ranges of the file to read for a given query. For a query like "give me panel_voltage from row group 7", the planner has the file offset (where to seek) and the compressed size (how many bytes to read) for that one column chunk. The one wrinkle in production code is that opening a File and reading the footer is a synchronous, blocking operation; the Artemis reader uses parquet::arrow::async_reader to do the equivalent over object_store for files in object storage, which is what the cold archive actually uses. The synchronous version here is for clarity.
Selectively Reading One Column
The footer told the reader where the column chunks live. The actual read pulls only the column chunks the query needs, decodes their pages, and emits values. The parquet::arrow integration produces Arrow record batches — covered in detail in Lesson 3 — but the projection mechanism is worth seeing in isolation here, because it is what turns the footer's per-chunk offsets into actual I/O savings.
use std::fs::File;
use std::sync::Arc;
use anyhow::Result;
use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
use parquet::arrow::ProjectionMask;
/// Read only the `panel_voltage` column from an Artemis Parquet file.
/// Demonstrates the projection-driven read path: column chunks for unselected
/// columns are never read from disk.
fn read_panel_voltage(path: &str) -> Result<Vec<f64>> {
let file = File::open(path)?;
let builder = ParquetRecordBatchReaderBuilder::try_new(file)?;
// Resolve the column name to a leaf index in the schema. Production code
// looks the column up by name rather than hard-coding indices.
let schema = builder.schema().clone();
let panel_voltage_idx = schema
.index_of("panel_voltage")
.map_err(|e| anyhow::anyhow!("missing panel_voltage column: {e}"))?;
let mask = ProjectionMask::leaves(
builder.parquet_schema(),
std::iter::once(panel_voltage_idx),
);
let reader = builder
.with_projection(mask)
.with_batch_size(8192)
.build()?;
let mut out = Vec::new();
for batch in reader {
let batch = batch?;
// The projected batch has exactly one column; extract it as f64.
let col = batch
.column(0)
.as_any()
.downcast_ref::<arrow::array::Float64Array>()
.ok_or_else(|| anyhow::anyhow!("panel_voltage is not Float64"))?;
out.extend(col.iter().flatten());
}
Ok(out)
}
What to notice. The ProjectionMask::leaves call is what makes the read efficient — the builder consults the footer, identifies the column chunks that hold panel_voltage across all row groups, and issues reads for only those byte ranges. The other thirty-nine column chunks per row group are not touched. batch_size controls how many rows are decoded per emitted Arrow batch; 8192 is a common default that fits comfortably in L2 cache. Under load, two failure modes are worth knowing. First, if panel_voltage is dictionary-encoded (Lesson 2) and the dictionary page is large, the reader pays the dictionary-decode cost once per row group — typically negligible, but worth checking on hot columns with very large dictionaries. Second, if the file was written with no statistics on panel_voltage, predicate pushdown for downstream filters cannot prune row groups, and every row group's column chunk is read. The Artemis writer always enables statistics on numeric columns for this reason.
Writing With a Target Row Group Size
The writer side is where the row-group-size decision becomes a concrete number in code. The ArrowWriter accepts WriterProperties that bound the row group size; once the writer has buffered enough rows to hit the target size, it flushes the row group and starts a new one.
use std::fs::File;
use std::sync::Arc;
use anyhow::Result;
use arrow::array::RecordBatch;
use arrow::datatypes::Schema;
use parquet::arrow::ArrowWriter;
use parquet::basic::Compression;
use parquet::file::properties::WriterProperties;
/// Build a Parquet writer configured for the Artemis archive: 128 MB
/// row groups, ZSTD compression, statistics enabled on numeric columns.
fn artemis_writer(
path: &str,
schema: Arc<Schema>,
) -> Result<ArrowWriter<File>> {
let file = File::create(path)?;
let props = WriterProperties::builder()
// Target row group size in bytes. The writer flushes the row group
// when buffered bytes reach this threshold; the actual size is
// approximate because flushes happen at batch boundaries.
.set_max_row_group_size(128 * 1024 * 1024)
// ZSTD at level 3 is the Artemis default — substantially better
// ratios than SNAPPY on telemetry data with no meaningful CPU cost
// at the read side.
.set_compression(Compression::ZSTD(Default::default()))
// Statistics are what enable row-group pruning downstream. Disabling
// them on a column saves a small amount of write time and a large
// amount of read-time pruning power; rarely the right tradeoff.
.set_statistics_enabled(parquet::file::properties::EnabledStatistics::Page)
.build();
let writer = ArrowWriter::try_new(file, schema, Some(props))?;
Ok(writer)
}
/// Drain an iterator of record batches to a Parquet file, respecting the
/// configured row group size. The writer flushes row groups on its own as
/// the buffered byte count crosses the threshold; the caller does not need
/// to manage row group boundaries explicitly.
fn write_batches<I>(
path: &str,
schema: Arc<Schema>,
batches: I,
) -> Result<u64>
where
I: IntoIterator<Item = Result<RecordBatch>>,
{
let mut writer = artemis_writer(path, schema)?;
let mut total_rows: u64 = 0;
for batch in batches {
let batch = batch?;
total_rows += batch.num_rows() as u64;
writer.write(&batch)?;
}
// `close` is what emits the footer and the trailing magic number. Until
// this returns, the file is unreadable — see the `Footer` concept above.
writer.close()?;
Ok(total_rows)
}
The close() call is the critical line — it is what writes the footer and the trailing magic number that make the file readable. If the process crashes before close() returns, the partial file is unreadable; any recovery has to discard it and re-emit. The Artemis ingestion pipeline handles this by writing to a .parquet.inprogress filename and renaming to .parquet only after close() succeeds, which gives the readers a simple "if it exists with the final name, it is complete" invariant. The pattern generalizes: writers to object storage use the equivalent (S3 multipart complete, GCS compose, Azure commit-block-list) to make the visibility atomic with the file's validity.
Key Takeaways
- A Parquet file has a fixed structure:
PAR1magic, row groups, Thrift-serialized footer, four-byte footer length,PAR1magic. The reader works backward from EOF: read the trailer, decode the footer length, seek backward, read the footer, then plan its data reads. - Row groups are the unit of parallelism, of writer memory, and of column statistics. Picking a row group size is a four-way tradeoff between writer memory budget, parallelism shape, statistics granularity, and compression ratio. Artemis defaults to 128 MB.
- Inside a row group, each column lives in a contiguous column chunk, and inside the column chunk values live in pages. The three-level hierarchy is what lets a reader fetch one column out of forty by reading bytes corresponding to one chunk per row group.
- The footer-at-end layout means a Parquet file is unreadable until its writer has closed it. This is the structural reason streaming writes need a layer above Parquet (Iceberg, Delta — Module 2), not raw Parquet.
- Predicate pushdown operates at row group granularity using the per-chunk statistics in the footer. Always enable statistics on numeric columns; disabling them is rarely the right tradeoff.
- The Parquet write side's
close()call is what makes the file valid. Production writers stage to a temporary filename and rename on success to give readers a clean "exists ⇒ complete" invariant.
Lesson 2 — Columnar Encodings
Module: Data Lakes — M01: Columnar Storage Foundations
Position: Lesson 2 of 3
Source: In-Memory Analytics with Apache Arrow — Matthew Topol, Chapter 1 ("Getting Started — Dictionary-Encoded and Run-End-Encoded Arrays") and Chapter 3 ("Long-Term Storage Formats"); Designing Data-Intensive Applications — Martin Kleppmann and Chris Riccomini, Chapter 4 ("Column Compression"); Apache Parquet specification (github.com/apache/parquet-format/blob/master/Encodings.md).
Context
Lesson 1's writer flushed 128 MB row groups out to disk, one column chunk per column. What lives inside a column chunk is not the raw values — it is encoded bytes. Parquet's compression ratio is the product of two passes: an encoding pass that exploits the column's structure (repeated values, sortedness, small dynamic range), and a compression pass that hits the residual entropy with a general-purpose codec (ZSTD, SNAPPY, LZ4). For the Artemis archive the encoding pass is where most of the size savings live. The same telemetry data that was 8 GB as gzipped JSONL is 800 MB as Parquet with default encodings — and 480 MB with per-column encodings chosen deliberately. The difference between the default and the deliberate is one engineer-day of effort, and it pays back forever in storage cost and read I/O.
Encoding choice is also where the writer's per-column understanding of the data matters most. The dictionary encoding that wins on payload_id (8 distinct values across a row group) loses on sample_timestamp (no repeats; every value distinct). The delta encoding that wins on sample_timestamp (monotonically increasing, small differences) loses on panel_voltage (random float, no useful delta structure). The byte-stream-split encoding that wins on panel_voltage is a no-op on payload_id. There is no single best encoding — there are right answers per column, and the writer's job is to know which is which.
This lesson develops the encoding pipeline end to end. The encoding-then-compression order and why it is in that order. The four encodings that cover ninety-five percent of real workloads — dictionary, RLE, delta, byte-stream-split — and the access patterns each is built for. The interaction between encoding choice and Parquet's dictionary fallback behavior. And the practical question the writer faces on every column: what encoding to pick, and how to tell when the default was wrong.
Core Concepts
The Encoding–Compression Pipeline
A Parquet column chunk is produced in two passes. The encoding pass takes the column's values and emits a byte stream that exploits structure in those values — repeated values become indices into a dictionary, sorted integers become deltas from a base value, small integers become bit-packed. The compression pass takes the encoded byte stream and applies a general-purpose codec (ZSTD, SNAPPY, LZ4, GZIP) to squeeze the residual entropy. Encoding then compression — never the reverse. Compressing first would destroy the structure the encoder relies on; the encoder's win is over raw values, not over already-compressed bytes.
The order matters operationally. The encoder is the lever the writer controls most directly per column. The compressor is a global file-level choice. ZSTD-3 is the Artemis default because it gives compression ratios within five percent of GZIP at one-fifth the CPU cost, and substantially better ratios than SNAPPY at modest CPU cost. The compressor matters; the encoder matters more.
A second important property: decoding is a streaming, per-page operation. The reader does not decompress an entire column chunk in one go. It decompresses one page at a time, decodes its values, emits them to the consumer, and discards the decoded buffer. This is what keeps the reader's memory footprint bounded regardless of column chunk size.
Dictionary Encoding
Dictionary encoding is the encoding to know first because it is the encoding Parquet picks by default for most columns and the one that produces the largest single ratio improvement on real data.
The mechanic is direct. The encoder builds a dictionary of the distinct values seen in the column chunk (or row group; the level varies by Parquet version), assigns each distinct value an integer index, and writes the column's data pages as a sequence of integer indices into the dictionary. The dictionary itself lives in a separate dictionary page at the start of the column chunk. A column with eight distinct values across two million rows compresses to two million small integers plus eight values' worth of dictionary — roughly two orders of magnitude smaller than the raw values, before the compressor even runs.
The encoder's index stream is itself encoded: typically as bit-packed RLE (covered next), because the indices are by construction small integers. Topol (Ch. 1) makes the same observation about Arrow's DictionaryArray type — dictionary encoding is "an optional property on any array" and is the obvious choice "in the case of an array with a lot of repeated values."
Parquet's dictionary encoding has a built-in fallback rule the writer must understand. The encoder maintains a dictionary size cap (default 1 MB). If the dictionary fills during encoding, the writer falls back to plain encoding for the remainder of the column chunk. This means a column that is dictionary-friendly for the first million rows but suddenly explodes into high cardinality will produce a mixed-encoding chunk that is no better than plain. The Artemis writer monitors per-column dictionary-fallback events and surfaces them as a structured log; persistent fallbacks on a column are the signal that the column should switch to a different encoding or that the dictionary cap should be raised.
The columns that dictionary-encode well, in the Artemis schema: mission_id (~40 distinct), payload_id (~8 distinct), sensor_kind (~12 distinct), quality_flag (~5 distinct). Together these four columns are sixty percent of the row width and produce ninety percent of dictionary's benefit.
Run-Length Encoding and Bit-Packing
Run-length encoding and bit-packing are a pair, and Parquet uses them together in a hybrid called RLE_BITPACK. The hybrid is what encodes the integer indices produced by dictionary encoding, the definition and repetition levels used to represent nullability and nested data, and any small-integer column the writer specifies it for.
RLE on its own encodes runs of identical values as (run_length, value) pairs. A column with [7, 7, 7, 7, 7, 7, 7, 7] becomes (8, 7) — two integers instead of eight. RLE wins on data with long runs of the same value: dictionary indices on a sorted column, boolean flags that are mostly false, dictionary indices when the dictionary is small and adjacent rows tend to share values.
Bit-packing encodes a sequence of small integers by storing them in fewer bits than the value's nominal width allows. A u32 column whose values all fit in five bits is bit-packed into five-bit slots — six and a half bits saved per value. Bit-packing wins on small-dynamic-range integers: dictionary indices when the dictionary is dense and adjacent rows are unlike (no runs to exploit), short-count columns, and any small enum's underlying integer representation.
The hybrid switches between the two based on which is locally cheaper, in eight-value or thirty-two-value batches depending on the encoding variant. Kleppmann (Ch. 4) makes the same point about bitmap encoding's adaptive hybrid in the data warehouse world: "techniques such as roaring bitmaps switch between the two bitmap representations, using whichever is the most compact." Parquet's RLE_BITPACK is the same idea applied to integer streams.
The takeaway for the writer: RLE_BITPACK is not a knob to flip on or off. It is the encoding Parquet uses for dictionary indices and for the definition/repetition level streams that every nullable or nested column carries. The writer's lever is whether the values go through dictionary first (and then RLE_BITPACK on the indices) or go through some other encoding directly.
Delta Encoding
Delta encoding is the right answer for monotonically advancing integer columns. The most common cases in telemetry: sample_timestamp (monotonic per source, never repeats, small inter-sample gaps), sequence_number (strictly increasing per stream), tle_epoch (sortable timestamp). For these columns dictionary encoding is useless — there are no repeated values — and plain encoding gives no compression. Delta encoding gives both.
The mechanic: the encoder stores a base value (the first value) and a sequence of deltas (value[i] - value[i-1]). The deltas are typically tiny — for a 100 Hz sensor stream, the delta between consecutive samples is exactly 10 ms in nanoseconds, which is a constant. Parquet's DELTA_BINARY_PACKED encoding stores the deltas using a frame-of-reference plus bit-packing scheme: deltas are stored relative to a frame minimum, which keeps the per-delta bit width small even when the absolute delta values are large.
The wins are dramatic for the right columns. An 8-byte nanosecond timestamp column with 10 ms gaps between samples compresses to roughly 1 byte per value before the general-purpose compressor runs. After ZSTD-3 the same column is often under 0.5 bytes per value. Dictionary encoding on the same column produces zero benefit because every value is distinct.
Parquet also offers DELTA_LENGTH_BYTE_ARRAY and DELTA_BYTE_ARRAY for string columns. The former encodes the lengths of strings as deltas (useful when strings have similar lengths) while keeping the string bytes plain; the latter encodes lengths and shared prefixes (incremental encoding — store only the bytes that differ from the previous string). DELTA_BYTE_ARRAY is the right answer for sorted string columns like file paths, URLs, or hierarchical identifiers — a column of object-storage paths with shared prefixes can compress to 5–10% of its plain size.
Byte Stream Split
Byte stream split is the encoding floating-point columns need. The problem: dictionary encoding fails on floats (every value is distinct), delta encoding fails on floats (the deltas are also floats, and float-to-float subtraction does not produce compressible structure), and plain encoding of a Float64 column is exactly eight bytes per value with no win available.
The byte-stream-split trick: take a Float64 value's eight bytes and split them across eight parallel streams — all the first bytes together, all the second bytes together, and so on. The high-order bytes of float values that share a magnitude (typical of physical measurements) cluster together; the resulting streams are highly compressible because adjacent values' corresponding bytes are often equal or nearly so. The encoded byte count is the same as plain, but the structure that emerges through the splitting compresses substantially better under the general-purpose codec that runs next.
For the Artemis schema, byte-stream-split is the right encoding on panel_voltage, temperature_c, pressure_kpa, and every other physical-measurement float column. The compression-ratio improvement over plain-encoded float, after ZSTD, is typically two to three times.
Picking an Encoding per Column
The writer's per-column decision is a short table of rules, derived from the column's value distribution and access pattern. The Artemis writer's per-column encoding map looks like this:
| Column kind | Example | Encoding | Why |
|---|---|---|---|
| Low-cardinality categorical | mission_id, payload_id, sensor_kind | Dictionary (default) | Repeats are the entire structure of the column. |
| Boolean or small enum | quality_flag, is_eclipsed | Dictionary + RLE_BITPACK on indices | Trivially small dictionary; long runs in time-ordered data. |
| Monotonic integer | sample_timestamp_ns, sequence_number | DELTA_BINARY_PACKED | Inter-sample deltas are tiny and regular. |
| Sorted string | object_path, archive_uri | DELTA_BYTE_ARRAY | Shared prefixes are the structure. |
| Physical-measurement float | panel_voltage, temperature_c | BYTE_STREAM_SPLIT | High-order-byte clustering after the split. |
| High-cardinality string with no structure | event_uuid, request_id | PLAIN | Nothing to exploit; just compress the bytes. |
The defaults the writer is not overriding — dictionary for categorical, dictionary fallback to plain for high-cardinality — handle the common cases. The overrides are for timestamps, sorted strings, and floats. Three rules of thumb, three explicit settings in the writer, and the writer is producing files that compress 50–80% better than defaults on Artemis data.
The rule of thumb the writer must never forget: measure. The right encoding for the data you assume you have may be the wrong encoding for the data you actually have. Sample a row group, write it with three candidate encodings, and pick by measured compressed size. The Artemis ingestion pipeline runs a weekly encoding-audit job that does exactly this on the latest week of downlinks and surfaces any column whose chosen encoding is no longer optimal.
Code Examples
Inspecting the Encodings a File Uses
Before tuning the writer, the engineer needs to know what encodings the current files use. The footer records the encodings per column chunk. The parquet crate exposes this through the column chunk metadata.
use std::fs::File;
use anyhow::Result;
use parquet::basic::Encoding;
use parquet::file::reader::{FileReader, SerializedFileReader};
/// Print the encoding(s) used by every column chunk in every row group.
/// A column chunk lists the encodings it actually uses — typically the
/// data-page encoding plus the dictionary-page encoding if dictionary
/// encoding was selected.
fn inspect_encodings(path: &str) -> Result<()> {
let file = File::open(path)?;
let reader = SerializedFileReader::new(file)?;
let metadata = reader.metadata();
for (rg_idx, rg) in metadata.row_groups().iter().enumerate() {
println!("row group {}:", rg_idx);
for col in rg.columns() {
let encodings: Vec<&Encoding> = col.encodings().iter().collect();
let dict_fallback = encodings.contains(&&Encoding::PLAIN)
&& (encodings.contains(&&Encoding::PLAIN_DICTIONARY)
|| encodings.contains(&&Encoding::RLE_DICTIONARY));
println!(
" {:<24} compressed={:>10} encodings={:?}{}",
col.column_path().string(),
col.compressed_size(),
encodings,
if dict_fallback { " [DICT-FALLBACK]" } else { "" },
);
}
}
Ok(())
}
The pattern to notice. A column chunk whose encodings() list contains both a dictionary encoding (PLAIN_DICTIONARY or RLE_DICTIONARY) and PLAIN is a column chunk that started dictionary-encoding and fell back to plain partway through, because the dictionary hit its size cap. Flagging these is the cheapest possible writer-side observability — if a column persistently dictionary-falls-back, either its cardinality is higher than the writer assumes (raise the cap or switch encoding) or the column has changed shape upstream. The Artemis writer emits a dict_fallback{column=$col} Prometheus counter from this path so the encoding-audit job can detect drift.
Setting Per-Column Encodings in the Writer
The WriterProperties builder accepts per-column-path overrides via set_column_encoding. Use this for the columns whose ideal encoding the writer knows in advance, and leave the default (PLAIN_DICTIONARY with fallback to PLAIN) for everything else.
use std::sync::Arc;
use parquet::basic::{Compression, Encoding};
use parquet::file::properties::WriterProperties;
use parquet::schema::types::ColumnPath;
/// Build the Artemis writer properties with per-column encodings. The
/// columns listed here are the ones where the writer's understanding of
/// the data beats the default; every other column gets dictionary with
/// fallback to plain, which is correct for low-cardinality categoricals.
fn artemis_writer_properties() -> WriterProperties {
WriterProperties::builder()
.set_max_row_group_size(128 * 1024 * 1024)
.set_compression(Compression::ZSTD(Default::default()))
// Monotonic timestamps — DELTA_BINARY_PACKED. The dictionary
// default fails because every timestamp is distinct.
.set_column_encoding(
ColumnPath::new(vec!["sample_timestamp_ns".to_string()]),
Encoding::DELTA_BINARY_PACKED,
)
.set_column_dictionary_enabled(
ColumnPath::new(vec!["sample_timestamp_ns".to_string()]),
false, // disable dictionary explicitly; delta is the encoding.
)
// Sorted strings with shared prefixes — DELTA_BYTE_ARRAY.
.set_column_encoding(
ColumnPath::new(vec!["archive_uri".to_string()]),
Encoding::DELTA_BYTE_ARRAY,
)
.set_column_dictionary_enabled(
ColumnPath::new(vec!["archive_uri".to_string()]),
false,
)
// Physical-measurement floats — BYTE_STREAM_SPLIT.
.set_column_encoding(
ColumnPath::new(vec!["panel_voltage".to_string()]),
Encoding::BYTE_STREAM_SPLIT,
)
.set_column_encoding(
ColumnPath::new(vec!["temperature_c".to_string()]),
Encoding::BYTE_STREAM_SPLIT,
)
// Everything else: dictionary (default) with statistics for
// row-group-level predicate pushdown downstream.
.set_statistics_enabled(
parquet::file::properties::EnabledStatistics::Page,
)
.build()
}
Two subtleties. First, set_column_encoding does not automatically disable dictionary encoding for that column — the writer would otherwise try dictionary first and only fall back to the configured encoding if the dictionary failed. The explicit set_column_dictionary_enabled(_, false) is required to skip the dictionary attempt entirely on columns where it would be wasted. Second, the column path is a Vec<String> to handle nested schemas — top-level columns are single-element paths, fields inside structs are multi-element paths. The Artemis writer's encoding map lives in a separate config module and is applied programmatically from the column metadata, not hardcoded as in the example here.
Measuring Encoding Choices on Real Data
The encoding-audit job is what closes the loop. It samples one row group of recent data, writes it with each candidate encoding for each column, and reports the compressed size per choice. The function below sketches the per-column part of the audit: write the same RecordBatch with two different encodings and report the byte count.
use std::sync::Arc;
use anyhow::Result;
use arrow::array::RecordBatch;
use arrow::datatypes::Schema;
use parquet::arrow::ArrowWriter;
use parquet::basic::{Compression, Encoding};
use parquet::file::properties::WriterProperties;
use parquet::schema::types::ColumnPath;
/// Write a single record batch with the given encoding for the named
/// column and return the resulting file size in bytes. Used by the
/// encoding-audit job to compare candidate encodings on real data.
fn measure_encoding(
batch: &RecordBatch,
column_name: &str,
encoding: Encoding,
use_dictionary: bool,
) -> Result<u64> {
let path = format!("/tmp/audit-{column_name}-{encoding:?}.parquet");
let file = std::fs::File::create(&path)?;
let column_path = ColumnPath::new(vec![column_name.to_string()]);
let props = WriterProperties::builder()
.set_compression(Compression::ZSTD(Default::default()))
.set_column_encoding(column_path.clone(), encoding)
.set_column_dictionary_enabled(column_path, use_dictionary)
.build();
let mut writer = ArrowWriter::try_new(
file,
batch.schema(),
Some(props),
)?;
writer.write(batch)?;
writer.close()?;
let size = std::fs::metadata(&path)?.len();
std::fs::remove_file(&path)?;
Ok(size)
}
/// Run the audit for a column: try the candidate encodings and return
/// the one with the smallest compressed output.
fn audit_column(batch: &RecordBatch, column_name: &str) -> Result<Encoding> {
let candidates: &[(Encoding, bool)] = &[
(Encoding::PLAIN, false), // baseline
(Encoding::RLE_DICTIONARY, true), // dictionary attempt
(Encoding::DELTA_BINARY_PACKED, false), // monotonic-integer attempt
(Encoding::BYTE_STREAM_SPLIT, false), // float-clustering attempt
];
let mut best: Option<(Encoding, u64)> = None;
for &(encoding, use_dict) in candidates {
// Skip candidates that are not type-applicable; the writer would
// error otherwise. Production code checks the column's Arrow type
// against the encoding's compatibility table here.
match measure_encoding(batch, column_name, encoding, use_dict) {
Ok(size) => {
match best {
None => best = Some((encoding, size)),
Some((_, prev)) if size < prev => best = Some((encoding, size)),
_ => {}
}
}
Err(_) => continue, // not applicable to this column's type
}
}
Ok(best.expect("at least PLAIN should succeed").0)
}
The interesting part of this code is what is missing from the example. A real audit job runs the candidate encodings concurrently rather than sequentially (each write is independent), filters candidates by type compatibility before attempting them (BYTE_STREAM_SPLIT on a string column is an error, not a graceful no-op), and measures over a representative sample of row groups across the data corpus rather than one batch. The shape of the function is the right shape; the production version is a parallel job that writes its output to a Prometheus gauge and runs weekly. The discipline the audit enforces is the most important thing — measured encoding choices, not assumed encoding choices.
Key Takeaways
- A Parquet column chunk is produced by encoding then compression, in that order. The encoder exploits structure (repeats, sortedness, dynamic range); the compressor hits the residual entropy. Encoding choice is where most of the size savings live; ZSTD-3 over SNAPPY is the second-largest lever.
- Dictionary encoding is the default for most columns and the right answer for low-cardinality categoricals. Watch for dictionary-fallback events: a column listed with both
RLE_DICTIONARYandPLAINencodings has fallen back partway through and is no longer benefiting from dictionary. - Monotonic integer columns (timestamps, sequence numbers) want
DELTA_BINARY_PACKED, not dictionary. Dictionary on a column with no repeats is wasted work. Disable dictionary explicitly when overriding the encoding. - Physical-measurement float columns want
BYTE_STREAM_SPLIT. The trick exploits high-order-byte clustering across values of similar magnitude; the win over plain-encoded floats is typically 2–3x after ZSTD. - Sorted string columns with shared prefixes want
DELTA_BYTE_ARRAY. Object paths, hierarchical identifiers, and URLs are the typical winners. - Measure, don't assume. Run an encoding-audit job over real data periodically. The right encoding shifts as data shape evolves; the writer's encoding map needs to evolve with it.
Lesson 3 — Apache Arrow and the In-Memory Side
Module: Data Lakes — M01: Columnar Storage Foundations
Position: Lesson 3 of 3
Source: In-Memory Analytics with Apache Arrow — Matthew Topol, Chapter 1 ("Getting Started with Apache Arrow"), Chapter 2 ("Working with Key Arrow Specifications"), and Chapter 3 ("Memory Mapping"); Apache Arrow specification (arrow.apache.org/docs/format/Columnar.html).
Context
Parquet is the on-disk format. Arrow is the in-memory format. They are different formats, designed for different goals, and the engineer who treats them as the same thing produces code that is either slow on reads or wrong on writes. Parquet optimizes for small on disk and read-once into memory: encoded, compressed, footer-at-end, page-decoded as you go. Arrow optimizes for random access in memory and zero-copy across process boundaries: uncompressed, fully materialized, structured as buffers the CPU can scan directly. The Artemis archive uses both. Cold-archive files are Parquet; the working sets analysts hold in memory are Arrow; the hot caches between cold storage and the compute layer are Arrow IPC files, memory-mapped from a local SSD tier.
Understanding the boundary between them is what makes the read path work. When a query engine reads panel_voltage from a Parquet file, the work is: locate the column chunk (Lesson 1), decompress its pages (page-by-page, streaming), decode the values out of their encoding (dictionary indices → values, deltas → values, byte-split → values; Lesson 2), and materialize them as an Arrow array in memory. That last step is the Parquet-to-Arrow boundary. The cost of that boundary is the cost the query engine pays per read; the Arrow format is what the cost is paid into, and what makes every subsequent operation on the data cheap.
This lesson develops the Arrow side end-to-end. Arrow's columnar in-memory layout and how it differs from Parquet's compressed-on-disk layout. The record batch as the unit of in-memory work, and the chunked array that represents a column across multiple record batches. The Parquet-Arrow boundary and what crossing it costs. And the IPC format that lets two processes share Arrow data without copying — the basis of the hot-cache tier between the Artemis cold archive and the analyst-facing compute layer.
Core Concepts
Arrow vs Parquet: Two Different Goals
The two formats answer different questions. Topol (Ch. 3) frames the distinction directly: Parquet is a "long-term storage format" optimized for size on disk and column-projected reads; Arrow is an "in-memory representation" optimized for random-access analytical processing. Three concrete differences capture the design split.
Compression. Parquet column chunks are encoded and compressed; the bytes on disk are not the bytes the CPU operates on. Arrow buffers are uncompressed and laid out for direct CPU access; the bytes in memory are the bytes the CPU's vector instructions consume. Decompressing on read is the cost; getting a memory layout the CPU loves is the benefit.
Random access. Parquet's encodings — delta, RLE, dictionary — break O(1) random access to individual values. Finding the value at row 12,345 of a delta-encoded column requires decoding from a frame-of-reference boundary, not seeking by index. Arrow's columnar layout preserves O(1) random access by row index for every primitive type: the i-th value of a Float64Array lives at byte offset i * 8 in the value buffer, full stop. The Arrow layout is the reason a compute kernel can vectorize over a column without conditionals or decoders in the inner loop.
Mutability and shareability. Arrow buffers are designed to be shared across processes and language runtimes without copying. The buffer layout is language-agnostic and stable across versions of the spec; a Python process, a Rust process, and a C++ process can all hold pointers to the same memory and treat it as the same array. Parquet does not have this property — Parquet is a file format, not a memory format, and inter-process sharing of Parquet bytes still requires each process to do its own decode.
The operational consequence: the Artemis read path is Parquet-on-cold-storage → Arrow-in-memory at the boundary, and Arrow-in-shared-memory → Arrow-in-memory for the hot path. Each leg of the pipeline uses the format suited to its constraints.
The Arrow Memory Layout
An Arrow array is a small number of contiguous buffers plus a small amount of metadata. The buffers for a primitive type are two: a validity bitmap (one bit per row, 1 = valid, 0 = null) and a value buffer (the raw values laid out end to end). For a Float64Array of length 8192, the validity bitmap is 1024 bytes (8192 / 8) and the value buffer is 65,536 bytes (8192 × 8). The reader of this array does pointer arithmetic on the value buffer; nullability is a separate, dense, branch-friendly check against the bitmap.
Variable-length types — strings, binary, lists — add a third buffer: an offset buffer of i32 or i64 values that record where each row's value starts in the value buffer. A StringArray of length 8192 has a 1024-byte validity bitmap, an offset buffer of 8193 × 4 bytes (one entry per row plus a terminator), and a value buffer holding the concatenated UTF-8 bytes of every string. Reading row 47's string is value_buffer[offsets[47]..offsets[48]] — three pointer reads and a slice, no parsing.
Nested types — structs, lists of structs, maps — compose the same primitives. A struct array is metadata pointing at child arrays of the field types; a list array is an offset buffer pointing at a child values array; a dictionary array is an index array pointing at a dictionary values array. Topol (Ch. 1) walks the layout for each type. The composition rule is the same for all of them: every Arrow array, however complex its type, decomposes into a small number of flat, contiguous buffers that the CPU can scan directly.
The buffer-based layout is what makes Arrow's claim to "zero-copy" credible. Sharing an Arrow array across a process boundary is sharing the pointers and lengths of its buffers — no serialization, no deserialization, no encoding decode. Crossing a process boundary is a few hundred bytes of metadata regardless of the array's data size.
Record Batches and Chunked Arrays
A record batch is a group of Arrow arrays, each the same length, with named fields — Arrow's equivalent of a chunk of rows from a table. A record batch of 8192 rows with forty columns is forty Arrow arrays of length 8192 plus a small schema descriptor. The record batch is Arrow's unit of in-memory work: compute kernels operate on a record batch's columns directly, and downstream operators consume and emit record batches.
The size of a record batch is the analog of Parquet's row group size, but the constraint is different. Parquet row groups are sized for compression ratio and parallelism (Lesson 1, 128 MB target). Arrow record batches are sized for cache friendliness — small enough that a single column's values fit comfortably in L2 cache while the kernel runs. 8192 rows is the conventional default and the right answer for most kernels; the Artemis compute layer uses 8192 across the board.
A chunked array is what you get when a column spans multiple record batches — a logical "the values of panel_voltage across this entire query result." Topol (Ch. 3) explains the relevant connection: "Since Parquet files can be split into multiple row groups, we can avoid copying data by using a chunked array to treat the collection of one or more discrete arrays as a single contiguous array." A reader that pulls one Parquet row group at a time and produces one record batch per row group naturally produces a chunked array per column. Compute kernels handle this directly; a kernel that operates on a ChunkedArray iterates the chunks in turn and produces a chunked result. No concatenation, no copy.
The Parquet ↔ Arrow Boundary
The Parquet-to-Arrow boundary is where the encoded, compressed bytes on disk become the uncompressed, random-access buffers in memory. The cost of crossing it is the per-page work: decompress (ZSTD/SNAPPY) and decode (dictionary → values, delta → values, byte-stream-split → values). For a typical Artemis row group — 128 MB on disk, 600 MB uncompressed — the boundary cost is roughly 100 ms of CPU on a modern core. The product to think about: that 100 ms produces an in-memory representation that subsequent kernels operate on at memory-bandwidth speeds, billions of values per second.
The other direction — Arrow to Parquet — is what the writer does. Arrow record batches go in, encoded and compressed bytes come out, plus a footer at the end. The boundary cost on the write side is the encoding-then-compression cost from Lesson 2.
The boundary has a few practical implications the engineer must know.
Schema mapping is not free. Arrow's type system and Parquet's type system overlap but are not identical. Parquet has logical types that map to Arrow types via a conversion table, and the conversion is sometimes lossy. Decimal precision, timestamp time zones, and nested null handling are the typical edge cases. Production code pins the schema explicitly on both sides rather than relying on inference.
Reading is streaming. The parquet::arrow::ArrowReader does not decode an entire file to one giant Arrow table. It produces a stream of record batches, each corresponding to one Parquet row group or a configurable batch-size slice within it. Memory stays bounded; the consumer processes batches as they arrive and discards them.
Writing buffers a row group. The ArrowWriter accumulates incoming record batches until it has enough rows to flush a row group, then encodes-and-compresses the column chunks for the row group and writes them. During that accumulation, the writer holds the row group's worth of data uncompressed in memory — Lesson 1's writer-memory budget constraint comes from this.
IPC and Zero-Copy via Memory Mapping
Arrow IPC ("Inter-Process Communication") is the wire format and file format Arrow uses to share record batches between processes. The format is a sequence of FlatBuffers-framed messages: a schema message first, then one or more record-batch messages, optionally with dictionary-batch messages for shared dictionaries. The wire format is the streaming version; the file format is the same wire format with a magic-number envelope and a footer that records the byte offsets of every record batch in the file, enabling random access to any batch without reading the others.
The point of IPC is what Topol (Ch. 3) calls "no deserialization cost": the bytes on disk in an Arrow IPC file are the same bytes the CPU operates on in memory. There is no parse step, no decompression step, no decode step. A reader that memory-maps an Arrow IPC file gets pointers directly into the kernel's page cache; the values are addressable without any work beyond the page faults that bring them into RAM.
The Artemis hot cache is built on this property. The 50 GB working sets analysts query interactively are stored as Arrow IPC files on the local SSD tier (NVMe-backed). When an analyst's query touches a working set, the cache server memory-maps the relevant IPC files and hands the record batch pointers to the query engine. No copy, no decode, no allocation beyond a few hundred bytes of metadata. The query engine operates on the mmap'd pages; the kernel pages in only the bytes the kernel touches. Topol's memory-mapping discussion in Ch. 3 makes this concrete: a 1.6 GB file accessed via memory mapping reads "only the pages from the file that it needs when the corresponding virtual memory locations are accessed," with no allocation up front.
The tradeoff: Arrow IPC is much larger on disk than Parquet because it is not encoded or compressed. The same data that is 800 MB as a well-encoded Parquet file is 6 GB as an Arrow IPC file. The hot-cache tier is sized accordingly — it holds a working set, not the cold archive. The cold archive is Parquet; the IPC files are derived caches built from materialized Parquet reads.
Code Examples
Building an Arrow Record Batch From Rust Data
Most production code receives record batches from the Parquet reader rather than constructing them by hand, but the buffer-level construction is worth seeing once. It is what the writer pipeline produces as its last step before handing the batch to the Parquet writer.
use std::sync::Arc;
use anyhow::Result;
use arrow::array::{Float64Array, RecordBatch, TimestampNanosecondArray, UInt32Array};
use arrow::datatypes::{DataType, Field, Schema, TimeUnit};
/// Build a record batch holding one row group's worth of Artemis telemetry.
/// In production this is built incrementally by the ingestion loop, one
/// observation at a time, using `ArrayBuilder` types; the from-vec
/// construction here is for illustration.
fn build_telemetry_batch(
timestamps_ns: Vec<i64>,
panel_voltages: Vec<f64>,
payload_ids: Vec<u32>,
) -> Result<RecordBatch> {
assert_eq!(timestamps_ns.len(), panel_voltages.len());
assert_eq!(timestamps_ns.len(), payload_ids.len());
let schema = Arc::new(Schema::new(vec![
Field::new(
"sample_timestamp_ns",
DataType::Timestamp(TimeUnit::Nanosecond, None),
false, // non-nullable; timestamps are always present
),
Field::new("panel_voltage", DataType::Float64, false),
Field::new("payload_id", DataType::UInt32, false),
]));
let ts = TimestampNanosecondArray::from(timestamps_ns);
let voltage = Float64Array::from(panel_voltages);
let payload = UInt32Array::from(payload_ids);
RecordBatch::try_new(
schema,
vec![Arc::new(ts), Arc::new(voltage), Arc::new(payload)],
)
.map_err(Into::into)
}
What to notice. The RecordBatch::try_new call is cheap: it does not copy the underlying buffers, only validates the lengths match the schema and wraps the arrays in a struct. The Arc::new calls share ownership of the arrays without copying. Cloning the resulting batch downstream is similarly cheap — RecordBatch is Clone, and cloning increments the Arc refcounts on each column. This is the in-memory side of Arrow's zero-copy promise: passing record batches around the pipeline is a pointer-copy operation, not a data-copy operation.
The construction-from-Vec path used above is the slow path. Production ingestion uses ArrayBuilder types (Float64Builder, TimestampNanosecondBuilder) that accept values one at a time, manage their own buffer growth, and finish into an array at the end of a row group's worth of rows. The Artemis writer's ingestion loop is built around builders, one per column, finished and reset at each row group boundary.
Reading a Parquet File Into Arrow Record Batches
The parquet::arrow integration produces Arrow record batches from a Parquet file. The reader is an iterator: each .next() produces one record batch, and the iterator terminates when the file is fully read. Memory stays bounded regardless of file size.
use std::fs::File;
use std::sync::Arc;
use anyhow::Result;
use arrow::array::RecordBatch;
use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
/// Read an entire Parquet file as a stream of Arrow record batches. The
/// configurable batch size controls how many rows are emitted per batch;
/// 8192 is the cache-friendly default the Artemis compute layer uses.
fn read_to_record_batches(path: &str) -> Result<Vec<RecordBatch>> {
let file = File::open(path)?;
let reader = ParquetRecordBatchReaderBuilder::try_new(file)?
.with_batch_size(8192)
.build()?;
// The reader is an Iterator<Item = Result<RecordBatch>>. Each batch
// corresponds to a 8192-row slice; the boundary aligns with Parquet
// row groups when row groups are larger than the batch size.
let mut batches = Vec::new();
for batch in reader {
batches.push(batch?);
}
Ok(batches)
}
/// Same read, but with column projection. The Parquet reader only
/// decodes the column chunks for the projected columns; the others are
/// not read from disk. This is the I/O savings the columnar layout
/// promised in Lesson 1, expressed through the Arrow boundary.
fn read_projected(path: &str, columns: &[&str]) -> Result<Vec<RecordBatch>> {
use parquet::arrow::ProjectionMask;
let file = File::open(path)?;
let builder = ParquetRecordBatchReaderBuilder::try_new(file)?;
let schema = builder.schema();
let leaves: Vec<usize> = columns
.iter()
.map(|name| schema.index_of(name))
.collect::<Result<_, _>>()
.map_err(|e| anyhow::anyhow!("missing column: {e}"))?;
let mask = ProjectionMask::leaves(builder.parquet_schema(), leaves);
let reader = builder
.with_projection(mask)
.with_batch_size(8192)
.build()?;
reader.collect::<Result<Vec<_>, _>>().map_err(Into::into)
}
The boundary's cost is hidden inside the iteration. Each batch? line pulls one row group's worth of column chunks from disk, decompresses them, decodes them out of their encodings, and materializes them as Arrow arrays. The Arrow arrays the reader produces are the same Float64Array, TimestampNanosecondArray, etc. types the builder code produces — once the boundary is crossed, the data is indistinguishable from data that was built in memory directly.
Two production touches that the example omits. First, the synchronous File reader is correct for local files but wrong for the Artemis cold archive, which lives in object storage; production reads use parquet::arrow::async_reader with object_store::ObjectStore to issue range reads against S3-compatible storage. Second, the Vec<RecordBatch> materialization at the end of the example defeats the streaming property of the reader — production code consumes the iterator lazily and never materializes the whole result.
Memory-Mapping an Arrow IPC File for Zero-Copy Reads
The Artemis hot-cache tier stores frequently-queried working sets as Arrow IPC files on a local NVMe SSD. The cache server memory-maps these files; the query engine operates on the mmap'd pages directly. No allocation, no decode, no copy.
use std::fs::File;
use std::sync::Arc;
use anyhow::Result;
use arrow::array::RecordBatch;
use arrow::ipc::reader::FileReader;
use memmap2::Mmap;
/// Open an Arrow IPC file in read-only memory-mapped mode and read its
/// record batches. The bytes in the returned record batches' buffers
/// point directly into the mmap'd region; no copy of the data is made
/// at any point in this function.
fn read_ipc_mmap(path: &str) -> Result<Vec<RecordBatch>> {
let file = File::open(path)?;
// Safety: the file is opened read-only, and the mmap lives at least
// as long as the record batches returned. If the underlying file is
// truncated or replaced while this mmap is live, reads will SIGBUS;
// the Artemis cache server protects against this by never modifying
// a cache file in place — new versions are written to new paths and
// the index updated atomically.
let mmap = unsafe { Mmap::map(&file)? };
// The arrow-ipc FileReader accepts any Read + Seek. A Cursor over
// the mmap slice gives it that; reads from the cursor are reads
// from the mmap'd memory, which the kernel pages in on demand.
let cursor = std::io::Cursor::new(&mmap[..]);
let reader = FileReader::try_new(cursor, None)?;
let mut batches = Vec::new();
for batch in reader {
batches.push(batch?);
}
// The mmap must outlive the returned batches' buffer references.
// The simplest discipline is to keep the mmap alive in an enclosing
// struct alongside the batches — the example here returns owned
// batches whose data has been copied out of the mmap by the IPC
// reader's deserialization. True zero-copy requires the batches'
// buffers to alias the mmap directly, which the arrow-ipc reader
// can do with the right buffer-allocator configuration; see the
// production cache server's implementation for the wiring.
Ok(batches)
}
The unsafe block is the irreducible cost of memory mapping: the Mmap type cannot enforce, at compile time, that no one truncates or replaces the underlying file. The Artemis cache server addresses this with a write-discipline invariant — cache files are immutable, new versions are written to new paths, and the cache index is updated atomically — so the unsafety is contained to a known boundary.
The example's last comment is the production caveat worth understanding. True zero-copy from an Arrow IPC file requires the IPC reader to construct Arrow array buffers that alias the mmap'd memory rather than copying out of it. The default FileReader reads the FlatBuffers metadata zero-copy but materializes the data buffers into freshly-allocated Buffer objects. The Artemis cache server uses a custom buffer allocator that returns slices of the mmap'd region instead — this is what makes the cache hit path allocation-free. The mechanics are out of scope here; the principle is that Arrow IPC's wire format is byte-identical to its in-memory format, and aliasing is technically possible whenever the file's alignment matches Arrow's alignment requirements (8-byte by default).
Key Takeaways
- Arrow and Parquet are different formats with different goals. Parquet is small-on-disk and read-once into memory; Arrow is random-access in memory and zero-copy across process boundaries. The Artemis read path crosses the boundary between them deliberately.
- An Arrow array is a small number of flat, contiguous buffers — validity bitmap plus value buffer for primitives, offset buffer added for variable-length types. The layout is what makes O(1) random access and vectorized kernels possible.
- A record batch is the in-memory unit of work — multiple equal-length arrays with a schema. Chunked arrays represent a column across multiple batches without concatenation or copy.
- The Parquet-Arrow boundary is where on-disk encoded bytes become in-memory random-access buffers. The crossing is per-page decompression and decoding; the result is data the CPU can scan at memory-bandwidth speeds.
- Arrow IPC files are memory-mappable for zero-copy reads. The bytes on disk are the bytes the CPU operates on. The hot-cache tier between the Artemis cold archive and the compute layer uses this property to serve queries against working sets at zero allocation cost per query.
Capstone — Artemis Archive Parquet Writer
Module: Data Lakes — M01: Columnar Storage Foundations Estimated effort: 1–2 weeks of focused work Prerequisite: All three lessons in this module completed; all three quizzes passed (≥ 70%).
Mission Briefing
From: Artemis Cold Archive Lead
ARCHIVE BRIEFING — RC-2026-04-DL-001
SUBJECT: Artemis cold-archive ingestion writer, replacement for legacy
JSONL pipeline.
PRIORITY: P1 — current pipeline is the bottleneck for analyst query SLA.
The legacy archive writes downlinked telemetry as gzipped JSONL into the cold-storage object bucket. Writes are fast; reads are the problem. The analyst query SLA is "ninety-fifth percentile query under thirty seconds"; we are currently at "fiftieth percentile query under fifteen minutes." The bottleneck is the row-oriented, parse-every-byte storage layout. Three weeks ago we approved the migration to Parquet for the cold archive, and this is the writer that produces the new files.
The writer ingests record batches from the ground-segment handoff (already Arrow-formatted upstream — your input is RecordBatch, not JSONL) and produces Parquet files into the cold-archive bucket. The files this writer produces are the files Module 2's table-format layer will wrap into Iceberg tables, Module 3's partition strategy will organize, and Module 5's query engine will read against the analyst workload. Your writer is the foundation of the rest of the track.
This is not a research project. The writer's design decisions — row group size, per-column encoding, compression codec — are operational decisions with measurable consequences. Document the decisions and the evidence behind them; the engineering review at the end of this project will read that documentation.
What You're Building
A Rust crate, artemis-archive-writer, that exposes:
- A
Writerstruct, constructed from an output path and an ArrowSchemaRef, that acceptsRecordBatches via awrite(&mut self, batch: &RecordBatch) -> Result<()>method and aclose(self) -> Result<WriteSummary>method. - A
WriteSummarycontaining the output file path, total rows written, total compressed bytes, row group count, and per-column encoding choices. - A CLI binary,
artemis-write, that reads Arrow IPC files from stdin (or a path argument) and writes a Parquet file to a configurable output path, suitable for use in the ingestion pipeline. - A configuration module that holds the per-column encoding map and is unit-tested against the canonical Artemis schema.
The writer must produce files that are readable by the standard parquet and parquet-arrow Rust crates and by pyarrow from Python — the cross-language readability is what makes the cold archive useful to the analyst tooling. Your tests must verify both.
Functional Requirements
- Streaming ingestion. The writer accepts record batches incrementally; the caller does not have to materialize all data before calling
close. The writer's resident memory must stay bounded regardless of total rows written — bounded by the configured row group size, not the input size. - Configured row group size. The writer flushes a row group when the buffered byte estimate exceeds the configured target (default 128 MB). The actual size is approximate because flushes happen at record-batch boundaries.
- Per-column encoding map. The writer applies the encoding map from Lesson 2:
DELTA_BINARY_PACKEDforsample_timestamp_nsandsequence_number,BYTE_STREAM_SPLITforpanel_voltageandtemperature_c,DELTA_BYTE_ARRAYforarchive_uriandobject_path, dictionary (default) formission_id,payload_id,sensor_kind, andquality_flag. - ZSTD compression at level 3 as the file-level codec.
- Statistics enabled on numeric columns to support downstream predicate pushdown.
- Atomic visibility. The writer writes to
<output_path>.inprogressand renames to<output_path>only on successfulclose(). A crashed writer produces a file readers never see. - Observability hooks. The writer emits structured log events on each row-group flush (
rg_idx,rows,compressed_bytes,elapsed_ms) and on dictionary-fallback detection (column,row_group). The format is a single JSON object per line, written to a configurabletracingsubscriber.
Acceptance Criteria
Verifiable (automated tests must demonstrate these)
-
The writer accepts
RecordBatches viawrite()and produces a valid Parquet file viaclose()against the canonical Artemis test schema (defined intests/schema.rs). -
The output file is readable by
parquet::arrow::ParquetRecordBatchReaderBuilderand round-trips every column's values bit-exact for primitive types and byte-exact for variable-length types. -
The output file is readable by
pyarrow.parquet.ParquetFile(verified via a test harness that shells out to a small Python helper); the Python read produces the same row count and per-column null count as the Rust read. - For an input of 10 million rows with the canonical schema, the output file has at least 3 row groups when configured for 128 MB target, demonstrating that row-group flushing works.
-
The output file's footer reports
DELTA_BINARY_PACKEDforsample_timestamp_ns,BYTE_STREAM_SPLITforpanel_voltage, and a dictionary encoding formission_idin every row group's column chunk metadata. -
An interrupted write (simulated by dropping the
Writerwithout callingclose) leaves no file at the final output path; the.inprogressfile is present and is detectably invalid as Parquet (the standard reader rejects it). -
The compressed output file is at least 5× smaller than the same data written as gzipped JSONL via a baseline writer in
tests/baseline_jsonl.rs. -
The writer's resident memory, measured by
jemalloc-ctl::stats::residentat peak during a 10-million-row write, does not exceed 1.5× the configured row-group target.
Self-assessed (you write a short justification; reviewer checks it)
-
(self-assessed) The per-column encoding map is justified in
docs/encoding-decisions.mdagainst measured compression ratios on the canonical test data. Each non-default encoding choice has a one-paragraph rationale and a measurement. -
(self-assessed) The row-group-size choice is justified in
docs/row-group-decision.mdagainst the writer-memory-budget constraint and a reference analyst query pattern. -
(self-assessed) The crash-handling discipline (
.inprogressrename) is justified against the alternative of writing directly to the final path — the doc names at least one failure mode the rename pattern prevents. -
(self-assessed) The writer's API surface (
write,close,WriteSummary) is documented with rustdoc comments that explain the memory and atomicity properties from the caller's perspective.
Architecture Notes
The writer is fundamentally a thin layer over parquet::arrow::ArrowWriter. The crate the engineer is building is not "a new Parquet writer from scratch" — that is a reasonable exercise but not this exercise. The work is in the configuration discipline (per-column encoding map driven by the schema), the atomic-rename discipline, and the observability hooks. The ArrowWriter does the actual Parquet bytes.
A reasonable module layout:
artemis-archive-writer/
├── src/
│ ├── lib.rs # public API: Writer, WriteSummary, error types
│ ├── config.rs # encoding map, row group size, codec, statistics
│ ├── writer.rs # the Writer impl wrapping ArrowWriter
│ ├── observability.rs # tracing-subscriber events for flush and fallback
│ └── bin/artemis-write.rs # CLI binary
├── tests/
│ ├── schema.rs # canonical Artemis test schema
│ ├── baseline_jsonl.rs # gzipped-JSONL baseline writer for size comparison
│ ├── roundtrip.rs # write-then-read tests
│ └── pyarrow_compat.rs # Python-helper shell-out test
└── docs/
├── encoding-decisions.md
└── row-group-decision.md
The canonical Artemis schema includes at least:
| Column | Arrow type | Encoding |
|---|---|---|
sample_timestamp_ns | Timestamp(Nanosecond, None) | DELTA_BINARY_PACKED |
mission_id | Utf8 | dictionary (default) |
payload_id | UInt32 | dictionary (default) |
sensor_kind | Utf8 | dictionary (default) |
sequence_number | UInt64 | DELTA_BINARY_PACKED |
panel_voltage | Float64 | BYTE_STREAM_SPLIT |
temperature_c | Float64 | BYTE_STREAM_SPLIT |
pressure_kpa | Float64 | BYTE_STREAM_SPLIT |
quality_flag | UInt8 | dictionary (default) |
archive_uri | Utf8 | DELTA_BYTE_ARRAY |
object_path | Utf8 | DELTA_BYTE_ARRAY |
notes | Utf8 (nullable) | dictionary (default; falls back to plain) |
Hints
Hint 1 — Per-column encoding configuration in Rust
The WriterProperties builder method set_column_encoding takes a ColumnPath, which is constructed from a Vec<String>. For top-level columns the path has one element. Remember that overriding the encoding does not by itself disable Parquet's dictionary attempt — you need set_column_dictionary_enabled(path, false) for columns where the override should bypass dictionary entirely (delta-encoded timestamps, byte-stream-split floats).
Hint 2 — The atomic rename pattern
The standard library's std::fs::rename is atomic on the same filesystem on Linux and macOS. The writer should construct the output file with a .inprogress suffix appended to the configured path, write into it, call ArrowWriter::close, and then rename. The rename happens after the writer's Drop runs but before the caller's close() returns the WriteSummary. Be explicit about the ordering: close() returns Ok only after the rename succeeds.
Hint 3 — Verifying the per-column encoding choice in tests
After writing a test file, open it with SerializedFileReader and walk metadata.row_groups()[i].columns(). Each ColumnChunkMetaData has an encodings() method returning the encodings the chunk actually used. For columns with a non-default encoding, you should see exactly the configured encoding (no PLAIN_DICTIONARY/RLE_DICTIONARY entry alongside it). For columns left at default, you will typically see PLAIN_DICTIONARY and PLAIN if dictionary fallback occurred — the nullable test column with high-cardinality values is the case that exercises fallback detection.
Hint 4 — Cross-language verification with pyarrow
The pyarrow_compat.rs test does not need a full Python integration. A simple std::process::Command::new("python3").arg("-c").arg(script) invocation where script is a one-liner that opens the file with pyarrow.parquet.ParquetFile and prints the row count and the per-column null counts is enough. Parse the stdout in Rust and compare to the Rust-side read. If python3 is not present on the test runner, #[cfg_attr(not(feature = "pyarrow-test"), ignore)] lets you keep the test in the suite without requiring Python in CI.
Hint 5 — Bounded writer memory under a long write
The ArrowWriter buffers an entire row group's worth of data in memory before flushing. With max_row_group_size = 128 MB, the writer's peak memory should be ~1.2× the row group size (the row group plus encoding scratch). If your test shows higher peak memory, check that the test isn't accidentally retaining the input Vec<RecordBatch> — a streaming iterator that yields each batch once and is consumed-then-dropped is what produces the bounded-memory shape. The jemalloc-ctl measurement is taken at the write() call boundary, not after holding all batches in a vector.
References
- In-Memory Analytics with Apache Arrow (Topol), Chapter 3 — "Format and Memory Handling"
- Apache Parquet specification, particularly the Encodings document (
github.com/apache/parquet-format/blob/master/Encodings.md) parquetcrate documentation (docs.rs/parquet) —WriterProperties,ArrowWriter,ColumnPatharrowcrate documentation (docs.rs/arrow) —RecordBatch,Schema, primitive array builders
When You're Done
The crate is "done" when all eight verifiable acceptance criteria pass in CI and the four self-assessed docs are written. Open a draft PR against the meridian-data-systems/archive-writer repo with the implementation, the tests, and the docs. The review will read the docs first; the docs are how the next engineer who touches this code understands the decisions you made.
Module 2: Open Table Formats
Lesson 1 — The Lakehouse Problem
Module: Data Lakes — M02: Open Table Formats Position: Lesson 1 of 3 Source: Designing Data-Intensive Applications — Martin Kleppmann and Chris Riccomini, Chapter 7 ("Transactions — Single-Object Writes") and Chapter 8 ("The Trouble with Distributed Systems"). Apache Iceberg specification, "Goals" section.
Source note: Iceberg/Delta books are not in the project corpus. The transaction-theory framing in this lesson is grounded in DDIA; the format-specific assertions about table formats are synthesized from the Iceberg spec and would benefit from verification.
Context
Module 1 ended with a writer that produces well-formed Parquet files into the Artemis cold-archive object bucket. Each file is correct in isolation: footer at the end, encodings chosen, statistics computed, atomic rename on success. The cold archive now holds millions of these files across thousands of object paths. The question this lesson asks is: what is the table?
The naive answer is "the directory." A query against the cold archive lists the bucket, reads every Parquet file, and unions the results. This answer is wrong in five distinct ways that Module 2 will spend three lessons fixing. The list-and-read pattern is slow (every query pays the cost of listing every file). It has no transactional semantics (a writer dropping new files mid-query produces split-brain query results). It has no schema enforcement (a file with a stale schema is silently included). It has no version (the table cannot be queried as it was an hour ago). It has no atomic delete (removing a file requires hoping no reader is currently using it).
A table format solves these problems by adding a metadata layer between the catalog and the data files. The metadata is the source of truth for "what files are in the table at version N." Queries read the metadata first, learn which files to read, and read only those. Writes produce new metadata that atomically swaps in. The data files themselves do not change — they are the same Parquet files Module 1 produced. The table format is purely a metadata problem; the data layer is unchanged.
This lesson develops the lakehouse problem in five failure-mode-shaped pieces, each one a real outage type that has happened in production data warehouses without a table format. The remaining two lessons in the module build the metadata layer that fixes them.
Core Concepts
Listing Is Not Membership
The first failure mode is the cheapest to demonstrate. A query against the cold archive without a table format must answer "what files belong to this table?" by listing the object store. Object store listings are slow (paginated, eventually consistent on some backends, billed per request) and incomplete (an in-flight write may or may not be visible). A query against a 100k-file table issues a list operation that returns 100k object keys, filters the ones it wants, and reads the data. The listing itself can take ten seconds.
Worse, the listing answer is not stable. A writer dropping a new file during the listing produces undefined behavior: the listing may or may not include the new file depending on where the listing's pagination cursor is when the file appears. The query that reads the listing now races the writer. Two consecutive queries can return different results not because the table changed but because the listing returned different subsets.
The fix the table format provides is to make membership an explicit, atomic data structure — a list of files written at a known version of the metadata. The list of files is the table's membership; the directory contents are storage-layer details the table format manages. Listing the object store is a maintenance operation, not a query operation.
Atomic Visibility Across Multiple Files
A single Parquet file's atomic visibility is solved by Module 1's .inprogress rename pattern: a reader sees a file only after it is complete. But most table mutations produce multiple files at once. A daily Artemis batch ingest drops twelve new Parquet files for the day's downlinks. A correction job rewrites three files to fix a quality-flag bug. An optimization job consolidates fifty small files into five large ones.
Without a table format, "drop twelve files atomically" is not a primitive object storage exposes. The writer drops file 1 through file 11, crashes, and now the table contains eleven files of a twelve-file commit — partial data the reader sees as if it were complete. DDIA (Ch. 7) makes the same point in single-machine terms: atomicity at the multi-object level is what transactions provide and what storage engines alone do not. The same logic applies in the lakehouse case, scaled up: a multi-file commit is a multi-object transaction, and storage object stores do not provide them.
The table format makes multi-file commits atomic by indirecting through metadata. The writer produces the twelve new data files (each individually atomic via the rename pattern), then writes a new metadata snapshot listing the table's now-complete file set, then atomically swaps the catalog pointer to point at the new snapshot. Until the catalog pointer swaps, readers see the old table. After the swap, readers see the new. There is no in-between state any reader can observe.
Schema Enforcement and Schema Drift
A directory of Parquet files has no canonical schema. Every file has its own schema (in its footer), and the schemas may differ. A writer that adds a new column to its output produces files with one schema; old files in the same directory have the previous schema. A reader that unions these files must choose how to handle the inconsistency. The naive choices are bad: pick one file's schema and reject the rest; union all schemas and produce nulls for missing columns silently; fail. Each choice produces a different "the table" depending on which files were listed and which order.
This is schema drift, and it is the failure mode that makes ad-hoc data lakes unusable past the first year of operation. The Artemis legacy archive accumulated nine variants of the telemetry schema over six years, each one a writer that updated its output without coordinating with readers. Analysts learned to special-case ranges of dates against schema versions; the analyst tooling carried a schema-versioning table that no one was sure was complete.
The table format fixes schema drift by storing the table's schema in the metadata, not in the data files. New data files conform to the metadata's schema; schema changes are atomic operations on the metadata (Module 4 develops schema evolution mechanics). Readers consult the metadata schema, project data files against it, and reject files whose schema is incompatible. The metadata schema is what "the table" means; individual file schemas are storage details.
No Snapshot, No Time Travel
Without a table format, the table is whatever the directory contains right now. There is no version. There is no query that asks "what did this table look like an hour ago." The accident-investigation use case that motivates the Artemis cold archive — reconstructing the operational state of the satellite constellation as of a specific past minute — is fundamentally incompatible with a raw-directory lakehouse, because there is no representation of "the table as of past minute N."
DDIA (Ch. 7) makes the same observation about snapshots in transactional databases: snapshot isolation is a discipline that requires the storage engine to preserve old versions of data while new versions are being written. The lakehouse case has the same requirement and the same shape of solution: every mutation produces a new immutable version of the metadata; old versions remain on disk; queries against past timestamps read the version that was current at that timestamp. The data files themselves are content-addressed in effect — a data file is never modified once written, so all past snapshots can reference it directly.
Time travel is not the goal; immutable versioning is the goal. Time travel falls out of immutable versioning for free. Module 4 develops the time-travel query path explicitly.
Atomic Deletes and the Concurrent-Reader Problem
The last failure mode is the one that bites operational engineers running a lakehouse for the first time. Files in a data lake are not append-only forever — they need to be deleted. Old data ages out per retention policy. Bad files from a corrupted ingest need to be removed. Compaction (Module 6) rewrites many small files into fewer large ones and then deletes the originals. The naive delete is object_store.delete(path).
The naive delete races concurrent readers. A query that is reading a file when the delete arrives sees its in-flight read fail with "object not found." The query crashes, the analyst retries, the next listing doesn't include the file, the retry succeeds. The right outcome — the query against the old version of the table sees the old files — requires the file to remain readable until every reader of the old snapshot has finished. The table format solves this by making file deletion a two-phase operation: the new metadata snapshot stops referencing the deleted files, and a separate maintenance job (snapshot expiration; Module 6) physically deletes the files only after a configurable retention window during which the old snapshot remains queryable.
The discipline is the same as the rename pattern at the file level: visibility is decoupled from the underlying storage operation. The metadata controls visibility; the storage operations are kept apart and run on schedules that respect reader requirements.
Code Examples
Demonstrating the Listing-Race Failure Mode
The minimal demonstration of why listing-is-not-membership matters: two threads concurrently writing and reading, both treating the directory as the table.
use std::path::PathBuf;
use std::sync::Arc;
use std::time::Duration;
use anyhow::Result;
use tokio::task::JoinHandle;
/// Simulate the "list the directory and read everything" pattern that a
/// lakehouse without a table format produces. Two writers and one reader
/// run concurrently; the reader sees inconsistent counts depending on
/// where the listing's pagination falls relative to the writers.
async fn race_demo(dir: PathBuf) -> Result<()> {
let dir = Arc::new(dir);
// Two writers, each dropping new files at a steady cadence.
let w1 = spawn_writer(dir.clone(), "ingest-A", 100);
let w2 = spawn_writer(dir.clone(), "ingest-B", 80);
// Reader runs a query every 250ms by listing the directory and
// reading the files. Each query is independent; no coordination
// with the writers exists, so the result depends on what files
// the listing happens to return.
for _ in 0..10 {
let listed = tokio::fs::read_dir(&*dir).await?;
let count = count_complete_files(listed).await?;
// Two successive queries may return different counts not because
// the table semantics have changed, but because the writers
// dropped files between the calls. The reader cannot distinguish
// "the table grew" from "I read at a different moment in time."
println!("query saw {count} files");
tokio::time::sleep(Duration::from_millis(250)).await;
}
w1.abort();
w2.abort();
Ok(())
}
fn spawn_writer(
dir: Arc<PathBuf>,
name: &'static str,
rate_ms: u64,
) -> JoinHandle<Result<()>> {
tokio::spawn(async move {
let mut seq = 0u64;
loop {
let path = dir.join(format!("{name}-{seq:06}.parquet"));
// Write file with the .inprogress / rename pattern from M1.
// Each file is individually atomically visible, but the
// *set* of files visible to a reader is not stable.
write_one_parquet(&path).await?;
seq += 1;
tokio::time::sleep(Duration::from_millis(rate_ms)).await;
}
})
}
// Helpers `count_complete_files` and `write_one_parquet` elided — see the
// repository for the full demo. The point is structural: every reader
// gets a different table because the directory listing is the membership
// and the membership is racing with the writers.
What to notice. Each individual file is correctly atomic at the file-level (Module 1's rename discipline holds). The bug is above the file level — at the table level — and no amount of per-file discipline fixes it. The race is structural: the table format is the only place where the fix can live, because the directory listing has no notion of a "table version."
The fix the next two lessons will develop: the table is not the directory. The table is a metadata pointer to a known-complete file set. Listing the directory becomes a maintenance operation (find orphaned files that the metadata no longer references; Module 6), not a query operation.
What an Atomic Multi-File Commit Looks Like
The contrast: a writer that produces three new data files and makes them visible atomically through a metadata snapshot. The data files are written first; the metadata pointer swap happens last and is the only operation that affects what readers see.
use std::sync::Arc;
use anyhow::{Context, Result};
/// Sketch of an atomic multi-file commit. Steps 1–3 happen in any order
/// and can fail without affecting the table; step 4 is the atomic
/// operation that makes the commit visible. Step 5 is best-effort
/// cleanup if step 4 fails.
async fn atomic_commit(
table: &Table,
new_data_files: Vec<DataFile>,
) -> Result<SnapshotId> {
// Step 1: Write each new data file to a content-addressed object
// store path. Each write is individually atomic via the rename
// pattern from M1. If any of these fails, none of the files is
// referenced by table metadata yet, so the partial state is
// invisible to readers.
for file in &new_data_files {
write_parquet(&file.path, &file.batches)
.await
.with_context(|| format!("writing {}", file.path))?;
}
// Step 2: Read the table's current snapshot to learn what files
// are already in the table. This snapshot will be the *base* for
// the new snapshot — the new snapshot's file set is the base file
// set plus the new files.
let base = table.current_snapshot().await?;
// Step 3: Construct the new snapshot's metadata: a manifest listing
// the new data files, the existing manifests from the base
// snapshot, and the snapshot record itself.
let new_snapshot = build_snapshot(&base, &new_data_files).await?;
// Step 4: Atomic pointer swap. This is the only operation in the
// entire commit that has transactional semantics. Until it returns
// Ok, readers see the base snapshot. After it returns Ok, readers
// see the new snapshot. There is no in-between state.
table.compare_and_swap_snapshot(base.id, new_snapshot.id).await?;
Ok(new_snapshot.id)
}
The two structural properties that make this work. All the work is decoupled from visibility: writing the data files, building the manifests, constructing the snapshot — all of these can fail or be retried without affecting readers, because none of them changes the table's catalog pointer. The visibility change is one CAS: the only operation that has transactional semantics is the compare-and-swap on the catalog pointer, and the catalog pointer is small (typically a single object store key holding the current snapshot ID). The lakehouse problem reduces to "how does the catalog provide CAS?" — which is the subject of Lesson 3.
A Schema Enforcement Check at Read Time
The schema-drift fix in code: the reader consults the table's metadata schema, not the data file's schema, and projects/casts data files against the metadata schema. Files whose schema is incompatible are rejected at read planning, not at row-decode time.
use arrow::datatypes::SchemaRef;
use anyhow::{anyhow, Result};
/// Decide whether a data file's schema is compatible with the table's
/// current schema, and produce a projection plan if so. Compatibility
/// is asymmetric: the data file's schema must be a subset of the table
/// schema, with matching types for the columns present. Columns missing
/// from the data file are filled with nulls at read time; columns
/// present in the data file but absent from the table schema are
/// dropped.
fn plan_file_read(
table_schema: &SchemaRef,
file_schema: &SchemaRef,
) -> Result<FileReadPlan> {
let mut projection = Vec::with_capacity(table_schema.fields().len());
for table_field in table_schema.fields() {
match file_schema.field_with_name(table_field.name()) {
Ok(file_field) => {
if file_field.data_type() != table_field.data_type() {
// Incompatible type. The right action is to flag the
// file as quarantined (Module 6's lineage system) and
// skip it; never silently coerce a mismatched type.
return Err(anyhow!(
"type mismatch on {}: table={:?} file={:?}",
table_field.name(),
table_field.data_type(),
file_field.data_type(),
));
}
projection.push(FileColumn::Existing(file_field.clone()));
}
Err(_) => {
// Column added to the table after this file was written.
// Synthesize a null column of the right type at read time.
projection.push(FileColumn::Null(table_field.clone()));
}
}
}
Ok(FileReadPlan { projection })
}
The discipline this enforces. The table schema is the source of truth. Files that conform are read; files that conflict are flagged. The reader never silently produces inconsistent results because of schema drift. This is what schema enforcement means in the lakehouse context: the table format records the schema in the metadata, and reads project against the metadata schema rather than against whatever the individual data files happen to contain.
Key Takeaways
- A directory of Parquet files is not a table. Five distinct failure modes emerge: slow/inconsistent listing-as-membership, no atomic multi-file visibility, no schema enforcement, no snapshot/time-travel, no safe deletes.
- The table format's job is to make membership an explicit, versioned data structure stored as metadata. The data files are unchanged; the metadata is the source of truth for "what files are in this table at version N."
- Atomicity at the multi-file level requires indirection through metadata. A multi-file commit produces all its data files, builds a new metadata snapshot, and atomically swaps the catalog pointer. The CAS on the catalog pointer is the only transactional operation; everything else is decoupled work.
- Schema enforcement lives in the table metadata, not in individual data files. Readers project file contents against the metadata schema, accept conforming files, flag non-conforming ones.
- File deletion is decoupled from visibility. New snapshots stop referencing old files; physical deletion runs later, on a retention window long enough that no reader still needs the old snapshot.
Lesson 2 — The Manifest Hierarchy and Snapshot Model
Module: Data Lakes — M02: Open Table Formats
Position: Lesson 2 of 3
Source: Apache Iceberg specification, "Table Spec" and "Manifests" sections (iceberg.apache.org/spec). Delta Lake protocol, "Actions" section, for the contrast where useful. Iceberg whitepaper (Ryan Blue, Netflix, 2018).
Source note: This lesson is synthesis-mode against the Iceberg specification. The four-level hierarchy and the snapshot-update mechanic are accurate to the spec; the per-format detail differences (Iceberg's V1 vs V2 spec, Delta's transaction log vs Iceberg's snapshot pointer) are noted where they illuminate a design decision. Verification against the current spec recommended.
Context
Lesson 1 established the table format's job: provide an atomic, versioned, schema-enforced view over a directory of immutable data files. This lesson develops the shape of the metadata that does the work. The shape is a four-level hierarchy: a catalog pointer to a snapshot, a snapshot pointer to a manifest list, a manifest list pointer to manifests, a manifest pointer to data files. Each level has a job that the level below cannot do. The hierarchy is what makes commits cheap, queries fast, and pruning effective.
The hierarchy is not Iceberg-specific. Delta Lake uses a transaction log instead of a snapshot pointer, but the log is functionally a sequence of snapshot deltas; reading it produces the same logical structure. Hudi uses a slightly different shape but the same essential layering. Once the engineer understands the four-level hierarchy, switching between formats is a question of vocabulary, not architecture. The Artemis archive uses Iceberg, so this lesson uses Iceberg's vocabulary throughout; the design considerations transfer.
The lesson develops each level in isolation, then traces a worked example through all four levels: a single commit's effect on the metadata, end to end. The capstone project implements this hierarchy in code; this lesson is the model the capstone is implementing.
Core Concepts
The Four-Level Hierarchy
The metadata hierarchy answers two questions: "what files are in the table at the current version?" and "what files were in the table at version N?". The hierarchy makes both questions cheap by trading a small amount of indirection cost for substantial reductions in the amount of metadata a reader must scan.
catalog pointer
│
▼
snapshot ← table version: schema, partitioning, statistics
│
▼
manifest list ← per-snapshot listing of manifests + per-manifest stats
│
├─▶ manifest ← group of data files, typically per-partition
├─▶ manifest
└─▶ manifest
│
├─▶ data file (Parquet)
├─▶ data file (Parquet)
└─▶ data file (Parquet)
The naming follows Iceberg. Catalog pointer: a small piece of state (a row in a database, an S3 object key, a ZooKeeper node) that holds "the current version of the table is snapshot S." Snapshot: a metadata file describing the table at one point in time: its schema, partition spec, and a pointer to the manifest list that enumerates the table's data files. Manifest list: a file listing the manifests that, together, enumerate every data file in the snapshot, plus per-manifest summary statistics. Manifest: a file listing some subset of the table's data files (typically grouped by partition), plus per-file statistics. Data file: a Parquet file as produced by Module 1's writer.
Each level fans out to the next. A catalog has one current snapshot per table. A snapshot's manifest list contains tens to hundreds of manifests. A manifest contains tens to thousands of data files. A 100k-data-file table has perhaps 1k manifests, one manifest list, one snapshot. Read planning starts at the snapshot and prunes aggressively at every level — most queries touch one snapshot, a small fraction of its manifests, and an even smaller fraction of those manifests' data files. The pruning power compounds.
Snapshots Are Immutable
The single most important property of the snapshot model: once written, a snapshot is never modified. A new commit produces a new snapshot file, leaving the old snapshot file untouched. The catalog pointer changes; the snapshots themselves are content-addressed (or near enough — Iceberg uses unique snapshot IDs and timestamp-prefixed paths).
Immutability is what makes time travel and concurrent reads cheap. A reader that started a query against snapshot S can finish the query against snapshot S even if a hundred concurrent writers committed new snapshots in the meantime — the snapshot S metadata is still on disk, unmodified, fully readable. The catalog pointer's current value is irrelevant to the in-flight reader; the reader holds the snapshot ID it started with and reads against that.
Immutability also makes commits cheap: a commit is "write some new metadata files, then swap the catalog pointer." None of the old metadata is touched. The old metadata's storage lives forever, in principle; in practice, the snapshot-expiration job (Module 6) deletes old snapshots after a configurable retention window, but the deletion is decoupled from any commit.
Manifests: The Pruning Unit
A manifest is the smallest unit of "many data files described together." A manifest's payload is one record per data file: the file path, the partition tuple the file belongs to, the per-column statistics (min, max, null count, distinct count), the row count, the byte size. The manifest is structured as a columnar file itself — typically Avro in Iceberg — for the same reason data files are columnar: queries against the manifest typically read a few statistics columns out of many.
The manifest list, one level up, holds one record per manifest with summary statistics across the manifest's files: the partition ranges spanned by the manifest, the count of files, the count of rows, the count of deleted records. The manifest list is the first pruning step at query time: a query against mission_id = 'apollo-7' reads the manifest list, finds the three manifests whose partition range includes apollo-7, and ignores the other ninety-seven manifests entirely.
The second pruning step is inside the chosen manifests: read each chosen manifest's data-file records, find the data files whose mission_id statistic contains apollo-7, and read only those. The third pruning step is inside each chosen data file: use the Parquet footer's per-column-chunk statistics (Module 1) to prune row groups. Pruning compounds across all three levels; the read fraction at each level is multiplied at the next.
The grouping of data files into manifests is a design lever. Iceberg's default is one manifest per partition per commit, which produces good co-locality (files for the same partition are in the same manifest) but a many-files-per-manifest tail over time. The optimization process (Module 6's compaction) periodically rewrites manifests to consolidate small ones; this is a metadata-level compaction, complementary to the data-level compaction that consolidates small Parquet files.
The Catalog: External Atomicity
The catalog pointer is the part of the table format that lives outside the data lake's object store. Object stores have a weak primitive for atomicity: most support per-object atomic write (the rename pattern), but few support cross-object atomicity or transactional CAS in the general case. Iceberg's design pushes the CAS requirement to an external catalog — Hive Metastore, Project Nessie, AWS Glue, JDBC databases, a small DynamoDB row, ZooKeeper, or in the Artemis case a small Postgres row keyed on table name.
The catalog's job is exactly two operations: get_current_snapshot(table) and compare_and_swap_snapshot(table, expected_old, new). The CAS must be linearizable — concurrent writers must not both succeed in moving the pointer if they both based their new snapshot on the same old snapshot. Everything else the table format does (schema, manifests, data files) lives in the object store and inherits the catalog's CAS for atomicity.
The choice of catalog is operational, not architectural. Postgres is the obvious choice for systems that already run a transactional database; Nessie adds Git-like branching semantics on top; DynamoDB is the AWS-native option; Hive Metastore is legacy compatibility for systems migrating off Hadoop. The Artemis archive uses Postgres on the ground-segment ingestion node, which is already deployed for unrelated control-plane state. The single-row-per-table commit pattern is so cheap that even at very high commit rates the catalog is not the bottleneck.
The Delta Lake protocol takes a different approach: instead of an external catalog, Delta uses an append-only transaction log stored in the object store (_delta_log/00000000000000000000.json, _delta_log/00000000000000000001.json, …). The commit is "atomically create a new log file with the next sequential number." On filesystems that support atomic create-if-not-exists (HDFS, POSIX), this works directly. On S3, Delta uses a coordination service (DynamoDB, a small extra catalog) to provide the missing CAS. The two designs converge: both need external coordination for the commit primitive; the names and shapes differ.
A Worked Example: One Commit, Four Levels
Trace the change to the metadata when one batch ingest adds twelve new data files to the Artemis telemetry table.
Before the commit. Table is at snapshot S101. Snapshot S101 references manifest list ML101. Manifest list ML101 references manifests M1, M2, …, M100. Each manifest references some data files. The catalog row says current_snapshot_id = S101.
During the commit. The writer:
- Writes the twelve new data files to the object store, each one via the rename pattern. Files are in the object store but no metadata references them yet.
- Writes one new manifest,
M101, containing exactly the twelve new data files (one manifest per commit is Iceberg's default). - Writes a new manifest list,
ML102, containing every manifest fromML101plus the newM101. NoteML101is not modified —ML102is a new file with101 + 1 = 102entries. - Writes a new snapshot,
S102, withparent_snapshot_id = S101, the same schema asS101, and a pointer toML102. - Issues
compare_and_swap_snapshot(table, expected_old=S101, new=S102)against the catalog.
After the commit. The catalog row says current_snapshot_id = S102. Readers that started before step 5 see S101. Readers that start after step 5 see S102. There is no observable in-between state. The old metadata files (S101, ML101) remain in the object store, available for time-travel reads against the old version of the table.
The cost of the commit: one new manifest of twelve entries, one new manifest list of 101 entries, one new snapshot record, one catalog CAS. The cost is proportional to the size of the commit, not the size of the table. The hundred existing manifests are referenced by the new manifest list but not rewritten. This is what makes the format scale to millions of files — every commit is local to its own changes.
Code Examples
Modeling the Hierarchy in Rust
The Artemis capstone implements the four levels as Rust types. The structure below is a sketch — the production version has more fields per the Iceberg spec, but the shape is correct.
use std::collections::HashMap;
use anyhow::Result;
use serde::{Deserialize, Serialize};
pub type SnapshotId = u64;
pub type ManifestPath = String;
pub type DataFilePath = String;
/// Catalog row — the only mutable state in the table format. Held in
/// a transactional catalog (Postgres in the Artemis case) that
/// provides linearizable CAS.
#[derive(Debug, Clone)]
pub struct CatalogEntry {
pub table_name: String,
pub current_snapshot_id: SnapshotId,
// The metadata path the catalog points at. Reading the table starts here.
pub metadata_path: String,
}
/// Snapshot metadata file. Immutable once written. References a manifest
/// list that, together with the schema and partition spec, fully
/// determines the table at this version.
#[derive(Debug, Serialize, Deserialize)]
pub struct Snapshot {
pub snapshot_id: SnapshotId,
pub parent_snapshot_id: Option<SnapshotId>,
pub timestamp_ms: i64,
pub schema_id: u32,
pub partition_spec_id: u32,
pub manifest_list_path: String,
/// Summary of what this snapshot did relative to its parent.
/// Used by maintenance and audit tooling.
pub summary: SnapshotSummary,
}
#[derive(Debug, Serialize, Deserialize)]
pub struct SnapshotSummary {
pub operation: SnapshotOp,
pub added_files: u32,
pub removed_files: u32,
pub added_rows: u64,
pub removed_rows: u64,
}
#[derive(Debug, Serialize, Deserialize)]
pub enum SnapshotOp {
Append,
Overwrite,
Replace,
Delete,
}
/// Manifest list: one entry per manifest, with per-manifest summary
/// statistics for the first pruning pass at query planning time.
#[derive(Debug, Serialize, Deserialize)]
pub struct ManifestList {
pub manifests: Vec<ManifestListEntry>,
}
#[derive(Debug, Serialize, Deserialize)]
pub struct ManifestListEntry {
pub manifest_path: ManifestPath,
pub added_data_files: u32,
pub existing_data_files: u32,
pub deleted_data_files: u32,
pub partition_summaries: Vec<PartitionSummary>,
}
#[derive(Debug, Serialize, Deserialize)]
pub struct PartitionSummary {
pub partition_field: String,
pub lower_bound: Vec<u8>, // serialized partition value
pub upper_bound: Vec<u8>,
pub contains_null: bool,
}
/// Manifest: one entry per data file, with per-file statistics for the
/// second pruning pass. Data files are not modified; the manifest is
/// the level where add/remove decisions are recorded.
#[derive(Debug, Serialize, Deserialize)]
pub struct Manifest {
pub entries: Vec<ManifestEntry>,
}
#[derive(Debug, Serialize, Deserialize)]
pub struct ManifestEntry {
pub status: EntryStatus,
pub data_file: DataFile,
}
#[derive(Debug, Serialize, Deserialize)]
pub enum EntryStatus {
Existing,
Added,
Deleted,
}
#[derive(Debug, Serialize, Deserialize)]
pub struct DataFile {
pub path: DataFilePath,
pub file_format: FileFormat,
pub partition: HashMap<String, Vec<u8>>,
pub record_count: u64,
pub file_size_bytes: u64,
pub column_sizes: HashMap<u32, u64>,
pub value_counts: HashMap<u32, u64>,
pub null_counts: HashMap<u32, u64>,
pub lower_bounds: HashMap<u32, Vec<u8>>,
pub upper_bounds: HashMap<u32, Vec<u8>>,
}
#[derive(Debug, Serialize, Deserialize)]
pub enum FileFormat {
Parquet,
Avro,
Orc,
}
The shape to notice. Every level holds enough information to make pruning decisions without consulting the level below. The manifest list's per-manifest summaries answer "does this manifest contain partition value X" without reading the manifest. The manifest's per-data-file statistics answer "does this data file contain column value X" without reading the file. Statistics propagate upward at write time so they are available for downward pruning at read time.
Read Planning: Three Pruning Passes
The reader's job is to turn a query — say, WHERE mission_id = 'apollo-7' AND panel_voltage > 28.5 — into a minimal set of data files to read. The metadata hierarchy makes this three sequential filters.
use anyhow::Result;
/// Plan the data files to read for a query against the current snapshot.
/// Returns the data files that *might* contain matching rows; the reader
/// still applies the predicate row-wise after reading, because the
/// statistics are bounds, not exact matches.
async fn plan_query(
catalog: &Catalog,
table: &str,
predicate: &Predicate,
) -> Result<Vec<DataFile>> {
// Step 0: Resolve the current snapshot. One catalog read, one
// metadata file fetch.
let entry = catalog.get_current(table).await?;
let snapshot: Snapshot = read_metadata_file(&entry.metadata_path).await?;
let manifest_list: ManifestList =
read_metadata_file(&snapshot.manifest_list_path).await?;
// Step 1: Prune at the manifest-list level. A manifest whose
// partition summary doesn't overlap the predicate's partition
// constraints is skipped entirely. For mission_id='apollo-7' this
// typically drops ~95% of manifests.
let candidate_manifests: Vec<_> = manifest_list
.manifests
.iter()
.filter(|m| partition_summary_overlaps(&m.partition_summaries, predicate))
.collect();
// Step 2: Prune at the manifest level. Open each candidate manifest
// and apply the predicate to per-data-file statistics. For
// panel_voltage > 28.5, drop files whose upper_bound for that column
// is <= 28.5.
let mut candidate_files: Vec<DataFile> = Vec::new();
for manifest_entry in &candidate_manifests {
let manifest: Manifest = read_metadata_file(&manifest_entry.manifest_path).await?;
for entry in manifest.entries {
if matches!(entry.status, EntryStatus::Existing | EntryStatus::Added) {
if data_file_might_match(&entry.data_file, predicate) {
candidate_files.push(entry.data_file);
}
}
}
}
// Step 3 (not shown): The reader will further prune row groups
// within each data file using the Parquet footer statistics from
// M1. That step happens at Parquet open time, not metadata
// planning time, so it's outside this function.
Ok(candidate_files)
}
The discipline this enforces. Each pruning step reads only what the previous step did not prune. The reader never lists the object store, never reads a data file it has already proven cannot contain matching rows, never opens a manifest it has already proven contains no matching files. The metadata cost is proportional to what survived pruning; the data cost is proportional to what survived further pruning. The compounding is what makes the format usable at the 100k-file scale.
Writing a Snapshot
The write-side counterpart: producing a new snapshot from a base snapshot plus a set of file changes. The function is metadata-only — the data files are assumed already written.
use anyhow::Result;
use std::time::SystemTime;
/// Produce a new snapshot from the base snapshot plus a set of newly
/// added data files. The snapshot, manifest list, and new manifest are
/// written to the object store; the catalog CAS happens in the caller
/// (Lesson 3).
pub async fn build_append_snapshot(
base: &Snapshot,
base_manifest_list: &ManifestList,
new_data_files: Vec<DataFile>,
metadata_dir: &str,
) -> Result<(Snapshot, String)> {
let now_ms = SystemTime::now()
.duration_since(std::time::UNIX_EPOCH)?
.as_millis() as i64;
let new_snapshot_id = next_snapshot_id();
// 1. Build the new manifest from the new data files. One manifest
// per commit is Iceberg's default; manifest compaction (Module 6)
// periodically consolidates small manifests.
let new_manifest = Manifest {
entries: new_data_files
.iter()
.cloned()
.map(|df| ManifestEntry {
status: EntryStatus::Added,
data_file: df,
})
.collect(),
};
let new_manifest_path = format!(
"{metadata_dir}/m{new_snapshot_id}.avro"
);
write_metadata_file(&new_manifest_path, &new_manifest).await?;
// 2. Build the new manifest list: the base list's entries plus an
// entry for the new manifest. The base list is not modified.
let new_manifest_list = ManifestList {
manifests: {
let mut all = base_manifest_list.manifests.clone();
all.push(ManifestListEntry {
manifest_path: new_manifest_path.clone(),
added_data_files: new_data_files.len() as u32,
existing_data_files: 0,
deleted_data_files: 0,
partition_summaries: summarize_partitions(&new_data_files),
});
all
},
};
let new_list_path = format!(
"{metadata_dir}/ml{new_snapshot_id}.avro"
);
write_metadata_file(&new_list_path, &new_manifest_list).await?;
// 3. Build the snapshot record.
let snapshot = Snapshot {
snapshot_id: new_snapshot_id,
parent_snapshot_id: Some(base.snapshot_id),
timestamp_ms: now_ms,
schema_id: base.schema_id,
partition_spec_id: base.partition_spec_id,
manifest_list_path: new_list_path,
summary: SnapshotSummary {
operation: SnapshotOp::Append,
added_files: new_data_files.len() as u32,
removed_files: 0,
added_rows: new_data_files.iter().map(|f| f.record_count).sum(),
removed_rows: 0,
},
};
let snapshot_path = format!(
"{metadata_dir}/s{new_snapshot_id}.json"
);
write_metadata_file(&snapshot_path, &snapshot).await?;
Ok((snapshot, snapshot_path))
}
What to notice. Three new files are written; nothing existing is modified. The new snapshot's parent_snapshot_id records the version this commit was based on — Lesson 3's CAS uses this to detect conflicts when two writers concurrently base their commits on the same parent. The function returns the snapshot and its path; the caller's job is to perform the catalog CAS that makes the snapshot visible.
Key Takeaways
- The table format's metadata is a four-level hierarchy: catalog pointer → snapshot → manifest list → manifests → data files. Each level fans out to the next; pruning at each level compounds.
- Snapshots are immutable. A commit produces new metadata files; old metadata files remain. This is what makes time travel and concurrent reads cheap, and what makes commits cost proportional to the change size, not the table size.
- Manifests are the pruning unit. Manifest-list summaries enable partition-level pruning without reading manifests; manifest entries enable file-level pruning without reading data files; Parquet footers enable row-group pruning without reading data pages. The three layers of statistics propagate up at write time so they can be used for pruning down at read time.
- The catalog provides external atomicity. Object stores don't offer cross-object CAS; the table format pushes the CAS requirement to a transactional catalog (Postgres, Hive Metastore, Nessie). The catalog's job is exactly
get_currentandcompare_and_swap. - A commit's cost is proportional to the change. Write the new data files, write one new manifest, write one new manifest list (referencing the base's manifests plus the new one), write the snapshot, CAS the catalog. The base manifests and old data files are not rewritten.
Lesson 3 — Atomic Commits via Optimistic Concurrency
Module: Data Lakes — M02: Open Table Formats Position: Lesson 3 of 3 Source: Designing Data-Intensive Applications — Martin Kleppmann and Chris Riccomini, Chapter 7 ("Transactions — Pessimistic versus optimistic concurrency control", "Serializable Snapshot Isolation"). Apache Iceberg specification, "Commit Process". Database Internals — Alex Petrov, Chapter 5 ("Concurrency Control").
Source note: The transaction-theory framing is grounded in DDIA and Database Internals. The commit-protocol details follow the Iceberg spec.
Context
Lesson 2 ended with a writer that produced new metadata files for a commit and was about to ask the catalog for an atomic pointer swap. This lesson is about the pointer swap. The pointer swap is the single piece of the table format that needs transactional semantics; everything else is decoupled work that can be retried, abandoned, or run concurrently without affecting readers. Getting the pointer swap right is what makes the entire metadata hierarchy work; getting it wrong is how lakehouses lose data.
The right primitive is optimistic concurrency control. The writer reads the current pointer value, does its work assuming nothing else will change the pointer in the meantime, then attempts to swap the pointer from the value it read to its new value. If the swap succeeds, the commit is durable. If the swap fails — because another writer beat it to the punch — the writer aborts and retries from the start: read the new pointer, rebuild the commit on top of it, try again. DDIA (Ch. 7) introduces this as the optimistic alternative to pessimistic locking: "instead of blocking if something potentially dangerous happens, transactions continue anyway, in the hope that everything will turn out all right. When a transaction wants to commit, the database checks whether anything bad happened; if so, the transaction is aborted and has to be retried."
The lesson develops the commit protocol end to end. The compare-and-swap primitive the catalog must provide. The retry loop that handles conflicts. The narrower class of conflicts that retries cannot fix safely (and what to do about them). The throughput characteristics of optimistic concurrency under contention. By the end of this lesson the engineer has a full mental model of the commit path and can implement it; the capstone project does.
Core Concepts
The Compare-and-Swap Primitive
The catalog must provide, atomically, the operation:
compare_and_swap(table, expected_old_snapshot_id, new_snapshot_id):
if catalog[table].current == expected_old_snapshot_id:
catalog[table].current = new_snapshot_id
return Ok
else:
return Err(CommitConflict { actual: catalog[table].current })
The "atomically" is doing all the work. DDIA (Ch. 7, "Single-object writes") calls this the conditional write or compare-and-set operation, and notes that it is "similar to a compare-and-set or compare-and-swap (CAS) operation in shared-memory concurrency." For our purposes, the property the writer relies on is that no two writers can both observe expected_old as the current value and both have their writes succeed. Exactly one writer's CAS succeeds; the other writer's CAS fails with the actual current value, which the failed writer can use to retry.
In practice the CAS is implemented in whatever transactional primitive the catalog exposes. The Artemis archive's Postgres-backed catalog uses a single SQL UPDATE ... WHERE current_snapshot_id = $expected_old, which is atomic against Postgres's row locks. Hive Metastore uses its own transaction protocol. Nessie uses Git-like reference updates. DynamoDB uses a conditional UpdateItem. The mechanism differs; the contract is the same: linearizable CAS on a single small piece of state.
The CAS is also where Iceberg's design and Delta's design converge in spirit and diverge in mechanism. Iceberg names the operation directly: catalog CAS, with the catalog providing the primitive. Delta's transaction log puts the equivalent operation at the storage layer: "atomically create a new log file with sequence number N+1," which only works on filesystems that provide atomic create-if-not-exists. On S3 (which does not, except recently with conditional writes), Delta layers a coordination service (DynamoDB) on top to provide the missing primitive. The same primitive in a different shape.
The Commit Loop
The full commit protocol, written as the writer sees it, is a loop:
loop:
1. Read the current snapshot from the catalog. This is the base.
2. Build the new snapshot on top of the base (Lesson 2).
3. Write the new metadata files (snapshot, manifest list, manifest).
4. CAS the catalog from base.id to new.id.
Success → commit is durable; return.
Conflict → another writer committed in the meantime.
Retry: go to step 1.
The retry path is what makes optimistic concurrency robust. When the CAS fails, the writer learns the actual current snapshot ID. The writer then re-reads the new current snapshot, re-bases the commit on it, writes new metadata files, and retries the CAS. The retry is not a re-run of the same commit — it is a fresh commit built on top of whatever the table is now. The new metadata files written in step 3 of the failed attempt are orphaned (left in the object store but referenced by nothing); the next maintenance run (Module 6's orphan-file cleanup) will delete them. The data files themselves are not orphaned if the same set of new rows is being added; the rebased commit references them again.
Two subtleties the production loop must handle. First, the data files written before the loop began should not be rewritten on each retry — they are valid Parquet, content-addressed by path, and remain valid regardless of how many CAS attempts the metadata commit takes. Only the metadata files need to be regenerated per attempt. Second, the retry must bound: an unbounded loop under high contention is a livelock waiting to happen. The Artemis writer caps retries at sixteen attempts with exponential backoff plus jitter; commits that exhaust the retry budget fail loudly to the caller, who decides whether to wait longer or escalate.
Conflict Detection: Append vs Overwrite
Not all commit conflicts are equal. The CAS-based detection treats every concurrent commit as a conflict, but in the optimistic-concurrency literature there is a richer notion: did the two concurrent commits actually conflict on the data, or did they happen to commit at the same time without touching the same rows?
The richest version of the question is the same one DDIA (Ch. 7) develops for serializable snapshot isolation: did one transaction's read overlap another transaction's write? In the lakehouse case the question specializes. Append-only commits don't conflict with each other on data — two writers each adding new data files to disjoint partitions can in principle both succeed, because neither one invalidates the other's view of the table. Overwrite commits conflict with everything — a commit that rewrites file F invalidates any concurrent commit that was based on a snapshot containing F, because the rebased commit would need to be aware that F is gone.
Iceberg's commit protocol detects this distinction at retry time. When an append-only commit's CAS fails, the writer re-reads the new current snapshot, observes that the conflicting commit was append-only and added disjoint files, and constructs the rebased commit as "the previous new files plus the conflicting commit's new files." No data files need to be re-written. The retry is cheap and almost always succeeds.
When an overwrite commit's CAS fails — or when an append-only commit conflicts with an intervening overwrite — the rebase has to do real work. The writer must re-read the table's current file set, determine which files its overwrite would now affect (potentially a different set than originally planned), and produce a new commit. In the worst case, the overwrite is no longer applicable and the writer must reject the commit and surface the conflict to the application layer. This is the same shape as DDIA's "decisions based on an outdated premise" pattern: the writer's intent was framed in terms of a table state that no longer exists, and re-framing the intent in terms of the current state requires application-level judgment.
The Artemis ingestion writer is almost entirely append-only (new downlinks add new files; old files are not modified). Conflict resolution is straightforward and cheap. The correction-job writer that runs occasionally to fix bad ingests is overwrite-based; its conflict path requires re-evaluating which files to rewrite against the current snapshot. The two writers compose: an in-flight correction blocks ingest only briefly during the CAS, never during data file writing.
Throughput Under Contention
Optimistic concurrency's well-known weakness, as DDIA (Ch. 7) puts it, is that it "performs badly if there is high contention." When many writers attempt to commit concurrently against the same table, most CAS attempts fail and most writers retry, producing the classic optimistic livelock pattern: more retries → more contention → more retries.
For a typical lakehouse the contention rate is low — most tables have a single writer (the ingestion pipeline) and many concurrent readers. The Artemis cold archive has exactly one writer per table during steady-state operation; commit conflicts arise only during maintenance windows when multiple jobs run concurrently. Throughput is dictated by the catalog's CAS latency (a few milliseconds for Postgres on the same host as the writer) and the metadata-write cost (tens to hundreds of milliseconds depending on manifest list size). Commit rates of one per second are easy; tens per second are achievable; hundreds per second require care.
When contention does become a real bottleneck, the typical interventions are: (1) batch many ingest events into one commit (the Artemis writer accumulates downlinks for thirty seconds before committing, capping commit rate at ~2 per minute even under heavy downlink); (2) shard the table by some partition key so concurrent writers commit against different table identities (the Artemis archive has one Iceberg table per mission rather than one global table); (3) consolidate writers into a single committer service that orders incoming commits and applies them sequentially (the Delta Lake reference deployment uses this pattern). All three approaches reduce the per-table commit rate to something CAS-friendly. None of them works around the fundamental property that linearizable CAS bounds throughput; they shift the work onto a different scope.
Why Not Pessimistic Locking?
A natural alternative to optimistic CAS is pessimistic locking: the writer takes a table-level lock, holds it during the metadata write, releases it after the catalog update. This trades the retry cost of optimistic concurrency for the contention cost of locking — but the cost it really trades is the cost of unbounded lock holds when writers fail or stall.
Lakehouse writes are slow in absolute terms (writing twelve Parquet files of 128 MB each takes seconds to tens of seconds). A pessimistic lock held for the duration of a commit is a lock held for thirty seconds at a time. A crashed writer holding such a lock requires a separate lock-revocation mechanism with its own correctness concerns: lock leases, fencing tokens, the whole machinery DDIA (Ch. 8) develops for the distributed-lock problem. Optimistic concurrency avoids this entirely by making the only contended primitive a millisecond-scale CAS; concurrent writers do their slow work in parallel and only coordinate at the very end. The cost is that some writers' work is thrown away on conflict.
For workloads that are predominantly append-only against tables with single writers, optimistic concurrency is strictly the right choice. For workloads with high overwrite contention, a single-writer-per-table discipline (intervention 3 above) eliminates contention without paying for distributed locking. The lakehouse community has converged on this pattern; locking-based table formats are rare.
Code Examples
A Postgres-Backed Catalog with CAS
The Artemis archive's catalog is a single Postgres row per table. The CAS is a single UPDATE, atomic against Postgres's per-row lock.
use anyhow::{anyhow, Result};
use sqlx::PgPool;
#[derive(Debug, Clone, Copy)]
pub struct CommitConflict {
pub expected_old: SnapshotId,
pub actual_current: SnapshotId,
}
#[derive(Debug)]
pub struct PostgresCatalog {
pool: PgPool,
}
impl PostgresCatalog {
pub async fn get_current(&self, table: &str) -> Result<CatalogEntry> {
let row = sqlx::query!(
"SELECT current_snapshot_id, metadata_path
FROM iceberg_tables
WHERE table_name = $1",
table,
)
.fetch_one(&self.pool)
.await?;
Ok(CatalogEntry {
table_name: table.to_string(),
current_snapshot_id: row.current_snapshot_id as u64,
metadata_path: row.metadata_path,
})
}
/// Atomically swap the table's current_snapshot_id from
/// `expected_old` to `new`. Returns Ok(()) on success;
/// Err(CommitConflict { .. }) if the row's current value is not
/// `expected_old` at the time the UPDATE runs.
pub async fn compare_and_swap(
&self,
table: &str,
expected_old: SnapshotId,
new: SnapshotId,
new_metadata_path: &str,
) -> Result<(), CommitError> {
// The UPDATE is atomic on a single row. Postgres's MVCC ensures
// that concurrent UPDATEs against the same row are serialized;
// the loser sees rows_affected = 0 because its WHERE predicate
// matched a no-longer-current row.
let result = sqlx::query!(
"UPDATE iceberg_tables
SET current_snapshot_id = $1, metadata_path = $2
WHERE table_name = $3
AND current_snapshot_id = $4",
new as i64,
new_metadata_path,
table,
expected_old as i64,
)
.execute(&self.pool)
.await
.map_err(CommitError::Database)?;
if result.rows_affected() == 1 {
return Ok(());
}
// CAS failed. Read the current value to give the caller a
// useful error.
let current = self.get_current(table).await
.map_err(CommitError::Database)?;
Err(CommitError::Conflict(CommitConflict {
expected_old,
actual_current: current.current_snapshot_id,
}))
}
}
#[derive(Debug)]
pub enum CommitError {
Conflict(CommitConflict),
Database(anyhow::Error),
}
What to notice. The CAS is one SQL statement. Postgres's per-row MVCC is what provides the atomicity; no application-level locking is needed. The cost of a CAS attempt is a single round-trip to the database — typically under 2 ms on the same network. The contention behavior is Postgres's normal row-contention behavior: concurrent writers against the same row queue briefly at the lock, see the UPDATE succeed for one of them and fail for the others. The rows_affected = 0 is the conflict signal; the followup read produces the actual current value for the retry path.
The Full Commit Retry Loop
The retry loop puts the pieces from the previous lessons together: build a commit, attempt the CAS, rebase on conflict, retry. The example uses a fixed retry budget with exponential backoff and jitter.
use anyhow::{Context, Result};
use rand::Rng;
use std::time::Duration;
const MAX_RETRIES: u32 = 16;
const INITIAL_BACKOFF_MS: u64 = 50;
/// Append `new_data_files` to `table` with optimistic-concurrency retries.
/// The data files have already been written to the object store; this
/// function manages only the metadata commit.
pub async fn commit_append(
catalog: &PostgresCatalog,
table: &str,
metadata_dir: &str,
new_data_files: &[DataFile],
) -> Result<SnapshotId> {
for attempt in 0..MAX_RETRIES {
// 1. Read the current state.
let entry = catalog.get_current(table).await?;
let base_snapshot: Snapshot = read_metadata_file(&entry.metadata_path).await?;
let base_manifest_list: ManifestList =
read_metadata_file(&base_snapshot.manifest_list_path).await?;
// 2. Build the new snapshot on top of the current state.
let (new_snapshot, new_snapshot_path) = build_append_snapshot(
&base_snapshot,
&base_manifest_list,
new_data_files.to_vec(),
metadata_dir,
)
.await
.context("building append snapshot")?;
// 3. Attempt the CAS.
match catalog
.compare_and_swap(
table,
base_snapshot.snapshot_id,
new_snapshot.snapshot_id,
&new_snapshot_path,
)
.await
{
Ok(()) => {
// Commit durable. The metadata files we wrote in step 2
// are now referenced by the catalog; no cleanup needed.
return Ok(new_snapshot.snapshot_id);
}
Err(CommitError::Conflict(c)) => {
// Another writer committed between our read of the base
// and our CAS attempt. Log the conflict, back off, retry.
tracing::warn!(
table = %table,
expected = c.expected_old,
actual = c.actual_current,
attempt,
"commit conflict; retrying"
);
// The metadata files we wrote in step 2 are orphans now.
// Orphan-file cleanup (Module 6) handles them; we don't
// need to delete them here. The data files are *not*
// orphans — they'll be referenced by our next attempt's
// commit.
// Exponential backoff with jitter to avoid lockstep
// retries when many writers are contending.
let base_ms = INITIAL_BACKOFF_MS * 2u64.pow(attempt.min(8));
let jitter_ms = rand::thread_rng().gen_range(0..=base_ms);
tokio::time::sleep(Duration::from_millis(base_ms + jitter_ms)).await;
}
Err(CommitError::Database(e)) => {
// Real database error, not a conflict. Surface it.
return Err(e.context("catalog CAS"));
}
}
}
Err(anyhow::anyhow!(
"commit failed after {} retries on table {}",
MAX_RETRIES,
table,
))
}
The discipline. Data files are written once, before the retry loop, outside this function. The retry loop's per-iteration work is just the metadata: read base, build new metadata, attempt CAS. A retry costs the metadata files (some tens of KB written), the catalog read (one DB round-trip), the catalog write (one DB round-trip), and the backoff sleep. A typical retry under low contention completes in under 200 ms. Bounded retry plus exponential-backoff-with-jitter keeps the system stable even under bursts of contention.
The structured-log line on conflict (tracing::warn!) is critical for operational visibility. Sustained conflict rates above a threshold are an indicator that two writers are racing against a table that should be partitioned; the Artemis observability stack alerts on commit_conflict_total rate.
A Test That Exercises the CAS Under Concurrent Writers
The integration test that validates the CAS does what it claims: two concurrent writers each attempt a commit; exactly one succeeds, the other observes the conflict and either retries or fails per its policy.
use anyhow::Result;
use tokio::sync::Barrier;
use std::sync::Arc;
/// Spawn two concurrent committers against the same table. Assert that
/// exactly one's first attempt succeeds, the other's first attempt sees
/// a CommitError::Conflict, and the second writer's retry then succeeds
/// against the new base. The final table state has both commits applied
/// in sequence.
#[tokio::test]
async fn cas_serializes_concurrent_writers() -> Result<()> {
let catalog = test_catalog().await?;
let metadata_dir = test_metadata_dir();
let table = "telemetry_test";
setup_initial_table(&catalog, table, &metadata_dir).await?;
// A barrier ensures both writers reach the CAS at the same time;
// without it the test is racy in the wrong direction (timing-based
// rather than CAS-based).
let barrier = Arc::new(Barrier::new(2));
let cat1 = catalog.clone();
let cat2 = catalog.clone();
let dir1 = metadata_dir.clone();
let dir2 = metadata_dir.clone();
let b1 = barrier.clone();
let b2 = barrier.clone();
let writer_a = tokio::spawn(async move {
let files = make_test_files("A", 3).await?;
// Read the base before the barrier so both writers have the same base.
let entry = cat1.get_current("telemetry_test").await?;
let base = read_snapshot(&entry.metadata_path).await?;
let base_ml = read_manifest_list(&base.manifest_list_path).await?;
b1.wait().await; // line up at the CAS
let (new, path) = build_append_snapshot(&base, &base_ml, files, &dir1).await?;
cat1.compare_and_swap("telemetry_test", base.snapshot_id, new.snapshot_id, &path).await
});
let writer_b = tokio::spawn(async move {
let files = make_test_files("B", 3).await?;
let entry = cat2.get_current("telemetry_test").await?;
let base = read_snapshot(&entry.metadata_path).await?;
let base_ml = read_manifest_list(&base.manifest_list_path).await?;
b2.wait().await;
let (new, path) = build_append_snapshot(&base, &base_ml, files, &dir2).await?;
cat2.compare_and_swap("telemetry_test", base.snapshot_id, new.snapshot_id, &path).await
});
let result_a = writer_a.await?;
let result_b = writer_b.await?;
// Exactly one of the two should have succeeded; the other should
// have observed a CommitConflict.
let successes = [&result_a, &result_b].iter().filter(|r| r.is_ok()).count();
let conflicts = [&result_a, &result_b]
.iter()
.filter(|r| matches!(r, Err(CommitError::Conflict(_))))
.count();
assert_eq!(successes, 1, "exactly one writer should succeed");
assert_eq!(conflicts, 1, "exactly one writer should see a conflict");
Ok(())
}
The test asserts the property that matters: exactly one writer's CAS succeeds, exactly one observes the conflict. The retry-then-succeed behavior is tested separately in a longer-running test that exercises the full commit_append retry loop under a steady stream of concurrent commits. Both tests are in the capstone's integration suite.
Key Takeaways
- The table format's only transactionally-significant operation is the catalog CAS that swaps the current snapshot pointer. Everything else (data files, manifests, snapshot metadata) is decoupled work that can be retried or abandoned without affecting readers.
- Optimistic concurrency control is the right primitive for lakehouse commits: writers do their slow work in parallel, coordinate only at the very end, and retry on conflict. The alternative — pessimistic locking — would hold locks for the duration of multi-file writes, which is incompatible with object-store write latencies and the failure modes of crashed lock holders.
- Conflicts are detected by CAS but resolved by rebase. A conflicting commit reads the new current snapshot, rebuilds its metadata on top of the new state, and retries. For append-only commits this is cheap and almost always succeeds. For overwrite commits the rebase may require re-evaluating which files to rewrite.
- Bounded retries with exponential backoff and jitter keep the system stable under bursts of contention. Sustained conflict rates indicate that two writers are racing against a table that should be split into smaller tables or fed through a single-writer coordinator.
- Throughput under contention is bounded by catalog CAS rate, not by data write rate. For workloads that need hundreds of commits per second per table, the typical fix is batching, table-level sharding, or a single-writer committer — not faster locking.
Capstone — Mission Archive Table
Module: Data Lakes — M02: Open Table Formats Estimated effort: 1–2 weeks of focused work Prerequisite: All three lessons in this module completed; all three quizzes passed (≥ 70%). The Module 1 capstone (Artemis archive Parquet writer) is the producer this table format wraps.
Mission Briefing
From: Cold Archive Platform Lead
ARCHIVE BRIEFING — RC-2026-04-DL-002
SUBJECT: Mission archive table format, Iceberg-shaped metadata layer
over the Parquet writer from M01.
PRIORITY: P1 — required to unblock concurrent ingest from the daily
downlink pipeline and the weekly correction job.
The Module 1 Parquet writer is in production for the new cold archive, but the storage layer is still a directory of files. We are seeing the failure modes Lesson 1 enumerated: concurrent runs of the daily ingest and the weekly correction job race; analyst queries against the archive return different result sets across consecutive runs; a corrupted ingest from last Tuesday left the directory in a state that took manual cleanup. The table format layer is the fix.
Your work is the minimal Iceberg-shaped table format that solves these problems. The goal is operational correctness over feature completeness — we are not building an Iceberg-spec-compliant implementation; we are building the metadata layer that makes the archive transactional. The capstone produces a Rust crate, artemis-table-format, that the ingestion pipeline and the correction job can both use without coordination.
Module 3 will add partition layout on top of this; Module 4 will add time-travel reads; Module 5 will plug a query engine in; Module 6 will add the maintenance jobs (compaction, snapshot expiration). This module's project is the foundation those modules build on. Get the metadata model and the commit protocol right.
What You're Building
A Rust crate, artemis-table-format, exposing:
- A
Tablestruct holding a reference to the catalog and the current snapshot, with methodscurrent_snapshot(),read_plan(predicate),commit_append(data_files),commit_overwrite(removed, added). - A
PostgresCatalog(or equivalent — SQLite acceptable for the capstone) providing linearizable CAS on the per-table snapshot pointer. - Concrete types for
Snapshot,ManifestList,Manifest,DataFilematching the Iceberg-spec-aligned shape from Lesson 2. - Serialization to and from a metadata directory in the same object store as the data files (local filesystem is acceptable for the capstone; the production version uses S3-compatible
object_store). - A CLI binary,
artemis-table, with subcommandsinit,info,commit(consumes a list of data files from stdin),history(lists snapshots),inspect-manifest(dumps a manifest's contents).
The crate must support concurrent commits from at least two writers against the same table without data loss or visibility anomalies, demonstrated in the integration test suite.
Functional Requirements
- Snapshot immutability. Once written, a snapshot file is never modified. Commits write new snapshot files. The metadata directory's contents grow over time.
- Catalog CAS. The commit's only transactionally-significant operation is a single linearizable CAS on the per-table snapshot pointer. The CAS implementation can be any transactional backend (Postgres, SQLite with
BEGIN IMMEDIATE, in-memoryMutex<HashMap>for unit tests). - Optimistic retry loop. Failed CAS attempts trigger a rebase against the new current snapshot and a retry. The retry bound is configurable (default 16) with exponential backoff and jitter.
- Read planning. A read against a snapshot produces a list of data files using the three-pass pruning (manifest list summaries → manifest entries → done; the in-file row-group pruning is the Parquet reader's job, not the table format's).
- Schema enforcement. Data files committed to the table must match the table's current schema, or the commit is rejected. The capstone supports schema-compatible additions (new columns, all-nullable) without requiring a schema-evolution commit.
- Append and overwrite commit types. Append commits add files without removing any. Overwrite commits remove a specified file set and add a new file set in one snapshot. Both must work under concurrent contention.
- History query.
table.history()returns the sequence of snapshots from the current snapshot back to the table's creation, via theparent_snapshot_idchain.
Acceptance Criteria
Verifiable (automated tests must demonstrate these)
-
table.commit_append(files)against a fresh table produces a new snapshot referencing the files;table.current_snapshot().manifest_listlists one manifest containing the files. -
After 10 sequential
commit_appendoperations,table.history()returns 11 snapshots (1 initial + 10 commits) with the correct parent chain. -
Concurrent
commit_appendfrom two writers (orchestrated with aBarrierto line them up at the CAS) results in exactly one commit succeeding on first attempt and one observing a conflict; after the conflict-handler's retry, both commits are applied in sequence, and the final snapshot references all files from both writers. -
commit_overwriteremoves the specified files from the snapshot's effective file set (the manifest entries' status changes fromExistingtoDeleted) and adds the new files, in one snapshot. -
A
commit_appendrejected for schema mismatch (data file's schema is incompatible with the table schema) does not modify the catalog pointer; subsequent reads return the previous snapshot's file set. -
read_plan(predicate)for a partition-equality predicate returns only data files whose partition statistics overlap the predicate value; manifests whose partition summaries do not overlap are not opened. - After 100 random concurrent commits across two writers (operations interleaved arbitrarily, retries enabled), the final snapshot's effective file set equals the set of files that succeeded in either writer's view. No commits are silently lost; no commits are silently double-applied.
-
An old snapshot remains readable after a sequence of subsequent commits.
read_plan(snapshot_id = N)for an N from 50 commits ago returns the file set that was current at snapshot N.
Self-assessed (you write a short justification; reviewer checks it)
-
(self-assessed) The catalog backend choice (Postgres / SQLite / in-memory) is documented in
docs/catalog-choice.mdwith the linearizable-CAS argument: why the chosen primitive provides the required atomicity, and what fails if it does not. -
(self-assessed) The retry policy (bound, backoff, jitter) is documented in
docs/retry-policy.mdagainst the livelock failure mode: why the chosen bounds are sufficient for expected contention and what the failure surface looks like when they are not. -
(self-assessed) The metadata file layout (path conventions, file naming, directory structure) is documented in
docs/metadata-layout.mdagainst the future-modules concern: Module 6's snapshot expiration job needs to be able to enumerate snapshots and decide which to delete; your layout makes that operation efficient. -
(self-assessed) The
read_planpruning correctness is justified indocs/read-plan-correctness.mdagainst the false-negative concern: pruning must never skip a file that could match the predicate. Your implementation's correctness argument is one paragraph plus a test that exercises the boundary cases.
Architecture Notes
A reasonable module layout:
artemis-table-format/
├── src/
│ ├── lib.rs # Table, Snapshot, Manifest, etc.
│ ├── metadata.rs # serialization of snapshot/manifest-list/manifest
│ ├── catalog.rs # CatalogTrait, PostgresCatalog, InMemoryCatalog
│ ├── commit.rs # commit_append, commit_overwrite, retry loop
│ ├── read.rs # read_plan and pruning
│ └── bin/artemis_table.rs
├── tests/
│ ├── single_writer.rs # baseline correctness
│ ├── concurrent_writers.rs # the Barrier-orchestrated concurrent tests
│ ├── history.rs # snapshot chain queries
│ └── time_travel.rs # reading old snapshots
└── docs/
├── catalog-choice.md
├── retry-policy.md
├── metadata-layout.md
└── read-plan-correctness.md
The metadata directory layout the docs/metadata-layout.md should justify:
<table_location>/
├── data/ # written by the M01 Parquet writer
│ └── <data files>.parquet
└── metadata/
├── v0/ # snapshot 0 (initial empty table)
│ ├── snap.json
│ └── ml.avro
├── v1/
│ ├── snap.json
│ ├── ml.avro
│ └── m0001.avro
└── v2/
├── snap.json
├── ml.avro
└── m0002.avro
The version-prefixed directory is one option; a flat directory with snapshot-ID-prefixed filenames is another. The Iceberg spec uses a flat layout with monotonically-incrementing version files; either is defensible if the doc explains the choice.
Hints
Hint 1 — A simple in-memory CAS for unit tests
The CAS abstraction is small enough that a Mutex<HashMap<String, CurrentSnapshot>> is enough for fast unit tests. Concurrent commit tests run against the in-memory implementation; integration tests run against SQLite or Postgres to verify the same protocol works against a real transactional backend. The trait sketch:
#[async_trait]
pub trait Catalog: Send + Sync {
async fn get_current(&self, table: &str) -> Result<CatalogEntry>;
async fn compare_and_swap(
&self,
table: &str,
expected_old: SnapshotId,
new: SnapshotId,
new_metadata_path: &str,
) -> Result<(), CommitError>;
}
Hint 2 — Avoiding the wrong kind of retry test
Concurrent-writer tests must orchestrate the writers so they all reach the CAS at the same time; otherwise the test is testing timing, not the CAS protocol. Use tokio::sync::Barrier to line up both writers after they've read the same base snapshot. Without the barrier, one writer almost always finishes its work before the other starts, and the CAS never sees concurrent contention.
Hint 3 — Manifest pruning correctness boundary cases
The pruning correctness argument has to handle two edge cases: (1) partition statistics with contains_null = true against a non-null predicate (the manifest might contain matching rows even though its bounds don't overlap), and (2) min_value == max_value for a single-value partition (the comparison must be <= and >=, not < and >, or the matching manifest gets pruned). The pruning test suite should include both cases. A pruning bug that produces false negatives — incorrectly skipping a matching file — is silent data loss from the analyst's perspective.
Hint 4 — Reading the parent snapshot chain
table.history() walks the parent_snapshot_id chain backward from the current snapshot. Each step reads one metadata file. For tables with hundreds of snapshots, this is hundreds of file reads — acceptable for an info CLI but too slow for any per-query use. If you find yourself wanting to use history in the read path, that is the signal to add a "snapshot log" file (the Iceberg metadata_log field) that summarizes the chain in a single file. The capstone does not require this optimization; production deployments do.
Hint 5 — The orphan-file question
A failed CAS attempt leaves orphan metadata files in the metadata directory (the snapshot, manifest list, and manifest the writer produced but that no catalog pointer references). These are not a correctness concern — they cost only the storage to keep them. Module 6 will introduce the orphan-file cleanup job that finds and deletes them. Your capstone does not need to implement cleanup, but the documentation should note that orphans accumulate and reference Module 6 as the eventual fix.
References
- Designing Data-Intensive Applications (Kleppmann & Riccomini), Chapter 7 — "Pessimistic vs Optimistic Concurrency Control"
- Apache Iceberg specification (
iceberg.apache.org/spec), particularly "Table Spec V2" and "Commit Process" - Iceberg whitepaper (Ryan Blue, Netflix, 2018) —
github.com/apache/iceberg/blob/main/docs/img/iceberg-paper.pdf sqlxdocumentation (docs.rs/sqlx) — Postgres CAS pattern
When You're Done
The crate is "done" when all eight verifiable acceptance criteria pass in CI and the four self-assessed docs are written. The integration tests run against both the in-memory and the SQLite catalog backends to verify the protocol works regardless of the CAS implementation. The next module's project will build the partition strategy that this format's read_plan uses; your read_plan must support partition-equality predicates by the time Module 3 begins.
Module 3: Partitioning and Clustering
Lesson 1 — Partition Strategies and the Small-File Problem
Module: Data Lakes — M03: Partitioning and Clustering Position: Lesson 1 of 3 Source: Designing Data-Intensive Applications — Martin Kleppmann and Chris Riccomini, Chapter 6 ("Partitioning"). Apache Iceberg specification, "Partitioning" and "Partition Transforms" sections.
Context
The Module 2 table format gave us a metadata-tracked file set with three pruning passes available. Pruning depends entirely on per-file statistics being tight along the query-relevant dimensions. If a file contains rows from every mission ever flown, the per-file min(mission_id) and max(mission_id) statistics span every mission, and a query for one specific mission cannot prune the file. Partitioning is the discipline that arranges rows into files so the statistics are useful.
The Artemis analyst workload tells us what the partition columns should be. Every query mentions a mission. Most queries mention a time window. Many queries mention a sensor or payload. A partition strategy that aligns the file layout with these dimensions makes the queries cheap. A strategy that doesn't is worse than no strategy at all — because partitioning has a real cost (more files, more metadata, more directory operations), and a wrongly-chosen partition strategy pays the cost without delivering the benefit.
This lesson develops partitioning bottom-up. The mechanics of what "partitioning" means at the table-format level. The two question types every partition decision answers: which columns to partition on, and what transform to apply to handle high-cardinality columns. The small-file failure mode that over-partitioning produces and how Iceberg-style hidden partitioning avoids it. By the end of the lesson the engineer can defend a partition-spec choice against a real workload; the capstone designs the spec for the SDA observation table.
Core Concepts
What Partitioning Is, Mechanically
In an Iceberg-shaped table format, partitioning is metadata, not directory structure. The table's partition spec is a list of partition fields, each one a (source_column, transform, partition_name) triple. When a new data file is committed, the writer computes the partition tuple for the file — typically by computing the transform on the min and max values of the source column and requiring them to be equal (the file is "in" exactly one partition value). The partition tuple is recorded in the data file's manifest entry. Read planning prunes by computing the partition tuple's overlap with the query predicate.
Two important consequences. First, partitioning does not require physical directory structure. The table format records the partition tuple in metadata; the data files can live anywhere in the object store. Hive's directory-based partitioning (/year=2024/month=03/file.parquet) is a convention, not a requirement; Iceberg supports it but doesn't require it. The Artemis archive uses content-addressed paths (<table>/data/<hash>.parquet) and recovers the partition value entirely from metadata.
Second, partitioning affects write-side file boundaries. The writer cannot put rows belonging to different partition tuples in the same file, because a file has exactly one partition tuple. A writer ingesting a record batch containing rows from twelve missions must produce at least twelve files — one per mission partition. The Module 1 writer's row-group-size discipline still applies within a partition; multiple files in the same partition are produced when the partition's row count exceeds the row group size target. The partition spec sets a minimum file count per ingest; the row group size sets the maximum row count per file.
DDIA (Ch. 6) draws the same distinction between partitioning's logical purpose (limiting how much data each query touches) and its physical realization (which can vary across storage layers). In the lakehouse, the realization is the file-level partition tuple recorded in the manifest.
Identity vs Transformed Partitioning
The simplest partition transform is identity: partition by the column's value directly. partition by mission_id produces one partition per distinct mission ID. For a low-cardinality column like mission_id (~40 distinct values across the archive's history), this is fine. For high-cardinality columns it is not.
Consider partitioning by sample_timestamp_ns. The column's cardinality is enormous — every sample at every timestamp is a distinct value. Identity partitioning on sample_timestamp_ns would produce one partition per row, which produces one tiny file per row, which is catastrophic. The small-file problem at scale: a year of data at 100 Hz becomes three billion partitions, and the metadata cost is more than the data cost.
The fix is transformed partitioning: partition by some function of the column rather than the column itself. The standard Iceberg transforms:
year(ts),month(ts),day(ts),hour(ts)— extract a time grain from a timestamp.partition by day(sample_timestamp_ns)produces one partition per day, regardless of how many samples that day contains.bucket(N, col)— hash the column value into one of N buckets.partition by bucket(16, payload_id)produces 16 partitions regardless of how many distinct payload IDs exist. Used for high-cardinality columns where the workload's queries are exact-match against the column.truncate(width, col)— for strings, take the firstwidthcharacters; for numbers, round to a multiple ofwidth.partition by truncate(8, mission_id)partitions by the first eight characters of the mission ID, collapsing similar IDs into the same partition.
The transform is what makes the partition tractable. The right transform produces a partition count in the tens-to-hundreds-of-thousands range — enough granularity for pruning, few enough partitions that the metadata overhead is acceptable.
Hidden Partitioning: The Iceberg Innovation
The Hive-era partition pattern required the query to explicitly mention the partition column. A table partitioned by day(ts) needed queries that filtered on the partition column directly:
-- Hive-style: works
WHERE ts >= '2024-03-01' AND ts < '2024-03-02' AND day = '2024-03-01'
-- Hive-style: does NOT prune; full scan
WHERE ts >= '2024-03-01' AND ts < '2024-03-02'
The second query is logically equivalent to the first and produces the same rows, but Hive's planner could not derive the partition predicate from the timestamp predicate. Analysts had to know about the partition layout and write queries that explicitly referenced it. Schema changes (changing the partition transform) required all queries to be rewritten.
Iceberg's hidden partitioning moves the transform from "user-visible partition column" to "partition-spec metadata." The query writes the natural predicate on the source column; the planner derives the partition predicate from the partition spec. Both queries above prune identically against an Iceberg table partitioned by day(ts). The analyst doesn't need to know the partition strategy; the planner derives it.
Hidden partitioning is what makes partition-strategy changes safe over time. An Iceberg table partitioned by day(ts) can switch to month(ts) without rewriting queries (the Module 4 partition-evolution mechanic). The query layer is decoupled from the storage layout. This is impossible in Hive's directory-based scheme.
The Small-File Problem
The naive partition strategy is "partition by everything that ever appears in a query predicate." This produces the small-file problem: a partition spec with (mission_id, day(ts), payload_id, sensor_kind) against an archive with 40 missions × 1000 days × 8 payloads × 12 sensors = 3.84M partitions. Each partition contains, on average, a vanishing fraction of the table's rows. The vast majority of partitions hold one file each, and most of those files are tiny — kilobytes, not megabytes. The metadata overhead overwhelms the data.
The downstream costs of small files compound. Read planning must enumerate every relevant partition; the manifest list grows with partition count. Object store listings during maintenance are slow and expensive when there are millions of small objects. Query planning sees many candidate files even after pruning; the per-file overhead (footer parse, statistics check) dominates the actual data read. Compaction (Module 6) becomes a continuous background workload to consolidate tiny files into larger ones.
DDIA (Ch. 6) calls out the same tradeoff in terms of partition count: too few partitions limits parallelism and pruning; too many partitions wastes coordination overhead and produces hot spots. The right partition count depends on the workload — the rule of thumb that has held up in practice is target file size 64-512 MB, target partition size 1-100 GB, target partition count in the tens of thousands or fewer. Outside these ranges, the operator should expect to fight maintenance overhead constantly.
The Artemis SDA observation table sits well inside the ranges with partition by (mission_id, day(sample_timestamp_ns)) — 40 × 1000 ≈ 40,000 partitions, average partition size of a few hundred MB, average file size of around 128 MB.
Choosing Partition Columns: The Workload Determines the Spec
The partition spec is determined by the workload. The decision procedure for the Artemis SDA observation table goes like this.
First, enumerate the typical query predicates. Sample the actual query log (or the query patterns described in the requirements). For the Artemis archive, the workload distribution is approximately:
- 100% of queries filter on
mission_id - 95% of queries filter on a time window
- 30% of queries filter on
payload_id - 15% of queries filter on
sensor_kind
Second, estimate the cardinality and selectivity of each candidate. A predicate that prunes 99% of partitions is highly selective; one that prunes 50% is not. mission_id is highly selective (filter to 1 of 40 missions → 97.5% pruning). day(ts) for typical queries (one week to one month windows) is highly selective (7-30 days of 1000 → 97-99% pruning). payload_id is less selective (1 of 8) but only 30% of queries use it. sensor_kind is less selective still (1 of 12) and used by 15% of queries.
Third, pick the columns with high coverage × high selectivity. mission_id is in every query and is highly selective → include. day(ts) is in 95% of queries and is highly selective → include. payload_id and sensor_kind are less universal; including them would increase partition count by ×100 for relatively little additional pruning. Module 3 Lesson 2 develops clustering as the answer for these second-tier dimensions: arrange the data within each partition so that pruning on payload_id and sensor_kind still works at the file level, without partitioning by them.
The final spec: partition by (identity(mission_id), day(sample_timestamp_ns)). 40 missions × ~1000 days ≈ 40,000 partitions. Average partition size of ~500 MB. Average file size of ~128 MB. Manifest count per snapshot in the low thousands. Inside every operational threshold.
Code Examples
Defining and Applying a Partition Spec
The partition spec is part of the table's snapshot metadata. Extending the Module 2 types:
use serde::{Deserialize, Serialize};
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct PartitionSpec {
pub spec_id: u32,
pub fields: Vec<PartitionField>,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct PartitionField {
/// Name of the source column in the table schema.
pub source_column: String,
/// Name the partition value gets in metadata (and any directory
/// layout if used).
pub partition_name: String,
/// Transform applied to the source column to derive the partition.
pub transform: Transform,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum Transform {
/// Use the source value directly. Suitable for low-cardinality
/// columns: mission_id, sensor_kind, payload_id.
Identity,
/// Extract the year as YYYY from a timestamp. Coarsest time grain.
Year,
/// Extract YYYY-MM from a timestamp.
Month,
/// Extract YYYY-MM-DD from a timestamp. Most common for daily data.
Day,
/// Extract YYYY-MM-DD-HH from a timestamp. For very high-volume tables.
Hour,
/// Hash the source value into one of N buckets. For high-cardinality
/// columns with exact-match query patterns.
Bucket(u32),
/// Truncate strings to the first N characters / round numbers down to
/// a multiple of N. For string columns with hierarchical structure.
Truncate(u32),
}
The application — compute a partition tuple for a value — has one variant per transform:
use std::hash::{Hash, Hasher};
/// Compute the partition value for a single source value under a given
/// transform. The result is the value that gets recorded in the data
/// file's manifest entry and that pruning compares predicates against.
pub fn apply_transform(transform: &Transform, value: &Value) -> PartitionValue {
match transform {
Transform::Identity => PartitionValue::from_value(value),
Transform::Day => match value {
Value::TimestampNs(ns) => {
let secs = ns / 1_000_000_000;
let days = secs / 86_400;
PartitionValue::Date(days as i32)
}
_ => panic!("Day transform requires Timestamp source"),
},
Transform::Hour => match value {
Value::TimestampNs(ns) => {
let secs = ns / 1_000_000_000;
let hours = secs / 3_600;
PartitionValue::Hour(hours as i64)
}
_ => panic!("Hour transform requires Timestamp source"),
},
Transform::Bucket(n) => {
// Iceberg specifies a particular hash function (Murmur3)
// to keep partition values stable across implementations.
// The production version uses fastmurmur3; this sketch uses
// the default hasher for illustration.
let mut hasher = std::collections::hash_map::DefaultHasher::new();
value.hash(&mut hasher);
let bucket = (hasher.finish() % (*n as u64)) as u32;
PartitionValue::Bucket(bucket)
}
Transform::Truncate(width) => match value {
Value::String(s) => {
PartitionValue::String(s.chars().take(*width as usize).collect())
}
Value::Int64(n) => {
let w = *width as i64;
PartitionValue::Int64((n / w) * w)
}
_ => panic!("Truncate requires String or Int64 source"),
},
// Year and Month omitted for brevity; same shape as Day.
_ => todo!(),
}
}
The discipline: the transform is a pure function of the source value. Two writers computing the partition value for the same row produce the same result. The partition values are stable across implementations as long as the hash function and the time-grain boundaries match the spec.
The Hidden-Partitioning Derivation
Read planning's job for a hidden partition: turn a predicate on the source column into a predicate on the partition value. For each partition transform, this is a small piece of arithmetic.
use std::ops::RangeInclusive;
#[derive(Debug, Clone)]
pub enum SourcePredicate {
Equals(Value),
Range(RangeInclusive<Value>),
In(Vec<Value>),
}
/// Translate a predicate on a source column into a predicate on the
/// partition value for a given transform. The returned predicate is a
/// safe over-approximation: it may match more partitions than strictly
/// necessary, but never fewer. False negatives would be silent data
/// loss; false positives just mean reading extra files.
pub fn lift_predicate(
transform: &Transform,
source_pred: &SourcePredicate,
) -> PartitionPredicate {
match (transform, source_pred) {
(Transform::Identity, SourcePredicate::Equals(v)) => {
PartitionPredicate::Equals(PartitionValue::from_value(v))
}
(Transform::Day, SourcePredicate::Range(r)) => {
// sample_timestamp_ns range -> day range
let lo_day = ts_to_day(r.start());
let hi_day = ts_to_day(r.end());
PartitionPredicate::Range(lo_day..=hi_day)
}
(Transform::Bucket(_), SourcePredicate::Equals(v)) => {
// Equality on the source becomes equality on the bucket.
let bucket = apply_transform(transform, v);
PartitionPredicate::Equals(bucket)
}
(Transform::Bucket(_), SourcePredicate::Range(_)) => {
// Range queries cannot be lifted through bucket hashing;
// the predicate could match any bucket. The planner falls
// back to scanning all partitions; bucket-only partitioning
// is wrong for range-queried columns.
PartitionPredicate::AllPartitions
}
_ => PartitionPredicate::AllPartitions,
}
}
fn ts_to_day(ts: &Value) -> PartitionValue {
match ts {
Value::TimestampNs(ns) => PartitionValue::Date((ns / 1_000_000_000 / 86_400) as i32),
_ => panic!("ts_to_day requires Timestamp"),
}
}
What to notice. The lifting is conservative — when the transform's algebra doesn't admit a tight derivation, the planner produces AllPartitions (no pruning). False negatives would silently drop data from query results; false positives just read extra files. The Bucket + range case is the textbook example: hashing destroys the order property that range queries need; bucket partitioning is only valid for equality predicates. A capstone test must exercise this case and verify the planner doesn't try to prune.
The Per-Partition File-Boundary Discipline
The writer's contract: each data file has rows from exactly one partition tuple. Implementing this requires partitioning the incoming RecordBatch by the partition spec before handing slices to the Module 1 Parquet writer.
use std::collections::HashMap;
use arrow::array::RecordBatch;
/// Partition a record batch by the table's partition spec, returning
/// one sub-batch per partition tuple. Each sub-batch can then be
/// written to a separate Parquet file (or appended to the partition's
/// in-progress file).
pub fn split_by_partition(
batch: &RecordBatch,
spec: &PartitionSpec,
) -> Vec<(PartitionTuple, RecordBatch)> {
// 1. For each row, compute its partition tuple by applying every
// transform in the spec to the row's source column values.
let row_partitions: Vec<PartitionTuple> = (0..batch.num_rows())
.map(|row| compute_partition_tuple(batch, row, spec))
.collect();
// 2. Group row indices by partition tuple.
let mut groups: HashMap<PartitionTuple, Vec<usize>> = HashMap::new();
for (row, pt) in row_partitions.iter().enumerate() {
groups.entry(pt.clone()).or_default().push(row);
}
// 3. Produce a sub-batch per partition by gathering the rows.
// Arrow's `compute::take` does the per-column index gather in a
// single allocation per column.
groups
.into_iter()
.map(|(pt, rows)| {
let indices = arrow::array::UInt32Array::from(
rows.iter().map(|&r| r as u32).collect::<Vec<u32>>()
);
let sub_columns: Vec<_> = batch
.columns()
.iter()
.map(|c| arrow::compute::take(c, &indices, None).unwrap())
.collect();
let sub_batch = RecordBatch::try_new(batch.schema(), sub_columns).unwrap();
(pt, sub_batch)
})
.collect()
}
The cost of partition splitting is one take per column per partition — O(rows × columns × partitions). For a record batch of 8192 rows × 40 columns × an average of 3-5 distinct partition tuples per batch, this is fast (microseconds). The constant-factor cost is real but tractable. Production code that produces many partition tuples per batch — for instance, a backfill that spans many days — benefits from batching the incoming data into per-partition buffers and flushing partition-by-partition; the M01 writer already supports this via per-partition ArrayBuilder state.
Key Takeaways
- Partitioning is metadata, not necessarily directory structure. The partition spec is part of the table's snapshot; the partition value is recorded per data file in the manifest entry. Read planning compares predicates against the per-file partition value to prune.
- Identity partitioning works for low-cardinality columns (
mission_id,sensor_kind). Transformed partitioning (day(ts),bucket(16, payload_id),truncate(8, name)) makes high-cardinality columns tractable by collapsing many values into fewer partitions. - Hidden partitioning is the Iceberg-vs-Hive distinction worth knowing: the query writes the natural predicate on the source column; the planner derives the partition predicate from the spec. Queries are decoupled from the storage layout; partition strategy changes are safe.
- The small-file problem is the dominant failure mode of over-partitioning. Operational thresholds: target file size 64–512 MB, target partition size 1–100 GB, target partition count in the tens of thousands. Outside these, expect to fight maintenance overhead.
- The partition spec is determined by the workload. Pick columns with high query coverage × high selectivity. The Artemis SDA spec is
partition by (mission_id, day(sample_timestamp_ns))— two columns, ~40,000 partitions, ~500 MB per partition. Secondary dimensions (payload_id,sensor_kind) get handled by clustering, not partitioning.
Lesson 2 — Multidimensional Clustering
Module: Data Lakes — M03: Partitioning and Clustering Position: Lesson 2 of 3 Source: Apache Iceberg specification, "Sort Orders" and "Clustering" sections. Z-order from Morton (1966), "A computer oriented geodetic data base and a new technique in file sequencing." Hilbert curve from Hilbert (1891) with the practical engineering treatment in Faloutsos (1986). DuckDB blog post on Z-order indexing for an applied perspective.
Source note: Synthesis-mode from training knowledge and the public literature; the historical references are well-established. Applied details (which cluster ordering Iceberg implements, when Snowflake-style clustering applies) would benefit from verification against the relevant vendor documentation.
Context
Lesson 1's partition spec — (mission_id, day(ts)) — makes the queries that filter on mission and time fast. The 30% of queries that also filter on payload_id and the 15% that filter on sensor_kind still get partition-level pruning, but within each partition every payload and every sensor is randomly mixed across the files. The per-file statistics on payload_id span every payload; pruning by payload alone produces no benefit at the file level.
The fix is clustering: an arrangement of rows within a partition such that rows with similar values for the clustering columns end up in the same file. A partition that contains 30 GB of telemetry spanning 8 payloads, clustered on payload_id, becomes 30 files where each file contains rows from one or two payloads. A query for one payload reads 1-2 files out of 30. The per-file payload_id statistic is now tight; pruning works.
The hard case is multidimensional clustering: clustering on payload_id and sensor_kind and region and quality_flag simultaneously. The data sits in a four-dimensional space; the file boundaries need to be arranged so that nearby points in any of the dimensions tend to share a file. The naive solution — sort by (payload_id, sensor_kind, region, quality_flag) — clusters perfectly on the first column but degrades to random on the others. Space-filling curves — Z-order, Hilbert — are the answer. They map the multidimensional space to a single dimension in a way that preserves locality across all dimensions simultaneously.
This lesson develops Z-order and Hilbert curves: what they are, why they work, the limits of the technique, and how to apply them in a lakehouse. The capstone uses Z-order on the SDA observation table to cluster on (payload_id, sensor_kind); this lesson is the design and intuition behind that choice.
Core Concepts
The Multidimensional Locality Problem
A 2D analog makes the problem visible. Imagine a square of points distributed uniformly in a (x, y) plane. The query workload picks rectangular regions of the square. The data is stored in files; the planner picks which files to read based on per-file min/max statistics for x and y. What arrangement of points into files minimizes the number of files a typical query reads?
The naive options:
- Sort by x. Each file has a tight
xrange (one strip of the square) and the fullyrange (the whole height). Queries with tightxbounds prune well; queries with tightybounds prune nothing. - Sort by y. Mirror image: queries with tight
ybounds prune well; queries with tightxbounds prune nothing. - Sort by
(x, y)lex. Equivalent to sort byxfor clustering purposes; the secondary sort byyhappens within already-narrowxstrips, which buys very little.
None of these clusters on both dimensions simultaneously. A file's bounding rectangle should be roughly square — narrow in both x and y — so that queries with bounds in either or both dimensions prune the file. The arrangement that produces square-ish files is a space-filling curve: a function from a 1D ordering to the 2D plane that preserves locality, so consecutive elements in the 1D ordering are nearby in the 2D plane.
Z-Order: Bit Interleaving
The simplest space-filling curve is Z-order, also called Morton order after the 1966 paper that introduced it. The construction is direct: take the binary representations of the column values, interleave the bits, and use the interleaved value as the sort key.
For 4-bit values x = 0110 and y = 1010, the Z-order key interleaves them bit by bit starting from the high bit:
x: 0 1 1 0
y: 1 0 1 0
z: 0 1 1 0 1 1 0 0 (read top-to-bottom, left-to-right)
The key property: in the binary representation of the Z-order key, the high bits determine the high-level position in the plane, the next bits refine within that, and so on. Two values with the same top eight bits of Z-order key live in the same coarse 2D region. Sorting points by Z-order key produces a 1D ordering where consecutive points are clustered in the 2D plane.
Visually, the Z-order curve traces a Z-shaped path through the plane: visit the bottom-left, then top-left, then bottom-right, then top-right (the four quadrants); then recursively within each quadrant trace another Z-shape. The pattern is fractal — same shape at every scale.
For N dimensions the construction generalizes: interleave the bits of N column values, one bit from each column in round-robin order. The resulting key has the same property: high bits determine coarse N-dimensional position. Iceberg's clustering support for Z-order takes a list of columns and produces this multidimensional ordering.
The cost: bit interleaving is cheap (a few CPU instructions per coordinate). The constraint: the columns must be normalized to the same numeric range and bit width — typically by computing the rank or quantile of each column's distribution and using the rank as the interleaving input. The Iceberg implementation does this normalization internally; the user supplies the column names.
Z-Order's Limits: The Diagonal Discontinuity
Z-order has a known weakness. The Z-shape's recursive structure produces a "jump" at every level boundary: the curve goes from the top-right corner of one Z to the bottom-left corner of the next Z. Points adjacent along this jump in 1D space can be far apart in 2D space — a query that happens to span the discontinuity sees the file containing both endpoints listed as "potentially matching" even though the file actually contains widely-separated points.
The practical effect: Z-order is good but not perfect. Pruning effectiveness is typically 2-10× over a single-column sort for two-dimensional clustering, less as the dimension count grows. For most workloads this is enough; the diagonal-discontinuity loss is small relative to the win.
The fix is the Hilbert curve, a more sophisticated space-filling curve that avoids the discontinuity. The Hilbert curve traces a continuous path through the plane (no jumps), which produces tighter clustering and better pruning. The cost is computation: deriving the Hilbert key for a coordinate is several times slower than bit interleaving, and the implementation is famously fiddly. The benefit is typically a 1.5-2× improvement in clustering effectiveness over Z-order.
For the Artemis archive, Z-order is the right choice. The diagonal-discontinuity loss is dominated by the per-file row count (~1M rows per 128MB file); the clustering granularity is fine relative to the file size. The Hilbert curve would be marginally better and substantially harder to implement correctly. Production deployments that have measured both have generally landed on Z-order; Hilbert is most often seen in specialized geospatial or scientific contexts where the 1.5-2× matters enough to pay the complexity cost.
Clustering vs Partitioning: When to Use Which
The two disciplines are complementary, not competing. The decision rule:
- Partition dimensions that appear in nearly every query and have manageable cardinality after transform. The partition makes pruning the cheapest possible operation (manifest-list summary lookup, no file open). The Artemis archive partitions on
(mission_id, day(ts))because these are in 95-100% of queries. - Cluster dimensions that appear in many but not all queries, or dimensions where partitioning would produce the small-file problem. Clustering keeps the partition count manageable while preserving file-level pruning. The Artemis archive clusters on
(payload_id, sensor_kind)inside each(mission_id, day)partition.
Cardinality is the other determinant. A dimension with 4 distinct values (quality_flag) gives essentially no clustering benefit — the four values are mixed into every file anyway. A dimension with 8-1000 distinct values is the sweet spot for clustering: enough cardinality to produce useful file-level statistics, low enough that the clustering keys don't drown in noise.
Three dimensions is the practical maximum for Z-order clustering. With four or more dimensions, the bit-interleaving structure starts producing keys where each individual dimension gets only 16 / N bits of resolution, and the per-dimension clustering degrades. The Artemis archive uses two-dimensional clustering (payload_id, sensor_kind); adding a third dimension is being evaluated for the Module 6 compaction redesign but the early measurements suggest the marginal benefit isn't worth the complexity.
Clustering Is a Write-Time Discipline
Crucially, clustering does not change the data file format. A clustered table is just a normal Parquet table where the rows happen to be sorted by the cluster key. The reader sees ordinary Parquet files with per-column statistics; the pruning that works is the same pruning Module 1 introduced (min/max on each column chunk). The cluster key never appears in the data — it is computed at write time, used to order the rows, and discarded.
This has a useful operational consequence: clustering can be added to an existing table without breaking readers. The Module 6 compaction job is what physically applies the clustering — it reads existing data files, sorts the rows by the cluster key, and rewrites them as new files. Readers see a slow improvement in pruning effectiveness as the compaction completes; no schema change, no query rewrite, no migration.
The Iceberg "sort order" metadata records the cluster key for the table so the writer can apply it on new ingests. Snapshot N+1 of a table can have a different sort order than snapshot N; existing files retain their old sort, new files use the new sort, compaction migrates old files when it runs. The discipline is the same as the partition-evolution discipline (Module 4): the table's logical contract is unchanged; the physical layout migrates over time.
Measuring Clustering Effectiveness
The clustering decision must be measured, not assumed. The metric is the average files-scanned ratio: for a workload of representative queries, the ratio of files actually opened to files that would be opened with no clustering at all. A clustering scheme that reduces the ratio from 1.0 (no pruning) to 0.1 (10× pruning improvement) is winning; a scheme that reduces it from 1.0 to 0.95 is doing nothing.
The Artemis archive runs a quarterly clustering-audit job against the analyst query log. The job samples 1000 recent queries, evaluates each against the current table layout and against the no-clustering baseline, and reports the average ratio. If the ratio drifts above 0.3, the clustering needs reconsideration — typically because the workload has shifted to query patterns the original spec didn't anticipate. The audit job is the feedback loop that keeps the clustering aligned with the actual workload.
Code Examples
Computing a 2D Z-Order Key
The bit-interleaving function for two u32 values, producing a u64 Z-order key:
/// Interleave the bits of two u32 values into a u64 Z-order key.
/// Bit 0 of `x` becomes bit 0 of the result; bit 0 of `y` becomes bit 1;
/// bit 1 of `x` becomes bit 2; bit 1 of `y` becomes bit 3; and so on.
/// Higher-order bits of x/y end up at higher-order positions in the
/// result, which means sorting by the result clusters in 2D.
pub fn zorder_2d(x: u32, y: u32) -> u64 {
fn spread(v: u32) -> u64 {
// Spread 32 bits across 64, leaving gaps for the other dimension.
// The classic "magic number" technique; faster than the bit-by-bit
// loop. Each line moves and masks half the bits.
let mut v = v as u64;
v = (v | (v << 16)) & 0x0000_FFFF_0000_FFFF;
v = (v | (v << 8)) & 0x00FF_00FF_00FF_00FF;
v = (v | (v << 4)) & 0x0F0F_0F0F_0F0F_0F0F;
v = (v | (v << 2)) & 0x3333_3333_3333_3333;
v = (v | (v << 1)) & 0x5555_5555_5555_5555;
v
}
spread(x) | (spread(y) << 1)
}
The "magic numbers" are a standard trick for parallel bit spreading. Each step doubles the spacing between bits while masking off the extras. For five steps, 32 bits become 32 bits spread across 64 positions — exactly half the positions, leaving the other half for the second dimension.
Computing an N-Dimensional Z-Order Key
The general case for an arbitrary number of u32 dimensions, producing a u128 key (or an arbitrary-precision integer for higher dimensions):
/// Compute an N-dimensional Z-order key by bit-interleaving the inputs.
/// For N=2 this gives a 64-bit key; for N=4 a 128-bit key; for higher
/// N the result type would be a u256 or similar.
pub fn zorder_nd(values: &[u32]) -> u128 {
let n = values.len();
debug_assert!(n <= 4, "u128 result supports up to 4 u32 dimensions");
let mut key: u128 = 0;
// For each bit position in the source values, scatter one bit from
// each dimension into N consecutive output positions.
for bit in 0..32 {
for (dim_idx, &value) in values.iter().enumerate() {
let v_bit = ((value >> bit) & 1) as u128;
let out_position = bit * n + dim_idx;
key |= v_bit << out_position;
}
}
key
}
The function above is the canonical reference implementation — correct, easy to verify, but slow. Production code uses the magic-number technique generalized to N dimensions, which is about 10× faster but harder to read. For batch processing at write time, the slow version is plenty fast; the writer computes O(rows) keys per file, which is small relative to the encoding-and-compression cost.
Applying Z-Order at Write Time
The clustering step in the write path: compute Z-order keys for the rows, sort the rows by the keys, hand the sorted rows to the Module 1 Parquet writer.
use arrow::array::{RecordBatch, UInt32Array};
use arrow::compute::{lexsort_to_indices, SortColumn, SortOptions};
/// Re-order the rows of a record batch by the Z-order of the given
/// cluster columns, in preparation for writing to Parquet. The cluster
/// columns must be u32-normalized — typically the rank or quantile of
/// the column's value distribution, since Z-order requires same-scale
/// inputs.
pub fn zorder_batch(
batch: &RecordBatch,
cluster_columns: &[&str],
) -> Result<RecordBatch, anyhow::Error> {
// 1. Extract the cluster columns as u32 arrays.
let cluster_arrays: Vec<&UInt32Array> = cluster_columns
.iter()
.map(|name| {
let idx = batch.schema().index_of(name)?;
batch.column(idx)
.as_any()
.downcast_ref::<UInt32Array>()
.ok_or_else(|| anyhow::anyhow!("{name} is not UInt32"))
})
.collect::<Result<_, _>>()?;
// 2. Compute Z-order keys for each row.
let n_rows = batch.num_rows();
let mut keys: Vec<u128> = Vec::with_capacity(n_rows);
for row in 0..n_rows {
let values: Vec<u32> = cluster_arrays.iter().map(|a| a.value(row)).collect();
keys.push(zorder_nd(&values));
}
// 3. Sort row indices by Z-order key.
let mut indices: Vec<u32> = (0..n_rows as u32).collect();
indices.sort_unstable_by_key(|&i| keys[i as usize]);
// 4. Gather the columns by the sorted indices.
let idx_array = UInt32Array::from(indices);
let sorted_columns: Vec<_> = batch.columns().iter()
.map(|c| arrow::compute::take(c, &idx_array, None))
.collect::<Result<_, _>>()?;
Ok(RecordBatch::try_new(batch.schema(), sorted_columns)?)
}
The pattern. Z-order is applied before the file is written, on a record batch large enough to span the file's worth of rows. For the Artemis writer, this is one row group (1M rows): compute Z-order keys for 1M rows, sort, hand to the writer. The cost is the sort, which is O(n log n) and dominated by the key comparison cost for u128 keys — measurable but not dominant against the encoding and compression cost. For very large row groups (10M+), production code uses partial-radix sort on the high bits of the Z-order key, which is faster but more complex.
The "cluster columns must be u32-normalized" caveat is the production tax. The Iceberg implementation handles this internally by computing the rank of each column's value distribution at write time and using the rank as the interleaving input. The capstone may simplify by assuming the cluster columns are already u32 (true for payload_id and sensor_kind in the Artemis schema), with a note that production code generalizes.
Measuring Pruning Effectiveness
The clustering-audit job, sketched. For each query in the sample, plan against the current table layout and against a hypothetical no-clustering baseline; compute the files-scanned ratio.
use anyhow::Result;
#[derive(Debug, Clone)]
pub struct PruningMeasurement {
pub query_id: u64,
pub files_with_clustering: usize,
pub files_without_clustering: usize,
pub ratio: f64,
}
pub async fn measure_pruning(
table: &Table,
queries: &[Query],
) -> Result<Vec<PruningMeasurement>> {
let mut measurements = Vec::new();
for query in queries {
// Plan against the actual (clustered) table.
let actual_files = table.read_plan(&query.predicate).await?;
let clustered_count = actual_files.len();
// Plan against a hypothetical table with the same partition spec
// but no clustering. The simplest approximation: assume every
// file in every relevant partition is matched (because the
// unsorted files would have statistics spanning the full range
// of the cluster columns). For the Artemis archive's
// (mission, day) partitioning, this equals every file in
// partitions matching the partition predicates of the query.
let partition_files = table
.read_plan_partition_only(&query.predicate)
.await?
.len();
measurements.push(PruningMeasurement {
query_id: query.id,
files_with_clustering: clustered_count,
files_without_clustering: partition_files,
ratio: clustered_count as f64 / partition_files.max(1) as f64,
});
}
Ok(measurements)
}
The measurement is approximate — the "no-clustering baseline" is a counterfactual that the production code only ever computes synthetically. The Artemis observability stack ingests these measurements weekly and tracks the average ratio over time; sustained drift above 0.3 produces a paging alert that the clustering needs revisiting against the current workload.
Key Takeaways
- Clustering arranges rows within a partition so that file-level statistics are tight on multiple dimensions simultaneously. It is the answer for query dimensions that don't fit in the partition spec — appearing in many but not all queries, or having cardinalities that would produce the small-file problem.
- Z-order via bit interleaving is the standard multidimensional clustering technique. Sorting rows by their Z-order key clusters the data in N-dimensional space; the bit-interleaving construction is cheap and well-supported in the lakehouse ecosystem.
- Z-order has a known discontinuity loss at recursive-step boundaries; the Hilbert curve avoids it at substantial implementation cost. For most workloads Z-order is the right tradeoff; Hilbert appears in specialized geospatial and scientific deployments.
- Clustering is a write-time discipline, not a format change. Clustered tables are normal Parquet files with rows happening to be sorted by the cluster key. Readers see ordinary per-column statistics; pruning uses the same mechanisms as unclustered tables.
- Measure the clustering effectiveness. The files-scanned ratio against a no-clustering baseline is the metric. The Artemis archive's quarterly audit job tracks this against representative queries; drift above 0.3 means the clustering needs reconsideration against the actual workload.
Lesson 3 — Partition Pruning at Query Time
Module: Data Lakes — M03: Partitioning and Clustering Position: Lesson 3 of 3 Source: Apache Iceberg specification, "Scan Planning" section. Database Internals — Alex Petrov, Chapter 9 ("Query Processing"), for the predicate-as-tree machinery. Earlier modules (M01 row group statistics, M02 manifest hierarchy) for the substrate.
Source note: The pruning protocol's structural pieces are spec-driven (Iceberg scan planning, Parquet metadata) and well-established. The Artemis-specific tuning details are synthesis-mode against the workload patterns described in Lesson 1.
Context
Lessons 1 and 2 designed the partition spec and the cluster layout. This lesson is the read-side counterpart: given a query, how does the planner turn the query's predicate into a minimal set of files to scan? The answer is three sequential pruning passes, each one consuming the output of the previous one and applying tighter statistics. Pruning power compounds across the passes; getting any one of them wrong loses an order of magnitude of work.
The three passes are: at the manifest list level, prune manifests whose partition summaries don't overlap the predicate; at the manifest entry level, prune data files whose per-file statistics don't overlap the predicate; at the row group level (inside the data files), prune row groups whose per-column-chunk statistics don't overlap the predicate. The first two passes are the table format's responsibility (Module 2's metadata); the third is the Parquet reader's responsibility (Module 1's footer). All three use the same logical operation — check whether the predicate's value range intersects the chunk's value range — at different granularities.
This lesson develops the protocol end to end. The predicate-tree shape that the planner uses internally and how it composes with the statistics. The lifting from source-column predicates to partition-value predicates (Lesson 1's hidden partitioning, applied at read time). The conservative-overshoot discipline that makes pruning correctness tractable. By the end the engineer can predict how many files a query will scan and explain why; the capstone's pruning-effectiveness measurements depend on exactly this skill.
Core Concepts
The Predicate Tree and Statistics Algebra
A query predicate is a tree of conjunctions, disjunctions, and leaves. Each leaf is a single-column comparison (mission_id = 'apollo-7', panel_voltage > 28.5, sample_timestamp_ns BETWEEN A AND B). The planner walks the tree and produces, for each node, a function (statistics) → maybe_matches that takes a statistics record and returns true if rows matching the predicate could exist in the data the statistics describe.
The crucial property of maybe_matches is that it is conservative: it must return true whenever rows could match. False positives are acceptable — the planner reads a file that happens not to contain matching rows. False negatives are silent data loss — the planner skips a file that does contain matching rows, and the query returns wrong results. Every pruning function preserves this asymmetry.
The leaf rules are direct. For a predicate column op value against statistics (min, max, null_count):
column = value:min <= value <= maxcolumn < value:min < valuecolumn > value:max > valuecolumn IN (vlist):(min, max)overlaps[min(vlist), max(vlist)]column IS NULL:null_count > 0column IS NOT NULL:null_count < row_count
The composition rules follow basic predicate logic. A AND B matches if both A matches and B matches; A OR B matches if either does. The planner produces one boolean per file by walking the tree against the file's statistics record; files where the root produces false are pruned.
This is the same statistics algebra DDIA (Ch. 4) describes for column-store predicate pushdown, generalized to the table-format hierarchy. The mechanics don't change between Parquet row-group pruning and Iceberg manifest pruning — only the granularity of the statistics does.
Pass 1: Manifest List Pruning
The first pruning pass operates on the manifest list. The manifest list holds one entry per manifest, summarizing the manifest's contents: which partitions it spans, the count of data files, the per-partition-field summary statistics (lower bound, upper bound, contains-null flag).
For a query against mission_id = 'apollo-7' AND sample_timestamp_ns >= '2024-03-01' AND sample_timestamp_ns < '2024-03-08', the manifest list pruning derives partition predicates for each partition field in the spec:
mission_idpartition (identity transform):partition.mission_id = 'apollo-7'day(sample_timestamp_ns)partition (day transform):partition.day BETWEEN '2024-03-01' AND '2024-03-07'
The planner then checks each manifest's partition summary against these predicates. A manifest with mission_id summary (min='apollo-3', max='apollo-3') and day summary (min='2024-01-01', max='2024-01-31') is pruned by both predicates — neither overlaps. A manifest with mission_id summary (min='apollo-7', max='apollo-7') and day summary (min='2024-03-01', max='2024-03-31') matches the first predicate exactly and overlaps the second; it survives the pass.
For the Artemis archive with 40,000 partitions distributed across roughly 4,000 manifests (one manifest per commit, with hundreds of commits per year per mission), the typical week-long-window query prunes to ~10 manifests out of ~4,000. The pass takes one manifest-list read (a few hundred KB, one S3 GET) and produces a 99.75% pruning ratio.
Pass 2: Manifest Entry Pruning
The second pruning pass opens each surviving manifest and applies file-level statistics. The manifest entries hold one record per data file with the column-level lower bounds, upper bounds, and null counts. The planner applies the full predicate tree (not just partition predicates) against each file's column statistics.
For the example query, the planner now checks each file against:
- The partition predicates (same as Pass 1, redundant at this level but cheap).
- Any non-partition predicates — for instance, if the query also has
WHERE panel_voltage > 28.5, the planner uses the file'spanel_voltageupper-bound statistic. A file withpanel_voltage(min=24.1, max=27.3)is pruned becausemax < 28.5. - Any IN-list and BETWEEN predicates, applied with the intersection algebra above.
This is where clustering pays off. A clustered table has files whose payload_id and sensor_kind statistics are tight; a query for one payload reads 1-2 files out of 30 inside the matching partition. An unclustered table has files where every file's payload_id statistic spans every payload; the same query reads all 30 files. The 10×-15× pruning improvement from Lesson 2's Z-order clustering shows up entirely at this pass.
The cost of Pass 2 is one manifest read per surviving manifest. For 10 surviving manifests of 1 MB each, this is 10 MB of metadata I/O — order tens of milliseconds against object storage. The output is typically tens to hundreds of data files to scan.
Pass 3: Row Group Pruning
The third pruning pass happens inside the data files, at the Parquet row group level. Module 1 introduced the Parquet footer's per-column-chunk statistics; Pass 3 applies the same statistics algebra at row-group granularity.
For a Parquet file with 8 row groups of 128 MB each, the planner reads the footer, evaluates the predicate against each row group's column statistics, and emits the byte ranges of the column chunks that need to be read. A file where 2 out of 8 row groups survive Pass 3 has 75% of its bytes pruned without being read. The Parquet reader's ProjectionMask (Lesson 1, Module 1) combines with the row group pruning to produce a minimal-bytes read plan.
Pass 3 is the Parquet reader's responsibility, not the table format's. The table format hands the Parquet reader the list of files; the reader opens each file's footer, applies row-group pruning, and emits Arrow batches for the surviving row groups. The pruning is invisible at the table-format API boundary — the planner sees a "scan the file" operation; the file's actual I/O depends on the row-group pruning the reader does internally.
The three passes together produce the pruning pyramid: 4,000 manifests → 10 manifests → ~50 files → ~80 row groups → reads of column chunks for selected columns. The product is what makes the query fast.
Lifting Source Predicates Through Partition Transforms
Lesson 1 introduced hidden partitioning: the query writes the predicate on the source column, the planner derives the partition predicate from the partition spec. Pass 1 needs to do this derivation; the lifting must be conservative — return a partition predicate that matches at least every partition where the source predicate could match.
The leaf cases:
- Identity transform. Source
column = vlifts topartition = vexactly. Sourcecolumn IN [a, b, c]lifts topartition IN [a, b, c]. Sourcecolumn > vlifts topartition > v. The identity transform preserves order and equality. - Day, month, year transforms. Source
ts >= '2024-03-01' AND ts < '2024-03-08'lifts today_partition BETWEEN '2024-03-01' AND '2024-03-07'. Range predicates on the source translate to range predicates on the partition because day-extraction preserves order. - Bucket transform. Source
column = vlifts topartition = bucket(v)exactly. Sourcecolumn IN [a, b, c]lifts topartition IN [bucket(a), bucket(b), bucket(c)]. Sourcecolumn > vcannot be lifted — bucket hashing destroys order. The planner falls back to "all partitions might match" for range predicates on bucket-transformed columns. This is the structural reason bucket partitioning is correct only when the column is queried with equality predicates. - Truncate transform. Source
column = vlifts topartition = truncate(v). Sourcecolumn > vlifts topartition >= truncate(v), because truncation rounds down. Range predicates lift conservatively (some extra partitions match) but soundly.
The Iceberg spec records the transform per partition field. The planner uses the transform to pick the right lifting rule; the rule lives in the planner code, not in the table format itself. A new transform requires a new lifting rule; the table format's flexibility is bounded by the lifting rules the planner implements.
When Pruning Fails: Sparse and Non-Selective Predicates
Pruning works when the predicate is selective relative to the partition layout — when the file/manifest's value range is small relative to the predicate's value range, the pruning function returns true (no prune); when the value range is large relative to the predicate, the pruning function more often returns false (prune happens). The failure modes are predicates where the algebra doesn't help:
- Predicates on unsupported transforms. A range predicate on a bucket-partitioned column is the canonical example: bucket hashing destroys order, no lifting is possible, every partition is "potentially matching." The fix is changing the partition strategy, not the query.
- Predicates on columns without statistics. Module 1's writer enables statistics on numeric columns by default but disables them on very wide string columns to save metadata space. A predicate on a non-statistics-bearing column gets no pruning at the file or row-group level. The fix is enabling statistics for the column (paying the metadata cost) or accepting the full scan.
- Predicates whose value range exceeds the partition value range. A query
WHERE mission_id != 'apollo-3'(whereapollo-3is one of 40 missions) prunes 1/40th — almost nothing. Inequality with a single excluded value is structurally hard to prune. The fix is rephrasing the query in terms of an IN-list (WHERE mission_id IN [apollo-1, apollo-2, ..., apollo-40] EXCEPT apollo-3), which is awkward, or accepting the broader scan. - OR predicates that span many partition values. A query
WHERE mission_id = 'apollo-3' OR panel_voltage > 30cannot prune any file unless both sides prune that file. The pruning power is bounded by the less selective branch. The fix is restructuring the query as a UNION of two more selective queries, if the application allows.
Operationally, the planner emits a pruning summary for each query: how many manifests, files, and row groups survived each pass. The Artemis observability stack ingests these summaries; queries with low pruning ratios (high file counts relative to predicate selectivity) are surfaced for analyst review. A "this query reads 1.4 TB" alert is usually a sign that the query missed an opportunity to use a partition column — the analyst rephrases and the next run reads 14 GB.
Code Examples
A Statistics-Check Function for a Numeric Predicate
The atomic check at every pruning level: does a (min, max, null_count) triple admit a predicate of the form column op value?
use anyhow::Result;
#[derive(Debug, Clone)]
pub struct Statistics<T: PartialOrd + Clone> {
pub min: Option<T>,
pub max: Option<T>,
pub null_count: u64,
pub row_count: u64,
}
#[derive(Debug, Clone)]
pub enum LeafOp<T> {
Eq(T),
Ne(T),
Lt(T),
Le(T),
Gt(T),
Ge(T),
In(Vec<T>),
IsNull,
IsNotNull,
}
/// Return true if the statistics admit *any* row matching the predicate.
/// Returns false only when the predicate is proven not to match — i.e.,
/// when pruning is safe. Conservative on uncertainty: missing or
/// ambiguous statistics return true (no prune).
pub fn might_match<T: PartialOrd + Clone>(
stats: &Statistics<T>,
op: &LeafOp<T>,
) -> bool {
let (min, max) = match (&stats.min, &stats.max) {
(Some(a), Some(b)) => (a, b),
_ => return true, // no stats → can't prune
};
let has_non_null = stats.null_count < stats.row_count;
match op {
LeafOp::Eq(v) => has_non_null && min <= v && v <= max,
LeafOp::Ne(v) => {
// Pruned only if every row equals v; safe to claim when
// null_count == 0 and min == max == v. Conservative
// otherwise.
!(stats.null_count == 0 && min == v && max == v)
}
LeafOp::Lt(v) => has_non_null && min < v,
LeafOp::Le(v) => has_non_null && min <= v,
LeafOp::Gt(v) => has_non_null && max > v,
LeafOp::Ge(v) => has_non_null && max >= v,
LeafOp::In(values) => {
has_non_null && values.iter().any(|v| min <= v && v <= max)
}
LeafOp::IsNull => stats.null_count > 0,
LeafOp::IsNotNull => has_non_null,
}
}
The pattern. Each LeafOp has a conservative answer derived from the bounds. Predicates that the bounds rule out return false (prune); everything else returns true (read). Note the special handling of Ne: an inequality is only prunable when the bounds prove every row equals the excluded value, which essentially never happens — Ne predicates are structurally hard to prune, as discussed above.
Lifting a Source Predicate to a Partition Predicate
The transform-aware lifting that Pass 1 uses. The function turns a predicate on the source column into a predicate on the partition value.
use anyhow::Result;
/// Lift a source-column predicate through a partition transform.
/// Returns Some(partition_predicate) if the transform admits a tight
/// lift, or None if the planner must treat all partitions as potentially
/// matching. The "None" return represents the bucket-with-range case
/// and any other unsupported combination.
pub fn lift(
transform: &Transform,
source_op: &LeafOp<Value>,
) -> Option<LeafOp<PartitionValue>> {
match (transform, source_op) {
// Identity preserves everything.
(Transform::Identity, op) => Some(op.map(PartitionValue::from_value)),
// Day transform on a timestamp: order-preserving.
(Transform::Day, LeafOp::Eq(Value::TimestampNs(ns))) => {
Some(LeafOp::Eq(PartitionValue::Date(ns_to_day(*ns))))
}
(Transform::Day, LeafOp::Ge(Value::TimestampNs(ns))) => {
Some(LeafOp::Ge(PartitionValue::Date(ns_to_day(*ns))))
}
(Transform::Day, LeafOp::Lt(Value::TimestampNs(ns))) => {
// For Lt, the day of the boundary is included if any
// nanoseconds of that day are below the boundary. Conservative
// lift: Lt becomes Le on the day.
Some(LeafOp::Le(PartitionValue::Date(ns_to_day(*ns))))
}
// (Other Day cases: Gt → Ge, Le → Le, etc., omitted for brevity.)
// Bucket transform: only Eq and In lift; range predicates cannot.
(Transform::Bucket(_), LeafOp::Eq(v)) => {
Some(LeafOp::Eq(apply_transform(transform, v)))
}
(Transform::Bucket(_), LeafOp::In(values)) => {
Some(LeafOp::In(values.iter().map(|v| apply_transform(transform, v)).collect()))
}
// Bucket + Lt/Gt/Le/Ge/Ne: no lifting possible.
(Transform::Bucket(_), _) => None,
// Truncate: range predicates lift conservatively.
(Transform::Truncate(_), LeafOp::Eq(v)) => {
Some(LeafOp::Eq(apply_transform(transform, v)))
}
(Transform::Truncate(_), LeafOp::Ge(v)) => {
Some(LeafOp::Ge(apply_transform(transform, v)))
}
(Transform::Truncate(_), LeafOp::Lt(v)) => {
// Truncation rounds down; the truncated value is <= the
// source. A source < v could come from a partition with
// truncated value < truncate(v) OR == truncate(v).
Some(LeafOp::Le(apply_transform(transform, v)))
}
_ => None,
}
}
fn ns_to_day(ns: i64) -> i32 {
(ns / 1_000_000_000 / 86_400) as i32
}
The discipline this implements. Each (transform, leaf_op) pair has either a tight lift, a conservative lift, or no lift. The conservative lift is sometimes slightly loose (the example's Day + Lt becomes Le on the day — one extra day's worth of partitions might be read) but never produces false negatives. The "no lift" case (bucket + range) returns None, which the planner translates to "all partitions are potentially matching" for this leaf, defeating pruning on this dimension. A real planner combines partial lifts: a query with mission_id = 'apollo-7' (lifts) AND panel_voltage > 28.5 (does not lift if panel_voltage is bucket-partitioned, but most likely is not partitioned at all) still gets the partition-level pruning from mission_id.
A Complete Read Plan with All Three Passes
The end-to-end planner: query in, byte-ranges out. The function uses the Module 2 metadata types and the Module 1 Parquet reader.
use anyhow::Result;
pub struct ScanPlan {
/// Files to open, with the row groups inside each that survive
/// pass-3 pruning.
pub files: Vec<FileScan>,
}
pub struct FileScan {
pub path: String,
pub row_groups: Vec<usize>,
}
pub async fn plan_scan(
table: &Table,
predicate: &Predicate,
) -> Result<ScanPlan> {
let snapshot = table.current_snapshot().await?;
let manifest_list = read_manifest_list(&snapshot.manifest_list_path).await?;
let spec = &snapshot.partition_spec;
// PASS 1: prune manifests by partition summary.
let candidate_manifests: Vec<_> = manifest_list
.manifests
.iter()
.filter(|m| manifest_might_match(m, predicate, spec))
.collect();
// PASS 2: open each surviving manifest, prune data files by per-file
// statistics. The full predicate (not just partition predicates) is
// applied here.
let mut candidate_files: Vec<DataFile> = Vec::new();
for manifest_entry in &candidate_manifests {
let manifest = read_manifest(&manifest_entry.manifest_path).await?;
for entry in manifest.entries {
if matches!(entry.status, EntryStatus::Existing | EntryStatus::Added) {
if data_file_might_match(&entry.data_file, predicate) {
candidate_files.push(entry.data_file);
}
}
}
}
// PASS 3: for each candidate file, open the Parquet footer and
// prune row groups by per-column-chunk statistics. This is the
// Parquet reader's responsibility; the table format hands it the
// file list and the predicate.
let mut file_scans = Vec::new();
for data_file in candidate_files {
let file_meta = open_parquet_metadata(&data_file.path).await?;
let surviving_row_groups: Vec<usize> = (0..file_meta.num_row_groups())
.filter(|&rg_idx| {
row_group_might_match(file_meta.row_group(rg_idx), predicate)
})
.collect();
// If no row groups survive, drop the file entirely; otherwise
// emit a FileScan with the surviving row group indices.
if !surviving_row_groups.is_empty() {
file_scans.push(FileScan {
path: data_file.path,
row_groups: surviving_row_groups,
});
}
}
Ok(ScanPlan { files: file_scans })
}
The three passes are visible as three blocks. Each block consumes only the output of the previous; the metadata reads are bounded; the pruning compounds. Production code parallelizes the per-manifest reads in Pass 2 and the per-file footer reads in Pass 3 — the Artemis planner uses a tokio::stream::FuturesUnordered to issue concurrent S3 GETs, capped at a configurable concurrency level (typically 64) to avoid overwhelming the connection pool. The structure is the same; the I/O is parallel.
The function also surfaces the pruning summary the operations team uses to spot ineffective queries. A production version returns a ScanPlan plus a ScanPlanSummary containing the counts at each pass (manifests_total, manifests_kept, files_total, files_kept, row_groups_total, row_groups_kept) — the metric the audit job consumes.
Key Takeaways
- Pruning compounds across three passes: manifest list → manifest entries → row groups. Each pass uses statistics at a different granularity but the same statistics algebra (min/max/null intersection). Getting any one pass wrong loses an order of magnitude of work.
- Pruning correctness is asymmetric. False positives (reading a file that doesn't match) are acceptable; false negatives (skipping a file that does match) are silent data loss. Every pruning function preserves the asymmetry by being conservative on uncertainty.
- Lifting source predicates through transforms is the planner's most error-prone job. Identity preserves everything; date transforms preserve order with slight conservatism; bucket destroys order so range predicates can't lift; truncate produces conservative but sound lifts. Unsupported transforms fall back to "all partitions might match" — bucket partitioning is only correct for equality-queried columns.
- Clustering pays off at Pass 2. A clustered file has tight per-column statistics on the cluster columns; an unclustered file's statistics span the entire dimension. The 10×-15× clustering improvement from Lesson 2 shows up entirely in Pass 2's manifest-entry filtering.
- The pruning summary is a first-class metric. Per-query ratios (manifests-kept / manifests-total, files-kept / files-total) are the lakehouse's QPS-and-latency equivalent. Low ratios on frequent queries are an architectural signal that the partition or clustering strategy is misaligned with the workload.
Capstone — SDA Observation Partition Layout
Module: Data Lakes — M03: Partitioning and Clustering Estimated effort: 1–2 weeks of focused work Prerequisite: All three lessons in this module completed; all three quizzes passed (≥ 70%). The Module 2 capstone (mission archive table) is the substrate this module extends.
Mission Briefing
From: SDA Platform Lead, Cold Archive
ARCHIVE BRIEFING — RC-2026-04-DL-003
SUBJECT: SDA observation table partition + clustering layout for the
analyst workload.
PRIORITY: P1 — direct dependency for Q3 SDA dashboard rollout.
The SDA observation table now holds 18 months of fused observation data — roughly 8 TB compressed, 40,000 partitions if we use the proposed (mission, day) layout, 250,000 data files at current row-group sizing. The analyst workload against this table has been measured for six weeks; we have a query log of 3,200 representative queries with their predicates and the columns they project.
The job: design the partition spec and clustering layout for the table, implement them on top of the Module 2 table format, demonstrate the pruning effectiveness on the measured workload, and produce a design document that the next operator can read and understand.
The design must hold up under the actual workload — not the workload we wish we had. The query log will tell you which predicates appear with what frequency. Your spec must serve those predicates well. Don't over-fit (the workload will shift), but don't ignore the data either.
Module 4 will add time-travel queries that work against this layout; Module 6's compaction will rewrite files into the chosen cluster order. Get the layout right; the rest of the track builds on it.
What You're Building
A Rust crate, artemis-partition-layout, that extends the Module 2 artemis-table-format with:
- A
PartitionSpectype and the transforms (Identity, Day, Month, Year, Bucket, Truncate) as described in Lesson 1, implementing Iceberg-spec-compatible computation. - A
SortOrdertype recording the cluster columns and ordering (NullsFirst/Last) for the table. - A
lift_predicatefunction (Lesson 3) that turns source-column predicates into partition predicates per the transform's lifting rules. The function must produce safe conservative outputs for all (transform, op) pairs. - An extended writer that respects both the partition spec (file boundaries) and the sort order (row order within file) on commit.
- An extended
read_plan(Module 2 carry-forward) implementing all three pruning passes. - A benchmarking harness that takes a query log and produces a pruning-effectiveness report.
The deliverable includes the implementation, the integration tests, the benchmark against the measured workload, and the design document.
Functional Requirements
- Transform implementations. Identity, Day, Month, Year, Bucket(N), Truncate(width). Match the Iceberg spec for output values (Day = days since Unix epoch as int32; Bucket =
(Murmur3(value) & Long.MAX_VALUE) % N; etc.). - Predicate lifting. For each
(transform, leaf_op)pair, implement the lifting rule. The output must be safe-conservative: never produces a partition predicate that excludes a partition where the source predicate could match. Unsupported pairs returnAllPartitions(no lifting). - Write-side partition splitting. A
RecordBatchis split into per-partition-tuple sub-batches before being handed to the Parquet writer. Each output Parquet file contains rows from exactly one partition tuple. - Write-side sort order. Within each partition's data files, rows are sorted by the table's sort order. For Z-order clustering, the writer computes Z-order keys and sorts by them. For lexicographic sort orders (defined in
SortOrder), the writer usesarrow::compute::lexsort. - Read planning, all three passes. Pass 1 prunes manifests by partition summary; Pass 2 prunes files by column statistics; Pass 3 prunes row groups via Parquet footers. Pass 3 happens in the Parquet reader; the table format hands the reader the file list and the predicate.
- Pruning-summary instrumentation. The planner returns a
ScanPlanSummarywith counts at each pass (manifests_total,manifests_kept,files_total,files_kept,row_groups_total,row_groups_kept). - Benchmark harness. A binary
partition-benchthat takes a query log (TSV: query_id, predicate JSON, projected columns) and produces a report with the pruning ratios for each query plus aggregate statistics.
Acceptance Criteria
Verifiable (automated tests must demonstrate these)
-
Each transform's output matches the Iceberg spec's reference test vectors (provided in
tests/iceberg_transform_vectors.toml). -
lift_predicate(Day, Range(t1..t2))produces aDayrange predicate covering exactly the days containing[t1, t2]. Boundary-day tests (t1is the first nanosecond of a day,t2is the last) pass without dropping or duplicating data. -
lift_predicate(Bucket(N), Range)returnsAllPartitions.lift_predicate(Bucket(N), Eq)returns the singleton bucket-equality predicate. -
lift_predicate(Truncate(W), Ge(v))returnsGe(truncate(v))— strictly conservative for the boundary case wherevis not on a truncation boundary. -
After committing a 5-million-row batch with
partition by (mission_id, day(ts))and 3 distinct missions × 7 distinct days, the resulting snapshot has exactly 21 partition tuples in the manifest entries (or fewer if some tuples have no rows). -
After committing the same batch with Z-order clustering on
(payload_id, sensor_kind), each output Parquet file's per-column-chunk statistics for bothpayload_idandsensor_kindare tighter than the corresponding unclustered baseline (measured in the test asmax - minfor each column, averaged across files). -
read_plan(WHERE mission_id = 'apollo-7' AND ts >= D1 AND ts < D7)against a table with 40 missions × 1000 days returns at most 6 manifests inScanPlanSummary.manifests_kept, demonstrating Pass 1 pruning. -
read_plan(WHERE mission_id = 'apollo-7' AND ts BETWEEN D1 AND D7 AND payload_id = 5)against a Z-order-clustered table returns at most 30% of the files that the same plan against an unclustered baseline returns, demonstrating Pass 2 pruning improvement from clustering.
Self-assessed (you write a short justification; reviewer checks it)
-
(self-assessed) The partition spec choice for the SDA observation table is justified in
docs/partition-spec-rationale.mdagainst the measured workload. The doc enumerates the candidate columns, their query coverage and selectivity, and explains why the chosen spec (likely(mission_id, day(ts))) wins over alternatives. -
(self-assessed) The clustering choice is justified in
docs/clustering-rationale.mdagainst the measured workload. The doc shows the files-scanned-ratio improvement on representative queries and explains why Z-order (over lex sort, no clustering, or Hilbert) is the right tradeoff. -
(self-assessed) The lifting rules table is documented in
docs/lifting-rules.mdwith a row per(transform, op)pair. For each row, the lift is either tight, conservative-but-sound (with the conservative loss quantified), or no-lift (with the reason). The doc is the artifact a future engineer adding a new transform will consult. -
(self-assessed) The benchmark harness's correctness is justified in
docs/benchmark-correctness.md. The doc describes what counterfactual the harness compares against (no-clustering baseline) and why the comparison is valid.
Architecture Notes
A reasonable module layout (extending the Module 2 crate or as a new crate that depends on it):
artemis-partition-layout/
├── src/
│ ├── lib.rs
│ ├── spec.rs # PartitionSpec, PartitionField, SortOrder
│ ├── transform.rs # Transform enum, apply_transform
│ ├── lift.rs # lift_predicate for each (transform, op)
│ ├── write.rs # partition splitting + sort + parquet write
│ ├── read.rs # extended read_plan with three-pass pruning
│ ├── zorder.rs # zorder_2d, zorder_nd, batch sort helpers
│ └── bin/partition_bench.rs
├── tests/
│ ├── iceberg_transform_vectors.toml
│ ├── lift_correctness.rs
│ ├── partition_split.rs
│ ├── zorder_clustering.rs
│ └── pruning_pyramid.rs
├── benches/
│ └── workload_bench.rs
└── docs/
├── partition-spec-rationale.md
├── clustering-rationale.md
├── lifting-rules.md
└── benchmark-correctness.md
The Iceberg transform reference vectors should be sourced from the actual Iceberg test suite (apache/iceberg/api/src/test/java/.../TransformTestUtils.java) and ported to TOML for use here. The bucket transform must use the Iceberg-specified Murmur3 variant; the fastmurmur3 crate is the recommended implementation.
The Z-order normalization is the production tax flagged in Lesson 2. For the capstone, the cluster columns can be treated as already-u32 (true for the Artemis schema's payload_id and sensor_kind after dictionary indexing); a production implementation computes per-column ranks via approximate-quantile sketches at write time. The doc should note the simplification.
Hints
Hint 1 — Iceberg's bucket transform's exact specification
The Iceberg bucket transform is (Murmur3_32(value) & Integer.MAX_VALUE) % N for integer N. The Murmur3 variant is specifically the 32-bit one with seed=0. The fastmurmur3::murmur3_32(&value_bytes, 0) call produces the right hash; AND-ing with i32::MAX (which is 0x7FFF_FFFF as u32) before the modulo is what produces non-negative bucket indices. Mismatches with the Iceberg reference are almost always sign-handling: the test vector expects bucket 7, your code produces 7 for positive seeds and N - 7 - 1 for negative seeds.
Hint 2 — The Day transform's UTC convention
The Day transform produces "days since Unix epoch" as an int32. The conversion is ns_since_epoch / 1_000_000_000 / 86_400, treating the timestamp as UTC. Some test inputs are negative (pre-1970 timestamps); the division semantics for negative integers (round-toward-zero vs round-toward-negative-infinity) matter. The Iceberg spec uses Java's integer division, which is round-toward-zero. Rust's i64::div_euclid is round-toward-negative-infinity; plain / is round-toward-zero — match Java's behavior with plain /.
Hint 3 — The lifting-rule table as the test driver
Implement the lifting rules as a data-driven table: a TOML file with rows of (transform, source_op, partition_op) triples, and a test that exercises each row by constructing the source predicate, lifting it, and asserting the partition predicate matches the expected output. The TOML-driven approach makes the rules easy to extend (new transform → new TOML rows) and easy to audit (the table itself is the doc the lift-correctness doc points at).
Hint 4 — Measuring against a no-clustering baseline
The benchmark harness needs a "no-clustering" counterfactual to measure clustering effectiveness against. The simplest approximation: for each query, compute the set of partition tuples that the partition predicates select, then assume every file in those partitions would be read. This is what an unclustered table would do at Pass 2 (no per-file pruning beyond the partition's). The clustered table's files-scanned count divided by this counterfactual is the clustering ratio. The approximation is conservative (it slightly overestimates the unclustered baseline, because real unclustered files have some statistics noise), but it gives a clean, reproducible number.
Hint 5 — The query-log format
The supplied query log (in tests/queries.tsv) is TSV with columns query_id, predicate_json, projected_columns. The predicate_json is a JSON-encoded AST with op, column, value (or lo, hi) fields. A small serde-driven parser turns each line into a Predicate value the planner can consume. The harness does not need to actually execute queries — it only needs to plan them and measure the planning output. Plan-and-measure is far cheaper than plan-and-execute; the benchmark runs the entire 3,200-query log in seconds, not hours.
References
- Designing Data-Intensive Applications (Kleppmann & Riccomini), Chapter 6 — "Partitioning"
- Apache Iceberg specification, "Partition Transforms" and "Partition Evolution" sections
- Morton (1966), "A computer oriented geodetic data base and a new technique in file sequencing" — the Z-order paper
- DuckDB blog, "Z-Order Indexing for Multi-Dimensional Range Queries" — the applied perspective
When You're Done
The crate is "done" when all eight verifiable acceptance criteria pass in CI, the four self-assessed docs are written, and the benchmark report shows the pruning improvement on the 3,200-query workload. Module 4 begins with the assumption that this layout is in place; the time-travel mechanics will exercise it heavily.
Module 4: Time Travel and Schema Evolution
Lesson 1 — Snapshot Isolation on Object Storage
Module: Data Lakes — M04: Time Travel and Schema Evolution Position: Lesson 1 of 3 Source: Designing Data-Intensive Applications — Martin Kleppmann and Chris Riccomini, Chapter 7 ("Transactions — Snapshot Isolation and Repeatable Read", "Indexes and Snapshot Isolation"). Apache Iceberg specification, "Scan Planning" and "Snapshot Retention" sections.
Context
Module 2 introduced snapshots as the unit of immutable table-version metadata. Module 3 introduced read planning against the current snapshot. This lesson develops the isolation contract that snapshots provide to readers — what guarantees a query holds about the state of the table during its execution, and what the writer side has to do to preserve those guarantees.
The right frame is the one DDIA (Ch. 7) develops for traditional databases: snapshot isolation is the property that "each transaction reads from a consistent snapshot of the database — that is, the transaction sees all the data that was committed in the database at the start of the transaction. Even if the data is subsequently changed by another transaction, each transaction sees only the old data from that particular point in time." In the lakehouse the same property holds, structurally, because the storage substrate is immutable: every snapshot file remains on disk after subsequent commits, so a reader that pins a snapshot ID at query start continues to read against that snapshot for the query's full duration, regardless of how many other commits arrive in the meantime.
This lesson makes the structural property explicit: the read-side protocol that pins a snapshot, the implications for long-running queries, the interaction with snapshot expiration that determines how long a snapshot remains readable, and the specific guarantees and non-guarantees the model provides. The capstone's mission-replay engine depends on this protocol — replay queries are by definition long-running reads against past snapshots, and the snapshot-expiration retention window directly bounds how far back replay can reach.
Core Concepts
Snapshot Isolation as a Free Consequence of Immutability
DDIA (Ch. 7) describes snapshot isolation as a database engine's deliberate machinery — multiple object versions, garbage collection of old versions, careful read-time visibility rules. The lakehouse gets the same property essentially for free. Each commit produces new metadata files; old metadata files are not modified; the catalog pointer changes atomically. A reader that records the current snapshot ID at query start has a stable handle: the metadata files referenced by that snapshot ID remain on disk, unmodified, fully readable, regardless of how many commits arrive after.
The corollary that matters operationally: a long-running query does not block writers, and writers do not block readers. The Artemis ingestion pipeline can commit a new snapshot every thirty seconds while an analyst query that takes ten minutes runs against the snapshot from before the query started. The analyst sees consistent data; the writers make progress; no coordination is needed beyond the catalog CAS that the writers use among themselves.
DDIA (Ch. 7) calls out the same property for traditional MVCC: "Snapshot isolation is a boon for long-running, read-only queries such as backups and analytics. It is very hard to reason about the meaning of a query if the data on which it operates is changing at the same time as the query is executing." The lakehouse case is identical except cheaper — the immutability is structural, not stored as a per-row version chain that the engine garbage-collects, so the per-row overhead is zero.
The Pin Protocol
A query against a lakehouse table pins a snapshot by capturing the snapshot ID at planning time. The protocol:
- The planner reads the catalog's current snapshot for the table. Call this
S. - The planner reads the snapshot metadata file for
Sfrom the object store. - The planner reads
S's manifest list, then its referenced manifests, then plans the data file reads (Module 3's three-pass pruning). - The reader executes the plan against the data files. Every file read uses paths from
S's manifest entries.
After step 1, the query never re-reads the catalog. The catalog can advance to S+1, S+2, … while the query runs; the query continues to use S as its snapshot. Other queries against the same table started after the catalog advances will read the new snapshot; long-running queries hold their original pin.
A subtlety: the pin must be explicit in the query's state. The Artemis read planner returns a ScanPlan that includes the pinned snapshot_id; the reader includes it in its log lines and observability. If a query takes longer than expected, the operations team can correlate the query's snapshot ID with the catalog history and determine exactly which version of the table the query is reading. This is the lakehouse equivalent of a database's "transaction start timestamp" — same diagnostic shape, same operational value.
Long-Running Queries and the Expiration Window
The pin protocol holds the snapshot's metadata files (snapshot, manifest list, manifests) and data files in their referenced state for the query's duration. The hidden requirement: those files must continue to exist on disk while the query runs. Snapshot expiration (Module 6) is the maintenance job that physically deletes snapshot files older than a retention window; a query that pins a snapshot that is expired during its execution sees its in-flight reads fail with "object not found."
The Artemis archive sets the retention window to 30 days. Any query that completes in under 30 days against any snapshot from the last 30 days is safe. Replay queries against snapshots older than 30 days are explicitly unsupported by the live read path; the data is available in the cold-archive backup (immutable snapshots replicated to an object-locked bucket) and accessed through a separate read path with longer SLAs.
The interaction worth understanding. The retention window is the lakehouse's analog of the database's MVCC garbage-collection horizon. DDIA (Ch. 7, "Indexes and snapshot isolation") makes the same point for traditional systems: long-running queries hold snapshots; the system cannot reclaim space until the queries finish; the operations team has to set a horizon past which it gives up on long-running queries to avoid running out of space. The Artemis archive's 30-day window is the equivalent setting; query timeouts are configured well within it (the default planner sets a 6-hour query timeout) to avoid the operational case where a forgotten query holds an old snapshot indefinitely.
What Snapshot Isolation Guarantees, and What It Does Not
The guarantee is precise: every row read by a query reflects the table's state as of the pinned snapshot's commit time. No row reflects a later commit; no row is missing because of a later commit. The query produces the result it would have produced if no writer had run during the query.
The non-guarantees, also precise:
- Snapshot isolation is not serializable isolation. Two read-only queries that pin different snapshots can produce results that no serial execution of all transactions could produce; the lakehouse does not order queries among themselves except by their pin. For typical analytical queries this is irrelevant — the queries are independent — but it matters for any workflow that runs two queries and relies on them seeing the same state. The fix is to share a snapshot ID across the related queries; the Artemis tooling supports a
--pin-snapshotflag for this case. - Snapshot isolation does not prevent write skew. DDIA (Ch. 7, "Write Skew and Phantoms") describes the anomaly: two writers each read a state, each decide to modify a different row, each commit, neither observes the other's changes. The lakehouse's optimistic concurrency control (Module 2) detects writer-writer conflicts at the CAS but does not detect read-modify-write conflicts where the read and the write target different rows. For append-only workloads this is irrelevant; for overwrite workloads the application layer must structure its commits to avoid the pattern.
- Snapshot isolation does not bound staleness across regions. If the object store and the catalog are geographically distributed, a query in region A may see a snapshot that lags behind region B's writers by the inter-region replication lag. The Artemis archive runs a single catalog and a single primary object store region, with backup replication to a second region — this avoids the cross-region staleness problem at the cost of higher write latency for that one region. Multi-region lakehouses generally accept staleness in trade for write availability.
The guarantee that is given is the one that matters for analytical workloads: a query operates on a consistent, point-in-time view of the table. This is what makes the lakehouse reliable for the long-running, read-only analyses that dominate its workload.
Implications for Operational Pipelines
The pin protocol has several operational implications worth knowing.
Reader retries are cheap. A query that fails partway through (network error, transient object-store fault) can be retried by replanning against the same pinned snapshot. The retry produces identical results because the snapshot is unchanged. The Artemis reader does this automatically — any query failure with a recognized-transient error code triggers a retry against the original pinned snapshot, up to a configurable retry budget.
Tail-latency analysis is meaningful. A query that takes 10 minutes against a stable snapshot is a query against a known input. The same query at the same configuration produces the same input bytes the next time it runs (the snapshot is immutable). This makes performance regressions observable: a query that took 10 minutes last week and takes 20 minutes this week, against the same snapshot, has a real regression in either the planner or the storage layer — not a workload shift, because the workload (the snapshot) is unchanged.
The snapshot ID is the right caching key. Computed results that depend on a snapshot's data can be cached by snapshot ID; cache invalidation is trivial because a new snapshot has a new ID and a new cache entry. The Artemis dashboard caches frequently-computed aggregations this way; the cache hit rate at steady state is over 99% because snapshots typically change every few minutes while dashboard queries refresh every few seconds.
Core Mechanics in Code
Pinning a Snapshot for a Query
The minimal read-side protocol: capture the snapshot once at the start of the query, use it for all subsequent reads.
use anyhow::Result;
pub struct PinnedQuery {
pub table: String,
pub snapshot_id: SnapshotId,
pub snapshot_metadata_path: String,
}
/// Pin the current snapshot of the table. The returned PinnedQuery
/// captures the snapshot ID and metadata path; subsequent reads use
/// these directly without consulting the catalog again.
pub async fn pin_current(catalog: &PostgresCatalog, table: &str) -> Result<PinnedQuery> {
let entry = catalog.get_current(table).await?;
Ok(PinnedQuery {
table: table.to_string(),
snapshot_id: entry.current_snapshot_id,
snapshot_metadata_path: entry.metadata_path,
})
}
/// Read the table at a previously-pinned snapshot. This function does
/// not consult the catalog; it goes straight to the snapshot's metadata
/// path. If the snapshot has been expired (Module 6) and its metadata
/// removed, the metadata read fails with NotFound and the caller must
/// handle the case.
pub async fn read_pinned(
query: &PinnedQuery,
predicate: &Predicate,
) -> Result<ScanPlan> {
let snapshot: Snapshot = read_metadata_file(&query.snapshot_metadata_path).await?;
plan_scan_against_snapshot(&snapshot, predicate).await
}
The discipline. The catalog is consulted once, at pin time. After that, the query is independent of any subsequent commits. The query's results are determined entirely by the snapshot ID it captured; two queries with the same pin produce identical results.
Pinning a Past Snapshot
Time travel (Lesson 2 develops this in depth) is just pinning a snapshot other than the current one. The mechanics are the same.
use anyhow::Result;
/// Pin a specific past snapshot by ID. The snapshot history is recorded
/// in the table's snapshot-history metadata; this function consults the
/// history to find the metadata path for the requested snapshot ID.
pub async fn pin_by_id(
catalog: &PostgresCatalog,
table: &str,
snapshot_id: SnapshotId,
) -> Result<PinnedQuery> {
let history = read_snapshot_history(catalog, table).await?;
let entry = history.find(|s| s.snapshot_id == snapshot_id)
.ok_or_else(|| anyhow::anyhow!("snapshot {snapshot_id} not in history"))?;
Ok(PinnedQuery {
table: table.to_string(),
snapshot_id,
snapshot_metadata_path: entry.metadata_path.clone(),
})
}
/// Pin the snapshot that was current at the given UTC timestamp. The
/// implementation walks the snapshot history backward from the current
/// snapshot until it finds one whose commit timestamp is <= the target.
pub async fn pin_at_time(
catalog: &PostgresCatalog,
table: &str,
timestamp_ms: i64,
) -> Result<PinnedQuery> {
let history = read_snapshot_history(catalog, table).await?;
// Walk history backward; the first snapshot with timestamp <= target
// is the one that was current at the target time.
let entry = history.iter()
.rev()
.find(|s| s.timestamp_ms <= timestamp_ms)
.ok_or_else(|| anyhow::anyhow!("no snapshot before {timestamp_ms}"))?;
Ok(PinnedQuery {
table: table.to_string(),
snapshot_id: entry.snapshot_id,
snapshot_metadata_path: entry.metadata_path.clone(),
})
}
What to notice. Time travel does not require any new mechanism — it is the same pin protocol applied to a different snapshot ID. The retrieval mechanism (consult the snapshot-history metadata) is the only new piece; the read path after the pin is unchanged. The snapshot-history metadata is itself the Iceberg metadata_log field, a small append-only list of (snapshot_id, commit_timestamp_ms, metadata_path) triples. Maintaining it is a per-commit responsibility of the writer; reading it is the only catalog dependency for time-travel queries.
Key Takeaways
- Snapshot isolation in a lakehouse is a structural consequence of the immutable-snapshot model, not a deliberate engine mechanism. Each commit produces new metadata; old metadata is never modified; readers pin a snapshot and use it for the query's duration regardless of subsequent commits.
- The pin protocol is one catalog read at query start, captured in the query's state, and used for every subsequent read. The catalog can advance freely during the query; the pin holds the query's view stable.
- Snapshot retention bounds the time-travel reach. The Module 6 snapshot-expiration job deletes old snapshot files after a configurable retention window; queries against expired snapshots fail. The Artemis archive uses a 30-day window; query timeouts are configured well inside it.
- Snapshot isolation guarantees per-query consistency, not cross-query serializability. Two related queries can pin different snapshots and see different states; the fix is to share a pin across them. Write skew is not prevented; the optimistic-CAS layer (Module 2) detects writer-writer conflicts but not read-modify-write conflicts on disjoint rows.
- The snapshot ID is the right cache key, the right log correlation ID, and the right unit for performance analysis. Snapshots are immutable inputs; two runs against the same snapshot are two runs against the same input. Performance regressions and result divergences are diagnosable in terms of snapshot IDs.
Lesson 2 — Time Travel Queries and Change Data Feed
Module: Data Lakes — M04: Time Travel and Schema Evolution Position: Lesson 2 of 3 Source: Apache Iceberg specification, "Scan Planning — Time Travel" section. Delta Lake protocol, "Change Data Feed" section, for the diff-based pattern. Designing Data-Intensive Applications — Martin Kleppmann and Chris Riccomini, Chapter 11 ("Stream Processing — Change Data Capture") for the CDC semantics.
Source note: The time-travel query mechanics are well-supported by the Iceberg spec. The change-data-feed pattern in this lesson is synthesized — the lakehouse formats implement CDF differently (Delta has a first-class feature; Iceberg supports it via snapshot diff inference); the lesson focuses on the structural pattern, with references to the format-specific implementations.
Context
Lesson 1 made snapshot isolation explicit: a query pins a snapshot, reads against it, and is unaffected by subsequent commits. The natural generalization is time travel — pinning a snapshot other than the current one. The mechanism is unchanged; what changes is which snapshot the query pins, and how the query's results are interpreted by the consumer.
Two query patterns dominate the time-travel workload. Point-in-time queries ask "what did the table look like at moment T?" — a historical reconstruction for accident investigation, audit reporting, or reproducing a prior analysis. Incremental queries ask "what changed in the table between moments T1 and T2?" — the change-data-feed (CDF) pattern that drives downstream pipelines, dashboards, and ML feature stores. Both are built on the same primitive: read the table at a specific snapshot.
The capstone — the Mission Replay Engine — implements point-in-time queries against the Artemis Orbital Object Registry. Replay is what enables the accident-investigation workflow that motivates the cold archive: when something fails on orbit, the analyst pins the snapshot from immediately before the anomaly and reconstructs the registry's state as the operators saw it at the time. The lesson develops both the point-in-time and the change-data-feed patterns; the capstone exercises the point-in-time path.
Core Concepts
Two Time-Travel Addressing Modes
A time-travel query specifies the snapshot to read in one of two ways: by snapshot ID or by timestamp. Both addressings ultimately resolve to a single snapshot ID; the difference is the lookup mechanism.
By snapshot ID. The query specifies the snapshot ID directly. The reader consults the snapshot history to find the snapshot's metadata path, then proceeds with the normal read protocol. This is the operationally simplest path and the one the audit-trail use case typically uses — an event recorded as "snapshot 4729 contained the configuration" can be replayed later by pinning snapshot 4729.
By timestamp. The query specifies a UTC timestamp. The reader walks the snapshot history backward to find the most recent snapshot whose commit timestamp is at or before the target. The pinned snapshot is the one the table was at at the target time. This is the analyst-friendly path — humans think in timestamps, not snapshot IDs — and it is the path the Mission Replay Engine exposes.
Both addressings depend on the snapshot history metadata: a small append-only log of (snapshot_id, commit_timestamp_ms, metadata_path) triples maintained by every commit. Iceberg calls this the metadata_log. The log is bounded by the retention window — only snapshots still on disk are reachable for time travel. Snapshots older than the retention horizon are absent from the log (Module 6's snapshot expiration removes them), and queries against them return a "snapshot not found" error.
A subtle case the timestamp-based lookup must handle: the target timestamp falls between two commits. The convention is "snapshot N was current from commit_timestamp_ms[N] to commit_timestamp_ms[N+1]." A query at a timestamp T finds the snapshot whose commit_timestamp_ms <= T and either there is no next snapshot (it's still current) or the next snapshot's timestamp is > T. The found snapshot is the one to pin. The convention puts every instant into exactly one snapshot's window — including the rare instant exactly at a commit timestamp, which is conventionally assigned to the new snapshot.
Schema-of-the-Time Projection
A time-travel query against snapshot S sees S's data, but the schema projection is the operator's choice. Two reasonable answers, both supported:
Current-schema projection. The query reads S's data files and projects them against the table's current schema. Columns added since S was committed appear as nulls (the data files don't have them); columns dropped since S are filtered out. This is the right choice when the consumer is current-tooling that expects the current schema — for instance, the dashboard that pulls the table over a stable schema and replays historical data.
Snapshot-time schema projection. The query reads S's data files and projects them against the schema that was current at S's commit time. Columns added later are not in the result; columns dropped later are present. This is the right choice for accident investigation and audit replay — the analyst wants to see the table as it looked then, not as it would look now.
The Mission Replay Engine exposes both via a query parameter (schema_mode = current | snapshot_time); the default is snapshot_time because the dominant use case is reconstructing operational state. The mechanics differ only in which schema_id the reader uses for projection; the data file reads are identical.
The schema-of-the-time projection requires the table's schema history. Iceberg records every schema the table has had via the schemas field in the snapshot metadata, keyed by schema_id. Each snapshot records the schema_id that was current at its commit. The reader looks up the snapshot's schema_id, finds the schema in the schemas table, and uses it to project the data file reads.
Change Data Feed: The Snapshot Diff Pattern
The change-data-feed (CDF) pattern reads the difference between two snapshots. Given snapshots S_old and S_new, CDF returns the rows that were added (in S_new but not S_old) and the rows that were removed (in S_old but not S_new). The pattern is the foundation of downstream pipelines that want to keep their derived data in sync with the table without recomputing from scratch.
The diff is computed at the manifest entry level, not by comparing row content. Each manifest entry has an Added, Existing, or Deleted status; the entry's status is what records the change relative to the prior snapshot. Diffing two snapshots produces:
- Added files: files whose manifest entries have status
Addedin any snapshot strictly betweenS_oldandS_new(inclusive ofS_new). - Removed files: files whose manifest entries have status
Deletedin any snapshot strictly betweenS_oldandS_new.
A consumer that wants the added rows reads the added files; a consumer that wants the removed rows reads the removed files (the data is still on disk because the files have not yet been physically deleted by snapshot expiration). The output is two streams of Arrow record batches: the inserts and the deletes.
The Delta Lake CDF feature provides this directly. Iceberg's snapshot.added_files() and snapshot.removed_files() provide the primitives; the Artemis archive's CDF tooling wraps these into a stream-of-batches API. Both formats produce the same result for the same diff; the format-specific details are around how the row-level data is materialized (Delta supports per-row CDC with _change_type columns; Iceberg infers the change types from the file-level diff).
The cost of CDF computation is proportional to the number of changes between the snapshots, not the table size. A consumer that polls for changes every 30 seconds sees only the files added/removed in the last 30 seconds — typically a handful of files, regardless of the table's total size. This is the property that makes CDF cheap enough to drive downstream pipelines without batch-job overhead.
Long-Running Replays and the Expiration Boundary
The replay workload has a subtlety the operational team must handle: replay queries are long-running by nature. A full-mission replay reads months of data; the query can run for hours. The pin protocol holds the snapshot for the query's duration, which means the snapshot's data files must remain on disk for at least that long.
The retention window is the operational lever. The Artemis archive's 30-day window allows replays of up to 30 days of history to complete in real time; replays against older history use the longer-retention cold-archive backup tier. Replay queries that exceed the retention window fail with a clear error code; the application either accepts the failure or escalates to the cold-archive tier.
A separate concern: replay queries are read-heavy and may starve concurrent ingest writers' bandwidth. The Artemis read path runs replays through a separate worker pool with rate-limiting on object-store GET requests, so the ingest writers' Parquet uploads are not delayed. This is operational shaping, not a property of the table format itself; the same pattern applies to any read-heavy workload sharing infrastructure with writers.
The Replay Workflow End to End
A canonical Mission Replay flow:
- The analyst specifies a target timestamp (typically the moment before an anomaly observed in orbit).
- The replay engine calls
pin_at_time(catalog, "orbital_object_registry", target_timestamp_ms)to resolve the snapshot. - The replay engine reads the snapshot's
schema_idand looks up the corresponding schema from the snapshot'sschemasfield. - The replay engine plans the scan against the pinned snapshot (Module 3's three-pass pruning, using the snapshot's partition spec).
- The replay engine reads the scan plan, producing Arrow batches projected against the snapshot-time schema.
- The engine emits the batches to the consumer (an analyst's notebook, a dashboard, or a downstream replay validator).
Each step is bounded by the snapshot's contents. The replay produces deterministic output: running the same query at a different time produces the same result, because the snapshot is unchanged. This determinism is what makes replay-based analyses scientifically valid; the analyst can reproduce another team's investigation by replaying against the same snapshot and seeing the same data.
Core Mechanics in Code
Reading the Snapshot History
The snapshot history is the source of truth for time-travel resolution. The function below is the minimal reader.
use anyhow::Result;
use serde::{Deserialize, Serialize};
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct SnapshotHistoryEntry {
pub snapshot_id: SnapshotId,
pub timestamp_ms: i64,
pub metadata_path: String,
pub schema_id: u32,
pub partition_spec_id: u32,
}
/// Read the table's snapshot history. The history is stored as a small
/// metadata file (Iceberg's metadata_log) that is updated on every commit.
/// Entries older than the retention window are removed by snapshot
/// expiration (Module 6).
pub async fn read_snapshot_history(
catalog: &PostgresCatalog,
table: &str,
) -> Result<Vec<SnapshotHistoryEntry>> {
let entry = catalog.get_current(table).await?;
let table_metadata: TableMetadata = read_metadata_file(&entry.metadata_path).await?;
Ok(table_metadata.snapshot_log)
}
/// Resolve a target timestamp to the snapshot ID that was current at
/// that moment. The convention: snapshot N is current from
/// `timestamp_ms[N]` to `timestamp_ms[N+1]`. A target equal to a
/// commit timestamp resolves to the new snapshot (not the prior one).
pub fn resolve_timestamp(
history: &[SnapshotHistoryEntry],
target_ms: i64,
) -> Result<&SnapshotHistoryEntry> {
// The history is in commit order (ascending timestamp). Find the
// last entry with timestamp <= target.
let mut found: Option<&SnapshotHistoryEntry> = None;
for entry in history {
if entry.timestamp_ms <= target_ms {
found = Some(entry);
} else {
break;
}
}
found.ok_or_else(|| anyhow::anyhow!(
"no snapshot at or before {target_ms} (oldest is {})",
history.first().map(|e| e.timestamp_ms).unwrap_or(0),
))
}
The pattern. The snapshot history is small (one entry per commit, dozens of bytes each) and bounded by the retention window. A typical table with hourly commits and a 30-day retention has 720 entries; reading the history is one small object-store GET. The linear search is fast enough at this scale; a sorted index would be unnecessary engineering.
Reading at a Snapshot
The actual time-travel read is the Module 2/3 read path applied to a pinned snapshot. The function below brings together the pieces.
use anyhow::Result;
use arrow::array::RecordBatch;
use arrow::datatypes::SchemaRef;
pub enum SchemaMode {
/// Project against the table's current schema. New columns appear
/// as nulls; dropped columns are filtered out.
Current,
/// Project against the schema that was current at the pinned
/// snapshot's commit time.
SnapshotTime,
}
/// Plan and execute a time-travel read at the given snapshot ID with
/// the specified schema-projection mode. Returns a stream of record
/// batches with the appropriate schema.
pub async fn read_at_snapshot(
catalog: &PostgresCatalog,
table: &str,
snapshot_id: SnapshotId,
predicate: &Predicate,
schema_mode: SchemaMode,
) -> Result<RecordBatchStream> {
let pinned = pin_by_id(catalog, table, snapshot_id).await?;
let snapshot: Snapshot = read_metadata_file(&pinned.snapshot_metadata_path).await?;
// Look up the schema to project against.
let table_metadata = read_table_metadata(catalog, table).await?;
let schema: SchemaRef = match schema_mode {
SchemaMode::Current => table_metadata.current_schema(),
SchemaMode::SnapshotTime => table_metadata.schema_by_id(snapshot.schema_id)?,
};
// Plan the scan with the snapshot's partition spec (which may
// differ from the current spec — Lesson 3 develops this).
let plan = plan_scan_against_snapshot(&snapshot, predicate).await?;
// Execute the plan, projecting each Parquet read against the
// chosen schema. The Parquet reader handles column-ID-based
// mapping (Lesson 3) to project files written with different
// schemas against the chosen schema.
let stream = execute_plan_with_schema(plan, schema).await?;
Ok(stream)
}
The structure. The read protocol is the same as for the current snapshot; the only changes are which snapshot is pinned and which schema is used for projection. The data file reads are unchanged. This is what makes time travel cheap to implement once the snapshot-isolation foundation is in place — it is the same read path with a different pin.
Computing a Snapshot Diff for CDF
The change-data-feed primitive: given two snapshot IDs, return the added and removed files.
use anyhow::Result;
use std::collections::HashSet;
pub struct SnapshotDiff {
pub added_files: Vec<DataFile>,
pub removed_files: Vec<DataFile>,
}
/// Compute the file-level diff between two snapshots. The diff is the
/// set of files added in any snapshot strictly between old and new
/// (exclusive of old, inclusive of new), and the set of files removed
/// in the same range. The implementation walks the snapshot chain from
/// old+1 to new, accumulating Added and Deleted manifest entries.
pub async fn snapshot_diff(
catalog: &PostgresCatalog,
table: &str,
old_snapshot_id: SnapshotId,
new_snapshot_id: SnapshotId,
) -> Result<SnapshotDiff> {
let history = read_snapshot_history(catalog, table).await?;
// Find the range of snapshots strictly after old, up to and
// including new.
let old_idx = history.iter().position(|h| h.snapshot_id == old_snapshot_id)
.ok_or_else(|| anyhow::anyhow!("old snapshot not in history"))?;
let new_idx = history.iter().position(|h| h.snapshot_id == new_snapshot_id)
.ok_or_else(|| anyhow::anyhow!("new snapshot not in history"))?;
if new_idx <= old_idx {
return Err(anyhow::anyhow!("new must be strictly after old"));
}
let range = &history[(old_idx + 1)..=new_idx];
// Accumulate file-level changes across the range. Each snapshot's
// manifests list one entry per file with status Added/Existing/Deleted;
// we collect the Added and Deleted ones across the range.
let mut added: Vec<DataFile> = Vec::new();
let mut removed: Vec<DataFile> = Vec::new();
for entry in range {
let snap: Snapshot = read_metadata_file(&entry.metadata_path).await?;
let manifest_list = read_manifest_list(&snap.manifest_list_path).await?;
for ml_entry in manifest_list.manifests {
// Only read manifests that have any added or deleted entries —
// existing-only manifests have nothing to contribute.
if ml_entry.added_data_files == 0 && ml_entry.deleted_data_files == 0 {
continue;
}
let manifest = read_manifest(&ml_entry.manifest_path).await?;
for me in manifest.entries {
match me.status {
EntryStatus::Added => added.push(me.data_file),
EntryStatus::Deleted => removed.push(me.data_file),
EntryStatus::Existing => {}
}
}
}
}
Ok(SnapshotDiff { added_files: added, removed_files: removed })
}
The cost. The function reads O(range) snapshot metadata files and O(changed_manifests) manifest files; the data files themselves are not read at all for the diff computation. A typical CDF poll comparing two consecutive snapshots reads one snapshot file and the one new manifest — single-digit GETs, well under a second. The CDF consumer reads the data files themselves only after the diff has identified the relevant ones; the data-file reads are bounded by the actual change volume.
The manifest_list.manifests filter that skips manifests with no Added/Deleted entries is an important optimization. Most commits in a long-running table touch only a small fraction of the manifests; skipping the rest is what keeps CDF cheap relative to the table's total size. The skipping is safe because a manifest that has no Added or Deleted entries contributes nothing to the diff.
Key Takeaways
- Time travel is the pin protocol applied to a non-current snapshot. The mechanism is unchanged from Lesson 1; the only new piece is resolving snapshot IDs (by-ID or by-timestamp) via the snapshot history metadata.
- Two schema-projection modes for time-travel reads: current-schema (back-fill nulls for columns added since; drop columns added since) and snapshot-time-schema (the table as it actually looked then). The Mission Replay Engine defaults to snapshot-time for the accident-investigation workload.
- Change Data Feed reads the difference between two snapshots by accumulating Added/Deleted manifest entries across the snapshot chain. The cost is proportional to the changes between the snapshots, not the table size — this is what makes CDF cheap enough to drive downstream pipelines without batch overhead.
- The retention window bounds time travel. Snapshots older than the window have been physically removed and cannot be replayed; the operational team sets the window based on the longest-supported replay duration plus a margin.
- Replay queries are deterministic and reproducible because the snapshot is immutable. Two analysts running the same time-travel query against the same snapshot ID get identical results; this is what makes replay-based investigations scientifically valid.
Lesson 3 — Schema and Partition Evolution
Module: Data Lakes — M04: Time Travel and Schema Evolution Position: Lesson 3 of 3 Source: Designing Data-Intensive Applications — Martin Kleppmann and Chris Riccomini, Chapter 5 ("Encoding and Evolution"), especially "Schema Evolution" and "Forward and Backward Compatibility". Apache Iceberg specification, "Schema Evolution" and "Partition Evolution" sections.
Source note: The schema-evolution principles are well-grounded in DDIA. The format-specific shape (column IDs, per-snapshot schema IDs, partition spec IDs) is from the Iceberg spec; verification of the field-mapping rules against the current spec recommended.
Context
Long-lived tables outlive the schemas they started with. New columns get added as instrumentation grows; old columns get retired as features change; sometimes a column's type needs widening or its meaning needs renaming. In a row-store database these operations are routine — ALTER TABLE ADD COLUMN, DROP COLUMN, RENAME COLUMN — but in a lakehouse the discipline is different. The data files are immutable; you cannot rewrite a million Parquet files just because a column changed. The schema-evolution mechanism must work against the existing data, projecting old data files onto the new schema without rewriting them.
The mechanism Iceberg uses is column IDs. Every column in the table has a unique numeric ID, recorded in the schema and stamped into every data file's Parquet schema. Column names are display labels in the metadata; what identifies a column physically is its ID. Adding a column allocates a new ID; renaming a column changes the display label but preserves the ID; dropping a column marks the ID as retired. The data files are never touched. Readers project files against the current schema by matching column IDs; columns missing from a file (because they were added after the file was written) appear as nulls; columns present in the file but absent from the current schema (because they were dropped) are filtered out.
Partition specs evolve under the same discipline, with a separate partition spec ID that links each snapshot to the spec under which it was written. A table can change its partition spec without rewriting historical data; new commits use the new spec, old commits remain under their original spec, and the planner handles both during reads.
This lesson develops both evolution mechanisms. The capstone's Mission Replay Engine exercises schema evolution heavily — replays against snapshots from before a column was added must project against the snapshot-time schema, with the column absent.
Core Concepts
Column IDs: The Decoupling Principle
DDIA (Ch. 5, "Schema Evolution") draws the right framing for any schema-evolving system: physical storage must reference fields by something stable, not by display name. Protocol Buffers uses field tags; Avro uses position-plus-name lookup with explicit aliases. Iceberg uses column IDs.
The mechanism. The table's schema records, for each column, a numeric id, a name, a type, and a nullability flag. When the writer commits a data file, the Parquet schema in the file's footer records the column IDs as Parquet field metadata (typically via the field_id annotation). The data file's bytes are stored in column-ID order, not name order, with the schema in the footer providing the mapping.
When the reader projects a data file against a different schema (a later schema after a rename, an earlier schema after time travel, the current schema after a column drop), the projection uses column IDs:
- A column in the projection's schema with ID K is mapped to the data file's column-with-ID-K, regardless of the column's name in either schema.
- A column in the projection's schema with no matching ID in the data file (e.g., added later) is filled with nulls.
- A column in the data file with no matching ID in the projection's schema (e.g., dropped) is ignored.
The result is that renaming a column is free at the data layer. The schema's display name changes; the column ID doesn't. Old data files continue to be read correctly because the column ID still matches. No data rewrite, no migration, no compatibility window — the change is a metadata commit.
Safe Schema Changes
The schema-evolution rules fall into three categories: trivially safe, safe with care, and structurally impossible.
Trivially safe — pure metadata changes that do not affect any existing data file's interpretation:
- Add a nullable column. Allocates a new column ID. Existing data files don't have it; reads back-fill nulls. New writes include it. The column's nullability is required because existing rows have no value for it.
- Drop a column. Marks the ID as retired in the schema. Existing data files retain the bytes; the reader filters them out. The bytes are reclaimed only by Module 6's compaction, which rewrites files under the new schema. A dropped column's ID is never reused — the retirement is permanent, so a future "re-add the same column" allocates a new ID rather than restoring the old one.
- Rename a column. Changes the display name; preserves the ID. Trivial.
- Reorder columns. Changes the display order in the schema; the data file's physical layout is unchanged. Trivial.
Safe with care — changes that affect interpretation but can be handled correctly with explicit rules:
- Widen a column's type.
int32→int64,float32→float64,decimal(8, 2)→decimal(10, 2). The new type can represent every value the old type could, so existing data files (which contain the old type's bytes) can be read back and converted at projection time. The Iceberg spec enumerates the allowed widenings; narrowings (the reverse) are not safe and require explicit migration. - Make a non-nullable column nullable. Adds null to the value set; existing data has no nulls but the schema now permits them. Safe.
- Add a struct field to a complex type. Same logic as adding a top-level nullable column.
Structurally impossible without data rewrite:
- Narrow a type.
int64→int32requires checking every value in existing data; values that overflow the narrower type have no defined behavior. The Iceberg spec rejects type narrowings; if the application needs this, it must rewrite the data under a new schema. - Change a column's type to an incompatible type.
string→int32is impossible without parsing every value; some values may not parse. Rejected. - Make a nullable column non-nullable. Requires asserting no nulls exist; existing files may contain nulls that the new schema forbids. Rejected unless the application explicitly validates and is prepared to rewrite.
The Iceberg spec's tables of allowed and disallowed type changes are the canonical reference. The discipline the writer must enforce is: commit a schema change only if the change is in the allowed-without-rewrite set, or run an explicit migration that produces new data under the new schema before swapping it in.
Schema History and Per-Snapshot Schema IDs
A table's schema can change many times over its lifetime. The Iceberg metadata records every schema the table has had, in the schemas field of the table metadata. Each schema has a schema_id; the current schema is identified by current_schema_id. Every snapshot records which schema_id was current at its commit time.
The shape of this metadata. The table metadata file (the file the catalog points at, distinct from the snapshot files) has fields:
schemas:Vec<Schema>— every schema the table has ever had.current_schema_id:u32— the active schema.snapshots: indirectly via the snapshot files; each snapshot records itsschema_id.
Time-travel reads use the snapshot's recorded schema_id to find the schema to project against (snapshot-time mode) or use the current_schema_id (current mode). Both lookups are O(1) against the in-memory schemas vector. The vector grows by one entry per schema change; for tables with stable schemas it stays small. For aggressively-evolving tables it grows linearly with time — production deployments compact the schema history periodically (Module 6) by removing schemas that no live snapshot references.
Partition Spec Evolution
The partition spec evolves under the same per-snapshot pattern. Each snapshot records the partition_spec_id that was current at its commit time; the table metadata maintains a partition_specs vector recording every spec the table has had.
The structural reason this needs to be per-snapshot is the same as for schema: data files are immutable, so a file written under spec A cannot be rewritten under spec B without explicit migration. Reading a table that has changed partition specs requires the planner to apply the original spec when reading files written under that spec, and the current spec when reading files written under it.
The mechanic. Each manifest is tagged with the partition spec ID under which its data files were committed. Manifest entries' partition field is interpreted under that spec. A query against a table whose history includes spec changes plans against multiple specs simultaneously:
- For each manifest, identify its spec.
- Lift the source-column predicates through that spec's transforms (Module 3's lifting rules).
- Apply the lifted predicate to the manifest's partition summary.
- Continue into the manifest if the summary matches.
The cost is bounded by the number of distinct partition specs in the table's history — typically one or two over a table's lifetime. The Artemis cold archive started with partition by (mission_id) for the first six months of its life, switched to partition by (mission_id, day(ts)) when daily query volume justified the finer granularity, and has been on the current spec for two years. The planner handles both specs cleanly; the spec-1-history reads use the coarse partitioning and the spec-2 reads use the fine partitioning. The two coexist in the same query plan without conflict.
Migrating Data Under a New Spec
Partition spec evolution does not automatically rewrite old data — but eventually the operator may want to. The migration discipline is:
- Commit the new spec. The new
partition_specsentry is added to the table metadata;default_spec_idis updated to point at it. From this point, new commits use the new spec. - Run a compaction job that rewrites old data. The Module 6 compaction reads files written under old specs, repartitions them under the new spec, and writes them as new files. Each compaction commit removes the old files and adds the new ones (an overwrite commit; Module 2 covered the protocol).
- Eventually retire old specs. When no live snapshot references files written under the old spec, the spec can be removed from the table metadata's
partition_specsvector. This is cosmetic — the spec being present costs nothing operationally — but keeps the metadata tidy.
The migration is a long-running background job. For the Artemis archive, the spec-1-to-spec-2 migration ran over six weeks, processing roughly 1% of the table per day to avoid impacting the analyst workload. During the migration, queries returned correct results against both specs; users were unaware of the in-progress migration. This is the same property that makes schema evolution safe: the table's logical contract is unchanged during migration, and clients see consistent results throughout.
Operational Discipline: Schema Changes Are Commits
A subtle but important property: a schema change is a commit, with the same CAS-and-retry semantics as a data commit (Module 2 Lesson 3). The change is recorded in a new snapshot; the snapshot is added via the optimistic-CAS protocol; conflicts are handled by retry. There is no separate "DDL transaction" or "schema lock" — schema changes are first-class commits.
This has two practical consequences. Schema changes are observable in the snapshot history. Every change has a timestamp, a commit author (if tracked), and a snapshot ID; the audit trail is automatic. Schema changes can be rolled back by committing a new snapshot that reverts the change — the rollback is just another schema change. There is no "ALTER TABLE UNDO" primitive; the rollback is a forward commit that happens to undo a prior one. Both properties make schema evolution easier to reason about than in traditional databases; the lakehouse's transactional model applies uniformly.
The Artemis archive's schema-change discipline runs every change through a pull request that includes the new schema, the data-validation tests, and a rollback plan. The schema is committed to the table by the merge-to-main CI job; the change appears in the snapshot history with the commit hash recorded in the snapshot's metadata. Six months later, an investigator can identify exactly when a column was added and trace it to the originating PR.
Core Mechanics in Code
The Schema Type with Column IDs
The schema's shape, with column IDs as the physical-identity field:
use serde::{Deserialize, Serialize};
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Schema {
pub schema_id: u32,
pub fields: Vec<Field>,
pub identifier_field_ids: Vec<u32>, // primary-key-like; for merge ops
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Field {
/// Stable column ID. The physical identity of this column.
/// Allocated at column-creation time; never reused after drop.
pub field_id: u32,
/// Display name. Can change via rename without affecting data.
pub name: String,
/// The column's logical type.
pub data_type: DataType,
/// Whether the column allows nulls.
pub required: bool,
}
Projecting a Data File Against a Different Schema
The reader's per-file projection: given the schema written into the data file and the schema we want to project against, produce a per-column mapping.
use std::collections::HashMap;
use anyhow::Result;
#[derive(Debug)]
pub enum ProjectedColumn {
/// The file has this column under a known internal name; read it
/// from there and present under the target schema's name.
FromFile { file_column_name: String, target_field: Field },
/// The file doesn't have this column (it was added after the file
/// was written); fill with nulls of the target type.
Null { target_field: Field },
}
pub fn build_projection(
file_schema: &Schema,
target_schema: &Schema,
) -> Vec<ProjectedColumn> {
// Index the file's columns by field_id for O(1) lookup.
let file_by_id: HashMap<u32, &Field> = file_schema
.fields
.iter()
.map(|f| (f.field_id, f))
.collect();
target_schema
.fields
.iter()
.map(|target| {
match file_by_id.get(&target.field_id) {
Some(file_field) => ProjectedColumn::FromFile {
file_column_name: file_field.name.clone(),
target_field: target.clone(),
},
None => ProjectedColumn::Null {
target_field: target.clone(),
},
}
})
.collect()
}
The pattern. The projection is keyed on field_id, never on name. A file written under schema where voltage was the column name and a target schema where the same column has been renamed to voltage_v both produce a FromFile mapping if their field_id matches; the rename is transparent. A target column with field_id not in the file produces a Null mapping; the file simply doesn't have that column. The result drives the Parquet reader's column-selection and the post-read column rename/null-fill.
Committing a Schema Change
A schema-change commit produces a new schema, adds it to the table's schemas list, and commits a new snapshot that references the new schema.
use anyhow::{Context, Result};
/// Add a new nullable column to the table. The function:
/// 1. Reads the current table metadata.
/// 2. Builds a new schema with the column appended.
/// 3. Commits a new snapshot referencing the new schema.
/// The data files are untouched; this is a pure metadata change.
pub async fn add_nullable_column(
catalog: &PostgresCatalog,
table: &str,
column_name: String,
data_type: DataType,
) -> Result<SnapshotId> {
// Standard retry loop (Module 2 Lesson 3).
for _ in 0..16 {
let entry = catalog.get_current(table).await?;
let mut metadata = read_table_metadata_at(&entry.metadata_path).await?;
// Allocate a new column ID. The convention is "last_column_id + 1";
// the table metadata tracks this monotonically.
let new_field_id = metadata.last_column_id + 1;
let mut new_schema = metadata.current_schema().clone();
new_schema.schema_id = metadata.last_schema_id + 1;
new_schema.fields.push(Field {
field_id: new_field_id,
name: column_name.clone(),
data_type: data_type.clone(),
required: false, // nullable; existing rows have no value
});
// Update the table metadata: add the new schema, advance the
// current_schema_id, advance the column counter, leave the
// current snapshot (data state) alone.
metadata.schemas.push(new_schema.clone());
metadata.current_schema_id = new_schema.schema_id;
metadata.last_column_id = new_field_id;
metadata.last_schema_id = new_schema.schema_id;
// Commit the metadata change as a new snapshot. The snapshot's
// file set is identical to the parent's; only the schema_id
// differs. (Iceberg supports this as a "metadata-only" snapshot.)
let new_snapshot = build_metadata_only_snapshot(
&metadata,
new_schema.schema_id,
"add_column",
)?;
let new_metadata_path = write_table_metadata(&metadata, &new_snapshot).await?;
// CAS the catalog as in any commit.
match catalog.compare_and_swap(
table,
entry.current_snapshot_id,
new_snapshot.snapshot_id,
&new_metadata_path,
).await {
Ok(()) => return Ok(new_snapshot.snapshot_id),
Err(CommitError::Conflict(_)) => continue,
Err(e) => return Err(e.into()),
}
}
Err(anyhow::anyhow!("schema change failed after retries"))
}
What to notice. The schema change is a snapshot with no data changes — the file set is identical to the parent. The CAS protocol from Module 2 is unchanged; schema changes are commits like any other. The new field_id is monotonically allocated from the table metadata; it is never reused even if columns are dropped. The validation discipline (is the change in the allowed-without-rewrite set?) is not shown here but is the writer's responsibility before calling the function — adding a non-nullable column, for instance, should error out at the API surface rather than producing data files the reader cannot interpret.
Key Takeaways
- Column IDs are the physical-identity field; column names are display labels. Renaming a column is free at the data layer because the ID is unchanged; old data files continue to be read correctly via ID matching.
- Adding a nullable column, dropping a column, renaming a column, and widening a type are safe schema changes — pure metadata operations with no data rewrite. Type narrowings, nullability tightenings, and incompatible type changes are structurally impossible without explicit migration.
- The schema history is recorded in the table metadata; each snapshot records the
schema_idthat was current at its commit. Time-travel reads project against the snapshot-time schema by default; the current-schema option back-fills nulls and drops removed columns. - Partition specs evolve under the same per-snapshot pattern. A table's history can include multiple specs; the planner applies each manifest's spec independently during reads. Migration to a new spec is a long-running background job (Module 6 compaction) that rewrites old data into the new spec over time.
- Schema and partition changes are commits, not separate DDL operations. They go through the same CAS-and-retry protocol; they appear in the snapshot history; they are observable and rollback-able by the standard commit machinery.
Capstone — Mission Replay Engine
Module: Data Lakes — M04: Time Travel and Schema Evolution Estimated effort: 1–2 weeks of focused work Prerequisite: All three lessons in this module completed; all three quizzes passed (≥ 70%). The Module 2 (table format) and Module 3 (partition + clustering) capstones are the substrate.
Mission Briefing
From: SDA Investigations Lead
ARCHIVE BRIEFING — RC-2026-04-DL-004
SUBJECT: Mission Replay Engine — read-only query service for
reconstructing operational state at past timestamps.
PRIORITY: P1 — required for Q3 accident-review process.
When something goes wrong on orbit, the investigation team needs to see the table as the operators saw it then — not as we see it now. The current archive lets us pin a snapshot, but we don't have a unified entry point that takes a timestamp, resolves to a snapshot, projects against the right schema, and returns the data. That's the engine you're building.
The replay engine is read-only. It does not commit; it does not modify; it does not affect the cold archive's data path. It is a service that accepts a (table, target_timestamp_ms, predicate) query and returns Arrow record batches. The full-mission replay (one mission of data over six months) must complete in under ten minutes against the live cold archive on the standard worker pool.
The engine is the user-facing tool for the accident-investigation workflow. Get it right and the investigators have what they need; get it wrong and they fall back to manual file-grepping against the data lake, which they have been doing for six years and have promised to stop.
What You're Building
A Rust crate, artemis-mission-replay, exposing:
- A
ReplayEnginestruct constructed from a catalog reference and a table name, holding aReplaySessionper active query. - A
replay_at(timestamp_ms, predicate) -> RecordBatchStreammethod that resolves the timestamp to a snapshot, projects against the snapshot-time schema, applies the predicate, and streams batches. - A
replay_at_snapshot(snapshot_id, predicate) -> RecordBatchStreammethod for snapshot-ID-addressed replays. - A
replay_diff(t_old, t_new) -> ChangeStreammethod that emits the (added, removed) rows between the two times. - A CLI binary,
artemis-replay, with subcommandsquery(point-in-time read),diff(CDF between two times), andinspect(snapshot history, table metadata). - Observability: every replay records the pinned snapshot ID, the schema ID used, the planning summary (manifests/files/row groups scanned vs total), and the elapsed time. The structured logs are consumed by the existing Artemis observability stack.
The engine handles schema changes that occurred between the replay target and the present: a query for data from six months ago, when the table had three fewer columns, returns the three-column-fewer schema (snapshot-time mode) rather than back-filling nulls (current mode).
Functional Requirements
- Timestamp-to-snapshot resolution. Implements the resolution rule from Lesson 2: the most recent snapshot whose
commit_timestamp_ms <= target_ms. Resolution against a timestamp outside the retention window returns a clear "snapshot expired" error. - Snapshot-time schema projection. The default mode reads files projected against the schema that was current at the pinned snapshot's commit time. The
mode = currentparameter switches to projection against the current schema. - Per-snapshot partition spec. The planner uses the partition spec recorded in the pinned snapshot, not the table's current spec. Reads against pre-spec-change snapshots use the old spec; reads against post-spec-change snapshots use the new.
- Predicate lifting per snapshot's spec. The predicate is lifted through the pinned snapshot's partition spec (Module 3 Lesson 3's lifting rules). Different spec → different lifting → potentially different pruning result.
- Streaming reads. The result is a stream of
RecordBatches, not a materialized vector. Long replays must not blow up memory; the consumer pulls batches as it processes them. - Schema-aware projection. A data file written before a column was added is read with the missing column null-filled; a file written before a column was renamed is read with the rename transparently applied via field-ID matching.
- Change Data Feed.
replay_diff(t_old, t_new)returns two streams (added,removed) ofRecordBatches, computed by walking the snapshot chain between the two times and collecting Added/Deleted manifest entries.
Acceptance Criteria
Verifiable (automated tests must demonstrate these)
-
A replay at
target_ms = Tagainst a table with snapshots committed atT-5min,T-1min,T+1minpins the snapshot committed atT-1min. -
A replay at a
target_msolder than the retention window returns a structuredSnapshotExpirederror containing the oldest available snapshot's timestamp. -
A replay at
Tagainst a table whose schema changed atT+1day(a column added after the replay target) produces a record batch stream with the schema-at-T (no new column). The same replay withmode = currentproduces a stream with the new column null-filled. -
A replay at
Tagainst a table whose partition spec changed atT+1dayuses the partition spec that was current at T for predicate lifting. The pruning result is deterministically reproducible. -
A replay of a column-renamed table (
voltage→panel_voltageatT+1day) reads the column correctly under either projection mode: atTwithmode = snapshot_timethe column appears asvoltage; withmode = currentit appears aspanel_voltagewith the same values (field-ID matching makes the rename transparent). - A full-mission replay over six months of data (~2 TB compressed) completes in under 10 minutes on the standard worker pool, measured against the integration test bench.
-
replay_diff(t_old, t_new)against two snapshots that differ by one append commit returns the appended files' rows in theaddedstream and an emptyremovedstream. -
A long-running replay's memory consumption (resident set, measured via
jemalloc-ctl::stats::resident) stays under 2 GB regardless of the result size, demonstrating streaming behavior.
Self-assessed (you write a short justification; reviewer checks it)
-
(self-assessed) The retention-window error handling is documented in
docs/expired-snapshot.md. The doc describes what the engine returns when the target is expired, how the investigator escalates to the long-retention archive tier, and why the engine does not auto-fall-through. -
(self-assessed) The projection-mode choice is documented in
docs/projection-modes.md. The doc justifies snapshot-time as the default for accident investigation and explains when current mode is the right choice. -
(self-assessed) The streaming-memory discipline is documented in
docs/streaming-memory.md. The doc explains how the engine bounds memory regardless of result size and what the failure mode is if the consumer stalls. -
(self-assessed) The observability output is documented in
docs/observability.mdwith the schema of the structured log lines and the metric names the engine emits.
Architecture Notes
A reasonable module layout, building on the Module 2 and Module 3 crates:
artemis-mission-replay/
├── src/
│ ├── lib.rs # ReplayEngine, ReplaySession
│ ├── resolve.rs # timestamp -> snapshot resolution
│ ├── projection.rs # snapshot-time vs current schema projection
│ ├── stream.rs # streaming RecordBatch reader
│ ├── diff.rs # snapshot-diff CDF
│ └── bin/artemis_replay.rs
├── tests/
│ ├── resolve.rs
│ ├── projection.rs
│ ├── schema_evolution.rs
│ ├── partition_evolution.rs
│ ├── streaming_memory.rs
│ └── full_mission_bench.rs # ignored by default; the perf bench
└── docs/
├── expired-snapshot.md
├── projection-modes.md
├── streaming-memory.md
└── observability.md
The engine is fundamentally a thin layer over the Module 2/3 read path. The new work is:
- Resolving timestamps to snapshots via the snapshot history.
- Applying the right schema to projections.
- Wrapping the result as a
Stream<Item = Result<RecordBatch>>(anasync_stream::stream!macro is the cleanest pattern). - Computing snapshot diffs via the manifest-entry-status walk.
The CLI is a small clap-driven binary; the heavy lifting is in the library.
Hints
Hint 1 — Streaming with `async_stream`
The async_stream crate's stream! macro lets you yield batches lazily from an async function. The structure:
use async_stream::stream;
pub fn replay_at(
&self,
target_ms: i64,
predicate: Predicate,
) -> impl Stream<Item = Result<RecordBatch>> {
stream! {
let pinned = self.resolve_timestamp(target_ms).await?;
let plan = self.plan_against(&pinned, &predicate).await?;
for file_scan in plan.files {
for batch in self.read_file(&file_scan).await? {
yield Ok(batch?);
}
}
}
}
The consumer pulls batches as it processes them; the engine reads files as needed. Memory stays bounded by the current batch size.
Hint 2 — Field-ID-based projection in practice
The Parquet reader's projection mask supports projecting by Parquet column index, but the column-ID mapping you need lives in the Parquet schema's field metadata (the field_id annotation Iceberg adds). The parquet crate exposes this via SchemaDescriptor::column_with_path and the column's metadata. The pattern: walk the target schema, for each field look up the column in the Parquet file whose field_id matches, build a ProjectionMask of those indices, and the reader will return the columns in the target schema's order.
Hint 3 — Schema-time projection's null-fill
Columns in the target schema with no matching field_id in the data file need to be null-filled. The cleanest way: after the Parquet reader produces a batch with only the present columns, post-process the batch to add null columns for the missing target fields. arrow::array::new_null_array(data_type, batch.num_rows()) produces a null array of the right type and length. Insert these into the batch at the correct schema positions.
Hint 4 — The full-mission benchmark
The 10-minute SLA for a full-mission replay (~2 TB) requires the engine to parallelize file reads. A tokio::stream::FuturesUnordered with a concurrency cap of ~64 (configurable) issues that many concurrent S3 GETs at a time. Each GET is a file read; the file's batches are decoded and yielded into the result stream. The order of batches in the output may not match the order in the metadata — the consumer must not assume ordering. The Mission Replay Engine documents this; the analyst tooling that consumes it does its own ordering when needed (typically by sample_timestamp_ns).
Hint 5 — CDF and the schema-change subtlety
A snapshot diff that spans a schema change has an interesting edge case: the Added files use the new schema; the Removed files use the old schema. The consumer of the CDF may see "added rows under schema B, removed rows under schema A." The engine's CDF API documents this and projects both streams against a user-specified target schema (snapshot-time-of-old, snapshot-time-of-new, or current), with the same field-ID-matching machinery the time-travel read uses. The tests should exercise the case where a column was added between the two diff endpoints.
References
- Designing Data-Intensive Applications (Kleppmann & Riccomini), Chapter 5 — "Encoding and Evolution"; Chapter 7 — "Snapshot Isolation"
- Apache Iceberg specification — "Scan Planning" and "Schema Evolution"
async_streamcrate documentationparquetcrate's field-ID/Iceberg-compatibility documentation
When You're Done
The crate is "done" when all eight verifiable acceptance criteria pass in CI, the four self-assessed docs are written, and the full-mission replay benchmark hits the 10-minute SLA. The Module 5 capstone will plug a query engine on top of the read path; your ReplayEngine is the substrate that engine consumes for time-travel queries.
Module 5: Query Engine Integration
Lesson 1 — Predicate Pushdown and the Pruning Pyramid
Module: Data Lakes — M05: Query Engine Integration Position: Lesson 1 of 3 Source: Designing Data-Intensive Applications — Martin Kleppmann and Chris Riccomini, Chapter 4 ("Storage and Retrieval — Column-Oriented Storage"). Database Internals — Alex Petrov, Chapter 9 ("Query Processing"). Apache DataFusion documentation, "TableProvider" interface.
Source note: The pushdown protocol's structure is well-established across query engines (DataFusion, Spark, Trino, DuckDB). The lesson grounds the framing in DDIA's column-store treatment; the specific API surface details are DataFusion-flavored but the contract is universal.
Context
Module 3 developed pruning at the storage layer: given a predicate, the table format returns the minimal set of files (and row groups inside them) to read. But the predicate comes from somewhere — specifically, from the query engine that parsed the analyst's SQL. The contract that lets the engine hand the predicate down to the storage layer is predicate pushdown. This lesson develops that contract end to end.
The right framing is the one DDIA (Ch. 4) uses for column-store optimization: "If only relevant rows can be loaded from disk, the query is much faster." Predicate pushdown is the discipline that makes "only relevant rows" achievable across the engine-storage boundary. The engine knows the query's predicate; the storage knows the per-file statistics; pushdown lets them collaborate. The engine pushes the predicate down; the storage returns a pruned plan; the engine reads against the pruned plan and applies any residual filtering on the rows the storage returns.
The contract has subtle asymmetries. Not every SQL predicate is pushable (function calls on columns, predicates involving multiple tables, predicates over computed columns). Pushable predicates may be pushed partially (the engine pushes the part the storage can evaluate; the engine evaluates the rest). The storage may not prune as aggressively as the predicate would allow (conservative pruning is the right default; Module 3 Lesson 3 covered this). All of these complications come from the same correctness asymmetry: false-positives are acceptable; false-negatives are silent data loss. The pushdown contract is written to honor the asymmetry at every layer.
This lesson develops the contract's mechanics: what the engine hands down, what the storage returns, how partial pushdown works, and how the residual predicate is reapplied after the storage layer has done its part. The capstone implements a DataFusion TableProvider that does exactly this for the Artemis archive.
Core Concepts
The Pushdown Contract
The interface between the engine and the storage layer for predicate pushdown is a single method, in spirit:
push_predicates(engine_predicates: Vec<Expr>) → PushdownResult {
pushed: Vec<Expr>, // storage handled these completely
inexact: Vec<Expr>, // storage handled approximately; reapply
unhandled: Vec<Expr>, // storage cannot evaluate; engine must
}
The three-way classification is essential. Pushed predicates are evaluated exactly by the storage; the engine does not reapply them. Inexact predicates are evaluated approximately — the storage prunes some rows but the survivors may not all match; the engine must reapply the predicate. Unhandled predicates the storage couldn't evaluate at all; every row must be presented to the engine for evaluation.
The discipline this classification enforces. Conservative pruning is honest. When the storage layer's pruning rule is approximate (a manifest's partition summary admits the predicate but the manifest may also contain non-matching rows), the storage returns the predicate in the inexact bucket; the engine knows to reapply. The engine retains responsibility for correctness. The engine never assumes the storage's pruning was exact; it consults the classification and reapplies as needed. The result is that pushdown is a pure optimization — the engine's answer is correct regardless of how much pushdown happened.
DataFusion's TableProvider::supports_filters_pushdown method returns this classification per predicate as TableProviderFilterPushDown::Exact, Inexact, or Unsupported. The naming is DataFusion-specific; the three-way structure is universal.
Which Predicates Are Pushable
Not every predicate the engine sees can be pushed to the storage layer. The pushable subset is determined by what the storage knows how to evaluate. Module 3 Lesson 3 developed the leaf-predicate algebra for column statistics; pushdown is the same algebra applied at the engine-storage boundary.
Pushable, often as exact:
column op constantwhereopis=,<,<=,>,>=,!=and the column is statistics-bearing.column IN (constant_list).column IS NULL,column IS NOT NULL.- Conjunctions and disjunctions of pushable predicates.
Pushable, typically as inexact (the storage pruning is approximate; engine reapplies):
- Predicates on partition columns under transforms that don't admit tight lifting (Module 3's
Bucket+ range case). - Predicates on columns where statistics exist but are coarse (every file's
minandmaxspan the predicate's value range, so pruning happens at most at the manifest-list level).
Not pushable:
- Function calls on columns (
UPPER(name) = 'APOLLO-7'). The storage's statistics are on the column's values, not onUPPER(value); without an index over the function output, the storage can't prune. - Predicates involving multiple tables (
a.mission_id = b.mission_id). Pushdown is per-table; cross-table predicates are evaluated by the engine. - Predicates on derived columns or aggregates (
SUM(voltage) > 1000). - Predicates with non-trivial logic the storage doesn't implement (regex, JSON path expressions, geospatial predicates without spatial index support).
The capstone's TableProvider classifies incoming predicates by walking the predicate tree, identifying the leaves, and matching them against the table format's capabilities. The Module 2-3 capstones support exact pushdown for =, <=, >=, IS NULL, IS NOT NULL, and IN against statistics-bearing columns; inexact pushdown for any predicate that includes a bucket-partitioned column with a non-equality op; unsupported pushdown for everything else. The classification is per-leaf; the engine handles the conjunction/disjunction wiring.
The Projection-Pushdown Companion
Predicate pushdown has a partner: projection pushdown. The engine tells the storage which columns it actually needs, and the storage reads only those columns. The mechanism is structural at the Parquet level — the column-store format reads columns independently, and skipping columns means skipping their bytes entirely. Module 1 Lesson 3 introduced this as the Parquet reader's ProjectionMask; this lesson is what makes it cross the engine-storage boundary.
Projection pushdown is unambiguously simpler than predicate pushdown: a list of column names, the storage selects them, the result has exactly those columns. There is no "inexact projection" — either the column is in the result or it isn't. DataFusion's TableProvider::scan takes a projection: Option<Vec<usize>> parameter that the engine sets to the indices of the columns the query actually uses; the storage uses these indices to construct the Parquet ProjectionMask.
The two pushdowns compose orthogonally. A query SELECT mission_id, AVG(voltage) FROM t WHERE day = '2024-03-15' produces a scan call with projection = [mission_id, voltage] (the columns used in the projection and the aggregate) and predicates = [day = '2024-03-15'] (the WHERE clause). The storage prunes files by day, reads only mission_id and voltage from the surviving files, and returns the result. Without one of the pushdowns the query would either read every file (no predicate pushdown) or read every column from every file (no projection pushdown); with both, the read is minimal.
The Residual Filter Pattern
When the storage classifies a predicate as inexact, the engine must reapply it after reading. The residual filter is exactly the predicate the engine pushed, evaluated against each row in the returned batches. The mechanics are direct: the engine wraps the storage's RecordBatchStream in a filter operator that evaluates the residual predicates and drops rows that don't match.
The pattern is straightforward in DataFusion:
storage_stream → FilterExec(residual_predicates) → upstream_operators
The cost depends on how much the storage's pruning helped. A perfectly-pushable predicate (day = '2024-03-15') sees the storage's pruning eliminate most files, and the residual filter is unnecessary — the engine knows the storage was exact. An inexact predicate (payload_id > 'p-005' on a bucket-partitioned column) sees the storage return many candidate files; the residual filter eliminates the non-matching rows.
The residual-filter pattern's elegance is that it never produces wrong results. The storage's job is to narrow the search; the engine's filter is what makes the result exact. Even if the storage returns nothing useful (every predicate is Unsupported), the engine still produces the right answer — just at the cost of a full table scan plus engine-side filtering. The pushdown contract is a pure optimization; correctness is unaffected by how aggressive the pushdown is.
Pushdown of Logical Operators
Predicates with AND and OR compose under pushdown. The rules:
Conjunction (A AND B):
- If both
AandBare pushable exact, the conjunction is exact. - If one is exact and one is inexact, the conjunction is inexact (the engine must reapply the inexact part).
- If one is unsupported, the conjunction is pushed only at the level where the unsupported branch is dropped — the engine pushes
A AND BasA(the supported part) and applies the fullA AND Bas residual. This is the conservative-pruning principle at the conjunction level.
Disjunction (A OR B):
- Disjunction pushes only if every branch is pushable. If any branch is unsupported, the disjunction is unsupported (the engine can't drop a branch from a disjunction; the dropped branch's potentially-matching rows would be lost).
- If every branch is exact, the disjunction is exact.
- If any branch is inexact, the disjunction is inexact.
The conjunction-vs-disjunction asymmetry is the key insight. A conjunction can drop branches the storage doesn't support (because adding AND TRUE doesn't change the result; the engine reapplies the dropped branch). A disjunction cannot drop branches (because OR FALSE removes matching rows; the engine couldn't reapply because the rows aren't in the result to filter). The pushdown logic encodes this directly.
The capstone's predicate-classification function applies these rules during the tree walk. The result is a per-tree classification that DataFusion consumes; DataFusion's FilterExec wraps the storage stream with the residual predicates the storage didn't handle exactly.
Statistics Quality: The Hidden Variable
The pushdown's effectiveness depends on the quality of the storage's statistics. Three failure modes are worth knowing.
Coarse statistics. A file whose payload_id ranges from 'p-001' to 'p-200' cannot be pruned by a predicate payload_id = 'p-042'; the file might match. If every file in the table has the full range, no pruning happens at the file level. The clustering work from Module 3 Lesson 2 is what tightens statistics — clustered files have narrow ranges per cluster column, so predicates on cluster columns prune effectively.
Missing statistics. Module 1's writer can be configured to omit statistics for very wide columns (long strings, complex nested types) to save metadata space. Predicates on those columns push to the storage but the storage can't evaluate them; classification is Unsupported, and the engine handles the full predicate. The fix is to enable statistics on the column, paying the metadata cost; this is a workload-tuning decision.
Stale statistics. The statistics in a manifest entry are the writer's view at commit time; they don't update if the data file changes. In Iceberg, data files are immutable, so this doesn't happen — every commit produces new files with fresh statistics. In some other systems (Hudi's COW vs MOR distinction; Delta's old in-place-update protocol) statistics can become stale; the result is over-conservative pruning. The lakehouse formats covered in this track all use immutable data files, so stale statistics are not a failure mode.
The Artemis observability stack tracks per-query pruning effectiveness as a leading indicator of statistics quality. A query whose predicate is theoretically selective but produces low pruning ratios is a signal that the statistics are coarser than the predicate needs — typically because the table needs better clustering, or because the predicate column is missing from the table's clustering spec.
Code Examples
A Predicate-Classification Walk
The capstone's classifier: walk a predicate tree, return the three-way classification for each node, compose the results.
use datafusion::logical_expr::Expr;
use datafusion::logical_expr::Operator;
#[derive(Debug, Clone, Copy)]
pub enum Pushdown {
Exact,
Inexact,
Unsupported,
}
/// Classify a predicate against the table format's capabilities.
/// Returns the strongest level the storage can handle; the engine
/// reapplies whatever the storage didn't handle exactly.
pub fn classify(expr: &Expr, caps: &TableCapabilities) -> Pushdown {
match expr {
// Leaf: column op constant. The storage's leaf rule (Module 3 L3)
// determines whether the result is Exact, Inexact, or Unsupported.
Expr::BinaryExpr(b) => {
classify_binary(b, caps)
}
// Conjunction: both branches must be at least supported.
// Branches' classifications combine: Exact + Exact = Exact;
// any Inexact among supported branches makes the result Inexact;
// Unsupported branches are dropped (engine reapplies as residual).
Expr::BinaryExpr(b) if b.op == Operator::And => {
let l = classify(&b.left, caps);
let r = classify(&b.right, caps);
match (l, r) {
(Pushdown::Exact, Pushdown::Exact) => Pushdown::Exact,
(Pushdown::Unsupported, x) | (x, Pushdown::Unsupported) => x,
_ => Pushdown::Inexact,
}
}
// Disjunction: every branch must be at least supported.
// If any branch is Unsupported, the disjunction is Unsupported.
Expr::BinaryExpr(b) if b.op == Operator::Or => {
let l = classify(&b.left, caps);
let r = classify(&b.right, caps);
match (l, r) {
(Pushdown::Unsupported, _) | (_, Pushdown::Unsupported) => Pushdown::Unsupported,
(Pushdown::Exact, Pushdown::Exact) => Pushdown::Exact,
_ => Pushdown::Inexact,
}
}
// Negation: classified by the inner expression's classification.
// Note: NOT over Inexact remains Inexact (the storage's pruning
// doesn't necessarily map cleanly through negation).
Expr::Not(inner) => classify(inner, caps),
// Function calls on columns: typically unsupported at the storage
// layer unless a function-specific index exists.
Expr::ScalarFunction(_) => Pushdown::Unsupported,
// Anything else: be conservative.
_ => Pushdown::Unsupported,
}
}
fn classify_binary(b: &BinaryExpr, caps: &TableCapabilities) -> Pushdown {
// Identify the column reference; if neither side is a column,
// the leaf is unsupported.
let (col, op, _val) = match (&*b.left, &*b.right) {
(Expr::Column(c), _) => (c, b.op, &b.right),
(_, Expr::Column(c)) => (c, swap_op(b.op), &b.left),
_ => return Pushdown::Unsupported,
};
if !caps.has_statistics(&col.name) {
return Pushdown::Unsupported;
}
if caps.is_partition_column(&col.name) {
// Partition column + transform: check whether the lifting rule
// for this (transform, op) is tight, conservative, or unsupported.
match caps.lifting_rule(&col.name, op) {
Some(LiftingRule::Tight) => Pushdown::Exact,
Some(LiftingRule::Conservative) => Pushdown::Inexact,
None => Pushdown::Unsupported,
}
} else {
// Non-partition column with statistics: the storage prunes files
// via per-file stats. Pruning is exact for the predicate's effect
// on file selection (the file is read or not), but residual rows
// in the read file may not match. Classification depends on
// whether the per-file stats are tight enough; we treat all
// non-partition pushdowns as Inexact to be safe.
Pushdown::Inexact
}
}
The structure. The classifier walks the predicate tree and applies the composition rules. Leaves classify based on the storage's lifting and statistics support; conjunctions and disjunctions combine per the rules in the previous section. The result tells the engine exactly which predicates can be dropped from the residual filter and which must be reapplied.
A TableProvider Skeleton
The DataFusion TableProvider shape, with the predicate-pushdown and projection-pushdown plumbing wired in:
use std::sync::Arc;
use async_trait::async_trait;
use datafusion::arrow::datatypes::SchemaRef;
use datafusion::catalog::TableProvider;
use datafusion::execution::SessionState;
use datafusion::logical_expr::{Expr, TableProviderFilterPushDown};
use datafusion::physical_plan::ExecutionPlan;
use datafusion::error::Result;
pub struct ArtemisTableProvider {
table: ArtemisTable,
schema: SchemaRef,
}
#[async_trait]
impl TableProvider for ArtemisTableProvider {
fn as_any(&self) -> &dyn std::any::Any {
self
}
fn schema(&self) -> SchemaRef {
self.schema.clone()
}
fn table_type(&self) -> datafusion::logical_expr::TableType {
datafusion::logical_expr::TableType::Base
}
/// DataFusion calls this for each predicate; the return value
/// determines whether DataFusion includes the predicate in the
/// scan call and whether it adds a residual FilterExec.
fn supports_filters_pushdown(
&self,
filters: &[&Expr],
) -> Result<Vec<TableProviderFilterPushDown>> {
let caps = self.table.capabilities();
Ok(filters
.iter()
.map(|e| match classify(e, &caps) {
Pushdown::Exact => TableProviderFilterPushDown::Exact,
Pushdown::Inexact => TableProviderFilterPushDown::Inexact,
Pushdown::Unsupported => TableProviderFilterPushDown::Unsupported,
})
.collect())
}
/// DataFusion calls this to build the physical scan. The
/// `projection` and `filters` arguments are the ones DataFusion
/// decided to push down based on supports_filters_pushdown.
async fn scan(
&self,
_state: &dyn datafusion::catalog::Session,
projection: Option<&Vec<usize>>,
filters: &[Expr],
_limit: Option<usize>,
) -> Result<Arc<dyn ExecutionPlan>> {
// Translate DataFusion's Expr predicates to the table format's
// Predicate type, plan the scan against the table format, and
// wrap the result in a DataFusion ExecutionPlan that streams
// record batches.
let table_predicate = exprs_to_predicate(filters)?;
let plan = self.table.plan_scan(&table_predicate).await?;
let exec_plan = ArtemisExec::new(plan, projection.cloned(), self.schema.clone());
Ok(Arc::new(exec_plan))
}
}
What to notice. The TableProvider is two methods: classification (returns the per-predicate pushdown level) and scan (returns the physical execution plan). DataFusion handles the rest — the SQL parser, the logical planner, the optimizer, the projection-and-filter wiring on top of the scan, the parallel execution of the scan stream. The TableProvider is the integration shim; DataFusion is the engine.
The ArtemisExec is the physical plan that produces the actual RecordBatchStream. Module 5 Lesson 2 develops the streaming side; the relevant DataFusion trait is ExecutionPlan with its execute method returning a SendableRecordBatchStream.
Key Takeaways
- Predicate pushdown is a three-way contract: predicates are pushed exact, pushed inexact, or unsupported. The storage handles exact predicates completely; the engine reapplies inexact predicates as a residual filter; the engine handles unsupported predicates entirely.
- The classification per predicate depends on the storage's capabilities (which columns have statistics, which partition transforms admit tight lifting). The composition rules for
AND(drop unsupported branches) andOR(any unsupported branch makes the whole disjunction unsupported) follow from the false-positive/false-negative asymmetry. - Projection pushdown is the unambiguous partner. The engine tells the storage which columns it actually needs; the storage reads only those columns. Composition with predicate pushdown is orthogonal; both pushdowns together minimize the read.
- The residual filter pattern is what makes pushdown a pure optimization. The engine never assumes the storage's pruning was exact; the engine reapplies inexact predicates after reading. Correctness is preserved regardless of how much pushdown happens.
- Statistics quality determines pushdown effectiveness. Clustered data, partition-column predicates, and tight per-file ranges produce strong pruning; unclustered data, missing statistics, and inexact transforms produce weak pruning. Pushdown effectiveness is monitored as a leading indicator of statistics-quality drift.
Lesson 2 — Vectorized Reads via Arrow IPC
Module: Data Lakes — M05: Query Engine Integration
Position: Lesson 2 of 3
Source: In-Memory Analytics with Apache Arrow — Matthew Topol, Chapter 6 ("Arrow IPC and the Flight Protocol"). Apache Arrow specification, "IPC Format" section. Apache DataFusion RecordBatchStream documentation.
Source note: The Arrow IPC format is well-specified and the Topol book is the authoritative reference; this lesson grounds the framing there and applies it to the lakehouse-to-engine connection.
Context
Module 1 produced Arrow record batches as the in-memory shape for columnar data; Module 2-4 produced a table format that returns Arrow batches as the result of read plans. Lesson 1 introduced the engine-storage boundary through the predicate-pushdown contract; this lesson is the data side of the same boundary. How do batches get from the storage layer to the engine? In the simplest case (storage and engine in the same process), they get passed by Arc. In the cross-process case (storage as a separate service, engine as a client), they cross the boundary via Arrow IPC.
The Arrow IPC format exists for exactly this: a binary wire format for record batches that preserves Arrow's in-memory layout. The decoder doesn't reconstruct rows or parse field-by-field — it maps the wire bytes directly into the buffer layout that the in-memory RecordBatch expects. The transport cost is memcpy; the decode cost is constant per batch. This is what makes the lakehouse architecture economical at any scale: the same Arrow buffers that came out of the Parquet decoder go straight onto the wire and into the engine's execution path, with no per-row materialization or per-column serialization step in between.
This lesson develops the IPC format's structure (schema message then batch messages), the streaming protocol (an open-ended sequence of batch messages with an EOS marker), and the protocol's properties that make it the right transport for the lakehouse. The capstone uses the IPC machinery in-process — DataFusion holds the table provider's stream directly without any wire format — but the cross-process pattern is structurally identical and worth understanding.
Core Concepts
Why Arrow IPC, Not Some Other Format
A query engine consuming the table format's output has a choice of wire format. The lakehouse community has converged on Arrow IPC for reasons that come straight out of In-Memory Analytics with Apache Arrow (Topol, Ch. 6). The framing the book develops, and the relevant comparisons:
- Row-oriented formats (JSON, MessagePack, Protobuf-with-row-encoding) require row reconstruction on the receiver's side. Each row's values are decoded into per-row structures, then the receiver typically rebuilds a columnar layout for processing. This is wasted work; the lakehouse's output is already columnar, and the engine wants columnar input.
- Bespoke columnar formats (custom column-by-column protocols) require per-column type-specific decoders. Adding a new type means adding a new decoder. Versioning is fragile.
- Parquet over the wire uses the file format as the transport. The bytes have to be re-encoded for the engine's in-memory format (Parquet's encodings — RLE, dictionary, BYTE_STREAM_SPLIT — don't match Arrow's flat layout). Decode is per-column-chunk, not constant.
- Arrow IPC is the in-memory Arrow layout serialized as-is. The receiver reads the schema message, allocates the right number of buffers, and
memcpys the wire bytes into them. The result is an ArrowRecordBatchready to use; no per-row, per-column, or per-type work.
The cost basis matters at scale. A query reading 100 GB of data through a row-oriented format spends roughly 100 GB of CPU on decoding; through Arrow IPC it spends roughly the network/disk I/O cost plus a few hundred microseconds per batch in IPC overhead. The 10×-100× advantage compounds; lakehouse query engines have all converged on Arrow IPC or a close relative for the same reason.
The IPC Streaming Protocol
The IPC streaming protocol is a sequence of messages, each one a small header followed by an optional binary payload. The message types relevant to the read path are:
- Schema message. Sent once at the start of the stream. Describes the schema of every batch that will follow. Includes column names, types, nullability, and any nested-type metadata. After this message, every batch in the stream conforms to this schema.
- RecordBatch message. Sent repeatedly. Each message describes one record batch: the row count, the buffer offsets and lengths within the payload, and any dictionary references. The payload is the buffers themselves — null bitmaps, offset arrays, value arrays, all concatenated.
- DictionaryBatch message. Sent when dictionary-encoded columns appear. Each dictionary is sent once (the first time a column uses it); subsequent batches reference the dictionary by ID. This is the wire-format counterpart to Module 1's dictionary encoding.
- EndOfStream marker. Sent at the end. Tells the receiver no more batches are coming.
The protocol has one important property: the schema is sent once, batches are sent many times. The receiver allocates the type-specific decoder state once, then processes batches in a tight loop. The per-batch overhead is the message header (a few hundred bytes) plus the payload memcpy. For batches of typical size (8K-64K rows), the per-batch overhead is dominated by the payload, not the framing.
The streaming property is what makes the protocol work over network connections. The sender can produce batches as it reads data files; the receiver can process batches as they arrive; neither side needs to buffer the entire result set. This is the same property the RecordBatchStream type provides in-process — the protocol generalizes it across process boundaries.
Schema Negotiation and Forward Compatibility
The schema message at the start of the stream is the receiver's contract about what types to expect. The schema is fully self-describing: column names, Arrow types (with parameters: timestamp units, decimal precision, list inner types, struct fields), and metadata (the key_value_metadata field for arbitrary annotations).
The Iceberg field-ID annotation lives in the schema metadata. When a table format produces an IPC stream of a time-traveled snapshot, the schema message includes the field IDs in the per-column metadata; the engine can use them for projection if it needs to map the stream's schema against a different target schema. This is the wire-format counterpart to Module 4 Lesson 3's column-ID-based projection — same mechanism, applied to the IPC layer.
The schema is sent before any batches, which gives the receiver a chance to reject incompatible streams. If the receiver requires a specific schema (the consumer is a specific query plan that knows what columns it needs), it can compare the incoming schema against its expectation and abort if they disagree. The discipline is the same as Module 2 Lesson 1's schema-enforcement check, applied at the wire layer.
Bounded Memory Through Streaming
The streaming protocol bounds memory because batches arrive one at a time. The receiver's memory at any moment is one batch in flight (typically a few MB to a few tens of MB, depending on batch size and column count) plus whatever the downstream consumer holds onto. A consumer that processes each batch and immediately drops it (the typical filter-and-aggregate pattern) sees its memory stay constant regardless of how many batches the stream produces; a consumer that buffers batches (collecting all results into a vector) sees its memory grow linearly with the result size.
The Artemis Mission Replay Engine (Module 4's capstone) exploits this directly. A full-mission replay returns terabytes of result data; the engine's memory stays under 2 GB because each batch is processed and dropped immediately. The downstream consumer (an analyst's tooling) batches results into its own bounded queue with an explicit flow-control mechanism — if the analyst's tool is slow, the engine's stream backpressures and pauses producing new batches.
The DataFusion SendableRecordBatchStream type implements this exactly: a Stream<Item = Result<RecordBatch>> with the Send bound. The stream produces one batch at a time; the consumer pulls; the producer reads the next file's batches when the consumer is ready. The flow-control is implicit in the stream's poll-based semantics — the producer doesn't run until the consumer wants more data.
The RecordBatchStream Trait
DataFusion's primary abstraction for batch streams is RecordBatchStream:
trait RecordBatchStream: Stream<Item = Result<RecordBatch>> {
fn schema(&self) -> SchemaRef;
}
The trait extends futures::Stream with a schema accessor. The schema is the same one every batch the stream produces conforms to; the consumer reads it once and uses it for all downstream operations. This matches the IPC schema-once-then-batches pattern; the in-process and cross-process cases share the same abstraction.
A TableProvider::scan returns an Arc<dyn ExecutionPlan>; the ExecutionPlan::execute returns a SendableRecordBatchStream (the Send-bound variant for parallel execution). The data flow is:
ArtemisTable → ScanPlan → FileReader → RecordBatchStream → DataFusion FilterExec → ...
The stream is built up lazily. The scan call constructs the ScanPlan (using Module 3's three-pass pruning) but doesn't read any data. The execute call returns a stream that, when polled, reads the next file's batches. The consumer's pulling drives the storage's reading; the storage's reading drives the network/object-store I/O.
IPC at the Storage-Engine Boundary
The lesson has so far considered the streaming-protocol shape and the in-process consumption pattern. The cross-process case — storage as a service, engine as a client — uses the same protocol over a network connection. The standard transports:
- Arrow Flight is the gRPC-based protocol layered on Arrow IPC. The client issues a query (a
FlightDescriptor); the server returns an IPC stream over gRPC. Used by datawarehouse-as-a-service offerings; the Artemis archive uses Flight for inter-region replay queries where the analyst tooling lives in a different region from the catalog. - Direct IPC over TCP is the lightest-weight transport. No gRPC framing, no service abstraction, just the IPC stream sent over a socket. Used in tightly-coupled deployments where the engine and the storage share infrastructure.
- HTTP with IPC streaming is the cloud-native variant. The client GETs a URL; the server streams the IPC bytes as the response body. The HTTP/2 streaming primitives carry the flow control; Arrow IPC handles the data shape. Used by some data-warehouse REST APIs; the Artemis read service's external API uses this.
The choice of transport is operational, not architectural. The lakehouse's storage layer produces Arrow IPC; the transport wraps it. Changing transports doesn't change the storage code, the engine code, or the data flow shape. This is the same architectural property as the catalog backend choice (Lesson 3): the table format defines the contract; the implementations swap underneath.
Core Mechanics in Code
Writing an IPC Stream from a RecordBatchStream
The producer side: take a stream of record batches, write an IPC stream to a writer (a TCP socket, an HTTP response body, a file).
use std::sync::Arc;
use anyhow::Result;
use arrow::ipc::writer::StreamWriter;
use arrow::record_batch::RecordBatch;
use arrow::datatypes::SchemaRef;
use futures::StreamExt;
/// Stream-encode a sequence of record batches as an Arrow IPC stream.
/// Writes one Schema message, then a RecordBatch message per batch,
/// then an EndOfStream marker. The writer is a generic `std::io::Write`
/// or a Tokio-style AsyncWrite wrapped via an adapter.
pub async fn write_ipc_stream<S>(
mut stream: impl futures::Stream<Item = Result<RecordBatch>> + Unpin,
schema: SchemaRef,
writer: S,
) -> Result<()>
where
S: std::io::Write,
{
// The IPC StreamWriter handles all the framing: the schema header,
// the per-batch headers, the dictionary tracking, and the
// end-of-stream marker on drop.
let mut ipc_writer = StreamWriter::try_new(writer, &schema)?;
while let Some(batch_result) = stream.next().await {
let batch = batch_result?;
// The write_batch call serializes the batch's buffers to the
// wire format. The serialization is essentially a memcpy of the
// Arrow buffers plus a small header — no per-row or per-column
// work beyond the framing.
ipc_writer.write(&batch)?;
}
// The finish call emits the EndOfStream marker and flushes the writer.
ipc_writer.finish()?;
Ok(())
}
The pattern. The StreamWriter from the arrow-ipc crate handles all the framing details — the schema flatbuffer, the per-batch message envelopes, the dictionary tracking, the end-of-stream marker. The user code is just a loop over the stream, calling write for each batch. The cost per batch is the framing (a few hundred bytes) plus the memcpy of the batch's buffers.
Reading an IPC Stream into a RecordBatchStream
The consumer side: read the IPC bytes from a reader, produce a stream of record batches.
use anyhow::Result;
use arrow::ipc::reader::StreamReader;
use arrow::record_batch::RecordBatch;
use std::io::Read;
/// Decode an Arrow IPC stream from a reader. Reads the schema header,
/// then yields one RecordBatch per batch message. The schema is
/// available via the StreamReader's accessor after construction.
pub fn read_ipc_stream<R: Read>(
reader: R,
) -> Result<(SchemaRef, impl Iterator<Item = Result<RecordBatch>>)> {
let stream_reader = StreamReader::try_new(reader, None)?;
let schema = stream_reader.schema();
// The StreamReader implements Iterator<Item = Result<RecordBatch>>,
// yielding one batch per .next() call until EndOfStream.
let iter = stream_reader.map(|r| r.map_err(Into::into));
Ok((schema, iter))
}
What to notice. The reader returns the schema separately from the batch iterator. The schema is the contract for every batch the iterator yields; the consumer can validate it against its expectation before processing any batches. The iterator is a normal Iterator<Item = Result<RecordBatch>>; each .next() call reads one message from the underlying reader, decodes it, and returns the batch.
The synchronous Iterator shape works for the in-process and file-backed cases. The async cross-network case uses the tokio-flavored arrow-flight crate, which exposes an async_stream of batches; the shape is the same with async polling instead of synchronous iteration.
A RecordBatchStream from the Module 3 Read Plan
Wiring the Module 3 scan plan into the DataFusion RecordBatchStream shape. The function below is the heart of the capstone's ArtemisExec:
use std::sync::Arc;
use async_stream::stream;
use arrow::datatypes::SchemaRef;
use arrow::record_batch::RecordBatch;
use anyhow::Result;
use futures::Stream;
/// Produce a stream of record batches from a scan plan. The stream
/// reads files in plan order, decoding each file's row groups into
/// batches and yielding them lazily. Memory stays bounded by the
/// current batch size; the consumer's pulling drives the I/O.
pub fn execute_plan(
plan: ScanPlan,
schema: SchemaRef,
object_store: Arc<dyn ObjectStore>,
) -> impl Stream<Item = Result<RecordBatch>> {
stream! {
for file_scan in plan.files {
// Open the file's Parquet reader with row-group projection
// and column projection from the scan plan.
let parquet_reader = open_parquet_with_projections(
&object_store,
&file_scan.path,
&file_scan.row_groups,
&schema,
).await?;
// Yield each batch as it's decoded. The Parquet reader's
// own internal buffering means batches arrive in row-group
// order; the consumer can process and drop them
// immediately.
let mut batch_iter = parquet_reader.read_batches();
while let Some(batch_result) = batch_iter.next().await {
yield batch_result;
}
}
// The stream ends when all files have been read; the consumer's
// .next() returns None.
}
}
The structure. The stream reads one file at a time, one row group at a time, one batch at a time. The Parquet reader's row-group projection (from Module 3 Pass 3) controls which row groups are read; the column projection (from Module 1 Lesson 3's ProjectionMask) controls which columns are read; the scan plan's file list controls which files are read. Every layer's pruning composes; the I/O is minimal.
Production code parallelizes by opening multiple files concurrently. The tokio::stream::FuturesUnordered pattern works directly: spawn one task per file, push tasks onto an unordered stream with a concurrency cap, yield batches as they arrive. The order of batches is no longer file-order, which is fine for query engines that don't depend on input order (DataFusion's scan operators don't); for operators that do, the engine adds an explicit sort.
Key Takeaways
- Arrow IPC is the right wire format for engine-storage data transport because it preserves Arrow's in-memory layout. Decode is constant-time
memcpy; no row reconstruction; no per-column type-specific work. - The streaming protocol is schema-then-batches-then-EOS. The schema is sent once at the start; batches stream open-endedly; an EOS marker ends the stream. The receiver allocates decoder state once, processes batches in a tight loop.
- Streaming bounds memory. A consumer that processes-and-drops each batch sees constant memory regardless of result size. The flow-control is implicit in the poll-based stream semantics — the producer pauses when the consumer isn't pulling.
- Schema metadata carries field IDs and other annotations for column-ID-based projection at the wire layer. The Iceberg field_id is preserved through IPC; the engine can use it to map streams against different target schemas.
- The choice of transport (Flight, raw TCP, HTTP-with-IPC) is operational, not architectural. The storage produces IPC; the transport wraps it. Changing transports doesn't change the storage or the engine.
Lesson 3 — Catalog Architecture
Module: Data Lakes — M05: Query Engine Integration Position: Lesson 3 of 3 Source: Designing Data-Intensive Applications — Martin Kleppmann and Chris Riccomini, Chapter 9 ("Replication and Consistency") for the consistency framing. Apache Iceberg REST catalog specification. Project Nessie documentation. Hive Metastore protocol documentation.
Source note: Catalog implementations evolve rapidly; the four backends compared here are the well-known ones as of the curriculum's writing. Specific feature matrices should be verified against current vendor documentation.
Context
Module 2 introduced the catalog as the source of the per-table snapshot pointer. The catalog's job at the transaction layer is simple: linearizable compare-and-swap on a single piece of state per table. But the catalog has another job at the query-engine layer: table discovery. A query engine connecting to a lakehouse needs to know which tables exist, what their current snapshots are, and how to read their metadata. The catalog is the discovery layer.
The lakehouse community has not converged on a single catalog implementation; the spec layer is settled (catalogs implement a small CAS-and-list API) but the implementations vary widely in their operational properties. The four common backends — REST catalog, Hive Metastore, Project Nessie, Postgres-JDBC — have substantially different operational behaviors at scale, and the choice has long-running consequences for how the lakehouse runs.
This lesson develops the catalog's role from the query engine's perspective, then compares the four backends on the dimensions that matter operationally: consistency, branching, multi-region behavior, operational complexity, and the API surface the query engine sees. The capstone uses the Postgres catalog from Module 2 — appropriate for the Artemis archive's single-region, single-writer deployment — but the engineer should be able to defend the choice against the alternatives.
Core Concepts
The Catalog's Two Jobs
The catalog answers two questions for the query engine. Table discovery: which tables exist, and what are their current snapshots? Snapshot retrieval: given a table identity, what is its current snapshot's metadata path? The first is a list_tables(namespace) call; the second is get_current_snapshot(table). Both are read-only; the catalog's CAS responsibility is on the writer side and not directly exposed to query engines.
The discovery API is small. Iceberg's catalog spec defines:
list_namespaces() -> Vec<Namespace>list_tables(namespace) -> Vec<TableIdentifier>load_table(table_id) -> TableMetadataregister_table,create_table,update_table,drop_table— the writer-side mutation operations
The query engine uses the first three. list_namespaces for hierarchical organization (a namespace per mission, per team, per project); list_tables for the tables-in-namespace browse; load_table for the actual table-metadata read that drives the read planning.
The Iceberg REST catalog API standardizes this surface as HTTP endpoints. The other backends implement the same logical API through their native protocols (Hive Metastore over Thrift, Nessie over a Git-flavored REST API, JDBC over SQL). The query engine code that consumes the catalog is identical across backends; the transport differs.
REST Catalog: The HTTP-Native Choice
The Iceberg REST catalog is an HTTP API specification — a small set of REST endpoints that an implementation exposes. The spec is open; multiple implementations exist (Tabular, Polaris from Snowflake, Apache Polaris, in-house implementations). The properties:
- Stateless API. The REST endpoints are stateless; clients can connect to any instance of the catalog service. Horizontal scaling is straightforward.
- HTTP/2 with token auth. Authentication is standard OAuth2 or token-based; client libraries (PyIceberg, iceberg-rust) handle the auth handshake transparently.
- Multi-tenant by design. Namespaces are first-class; per-namespace access control is the standard pattern.
- Operational shape: a stateless service. The service itself is stateless; the backing store (typically a SQL database) holds the per-table state. Failover is conventional service-failover; consistency depends on the backing store.
The REST catalog's main advantage is its client-side simplicity. Any HTTP client can talk to it; the network protocol is standard; debugging is straightforward (a browser can hit the endpoints). For deployments that need to expose the lakehouse to multiple clients in multiple languages, REST is the lowest-friction choice.
The disadvantage is that the REST service is an extra deployment unit. The lakehouse needs the catalog database plus the REST service plus the auth provider. For small deployments this overhead is significant; for large deployments the REST service is a benefit (it centralizes auth and access control).
Hive Metastore: The Legacy-Compat Choice
The Hive Metastore is the catalog that emerged from the Hadoop era. It is a Java service backed by a SQL database, exposing a Thrift API. Iceberg supports it as a catalog backend for compatibility with existing Hadoop-shaped deployments.
The relevant properties:
- Thrift API. Older than REST; well-supported in Java/Scala ecosystems; less convenient outside them.
- Wire-level compatibility with Hive tooling. Catalog metadata for Iceberg tables is stored in Hive's existing tables, sharing the database with Hive's own table metadata. A deployment migrating from Hive to Iceberg can use the same Metastore; client tooling that was already pointing at Hive Metastore continues to work.
- Single-process service. The Metastore runs as one Java service; horizontal scaling is via load balancing in front of multiple instances against a shared database.
- No multi-version branching. Hive Metastore's model is a single linear history per table; the branching support that Nessie provides (below) is not available.
Hive Metastore is the right choice in exactly one scenario: a deployment that already runs a Hive Metastore and is incrementally migrating to Iceberg. New deployments rarely choose it; the operational overhead of running a JVM service plus the SQL database is significant relative to the simpler alternatives, and the Hive-specific tooling integrations are useful only when Hive itself is also in the picture.
Nessie: The Git-for-Data Choice
Project Nessie is the most architecturally distinctive option. Nessie adds Git-like branching and merging semantics to the catalog: tables can have branches (main, experiment, recovery-2024-03), commits to branches, merges between branches, and tags pinning specific snapshots.
The properties:
- Branching as a first-class concept. A query engine can read from
main, from a branch, or from a tag. The same table identity exists under multiple branches with potentially different states. - Merge semantics across multiple tables. A commit to Nessie can update multiple tables atomically; the commit is a multi-table transaction in a way the other catalog backends don't support. This is the catalog's analog of a multi-statement transaction in a row-store database.
- Audit-trail by construction. Every commit has an author, a message, and a timestamp; the history is a directed acyclic graph in the Git sense, navigable like any Git repository.
- REST-style HTTP API. The protocol is similar to Iceberg's REST catalog; multi-tenant; horizontally scalable.
Nessie is the right choice for workflows that need branching. The two patterns the Artemis team evaluated:
- Experimental work on production data. An analyst wants to run a transformation that modifies a table; they branch from
main, commit the transformation, validate the result, and merge back. The branch is isolated frommainduring the work; other analysts continue to see the unmodifiedmain. - Multi-table atomic schema migration. A migration that needs to update three related tables consistently can commit all three changes in one Nessie commit; readers see either the pre-migration state or the post-migration state, never an in-progress mix.
The disadvantage is operational complexity. Nessie is a separate service with its own backing store; it has its own merge-conflict semantics that the writer code must handle; the Git-flavored mental model is unfamiliar to teams used to traditional databases. The Artemis team chose not to use Nessie for the cold archive specifically because the workload (single-writer ingest, append-only data) has no need for branching; for a more dynamic workload with multiple writers experimenting on shared data, Nessie would have been the right choice.
Postgres-JDBC: The Pragmatic Choice
The simplest catalog backend is a JDBC connection to a transactional database — typically Postgres, occasionally MySQL or SQLite. The catalog state is a small set of tables; the CAS is a row-level update; the discovery API is straightforward SQL.
The properties:
- No new infrastructure. A team that already runs Postgres for application data adds the catalog tables to it. No new service to deploy, no new auth scheme, no new monitoring.
- Transactional semantics inherited from Postgres. The CAS is a
UPDATE ... WHEREagainst a single row; multi-row updates are SQL transactions. The catalog's consistency guarantees are exactly Postgres's, which are well-understood. - No native branching. Like Hive Metastore, the JDBC catalog has a linear per-table history. Branching is achievable via convention (one table per branch) but not as a first-class concept.
- Single-region typically. A multi-region deployment requires replicating Postgres across regions, which is operationally non-trivial and changes the consistency model. Most JDBC-backed lakehouses are single-region.
The Artemis cold archive uses Postgres-JDBC because the operational profile matches: single-region, single-writer per table, an existing Postgres instance available on the ground-segment infrastructure. The catalog's overhead is essentially zero — adding a few tables to an existing database. For a small or medium deployment with no need for branching or multi-region, this is the obvious choice.
The disadvantage is scale. A deployment with thousands of tables across hundreds of teams typically outgrows a single Postgres instance; the JDBC catalog doesn't scale horizontally without sharding the catalog tables, which adds operational complexity. The REST catalog (which can run multiple instances against the same database, or against a horizontally-scalable database like CockroachDB) handles this case better.
Comparison Matrix
A summary of the four backends on the dimensions that matter operationally:
| Dimension | REST | Hive Metastore | Nessie | Postgres-JDBC |
|---|---|---|---|---|
| Setup overhead | Medium (service + DB + auth) | High (JVM service + DB + auth) | Medium-high (service + DB) | Low (just DB) |
| Horizontal scaling | Yes (stateless service) | Yes (load balancer) | Yes (stateless service) | Limited (single DB) |
| Branching/merging | No | No | First-class | No (convention only) |
| Multi-table txn | No | No | Yes | Via SQL transaction |
| Multi-region | Yes with replicated DB | Yes with replicated DB | Yes | Limited |
| Client ergonomics | Excellent (HTTP) | Mediocre (Thrift) | Good (REST) | Excellent (SQL) |
| Audit trail | Per backend | Per backend | First-class | Via SQL history |
| Best for | Multi-tenant prod | Hadoop migration | Branching workloads | Small/medium single-region |
The "Best for" row is the decision shortcut. The Artemis archive fits "small/medium single-region" cleanly; Postgres-JDBC was the obvious choice. A growing-multi-team deployment with multiple language clients would land on REST. A team with active experimentation on shared production data would land on Nessie. A team migrating from Hadoop would inherit Hive Metastore.
Consistency Across Backends
DDIA (Ch. 9) develops the consistency framing that applies here. The catalog's CAS must be linearizable for the table format's atomicity guarantees to hold — concurrent writers cannot both successfully advance the same pointer. The four backends provide this differently:
- Postgres-JDBC: linearizability via Postgres's per-row locking under the default
READ COMMITTEDisolation. The CAS is oneUPDATEagainst one row; Postgres serializes concurrent updates against the same row. - Hive Metastore: linearizability via the underlying SQL database (same as JDBC). The Thrift layer adds no consistency mechanism of its own.
- REST catalog: depends on the backing store. Most REST catalogs use a SQL database backend and inherit its linearizability; cloud-native variants on DynamoDB inherit DynamoDB's strong-consistency option (must be explicitly enabled).
- Nessie: linearizability via its own commit-log mechanism (single-writer at the global commit-log head; commits are serialized through it).
The dimension where the backends diverge is stale-read behavior. Postgres replicas can lag behind the primary by up to the replication latency; a read from a replica may see a stale snapshot pointer. Multi-region Nessie has explicit propagation latency; cross-region reads may see commits the local region has not yet seen. The lakehouse format itself does not handle these stale-read cases; clients must be aware of them and pin snapshots explicitly when consistency-across-reads matters (Module 4 Lesson 1's --pin-snapshot flag).
Code Examples
A Catalog Trait
The integration abstraction across backends: a single trait the query engine consumes, with one impl per backend.
use async_trait::async_trait;
use anyhow::Result;
use std::collections::HashMap;
#[async_trait]
pub trait Catalog: Send + Sync {
/// Discovery: list namespaces (typically hierarchical, e.g.
/// ["artemis", "sda", "observations"]).
async fn list_namespaces(&self) -> Result<Vec<Vec<String>>>;
/// List tables in a namespace.
async fn list_tables(&self, namespace: &[String]) -> Result<Vec<TableIdentifier>>;
/// Load a table's metadata: the current snapshot, the schema
/// history, the partition specs.
async fn load_table(&self, id: &TableIdentifier) -> Result<TableMetadata>;
/// Writer-side: CAS the table's current snapshot.
async fn compare_and_swap(
&self,
id: &TableIdentifier,
expected: SnapshotId,
new: SnapshotId,
new_metadata_path: &str,
) -> Result<(), CommitError>;
/// Writer-side: create a new table.
async fn create_table(
&self,
id: &TableIdentifier,
initial_metadata: &TableMetadata,
) -> Result<()>;
}
#[derive(Debug, Clone)]
pub struct TableIdentifier {
pub namespace: Vec<String>,
pub name: String,
}
The discipline. The trait is small. The query engine consumes it through list_namespaces, list_tables, load_table; the writer code uses it through compare_and_swap and create_table. Each backend implements the trait once; swapping backends is a single dependency-injection point.
A Postgres-JDBC Implementation Sketch
The Artemis archive's catalog, distilled to the trait:
use sqlx::PgPool;
pub struct PostgresCatalog {
pool: PgPool,
}
#[async_trait]
impl Catalog for PostgresCatalog {
async fn list_namespaces(&self) -> Result<Vec<Vec<String>>> {
let rows = sqlx::query!(
"SELECT DISTINCT namespace_path FROM iceberg_tables ORDER BY namespace_path"
)
.fetch_all(&self.pool)
.await?;
Ok(rows.into_iter()
.map(|r| r.namespace_path.split('.').map(String::from).collect())
.collect())
}
async fn list_tables(&self, namespace: &[String]) -> Result<Vec<TableIdentifier>> {
let ns_path = namespace.join(".");
let rows = sqlx::query!(
"SELECT table_name FROM iceberg_tables WHERE namespace_path = $1",
ns_path,
)
.fetch_all(&self.pool)
.await?;
Ok(rows.into_iter()
.map(|r| TableIdentifier {
namespace: namespace.to_vec(),
name: r.table_name,
})
.collect())
}
async fn load_table(&self, id: &TableIdentifier) -> Result<TableMetadata> {
let ns_path = id.namespace.join(".");
let row = sqlx::query!(
"SELECT metadata_path FROM iceberg_tables
WHERE namespace_path = $1 AND table_name = $2",
ns_path,
id.name,
)
.fetch_one(&self.pool)
.await?;
read_metadata_file(&row.metadata_path).await
}
async fn compare_and_swap(
&self,
id: &TableIdentifier,
expected: SnapshotId,
new: SnapshotId,
new_metadata_path: &str,
) -> Result<(), CommitError> {
let ns_path = id.namespace.join(".");
let result = sqlx::query!(
"UPDATE iceberg_tables
SET current_snapshot_id = $1, metadata_path = $2
WHERE namespace_path = $3 AND table_name = $4 AND current_snapshot_id = $5",
new as i64, new_metadata_path, ns_path, id.name, expected as i64,
)
.execute(&self.pool)
.await
.map_err(|e| CommitError::Database(e.into()))?;
if result.rows_affected() == 1 {
Ok(())
} else {
// Read current state to give the caller a useful error.
let row = sqlx::query!(
"SELECT current_snapshot_id FROM iceberg_tables
WHERE namespace_path = $1 AND table_name = $2",
ns_path, id.name,
)
.fetch_one(&self.pool)
.await
.map_err(|e| CommitError::Database(e.into()))?;
Err(CommitError::Conflict(CommitConflict {
expected_old: expected,
actual_current: row.current_snapshot_id as u64,
}))
}
}
// create_table omitted for brevity; standard INSERT.
}
The mechanics. The CAS is one SQL UPDATE; the discovery operations are straightforward selects. The Postgres row-locking provides linearizability for the CAS at no extra cost. The implementation is small because Postgres is doing most of the work; the catalog logic is mostly translation between the trait and the database.
A REST Catalog Client Sketch
The REST catalog client side, showing how the same trait swaps backends:
pub struct RestCatalog {
base_url: String,
client: reqwest::Client,
}
#[async_trait]
impl Catalog for RestCatalog {
async fn list_namespaces(&self) -> Result<Vec<Vec<String>>> {
let resp: NamespaceList = self.client
.get(format!("{}/v1/namespaces", self.base_url))
.send().await?
.json().await?;
Ok(resp.namespaces)
}
async fn list_tables(&self, namespace: &[String]) -> Result<Vec<TableIdentifier>> {
let ns_path = namespace.join(".");
let resp: TableList = self.client
.get(format!("{}/v1/namespaces/{}/tables", self.base_url, ns_path))
.send().await?
.json().await?;
Ok(resp.identifiers)
}
async fn load_table(&self, id: &TableIdentifier) -> Result<TableMetadata> {
let ns_path = id.namespace.join(".");
let resp: LoadTableResult = self.client
.get(format!("{}/v1/namespaces/{}/tables/{}", self.base_url, ns_path, id.name))
.send().await?
.json().await?;
Ok(resp.metadata)
}
// CAS and create_table omitted; both are POSTs to the equivalent endpoints.
}
The contrast. The same trait, the same operations, different protocols underneath. The query engine code that consumes the trait is identical; choosing the backend is a startup-time decision. This is the architectural property that makes the catalog choice operational rather than structural — the lakehouse's correctness does not depend on which backend is in use.
Key Takeaways
- The catalog has two jobs: discovery (list namespaces, list tables, load metadata) for query engines, and CAS for writers. The discovery API is small and well-defined; backends differ in implementation, not in interface.
- Four common backends with substantially different operational properties: REST (HTTP-native, multi-tenant, horizontally scalable), Hive Metastore (legacy-Hadoop-compat), Nessie (Git-like branching), Postgres-JDBC (minimal overhead, single-region).
- The choice is operational, not architectural. The catalog interface is consistent across backends; swapping backends is a dependency-injection decision; lakehouse correctness is unaffected by the choice.
- Linearizability of the CAS is the central correctness requirement across all backends. Each backend provides it differently (per-row locking for SQL, commit-log for Nessie); the lakehouse format inherits whatever the chosen backend offers.
- Branching is Nessie's exclusive territory. Workloads that need experimental branches off production data, or multi-table atomic transactions, justify the Nessie operational overhead. Single-writer append-only workloads don't.
Capstone — Cross-Mission Analytics Engine
Module: Data Lakes — M05: Query Engine Integration Estimated effort: 1–2 weeks of focused work Prerequisite: All three lessons in this module completed; all three quizzes passed (≥ 70%). The Module 2 (table format), Module 3 (partition + clustering), and Module 4 (replay) capstones are the substrate.
Mission Briefing
From: Cold Archive Platform Lead
ARCHIVE BRIEFING — RC-2026-04-DL-005
SUBJECT: Cross-Mission Analytics Engine — SQL surface over the
Artemis cold archive for analyst-driven exploration.
PRIORITY: P1 — required for Q3 analytics-portal rollout.
The cold archive has a transactional table format (M2), good partitioning and clustering (M3), and a replay engine for time-travel reads (M4). What it does not have is a SQL surface analysts can use directly. The Replay Engine answers specific point-in-time queries; it does not aggregate, join, or filter beyond the table format's pushdown.
The job: build a thin SQL layer over the Module 2-4 substrate so analysts can write SELECT mission_id, AVG(panel_voltage) FROM sda_observations WHERE day BETWEEN '2024-03-01' AND '2024-03-08' GROUP BY mission_id and get results. Use Apache DataFusion as the engine — it handles SQL parsing, logical planning, optimization, and physical execution. Your job is the integration shim: a custom TableProvider that wires the table format's planning and reading into DataFusion's execution path.
The engine must support the predicate-pushdown contract (Lesson 1's three-way classification), projection pushdown (Lesson 2's column subset selection), and time-travel queries via SQL syntax (AS OF TIMESTAMP '...'). It must not blow up memory on large result sets — the streaming discipline from Module 4 carries forward. The capstone is the user-facing tool the analytics portal will call.
What You're Building
A Rust crate, artemis-analytics, exposing:
- An
ArtemisCatalog(implements DataFusion'sCatalogProvider) that surfaces the Module 5 Lesson 3 catalog's namespaces and tables to DataFusion. - An
ArtemisTableProvider(implements DataFusion'sTableProvider) that wires the Module 2-4 table format into DataFusion's scan path, with predicate pushdown classification (Exact/Inexact/Unsupported) and projection pushdown. - An
ArtemisExec(implements DataFusion'sExecutionPlan) that returns aSendableRecordBatchStreamover the scan plan's files, decoding Parquet via Module 1's reader and streaming Arrow batches. - Time-travel SQL:
SELECT ... FROM table AS OF TIMESTAMP '2024-03-15T12:00:00Z'andSELECT ... FROM table AS OF SNAPSHOT 4729parse to a pinned-snapshot read. Implementation routes through DataFusion's table-function machinery or a query-rewrite step depending on what your DataFusion version supports natively. - A CLI binary,
artemis-sql, with a REPL for interactive SQL exploration and a--queryflag for single-shot queries. - A small HTTP service exposing the SQL API over Arrow Flight (the Lesson 2 cross-process transport), so the analytics portal can consume results without a Rust client.
The engine must complete a representative analytics workload (the 3,200-query log from Module 3's capstone, projected through a SQL GROUP BY step) in under twice the time the Module 3 scan-only benchmark took for the same queries. The 2× overhead budget covers DataFusion's aggregation, materialization, and result-shipping work; if the engine exceeds it, the integration shim has a bug.
Functional Requirements
CatalogProviderintegration. DataFusion'sSessionContext::register_catalogaccepts theArtemisCatalog;SHOW SCHEMAS,SHOW TABLES, andSELECT * FROM information_schema.tableswork without further setup.TableProvider::supports_filters_pushdownclassifies each filter per Lesson 1's rules. Identity-partition + equality →Exact. Day-partition + range →Exact. Non-partition column with statistics + comparison →Inexact. Function calls on columns →Unsupported. Conjunction and disjunction compose per the rules.TableProvider::scantranslates DataFusion'sExprfilters into the Module 2-4Predicateshape, calls the table format'splan_scan, and returns anArtemisExecwrapping the plan.ArtemisExec::executereturns aSendableRecordBatchStreamthat reads the scan plan's files lazily. The stream applies projection pushdown via the Parquet reader'sProjectionMask. Files are read concurrently with a configurable parallelism cap (default 64); batches arrive in unspecified order.- Time-travel via SQL.
AS OF TIMESTAMP '<rfc3339>'andAS OF SNAPSHOT <id>route the table provider to pin the corresponding snapshot for the duration of the query. The pinned snapshot ID appears in the query's structured logs (Module 4 Lesson 1's discipline). - Snapshot-time vs current schema projection. Time-travel queries default to snapshot-time projection (Module 4 Lesson 3). A SQL session-level option (
SET artemis.projection_mode = 'current') switches the default. - Arrow Flight server. The
artemis-flightbinary exposes the same SQL surface over the Flight gRPC API. The FlightDoGetendpoint streams the query result as an IPC stream; the client sees the same data the in-process API produces. - Streaming memory bound. Across all query types (point reads, full scans, aggregations), the engine's resident set stays under 4 GB. Aggregations spill to disk via DataFusion's built-in spill mechanism when memory pressure exceeds the threshold.
Acceptance Criteria
Verifiable (automated tests must demonstrate these)
-
SHOW TABLES IN artemis.sdareturns the table identifiers registered in the catalog. -
SELECT COUNT(*) FROM artemis.sda.observations WHERE mission_id = 'apollo-7' AND day = '2024-03-15'produces the correct count and the query plan (viaEXPLAIN) shows the predicate pushed exactly to the table provider, with no residual FilterExec for these predicates. -
SELECT COUNT(*) FROM artemis.sda.observations WHERE UPPER(notes) = 'X'produces a correct count and the query plan shows a residual FilterExec applied to the table provider's output, with the UPPER predicate not pushed. -
SELECT * FROM artemis.sda.observations AS OF TIMESTAMP '2024-03-01T00:00:00Z'reads against the snapshot that was current at that timestamp; the result schema matches the schema-as-of-that-timestamp (snapshot-time projection). -
The same query with
SET artemis.projection_mode = 'current'returns the result with the current schema, columns added since back-filled as nulls. -
A query
SELECT mission_id, COUNT(*) FROM artemis.sda.observations WHERE day BETWEEN '2024-03-01' AND '2024-03-08' GROUP BY mission_idcompletes in under 2× the Module 3 scan-only baseline for the same data range. - An aggregation query that produces results larger than the configured memory bound spills to disk and completes successfully; the spill is visible in the query's metrics.
- The Flight server returns identical results to the in-process API for a representative query suite (the same 50 queries run via both paths produce byte-identical Arrow IPC streams).
Self-assessed (you write a short justification; reviewer checks it)
-
(self-assessed) The pushdown classification is documented in
docs/pushdown-rules.mdwith a row per(column-kind, op, transform)triple and the corresponding classification. The doc is the artifact a future engineer extending the engine's pushdown will consult. -
(self-assessed) The time-travel SQL extension is documented in
docs/time-travel-sql.md. The doc describes the syntax accepted, how it parses, how it routes to the pinned snapshot, and the interaction with retention-window expiration. -
(self-assessed) The Flight server's deployment is documented in
docs/flight-deployment.md. The doc covers the authentication scheme (token-based; how tokens are issued and validated), the rate-limiting strategy (per-token request budget), and the failure modes the operator should monitor. -
(self-assessed) The memory-spill threshold is documented in
docs/spill-tuning.mdagainst the workload. The doc describes how the threshold was chosen, what queries exercise the spill path, and the performance characteristics with and without spill.
Architecture Notes
A reasonable module layout:
artemis-analytics/
├── src/
│ ├── lib.rs
│ ├── catalog.rs # ArtemisCatalog (CatalogProvider impl)
│ ├── provider.rs # ArtemisTableProvider (TableProvider impl)
│ ├── exec.rs # ArtemisExec (ExecutionPlan impl)
│ ├── pushdown.rs # classify(), exprs_to_predicate()
│ ├── time_travel.rs # AS OF TIMESTAMP / AS OF SNAPSHOT parsing
│ ├── stream.rs # file-list -> SendableRecordBatchStream
│ ├── bin/artemis_sql.rs # REPL + single-query CLI
│ └── bin/artemis_flight.rs # Flight server
├── tests/
│ ├── show_tables.rs
│ ├── pushdown_classification.rs
│ ├── time_travel_sql.rs
│ ├── streaming_memory.rs
│ ├── aggregation_spill.rs
│ ├── flight_parity.rs
│ └── workload_bench.rs # ignored by default; the 2x SLA bench
└── docs/
├── pushdown-rules.md
├── time-travel-sql.md
├── flight-deployment.md
└── spill-tuning.md
The DataFusion version against which to develop is the one currently deployed in the Artemis analytics portal — pin it in Cargo.toml rather than tracking head. The DataFusion TableProvider trait has evolved across recent versions; the integration code is sensitive to which signatures are current.
The Flight server's arrow-flight crate provides the gRPC machinery; the heavy lifting is in the IPC stream production, which reuses the same RecordBatchStream the in-process API produces. The Flight handler's job is mostly transport: receive a Ticket (DataFusion can encode the SQL query directly), execute it, return the result as an IPC stream over gRPC.
Hints
Hint 1 — Translating DataFusion `Expr` to the table format `Predicate`
DataFusion's filter expressions are Expr trees rooted at BinaryExpr, ScalarFunction, Column, etc. The table format's Predicate is a smaller AST built around LeafOp and conjunctions/disjunctions. Write a recursive translator that walks the Expr tree, matches on the node types, and produces the corresponding Predicate node — or returns None for nodes the table format doesn't support. The translator's failure modes feed directly into the pushdown classification (untranslatable → Unsupported).
Hint 2 — DataFusion's `ExecutionPlan` boilerplate
ExecutionPlan requires a fair amount of boilerplate: schema(), output_partitioning(), children(), with_new_children(), execute(), plus metrics and properties. Most of it is trivial for a leaf scan operator; copy the structure from DataFusion's built-in ParquetExec and adapt. The non-trivial method is execute, which is where your stream is built. Setting output_partitioning to Partitioning::UnknownPartitioning(n) where n is the number of file-reading tasks is the simplest correct choice; DataFusion's optimizer handles the rest.
Hint 3 — Parsing `AS OF TIMESTAMP` without forking DataFusion
If your DataFusion version doesn't natively parse AS OF TIMESTAMP, the cleanest workaround is a query-rewrite step: intercept the SQL string before parsing, extract any AS OF clauses with a regex, replace them with a table-function syntax DataFusion does parse (e.g., table_at_time('artemis.sda.observations', '2024-03-15T...')), and register the table function to return an ArtemisTableProvider pinned to the resolved snapshot. The user-visible syntax is AS OF; the internal mechanism is a table function.
Hint 4 — The 2× SLA benchmark structure
The benchmark runs the 3,200-query log from Module 3 in two configurations: scan-only (the M3 capstone's read_plan + raw Parquet read) and SQL (the M5 capstone's full DataFusion pipeline projecting to a GROUP BY aggregation). Both configurations read the same data; the difference is the DataFusion overhead. The 2× budget is generous because DataFusion's aggregation is well-optimized; if your benchmark exceeds 2×, the typical culprits are (a) repeated metadata reads per query when one cache would suffice, (b) inefficient Expr translation, (c) projection-pushdown failing silently and the scan returning all columns.
Hint 5 — Flight server's `Ticket` encoding
The Flight Ticket is an opaque byte string the server interprets. The simplest encoding: a JSON object with {"sql": "SELECT ..."}. The server deserializes the ticket, runs the SQL through the in-process API, returns the result. More sophisticated deployments encode pre-planned queries or include auth claims; for the capstone the simple JSON pass-through is sufficient. The Flight GetFlightInfo endpoint can return a single endpoint pointing at the server itself (no parallelism needed for the capstone's scale).
References
- Apache DataFusion documentation —
TableProviderandExecutionPlantraits arrow-flightcrate documentation — server and client patterns- Apache Iceberg Java reference implementation — for the canonical pushdown classification rules to cross-check against
When You're Done
The crate is "done" when all eight verifiable acceptance criteria pass in CI, the four self-assessed docs are written, and the workload benchmark hits the 2× SLA. Module 6 begins with the assumption that this SQL surface is in place; the lifecycle operations Module 6 develops (compaction, snapshot expiration, orphan cleanup) will need their own SQL hooks for operator-driven invocation.
Module 6: Compaction, Lineage, and Lifecycle
Lesson 1 — Compaction Strategies
Module: Data Lakes — M06: Compaction, Lineage, and Lifecycle Position: Lesson 1 of 3 Source: Database Internals — Alex Petrov, Chapter 7 ("Log-Structured Storage — Compaction") for the analogous LSM compaction discipline. Apache Iceberg specification, "Rewriting Data Files" section. Delta Lake protocol's "OPTIMIZE" command documentation for contrast.
Source note: The compaction strategies and tradeoffs are drawn from the lakehouse community's accumulated practice; the LSM analogy is grounded in Database Internals. Specific implementation details should be verified against the current Iceberg maintenance APIs.
Context
The Artemis cold archive has been ingesting data for two years. The ingestion pipeline commits a new snapshot every 30 seconds with the previous 30 seconds' downlinks; a typical commit produces 8-15 data files. After two years that's roughly 2 million data files, distributed across 40,000 partitions, with an average file size of around 18 MB — substantially smaller than the 128 MB target Module 1 set. The metadata is bloated; the planner reads thousands of manifests per query; per-file overhead dominates the actual data read; the analyst queries that used to complete in 30 seconds now take 5 minutes.
The fix is compaction: rewriting many small files into fewer large files. The discipline is mechanically straightforward (read the small files, write a large file with the same rows, atomically swap the file set in a single commit) but operationally subtle. Compaction competes with the ingest workload for object-store bandwidth, with the query workload for compute, and with the storage-tier infrastructure for IOPS. A naive compaction that runs full-speed continuously degrades the live system; a too-conservative compaction never catches up with the small-file accumulation.
This lesson develops the compaction discipline. The strategies — bin-packing, sort-based, Z-order rewriting — that handle different small-file patterns. The safety properties that make compaction non-disruptive (atomic file-set swap via overwrite commit; concurrent readers unaffected; rollback on failure). The operational tuning that keeps compaction running at the right rate against the right partitions. The capstone's lifecycle worker runs compaction continuously against the production archive; this lesson is the design.
Core Concepts
The Small-File Problem at Scale
The small-file problem comes from a structural mismatch between ingest cadence and target file size. Ingest commits arrive at the cadence the workload requires — for the Artemis archive, every 30 seconds, because the downlink path produces data at that rate and the analyst workload wants the data to be queryable within a minute. The 30-second window contains a finite amount of data; for a typical mission with 8 active payloads producing telemetry at 10 Hz, that's 2,400 samples per payload, or about 19 KB of payload per row times 8 payloads, or roughly 4 MB of uncompressed data per partition per commit. After ZSTD-3 compression, that's around 1-2 MB per data file per commit.
The result: small files. Each commit produces one file per active partition, each one about 1 MB, well below the 128 MB target the Module 1 writer aims for. Over time, partitions accumulate hundreds or thousands of small files. The Module 3 read planning consults all of them. The per-file overhead (footer parse, statistics evaluation, range request setup) is a fixed cost; for 1-MB files, the overhead dominates the data read. The read latency grows with the file count, not the data size.
DDIA's framing is the same one Database Internals (Petrov, Ch. 7) develops for LSM trees: small files are the inevitable consequence of a system that commits more often than its target file size justifies. The LSM tree's compaction is exactly the analogous discipline — merge small SSTables into larger SSTables, keep the file count bounded by an upper limit per level, keep the per-query overhead constant. The lakehouse's compaction does the same thing at the table-format layer.
The thresholds that define "small": the lakehouse community's accumulated practice is that files below 32 MB are "small" (they impose unacceptable per-file overhead at query time) and files above 1 GB are "too large" (they don't fit in a row-group at the target row count and they impose long re-decode times when a row group's worth of data is read for projection). The Module 1 target of 128 MB sits in the middle; compaction's job is to consolidate small files toward this size without overshooting into too-large.
Compaction as an Overwrite Commit
The mechanics: compaction is an overwrite commit (Module 2 Lesson 3). The compactor reads the source files, writes the consolidated output file, then commits a snapshot that adds the new file and removes the source files — both changes atomically applied via the CAS protocol. The compaction is invisible to readers because the catalog pointer advances atomically; readers pinned to the pre-compaction snapshot continue to see the source files (the snapshot's metadata still references them); readers starting after the compaction see the consolidated file.
Before compaction (snapshot S):
/data/abc.parquet (1 MB, 1k rows)
/data/def.parquet (1 MB, 1k rows)
/data/ghi.parquet (2 MB, 2k rows)
/data/jkl.parquet (1 MB, 1k rows)
/data/mno.parquet (1 MB, 1k rows)
/data/pqr.parquet (2 MB, 2k rows)
/data/stu.parquet (1 MB, 1k rows)
/data/vwx.parquet (1 MB, 1k rows)
Total: 8 files, 10 MB, 10k rows
After compaction (snapshot S+1):
/data/{compacted}.parquet (10 MB, 10k rows)
Total: 1 file, 10 MB, 10k rows
Catalog pointer swap S → S+1 is atomic.
Readers pinned to S still see 8 files (their snapshot is unchanged).
Readers starting at S+1 see 1 file.
The 8 source files remain on disk until snapshot expiration (Lesson 2) removes them.
Two properties this gives. Compaction is safe under concurrent reads. A query that started before compaction proceeds against its pinned snapshot; the data files are unchanged. A query that started after compaction sees the consolidated layout. Compaction is safe under concurrent writes. The compaction commit uses the same CAS protocol as any other commit; concurrent ingest commits and compaction commits race at the CAS, with the loser rebasing and retrying. For append-only ingest, the rebase is cheap (Module 2 Lesson 3); for compaction it's more expensive because the compactor must re-evaluate which source files still exist, but the cost is bounded by the compaction's scope, not the table size.
Bin-Packing: The Simplest Strategy
The simplest compaction strategy is bin-packing: find small files, group them into bins targeting the 128 MB output size, write one consolidated file per bin. The strategy is content-agnostic — it doesn't sort, it doesn't cluster, it just concatenates.
The procedure:
- List the data files in a target partition. (The partition scope matters; compacting across partitions would violate the file-per-partition-tuple discipline from Module 3 Lesson 1.)
- Filter to files below the small-file threshold (e.g., 32 MB).
- Group the small files into bins of roughly 128 MB total — typically by sorting by size descending and greedily packing into bins.
- For each bin, read all source files, concatenate the row batches, write one output file with the consolidated rows.
- Commit a single overwrite snapshot that removes all the source files and adds all the output files for the partition.
The strategy's strength is its simplicity and its low per-row cost. Reading and rewriting rows without sorting is O(rows) with a small constant; the output file's compression is similar to the inputs' (the row distribution doesn't change). The strategy's weakness is that it preserves whatever row ordering the inputs had — typically arrival order, which is poorly clustered for analytical queries. If the source files were unclustered, the output is also unclustered; bin-packing doesn't improve clustering, it just reduces file count.
For workloads where clustering is not a concern (typical when the partition spec already captures the dominant query dimensions), bin-packing is the right strategy. It's the cheapest compaction; it solves the file-count problem; it doesn't try to do more. The Artemis archive uses bin-packing for partitions where queries always specify both partition columns (mission_id and day), because at that granularity additional clustering buys little.
Sort-Based Compaction
When clustering matters — Module 3 Lesson 2's (payload_id, sensor_kind) clustering for the Artemis archive — bin-packing alone isn't enough. The compaction must produce files with tight per-column statistics for the cluster columns, which requires sorting the rows before writing.
Sort-based compaction runs bin-packing's bin-grouping step but, instead of concatenating the source files' batches in arrival order, sorts the rows in each bin by the sort order before writing. For a linear sort order (ORDER BY payload_id, sensor_kind), this is a multi-way merge over the source files; each source file is already sorted within itself (if produced by a sort-aware writer), so the merge is O(rows × log(files)) — sub-quadratic and tractable for typical compaction bin sizes.
The output files have tight per-column statistics on the sort columns. A bin containing 64 source files, each averaging 8 distinct payload_id values mixed randomly, produces compacted files where each file has 1-2 distinct payload_id values. The per-file payload_id statistic is tight; the query-time pruning from Module 3 Lesson 3 prunes hundreds of compacted files to single-digit files for any specific payload_id predicate.
The cost of sort-based compaction is the sort: an extra O(log n) factor per row plus the memory to hold the merge state. For typical Artemis bin sizes (128 MB compacted output, around 1M rows), the sort fits in memory comfortably and adds maybe 10-15% to the bin-packing cost. The tradeoff is favorable for any workload where clustering pays off.
Z-Order Compaction
The most aggressive strategy is Z-order compaction (Module 3 Lesson 2's space-filling curve applied at compaction time). The compactor reads source files, computes Z-order keys for every row across the sort columns, sorts globally by Z-order key, and writes files in Z-order. The output files cluster on multiple dimensions simultaneously — both payload_id and sensor_kind, in the Artemis case.
The cost is the Z-order key computation plus the sort. Z-order computation is cheap (a few cycles per row; the key is a u64 or u128). The sort is the same as sort-based compaction's. The total overhead over bin-packing is around 20%; the clustering improvement is the 5-10× pruning win Module 3 Lesson 2 measured.
The Artemis archive uses Z-order compaction on the SDA observation table. The capstone's lifecycle worker computes Z-order keys at compaction time using the column-rank normalization from Module 3 (the per-column ranks are computed once per partition per compaction; ranks are cheap given Arrow's compute::rank function).
Concurrent Compaction: The Conflict Pattern
Compaction is an overwrite commit; concurrent ingest commits are append-only commits. The two commit types can run concurrently; their conflict behavior at the CAS is asymmetric.
Compaction loses conflicts more often than it wins. An ingest commit that arrives after the compaction started its work has typically advanced the table's snapshot pointer by the time the compaction commits its CAS. The compaction's CAS fails; the compactor must rebase. The rebase requires re-evaluating which source files still exist (the ingest commit didn't touch them, so all source files are still present), then producing a new commit on top of the new base. The metadata work is cheap; the data write work (the consolidated output file) is not repeated — it's already written to the object store.
Long-running compactions amplify the conflict rate. A compaction that takes 5 minutes against a table with ingest every 30 seconds sees roughly 10 ingest commits arrive during its work; each one is a conflict at the CAS. The retry-loop handles this; the rebase is cheap; eventually the compaction wins. But the win is delayed by the conflict count, and the CAS retries consume catalog throughput unnecessarily.
The operational fix the Artemis archive uses: batch compactions per partition with a soft serialization at the partition level. The lifecycle worker runs one compaction job per partition at a time; many partitions can compact in parallel, but a single partition is never the target of two concurrent compactions. The ingest writer is single-writer per table anyway, so the only concurrency for compaction's CAS is the ingest writer for the same partition, and the conflict rate stays low.
A second pattern, more sophisticated: window-based compaction that compacts only sealed time windows. A partition like day('2024-03-15') is compacted only after the day has ended and ingest for that day has stopped. The compaction has no concurrent ingest to race; CAS conflicts are zero. For time-partitioned tables this works cleanly; for tables partitioned on dimensions where the window is harder to define (mission_id, payload_id), the per-partition serialization above is the fallback.
Resource Budget and Pacing
A naive compaction runs full-speed. Full-speed compaction reads gigabytes from the object store, writes gigabytes back, and uses available CPU for sorting and Z-order computation. On a shared system this competes with the analyst query workload for the same resources. The result is observable query slowdowns, which the operations team will rightly view as a regression.
The discipline is rate-limited compaction with explicit bandwidth budgets. The Artemis lifecycle worker:
- Consumes at most 30% of available object-store read bandwidth (configured as IOPS and MB/s limits per worker).
- Runs at most 4 parallel compaction tasks across the cluster, each with bounded local memory.
- Pauses compaction during the daily analyst peak window (09:00-17:00 UTC for the SDA team's analytical traffic).
- Backs off automatically when query-latency metrics exceed thresholds (Module 5's observability picks this up).
The result is that compaction makes steady progress without ever degrading the analyst-visible system. The compaction rate is approximately matched to the file-creation rate; the table's small-file count stays bounded.
Compacting Manifests
A separate compaction subject worth noting: manifest compaction. The Module 2 metadata model produces one manifest per commit. Over time the manifest list grows linearly with commit count; for a table with 2 million commits, the manifest list is 2 million entries. Reading the manifest list for query planning becomes a non-trivial cost — even though pruning Pass 1 (Module 3 Lesson 3) eliminates most manifests, the manifest list itself is large.
Manifest compaction rewrites many small manifests into fewer large manifests. It's structurally similar to data file compaction but operates on metadata: read the per-commit manifests, group them, write consolidated manifests with the same entries. The new snapshot references the consolidated manifests; the old per-commit manifests are deleted by orphan cleanup. The cost is proportional to manifest count, not data size — typically a few tens of MB of I/O for a billion-row table.
The Artemis archive runs manifest compaction weekly. The manifest list goes from ~100k entries at week-end to ~5k entries at week-start, and query-planning latency on cold cache drops from ~200ms to ~20ms. The compaction commit is a metadata-only overwrite — the data files are unchanged; only the manifest references move.
Core Mechanics in Code
A Bin-Packing Compactor
The minimal compaction implementation: identify small files in a partition, group them into bins, rewrite them.
use anyhow::Result;
use std::collections::HashMap;
const SMALL_FILE_THRESHOLD_BYTES: u64 = 32 * 1024 * 1024; // 32 MB
const TARGET_OUTPUT_BYTES: u64 = 128 * 1024 * 1024; // 128 MB
pub struct CompactionPlan {
pub partition: PartitionTuple,
pub bins: Vec<Vec<DataFile>>,
}
/// Plan a bin-packing compaction for a single partition. The plan
/// is a vec of bins; each bin is a vec of source files whose total
/// size is near the target output size. Files larger than the small-
/// file threshold are not compacted.
pub fn plan_compaction(
partition_files: Vec<DataFile>,
partition: PartitionTuple,
) -> CompactionPlan {
// 1. Filter to files below the small-file threshold.
let mut small_files: Vec<DataFile> = partition_files
.into_iter()
.filter(|f| f.file_size_bytes < SMALL_FILE_THRESHOLD_BYTES)
.collect();
// 2. Sort by size descending. The greedy bin-packing then takes
// the largest remaining file and adds smaller files until the
// bin is near the target. This produces fewer "overstuffed" bins
// than sort-ascending.
small_files.sort_by(|a, b| b.file_size_bytes.cmp(&a.file_size_bytes));
// 3. Greedy bin packing.
let mut bins: Vec<Vec<DataFile>> = Vec::new();
let mut current_bin: Vec<DataFile> = Vec::new();
let mut current_size: u64 = 0;
for file in small_files {
if current_size + file.file_size_bytes > TARGET_OUTPUT_BYTES && !current_bin.is_empty() {
bins.push(std::mem::take(&mut current_bin));
current_size = 0;
}
current_size += file.file_size_bytes;
current_bin.push(file);
}
if !current_bin.is_empty() {
bins.push(current_bin);
}
// 4. Drop bins of size 1 — a single small file doesn't benefit
// from compaction (the output would be one small file, same as
// the input). Compaction only fires when there's something to
// consolidate.
bins.retain(|b| b.len() > 1);
CompactionPlan { partition, bins }
}
/// Execute a single bin: read the source files, write a consolidated
/// output file, return the new DataFile metadata. The output file
/// inherits the partition tuple of the source files.
pub async fn compact_bin(
bin: Vec<DataFile>,
partition: &PartitionTuple,
object_store: &dyn ObjectStore,
) -> Result<DataFile> {
let output_path = generate_compacted_path(partition);
let writer = ParquetWriter::new(&output_path).await?;
for source in &bin {
let reader = ParquetReader::open(&source.path, object_store).await?;
for batch_result in reader.read_batches() {
let batch = batch_result?;
writer.write_batch(&batch).await?;
}
}
writer.finalize().await?;
let new_meta = read_file_metadata(&output_path, object_store).await?;
Ok(new_meta)
}
The discipline. The bin-packing plan is the cheap, deterministic part. The execution reads the sources and writes the consolidated output — this is the expensive part, dominated by I/O. The output's DataFile metadata (the per-file statistics for Pass 2 pruning) is computed at write time by the Parquet writer, the same way it was in Module 1's initial writer; nothing about compaction-produced files is different from ingest-produced files at the read layer.
Committing the Compaction
The CAS-protected swap that makes the compaction visible:
use anyhow::Result;
/// Commit a compaction: remove the source files from the table, add
/// the new compacted file, in one snapshot. Uses the optimistic-CAS
/// retry pattern from Module 2 Lesson 3.
pub async fn commit_compaction(
table: &Table,
bin: Vec<DataFile>, // the source files (to remove)
new_file: DataFile, // the compacted output (to add)
) -> Result<SnapshotId> {
let source_paths: Vec<String> = bin.iter().map(|f| f.path.clone()).collect();
for attempt in 0..16 {
let base = table.current_snapshot().await?;
// Sanity check: every source file must still be referenced by
// the current snapshot. If a concurrent commit removed any of
// them, this compaction is no longer applicable and must abort.
let current_paths = base.list_data_file_paths().await?;
for src in &source_paths {
if !current_paths.contains(src) {
return Err(anyhow::anyhow!(
"source file {} no longer in snapshot; compaction obsolete",
src,
));
}
}
// Build the overwrite commit: removed = source paths, added = new file.
let new_snapshot = build_overwrite_snapshot(
&base,
&source_paths,
vec![new_file.clone()],
"compaction",
).await?;
match table.catalog
.compare_and_swap(
&table.id,
base.snapshot_id,
new_snapshot.snapshot_id,
&new_snapshot.metadata_path,
)
.await
{
Ok(()) => return Ok(new_snapshot.snapshot_id),
Err(CommitError::Conflict(_)) => {
// Rebase and retry. The new compacted file is unchanged
// — it's already in the object store — so the next
// attempt reuses it.
tracing::warn!("compaction commit conflict; retrying attempt {}", attempt);
continue;
}
Err(e) => return Err(e.into()),
}
}
Err(anyhow::anyhow!("compaction commit failed after retries"))
}
What to notice. The retry loop is the same shape as Module 2 Lesson 3's ingest commit. The difference is the source-file sanity check before each attempt: if any source file has been removed by a concurrent commit (perhaps a previous compaction touched the same files), the current compaction is no longer applicable and must abort. The new file isn't deleted — it's left in the object store as an orphan, to be cleaned up by Lesson 3's orphan-cleanup job. This is the expected cost of optimistic concurrency in the overwrite case; the rebase cost is bounded, the abort case is rare.
Sort-Based Compaction (Sketch)
The sort-based variant extends compact_bin by sorting the rows before writing:
use anyhow::Result;
use arrow::array::RecordBatch;
use arrow::compute::{concat_batches, lexsort_to_indices, SortColumn};
/// Compact a bin with rows sorted by the given sort order. The merge
/// uses a streaming approach: read source files into memory, concat,
/// sort, write. For Artemis bin sizes (~128 MB) the in-memory approach
/// is sufficient; larger bins require an external merge sort.
pub async fn compact_bin_sorted(
bin: Vec<DataFile>,
partition: &PartitionTuple,
sort_columns: &[String],
object_store: &dyn ObjectStore,
) -> Result<DataFile> {
// 1. Read all source files' batches into a single vec.
let mut all_batches: Vec<RecordBatch> = Vec::new();
let schema = read_table_schema(partition).await?;
for source in &bin {
let reader = ParquetReader::open(&source.path, object_store).await?;
for batch_result in reader.read_batches() {
all_batches.push(batch_result?);
}
}
// 2. Concatenate into one batch for sorting.
let combined = concat_batches(&schema, &all_batches)?;
// 3. Compute the sort indices.
let sort_keys: Vec<SortColumn> = sort_columns
.iter()
.map(|name| {
let idx = schema.index_of(name).unwrap();
SortColumn {
values: combined.column(idx).clone(),
options: None,
}
})
.collect();
let indices = lexsort_to_indices(&sort_keys, None)?;
// 4. Gather rows in sort order.
let sorted_columns: Vec<_> = combined
.columns()
.iter()
.map(|c| arrow::compute::take(c, &indices, None))
.collect::<Result<_, _>>()?;
let sorted_batch = RecordBatch::try_new(schema.clone(), sorted_columns)?;
// 5. Write the sorted batch as the consolidated output.
let output_path = generate_compacted_path(partition);
let mut writer = ParquetWriter::new(&output_path).await?;
writer.write_batch(&sorted_batch).await?;
writer.finalize().await?;
Ok(read_file_metadata(&output_path, object_store).await?)
}
The trade against bin-packing: this version requires holding the entire bin in memory at once (~128 MB plus overhead). For Artemis-typical bin sizes that's fine; for larger bins, replace step 2 with a streaming k-way merge over already-sorted source files. The Z-order variant differs only in how the sort keys are computed — replace step 3's lexsort_to_indices with a Z-order-key computation followed by an integer sort.
Key Takeaways
- The small-file problem is structural: ingest commits at a cadence faster than the target file size justifies produce small files; over time the file count grows linearly with commit count and per-file overhead dominates query latency.
- Compaction is an overwrite commit that uses the Module 2 CAS protocol. It is safe under concurrent reads (snapshots are immutable; readers pinned to the pre-compaction snapshot see the old files) and safe under concurrent writes (the CAS protocol orders them).
- Three compaction strategies with different tradeoffs: bin-packing (cheap, content-agnostic; doesn't improve clustering), sort-based (adds a sort; improves clustering on a single linear order), Z-order (adds Z-order key computation plus sort; improves multidimensional clustering at modest extra cost).
- Resource pacing matters. Naive compaction degrades the live workload. Production compactors run with explicit bandwidth budgets, parallelism caps, and quiet-window scheduling.
- Manifest compaction is the structural analog at the metadata layer. Both data file and manifest compaction follow the same overwrite-commit pattern; manifest compaction is cheaper because the metadata is much smaller than the data.
Lesson 2 — Snapshot Expiration and Storage Tiering
Module: Data Lakes — M06: Compaction, Lineage, and Lifecycle Position: Lesson 2 of 3 Source: Apache Iceberg specification, "Snapshot Expiration" section. Designing Data-Intensive Applications — Martin Kleppmann and Chris Riccomini, Chapter 7 ("Snapshot Isolation — Indexes and Snapshot Isolation") for the analogous MVCC-garbage-collection framing.
Source note: The snapshot-expiration protocol is well-specified in Iceberg; the storage-tiering pattern is operational practice in the Artemis cold archive and may differ at other deployments.
Context
Every commit produces new metadata files. Every overwrite or compaction commit produces metadata files that reference fewer data files than the previous snapshot. Module 2's design choices — immutable snapshots, append-only metadata, no in-place updates — mean that nothing is ever deleted by the commit path itself. The storage grows monotonically. After two years of ingest plus daily compaction, the Artemis archive has roughly 200 GB of metadata files (snapshots, manifest lists, manifests) and 50 TB of data files; perhaps 5 TB of the data files are unreferenced by the current snapshot (compaction sources, replaced files). Without a cleanup discipline, this only grows.
Snapshot expiration is the discipline that reclaims storage by removing the metadata and data files no longer reachable from any kept snapshot. The job is conceptually simple — find what's unreferenced, delete it — but the safety properties are subtle. The expiration must coexist with concurrent readers (pinned snapshots from Module 4 Lesson 1); with the retention window contract (queries against snapshots within the window must work); with storage-tier handoffs (snapshots aged beyond the live window may be archived to a longer-retention tier rather than deleted outright); with the audit trail requirement (some snapshots may need to be preserved permanently for regulatory reasons).
This lesson develops the expiration protocol end to end. The reachability calculation that identifies what's safe to delete. The retention guards that keep live queries working. The storage-tier handoff pattern that moves data to a cheaper tier rather than deleting it. The capstone's lifecycle worker runs expiration continuously; the lesson is the design.
Core Concepts
The Reachability Calculation
A snapshot is reachable if some currently-supported read path can still see it. The Artemis archive's reachability rules:
- The current snapshot is reachable.
- Any snapshot within the retention window (30 days for the Artemis archive) is reachable. Time-travel queries with timestamps in this window must succeed.
- Any snapshot explicitly tagged for permanent retention is reachable. The audit-trail discipline (Lesson 3) preserves certain snapshots beyond the standard window.
- All other snapshots are expired and eligible for deletion.
The retention window is the dominant criterion in normal operation. The Artemis archive's 30-day window is set against the longest-supported live replay duration; queries against snapshots older than 30 days are routed to the cold-archive backup tier through a separate read path (the tier handoff described below).
A data file is reachable if any reachable snapshot's manifests reference it. A data file may be referenced by multiple snapshots — typically the case during the retention window, when older snapshots still reference data files that have since been replaced by compaction. The data file is reachable as long as any reachable snapshot still references it.
The expiration computation: walk every reachable snapshot, collect the set of data files referenced. The union is the reachable file set. Every data file not in this set is unreachable and eligible for deletion. The same logic applies to manifests and manifest lists.
Reachable set:
- Current snapshot: 1 entry
- Snapshots within retention window: hundreds to tens of thousands of entries
- Tagged-for-retention snapshots: a few entries
For each reachable snapshot, walk:
- Snapshot metadata: 1 file
- Manifest list: 1 file (referenced by the snapshot)
- Manifests: thousands of files (referenced by the manifest list)
- Data files: millions of paths (referenced by manifest entries)
Union of all data file paths across all reachable snapshots = reachable file set.
The Artemis archive's reachable file set as of a typical day: 30 days × ~80k commits per day = ~2.4 million snapshots, but most of them reference common data files (long-tail compaction sources). The deduplicated reachable file set is around 1.8 million data files. The unreachable set is around 200k files (compaction sources from outside the window, replaced files, and so on). The expiration job's job is to delete those 200k files plus the metadata that referenced them.
The Two-Phase Deletion
A delete that immediately removes the file races against in-flight queries (Module 4 Lesson 1). The lakehouse discipline is two-phase deletion: mark a file as eligible for deletion at the metadata layer (it's no longer in any reachable snapshot's manifests), then physically delete it after a grace period long enough that no in-flight query can still reference it.
The two phases:
Phase 1: Metadata-level decommissioning. The compaction commit (Lesson 1) or any overwrite commit removes the file from the new snapshot. The file's EntryStatus in the most recent commit's manifest is Deleted. Reachable snapshots from before the commit still reference the file via their own manifests. The file remains on disk; readers pinned to older snapshots continue to read it.
Phase 2: Physical deletion. Some time later — when the retention window has advanced past every snapshot that referenced the file — the expiration job runs the reachability calculation. The file is no longer in any reachable snapshot's manifest set. The physical deletion is safe: no reader can pin a snapshot that references the file because every such snapshot has been expired.
The time between Phase 1 and Phase 2 is the retention window plus the expiration scheduling delay. For the Artemis archive with a 30-day window and daily expiration, the delay is 30-31 days. The expiration's safety margin is intentionally large; the cost of an extra few days of unreachable-file storage is small compared to the cost of breaking an in-flight query.
This is the analog of MVCC garbage collection in row-store databases. DDIA (Ch. 7, "Indexes and snapshot isolation"): "When a transaction commits, the database engine cannot immediately delete old versions, because they may still be needed by another transaction." The lakehouse case is the same; the units are bigger (whole files instead of row versions); the safety properties are the same.
The Expiration Job Structure
The expiration job runs on a schedule (daily for the Artemis archive). The job is read-mostly against the metadata, write-mostly against the deletion stage. The structure:
-
Snapshot the catalog. Read the table's current snapshot and the snapshot history. The expiration runs against this point-in-time view; concurrent commits during the job are handled by the CAS at the end.
-
Compute the reachable snapshot set. Apply the reachability rules from above: current snapshot, snapshots within retention window, tagged snapshots. The result is a vec of snapshot IDs to keep.
-
Identify expired snapshots. Every snapshot in the history but not in the reachable set is expired.
-
Compute the reachable file set. For each reachable snapshot, read its manifest list, read each manifest, collect the data file paths. Union across reachable snapshots. (Optimization: cache the per-snapshot file sets so re-running this job doesn't re-read every manifest. The Artemis worker caches in Redis with a TTL bounded by the retention window.)
-
Compute the unreachable file set. Compare the reachable file set against the object store's actual contents. The difference is the unreachable files. (Important: the comparison must use a snapshot of the object store contents from before step 2, not after, to avoid race conditions with concurrent ingest writes.)
-
Schedule deletions. Each unreachable file is queued for deletion. The deletion is batched and rate-limited to avoid impacting concurrent reads' bandwidth.
-
Commit the expiration. A single metadata commit that records the expiration: the snapshots removed from the snapshot history, the data files queued for deletion. The commit is informational; the actual deletions proceed in the background.
The job is safe to re-run; if it fails partway through, the next run picks up where the previous left off. The metadata commits at the end are idempotent — committing an expiration that removes already-removed snapshots is a no-op.
Snapshot Expiration and the Retention Window
The retention window is the operational lever that bounds time-travel reach. The choice is a tradeoff:
- Long window (say 90 days): supports longer time-travel queries; stores more historical metadata and replaced data files; expiration runs less aggressively; storage costs are higher.
- Short window (say 3 days): supports only recent time-travel queries; stores minimal historical metadata; expiration runs aggressively; storage costs are lower.
The Artemis archive's 30-day window is the result of measurement against the actual replay-query workload. The investigation team's typical replay span is 7-14 days post-anomaly; the 30-day window covers this with margin. Replays older than 30 days are rare enough that the cold-archive backup tier is the right answer for them.
The window is per-table; not every table needs the same window. The orbital-object-registry table has a 30-day window. The ground-station-telemetry table has a 7-day window (replays don't matter; only the current state is queried). The mission-archive-config table has a 365-day window (used by the audit team).
Storage-Tier Handoff
A snapshot aging out of the live retention window doesn't have to be deleted outright. The lakehouse community's pattern for long-retention archives is storage-tier handoff: copy the snapshot's metadata and data files to a cheaper storage tier before deleting from the live tier. Queries against the long-retention tier go through a separate read path with longer SLAs.
The Artemis archive's tier structure:
- Live tier: AWS S3 Standard. 30-day retention. Sub-second access latency. Used by all online queries.
- Cold tier: AWS S3 Glacier Instant Retrieval. 7-year retention. Sub-second access latency, ~3× the cost per GB-month of Standard. Used by audit and accident-investigation queries against older history.
- Deep archive: AWS S3 Glacier Deep Archive. Permanent retention. Hours-to-days access latency, ~1/8 the cost of Standard. Used for legal compliance retention; queried only through the formal investigation process.
The handoff job runs as a background task between Phase 1 and Phase 2 of the expiration. Before a snapshot's data files are deleted from the live tier, they are copied to the cold tier. The cold tier maintains its own catalog (a separate Iceberg table whose snapshots are the live tier's expired snapshots); analyst queries against old history go through the cold-tier catalog and read from the cold-tier object store.
The complexity worth understanding: the cold tier's catalog is read-only. Snapshots in it are immutable; no commits modify them; the catalog's CAS protocol is unused. The cold-tier read path is simpler than the live-tier read path because there's no concurrent writer to worry about. The audit-trail use case fits this perfectly: the data is what it was at the time it aged out of the live tier; nothing modifies it later.
Coordinating with the Live Workload
The expiration job's work — reachability calculation, file listing, metadata commits — competes with the live query workload for object-store bandwidth and the catalog's read budget. The discipline matches Lesson 1's compaction pacing:
- Read budget capped at 20% of available object-store IOPS.
- Expiration runs during the daily quiet window (overnight UTC for the Artemis archive's analyst-team time zone).
- Catalog reads use a separate connection pool from the ingest writers to avoid lock contention.
- The metadata-commit at the end of the job retries on CAS conflict with the ingest commits, using the same retry-with-jitter pattern as the ingest path.
The result is that expiration completes within the daily quiet window without impacting analyst-visible performance. The work fits comfortably; the live tier's data file count stays bounded; storage growth is dominated by genuine new ingest, not by unreclaimed history.
Tagged Retention: The Audit Exception
Most snapshots expire on the standard schedule. Some don't, by explicit operator decision. The Artemis archive supports tagged retention through the Iceberg tag mechanism:
-- Operator-side: tag a snapshot for permanent retention.
TAG snapshot 4729 AS 'incident-2024-03-15-conjunction-alert' WITH RETENTION FOREVER;
Tagged snapshots are exempt from the expiration reachability rules. The expiration job treats them as reachable; their data files and metadata stay on disk indefinitely (or until the tag is explicitly removed). Tags also serve as named bookmarks for queries: SELECT ... FROM table FOR TAG 'incident-2024-03-15-conjunction-alert' reads against the tagged snapshot directly without timestamp resolution.
The tagged-retention pattern handles the audit and regulatory cases without complicating the standard expiration logic. The retention exception is one entry in the reachable set; everything else operates as if it weren't there. The Artemis archive has roughly 50 tags at any time — one per significant orbital event over the past two years — adding a few GB of preserved metadata and around 200 GB of preserved data files. Manageable cost; full forensic reach.
Core Mechanics in Code
The Reachability Walk
The core of the expiration job: walk the reachable snapshots, collect every file path they reference.
use anyhow::Result;
use std::collections::HashSet;
pub struct ReachableSet {
pub snapshot_ids: HashSet<SnapshotId>,
pub data_file_paths: HashSet<String>,
pub manifest_paths: HashSet<String>,
pub manifest_list_paths: HashSet<String>,
}
/// Walk the reachable snapshots and collect every file they reference.
/// The set returned is the complement of the deletable files: anything
/// in the object store not in this set is unreachable.
pub async fn compute_reachable_set(
catalog: &PostgresCatalog,
table: &str,
retention_window_ms: i64,
tagged_snapshot_ids: &HashSet<SnapshotId>,
) -> Result<ReachableSet> {
let history = read_snapshot_history(catalog, table).await?;
let now_ms = current_unix_ms();
// 1. Identify reachable snapshots:
// - the current snapshot (the last entry in history)
// - any snapshot within the retention window
// - any tagged snapshot
let mut reachable_snapshots: HashSet<SnapshotId> = HashSet::new();
if let Some(current) = history.last() {
reachable_snapshots.insert(current.snapshot_id);
}
for entry in &history {
if now_ms - entry.timestamp_ms < retention_window_ms {
reachable_snapshots.insert(entry.snapshot_id);
}
if tagged_snapshot_ids.contains(&entry.snapshot_id) {
reachable_snapshots.insert(entry.snapshot_id);
}
}
// 2. For each reachable snapshot, collect referenced files.
let mut data_paths: HashSet<String> = HashSet::new();
let mut manifest_paths: HashSet<String> = HashSet::new();
let mut manifest_list_paths: HashSet<String> = HashSet::new();
for entry in &history {
if !reachable_snapshots.contains(&entry.snapshot_id) {
continue;
}
let snapshot: Snapshot = read_metadata_file(&entry.metadata_path).await?;
manifest_list_paths.insert(snapshot.manifest_list_path.clone());
let manifest_list = read_manifest_list(&snapshot.manifest_list_path).await?;
for ml_entry in &manifest_list.manifests {
manifest_paths.insert(ml_entry.manifest_path.clone());
let manifest = read_manifest(&ml_entry.manifest_path).await?;
for me in manifest.entries {
if matches!(me.status, EntryStatus::Existing | EntryStatus::Added) {
data_paths.insert(me.data_file.path);
}
}
}
}
Ok(ReachableSet {
snapshot_ids: reachable_snapshots,
data_file_paths: data_paths,
manifest_paths,
manifest_list_paths,
})
}
The walk is the expensive part. For the Artemis archive with ~80k commits per day and a 30-day window, the walk reads ~2.4M snapshot metadata files (small; tens of KB each), ~2.4M manifest lists (the same), and several million distinct manifests (one per commit, but deduplicated across snapshots since each manifest is committed in exactly one snapshot). The data file paths are collected as the manifest read proceeds. The total work is a few hours of mostly-parallel I/O against the metadata store; the lifecycle worker parallelizes the per-snapshot walks at a configurable concurrency.
The Expiration Decision
Given the reachable set, identify what to delete:
use anyhow::Result;
use std::collections::HashSet;
pub struct ExpirationPlan {
pub expired_snapshots: Vec<SnapshotId>,
pub deletable_data_files: Vec<String>,
pub deletable_manifests: Vec<String>,
pub deletable_manifest_lists: Vec<String>,
}
/// Plan the expiration: identify snapshots to remove from history and
/// files (data, manifest, manifest list) to mark for physical deletion.
pub async fn plan_expiration(
catalog: &PostgresCatalog,
table: &str,
reachable: &ReachableSet,
object_store: &dyn ObjectStore,
) -> Result<ExpirationPlan> {
let history = read_snapshot_history(catalog, table).await?;
// Snapshots to remove from history.
let expired_snapshots: Vec<SnapshotId> = history
.iter()
.filter(|entry| !reachable.snapshot_ids.contains(&entry.snapshot_id))
.map(|entry| entry.snapshot_id)
.collect();
// Enumerate the metadata directory and identify metadata files not
// in the reachable set.
let all_data_files = list_object_store_dir(object_store, "data/").await?;
let all_manifests = list_object_store_dir(object_store, "metadata/m").await?;
let all_manifest_lists = list_object_store_dir(object_store, "metadata/ml").await?;
let deletable_data_files: Vec<String> = all_data_files
.into_iter()
.filter(|path| !reachable.data_file_paths.contains(path))
.collect();
let deletable_manifests: Vec<String> = all_manifests
.into_iter()
.filter(|path| !reachable.manifest_paths.contains(path))
.collect();
let deletable_manifest_lists: Vec<String> = all_manifest_lists
.into_iter()
.filter(|path| !reachable.manifest_list_paths.contains(path))
.collect();
Ok(ExpirationPlan {
expired_snapshots,
deletable_data_files,
deletable_manifests,
deletable_manifest_lists,
})
}
The pattern. The plan is the set of files to delete and the set of snapshots to remove from history. The plan is committed first via a metadata commit that updates the snapshot history; the physical deletes proceed in the background after the commit. If the worker crashes between the commit and the deletions, the next run re-plans and resumes — the metadata is the source of truth and the deletes are idempotent.
The Storage-Tier Handoff (Sketch)
Before deletion from the live tier, copy data files to the cold tier:
use anyhow::Result;
pub async fn handoff_to_cold_tier(
plan: &ExpirationPlan,
live_store: &dyn ObjectStore,
cold_store: &dyn ObjectStore,
cold_catalog: &PostgresCatalog,
) -> Result<()> {
// 1. Copy each data file from live to cold. The copy is streamed
// to bound memory; the destination path mirrors the source path
// (the cold tier uses the same path layout for simplicity).
for src in &plan.deletable_data_files {
let dst = src.clone();
copy_streamed(live_store, src, cold_store, &dst).await?;
}
// 2. Copy the expired snapshots' metadata to the cold tier.
// The cold tier catalog will reference these by path.
for snapshot_id in &plan.expired_snapshots {
let entry = find_history_entry(snapshot_id)?;
let snapshot: Snapshot = read_metadata_file(&entry.metadata_path).await?;
// Copy snapshot metadata, manifest list, and manifests.
copy_streamed(live_store, &entry.metadata_path, cold_store, &entry.metadata_path).await?;
copy_streamed(live_store, &snapshot.manifest_list_path,
cold_store, &snapshot.manifest_list_path).await?;
let manifest_list = read_manifest_list(&snapshot.manifest_list_path).await?;
for ml_entry in manifest_list.manifests {
copy_streamed(live_store, &ml_entry.manifest_path,
cold_store, &ml_entry.manifest_path).await?;
}
}
// 3. Commit a new snapshot to the cold-tier catalog referencing
// the just-copied snapshots. The cold tier is append-only at this
// point; old cold-tier snapshots are preserved.
cold_catalog.append_snapshots_to_history(plan.expired_snapshots.clone()).await?;
Ok(())
}
The pattern. The cold tier accumulates the live tier's expired snapshots. Audit and accident-investigation queries route to the cold tier through a separate read path; the cold tier's data is immutable, the catalog read-only, the operational profile much simpler than the live tier. The 7-year retention budget is the cost of keeping the cold tier; the Artemis archive's actual cold-tier size after two years of operation is around 35 TB — manageable, and a small fraction of the live tier's 50 TB working set.
Key Takeaways
- Snapshot expiration reclaims storage by removing metadata and data files that no reachable snapshot references. The job is essential — without it the storage grows monotonically — but it must coexist safely with concurrent reads.
- Two-phase deletion decouples visibility from physical deletion. Phase 1 (metadata commit) removes the file from the new snapshot's manifest; Phase 2 (physical delete) happens later, after the retention window guarantees no reader can still need the file.
- The reachability calculation is the heart of expiration. Walk every reachable snapshot (current + within retention window + tagged), union the file references, compare against the object store contents. The complement is the deletable set.
- Storage-tier handoff moves aged-out data to a cheaper tier rather than deleting outright. The Artemis archive's three-tier structure (S3 Standard / Glacier IR / Deep Archive) balances access latency against cost; audit and investigation use cases against old history go through a separate read path against the cheaper tiers.
- Tagged retention is the audit exception. Specific snapshots can be preserved indefinitely via tags; the reachability calculation treats them as reachable; the expiration logic stays clean.
Lesson 3 — Lineage, Audit Trails, and Orphan Cleanup
Module: Data Lakes — M06: Compaction, Lineage, and Lifecycle Position: Lesson 3 of 3 Source: Designing Data-Intensive Applications — Martin Kleppmann and Chris Riccomini, Chapter 11 ("Stream Processing — Event Sourcing"). Apache Iceberg specification, "Snapshot Properties" and "Maintenance" sections.
Source note: Lineage practice varies widely across deployments; the Artemis discipline described here is one specific pattern. The orphan-cleanup mechanic is well-specified by Iceberg's
RemoveOrphanFilesaction; verification against the current version is recommended.
Context
The previous two lessons handled the what of lifecycle work — consolidating small files and expiring old snapshots. This lesson handles the why and the cleanup of incidents the other jobs leave behind. Lineage is the audit trail that records what produced each commit: which job, which input snapshots, which transformation, which operator triggered it. Orphan cleanup is the reconciliation pass that finds files in the object store that no metadata references — debris from failed commits, interrupted compactions, or aborted maintenance jobs.
The two subjects pair because they share a common discipline: the metadata is the source of truth. Lineage records the metadata's history; orphan cleanup reconciles the object store against the metadata. Both jobs sit downstream of the commit path and serve the operations team rather than the analyst workload. A lakehouse without lineage is one no one can audit; a lakehouse without orphan cleanup is one whose object-store costs grow with the failure rate of every maintenance job.
This lesson develops both. The lineage schema that records what produced each commit with enough fidelity for forensic analysis. The schema-history compaction pattern that keeps the lineage's own metadata bounded. The orphan-detection scan that walks the object store and compares against metadata. The safety properties that prevent orphan cleanup from accidentally deleting an in-flight commit. The capstone's lifecycle worker runs both as part of its daily maintenance pass.
Core Concepts
What the Lineage Records
A commit's lineage answers questions like: which writer made this commit? Which input data did it transform? Which downstream consumers depend on it? At what timestamp did it happen? Was this a routine ingest, a backfill, a compaction, or an incident response? The lineage's job is to record enough context that the operations team can answer these questions months or years later.
The Artemis archive's commit-lineage schema, attached to each snapshot via Iceberg's snapshot.summary field (key-value metadata; the Iceberg spec leaves the keys to the operator):
snapshot.summary:
operation: "append" | "overwrite" | "compaction" | "expiration"
writer.id: "ingest-pipeline-prod-1" | "compaction-worker-2" | ...
writer.host: "ground-station-3.artemis.internal"
writer.commit_hash: "a1b2c3d..." // git SHA of the writer binary
writer.invocation_id: "run-2024-03-15-1027" // unique per writer invocation
input.snapshot_ids: [4727, 4728] // for compaction/replay commits
input.row_count: 12000 // rows ingested or processed
input.bytes_compressed: 24831044
trigger.type: "schedule" | "manual" | "incident-response"
trigger.operator: "j.smith" // if trigger.type = manual
trigger.ticket: "ARTEMIS-2024-03-15-3" // if incident-response
The fields are operational. writer.id identifies the producing service. commit_hash ties the data to a specific binary version, which the audit team can use to read the transformation logic at the time. invocation_id distinguishes between multiple runs of the same writer (e.g., two backfill jobs). input.snapshot_ids records the upstream data — for compaction, the source snapshot whose files were consolidated; for replay-derived data, the snapshot the replay was run against. trigger.type records why the commit happened — a routine schedule run, an operator-initiated change, an incident-response action. trigger.ticket ties incident-response commits to the originating issue.
The discipline. Every commit, no exceptions, writes these fields. The writer code is the only place that knows what it's doing; the lineage must be recorded at commit time, not reconstructed later. The Artemis archive's ingest writer, compaction worker, and replay tooling all populate the lineage fields as part of their normal commit path; a commit that omits required lineage fields is rejected by a pre-commit validation hook (added to the table format library; it inspects snapshot.summary against the table's required-keys configuration).
DDIA (Ch. 11, "Event Sourcing") makes the same point in the stream-processing context: "Event sourcing has the property that the application state is determined by the sequence of events. If you keep the entire history of events, you have an audit log of everything that has ever happened." The lakehouse's snapshot history is exactly this event log; the lineage metadata is what makes the events meaningful for audit.
The Audit Use Cases
The Artemis archive's audit team has three recurring queries against the lineage. Each one motivates a specific aspect of the schema.
"When was this data ingested, and by what version of the ingest pipeline?" This is the post-mortem query when an analyst discovers a data quality issue. The answer requires writer.id, writer.commit_hash, and timestamp_ms (recorded by the snapshot itself). The audit team correlates commit_hash against the ingest-pipeline source repository to see exactly what transformation produced the suspect data.
"What snapshots are downstream of this incident?" When an incident reveals that a specific input was bad (a sensor mis-calibrated, a ground-station feed corrupted), the audit team needs to know every snapshot that consumed the bad input. The lineage's input.snapshot_ids field is what enables this: the audit team finds the snapshot that ingested the bad input, then queries for every later snapshot that listed that one as an input (recursively). The result is the bounded blast radius of the incident.
"Who triggered this overwrite commit?" The Artemis archive's overwrite path is restricted — only the correction-job tooling produces overwrites, and only operators with specific permissions can trigger it. The lineage's trigger.type, trigger.operator, and trigger.ticket fields are the audit trail for this: every overwrite is traced to a specific human operator and a specific incident ticket. A commit without these fields fails pre-commit validation; the discipline is enforced at the commit boundary.
All three queries are answered by reading the snapshot history and inspecting the summary fields. The cost is one metadata read per snapshot in the audit window; for the typical audit query against ~1000 snapshots, this is a few seconds of wall-clock time.
Schema-History Compaction
The lineage data is stored in the snapshot summaries. Over time the snapshot history accumulates; for tables with hourly commits over a 30-day window, the history has ~720 entries, each with the summary metadata. The history's size grows linearly. For the Artemis archive's millions-of-commits-per-mission scale, the schema history is a non-trivial metadata payload.
The accumulation isn't a correctness problem — every entry is independently useful for audit. But the metadata read cost is real: reading the snapshot-history-log file to plan a time-travel query touches every entry. For tables with many millions of snapshots in history, the planning latency degrades.
The mitigation is schema-history compaction: a separate metadata commit that rewrites the snapshot-history-log to a compacted form. Old entries (typically those outside the active time-travel window) are summarized rather than enumerated — a single entry says "snapshots 4000-50000 occurred between T1 and T2, all routine ingest, all from writer ingest-pipeline-prod-1" rather than 46000 individual entries. The full per-snapshot detail remains in the underlying snapshot metadata files (which are themselves preserved or archived by Lesson 2's expiration); the compacted log is just the index.
The compaction runs monthly. The size of the snapshot-history-log file drops from megabytes to kilobytes; metadata-read latency drops by orders of magnitude. Audit queries that need per-snapshot detail still work — they read the snapshot file directly through the compacted log's pointer.
The Iceberg specification's metadata-log-compaction support is the primitive; the Artemis archive uses it on the Iceberg-standard schedule.
Orphan Files: The Failure Debris
An orphan file is an object-store file that exists physically but is not referenced by any reachable snapshot's metadata. Orphans come from a few sources:
- Failed commits. Module 2 Lesson 3's retry loop writes new metadata files speculatively; on conflict, the writer rebases and retries, leaving the previous attempt's metadata files behind. These are orphan manifests and orphan snapshot files. The associated data files are not orphans because the next attempt's commit references them.
- Aborted compactions. A compaction that wrote its consolidated output file but failed to commit (perhaps because the source files were no longer reachable; Lesson 1's sanity check) leaves the consolidated file behind. It is referenced by nothing.
- Interrupted writes. A writer that crashed partway through a multi-file commit may have written some data files before the crash. If the crash prevented the commit from completing, the partial data files are orphans.
- Manual operations. Sometimes operators copy files around for testing or debugging; those files are typically not in any commit's manifests and become orphans by definition.
Orphans aren't a correctness issue — they don't affect query results. They are a cost issue: every orphan is occupied storage paid for with no operational benefit. A poorly-maintained lakehouse can accumulate substantial orphan debris; the Artemis archive's first month of operation, before orphan-cleanup was in place, accumulated ~5% of its total storage in orphans. After the cleanup discipline was operational, the orphan rate dropped to a steady ~0.1% (mostly in-flight writes during the cleanup scan, cleaned up on the next pass).
Orphan Detection: The Scan Discipline
Orphan detection is the inverse of the reachability calculation from Lesson 2. The reachability calculation produces the reachable file set. Orphan detection produces the unreferenced and not-currently-being-written file set.
The discipline that's important for safety: a file that is currently being written must not be classified as an orphan, because deleting an in-flight file breaks the writer that's writing it. The fix is a minimum age threshold: only files older than some safety window are eligible for orphan classification. The Artemis archive uses a 7-day minimum age. A file younger than 7 days is assumed to be possibly in-flight; the next cleanup pass picks it up if it's still orphaned then.
The detection algorithm:
- Snapshot the reachable file set (Lesson 2's calculation; cached or re-run).
- List the object store's data and metadata directories.
- For each listed file, check three conditions: (a) it is not in the reachable set, (b) it is older than the minimum-age threshold, (c) it is older than the in-flight-write window for currently-active writers (the Artemis writers heartbeat to the catalog; the cleanup job consults the heartbeats to identify which writers are active and what files they could be writing).
- Files meeting all three conditions are orphans. Delete them.
The third condition is the safety guard that distinguishes in-flight from genuinely-orphan files. A writer that started a multi-file commit 3 hours ago hasn't yet finished; its in-progress data files are not yet in any manifest but they will be soon. The minimum-age threshold gives the writer time to finish; the active-writer heartbeat gives additional defense in case the writer is slow.
Idempotency and Resumability
Both lineage compaction and orphan cleanup are batch jobs with potentially-large work. A crashed job must resume cleanly on the next run. The discipline is idempotency: every step is safe to repeat.
For lineage compaction: the compacted snapshot-history-log is written atomically (the rename pattern; an in-progress write produces a .tmp file that doesn't replace the current log until the rename). A crash mid-write leaves the in-progress file behind (an orphan, cleaned up by the next orphan-cleanup pass); the current log is unchanged.
For orphan cleanup: each file's deletion is independent. The job logs deletions to a journal as it runs; on resume, it reads the journal, skips files already deleted (a tombstone in the journal), continues from where it left off. The journal is itself a small file in the metadata directory; the lifecycle worker maintains it as part of its working state.
Both jobs commit no shared state until they complete. The CAS-on-the-catalog protocol used by ingest and compaction commits is not used by maintenance jobs — they don't change the table's data state, only the metadata structure. The result is that maintenance failures don't interfere with the live workload; failed maintenance is just incomplete maintenance, retryable next time.
The Lifecycle Worker's Scheduling
The capstone's lifecycle worker runs all four jobs on a continuous schedule. The interactions matter; getting the ordering wrong produces wasted work or cascading conflicts.
The Artemis worker's daily schedule:
- 00:00 UTC: Snapshot expiration (Lesson 2). Reachable-set calculation, file enumeration, deletion scheduling.
- 02:00 UTC: Storage-tier handoff (Lesson 2). Aged-out snapshots copied to the cold tier.
- 04:00 UTC: Physical deletion (Lesson 2 Phase 2). The expiration plan's deletable files are removed from the live tier.
- 06:00 UTC: Orphan cleanup (Lesson 3). The post-expiration scan picks up debris from the day's expiration plus any other accumulated orphans.
- Throughout the day, with bandwidth budget: Compaction (Lesson 1). Runs continuously, target-sized to consume at most 30% of available bandwidth.
- Weekly (Sunday 02:00 UTC): Manifest compaction (Lesson 1). Consolidates the week's per-commit manifests.
- Monthly (1st of month): Schema-history compaction (Lesson 3). Compacts the snapshot-history-log.
The ordering matters. Expiration runs before orphan cleanup because expiration's Phase 2 deletes produce a clean state for orphan cleanup to operate against. Compaction runs continuously rather than in a window because it benefits from spreading the work; expiration runs in a single window because its reachability calculation is most efficient when run all at once. Schema-history compaction runs least often because its benefit is amortized over many time-travel queries.
The worker's metrics — files compacted, snapshots expired, orphans cleaned, lineage compactions per period — are reported to the observability stack alongside the live query metrics. The operations team sees the lifecycle health as a first-class observable, not as background noise.
Core Mechanics in Code
The Lineage Validation Hook
Every commit validates its lineage fields before the CAS:
use anyhow::{anyhow, Result};
use std::collections::HashMap;
pub struct LineageRequirements {
pub required_keys: Vec<String>,
pub allowed_operations: Vec<String>,
pub trigger_validators: HashMap<String, Box<dyn Fn(&HashMap<String, String>) -> Result<()>>>,
}
/// Validate the lineage fields in a snapshot's summary against the
/// table's lineage requirements. Returns Err if any required field
/// is missing or invalid.
pub fn validate_lineage(
summary: &HashMap<String, String>,
requirements: &LineageRequirements,
) -> Result<()> {
// 1. Every required key must be present.
for key in &requirements.required_keys {
if !summary.contains_key(key) {
return Err(anyhow!("required lineage key missing: {}", key));
}
}
// 2. The operation must be one of the allowed values.
let operation = summary.get("operation")
.ok_or_else(|| anyhow!("operation key missing"))?;
if !requirements.allowed_operations.contains(operation) {
return Err(anyhow!("operation '{}' not allowed", operation));
}
// 3. Trigger-specific validation: if trigger.type = manual, then
// trigger.operator must be present; if trigger.type = incident-
// response, then trigger.ticket must be present.
if let Some(trigger_type) = summary.get("trigger.type") {
if let Some(validator) = requirements.trigger_validators.get(trigger_type) {
validator(summary)?;
}
}
Ok(())
}
The pattern. The validation is per-commit; the requirements are per-table. Tables with different audit needs configure different required keys. The validation runs synchronously in the commit path — before the CAS — and rejects malformed commits before they advance the snapshot pointer. The cost is microseconds; the safety benefit is significant.
The Orphan Detection Scan
The reconciliation pass that finds orphan files:
use anyhow::Result;
use std::collections::HashSet;
use std::time::{SystemTime, Duration};
const MIN_ORPHAN_AGE: Duration = Duration::from_secs(7 * 24 * 3600); // 7 days
pub struct OrphanScanResult {
pub orphan_data_files: Vec<String>,
pub orphan_manifests: Vec<String>,
pub skipped_in_flight: u32,
}
/// Scan the object store for files unreferenced by any reachable snapshot.
/// Filters out files younger than the min-age threshold (potentially in
/// flight) and files associated with currently-active writers.
pub async fn detect_orphans(
table: &Table,
reachable: &ReachableSet,
object_store: &dyn ObjectStore,
active_writer_paths: &HashSet<String>,
) -> Result<OrphanScanResult> {
let now = SystemTime::now();
let mut orphan_data: Vec<String> = Vec::new();
let mut orphan_manifests: Vec<String> = Vec::new();
let mut skipped: u32 = 0;
// Scan the data directory.
let data_files = object_store.list_dir("data/").await?;
for file in data_files {
if reachable.data_file_paths.contains(&file.path) {
continue;
}
if active_writer_paths.contains(&file.path) {
skipped += 1;
continue;
}
let age = now.duration_since(file.last_modified).unwrap_or(Duration::ZERO);
if age < MIN_ORPHAN_AGE {
skipped += 1;
continue;
}
orphan_data.push(file.path);
}
// Same for manifests directory.
let manifest_files = object_store.list_dir("metadata/").await?;
for file in manifest_files {
if reachable.manifest_paths.contains(&file.path)
|| reachable.manifest_list_paths.contains(&file.path)
{
continue;
}
let age = now.duration_since(file.last_modified).unwrap_or(Duration::ZERO);
if age < MIN_ORPHAN_AGE {
skipped += 1;
continue;
}
orphan_manifests.push(file.path);
}
Ok(OrphanScanResult {
orphan_data_files: orphan_data,
orphan_manifests,
skipped_in_flight: skipped,
})
}
What to notice. The age threshold is the primary safety guard. The active-writer-paths set is the secondary guard for writers whose work pre-dates the threshold (e.g., a backfill that has been running for two weeks). The two together ensure no in-flight write is misclassified as an orphan. The skipped_in_flight count is reported as a metric; sustained nonzero values indicate either many slow writers or a misconfiguration of the active-writer tracking — both worth surfacing to the operations team.
The Deletion Loop with Journaling
The deletion side, with a journal that makes the job resumable:
use anyhow::Result;
use std::path::PathBuf;
use tokio::fs::OpenOptions;
use tokio::io::AsyncWriteExt;
pub async fn delete_orphans(
scan_result: &OrphanScanResult,
object_store: &dyn ObjectStore,
journal_path: &PathBuf,
) -> Result<u32> {
// Open the journal in append mode. The journal records every
// successful deletion; on a restart, the worker reads the journal
// to skip files already deleted.
let mut journal = OpenOptions::new()
.create(true)
.append(true)
.open(journal_path)
.await?;
let already_deleted = read_journal_entries(journal_path).await?;
let mut deleted_count: u32 = 0;
for path in scan_result.orphan_data_files.iter()
.chain(scan_result.orphan_manifests.iter())
{
if already_deleted.contains(path) {
continue;
}
match object_store.delete(path).await {
Ok(()) => {
journal.write_all(format!("{path}\n").as_bytes()).await?;
deleted_count += 1;
}
Err(e) => {
tracing::warn!(path = %path, error = %e, "failed to delete orphan; will retry next pass");
// Don't journal the failure; the next pass picks it up.
}
}
}
journal.flush().await?;
Ok(deleted_count)
}
The discipline. Every successful deletion is logged to the journal before the deletion is considered durable. On restart the journal is read and already-deleted paths are skipped. Failed deletions are not journaled — the next pass retries them. The pattern is the standard write-ahead-log pattern from the Database Internals track, applied at the lifecycle-job layer.
Key Takeaways
- Lineage is the audit trail that records what produced each commit. The schema must include enough fields (writer ID, commit hash, input snapshots, trigger metadata) to answer audit questions months or years later.
- Lineage validation runs at commit time. The writer is the only place that knows what it's doing; the commit-path validation hook rejects malformed commits before they advance the snapshot pointer.
- Orphan files come from failure modes of the commit and maintenance paths. They are a cost issue rather than a correctness issue, but accumulated debris can be substantial; the cleanup discipline keeps orphan storage to single-digit percentages.
- Orphan detection requires age-based safety guards. Files younger than the minimum-age threshold are assumed to be possibly in flight and are skipped. Active-writer tracking handles writers whose work pre-dates the threshold.
- The lifecycle worker schedules all four jobs (compaction, expiration, orphan cleanup, lineage compaction) with explicit ordering and bandwidth budgets. The ordering avoids cascading conflicts; the bandwidth budget keeps maintenance from impacting live queries.
Capstone — Artemis Archive Lifecycle Worker
Module: Data Lakes — M06: Compaction, Lineage, and Lifecycle Estimated effort: 1–2 weeks of focused work Prerequisite: All three lessons in this module completed; all three quizzes passed (≥ 70%). The Module 2-5 capstones are the substrate.
Mission Briefing
From: Cold Archive Operations Lead
ARCHIVE BRIEFING — RC-2026-04-DL-006
SUBJECT: Archive Lifecycle Worker — background maintenance service
for the Artemis cold archive.
PRIORITY: P1 — required to bring the archive into long-term
operational steady state.
The cold archive's read and write paths work. The analyst portal is in production. What we don't have is the maintenance discipline that keeps the archive working over years. We've been deferring the small-file problem for six months; query latencies are creeping up. We've never run snapshot expiration, so the metadata footprint keeps growing. Orphan files from failed commits accumulate at maybe 0.5% per week — we estimate 200-300 GB of orphans in the archive already. The audit team has started asking for lineage queries we can't currently answer because the lineage discipline isn't enforced uniformly.
The job: build the Archive Lifecycle Worker. It's a Rust service that runs in the cold-archive infrastructure, performs the four maintenance jobs (compaction, snapshot expiration, orphan cleanup, lineage compaction) on a continuous schedule, reports metrics to the observability stack, and respects the live workload's bandwidth budget so analyst queries are not affected.
The worker doesn't need to be flashy. It needs to be correct, observable, and safe. Production workloads will run against this for years; getting the maintenance discipline right now is what makes the archive durable over the long term.
What You're Building
A Rust crate, artemis-lifecycle-worker, exposing a long-running binary that runs the four maintenance jobs continuously. Components:
- A scheduler that runs each job on its configured schedule (daily, weekly, monthly, or continuous) with respect to bandwidth budgets and quiet-window policies.
- A compaction engine implementing bin-packing, sort-based, and Z-order rewrite strategies (Lesson 1), choosing per-partition based on the table's clustering spec.
- A snapshot expiration engine implementing the two-phase deletion (Lesson 2): reachability calculation, expiration plan, storage-tier handoff to S3 Glacier IR, Phase-2 physical delete.
- An orphan cleanup engine implementing the age- and active-writer-guarded reconciliation scan (Lesson 3).
- A lineage validation hook that the Module 2 commit path calls before the CAS to reject commits missing required lineage fields.
- A manifest-compaction job (Lesson 1) that runs weekly to consolidate per-commit manifests.
- A schema-history-compaction job (Lesson 3) that runs monthly.
- A structured-logging integration that emits metrics for every job (files compacted, snapshots expired, orphans cleaned, lineage validations) to the observability stack via the
tracingcrate. - A small operator CLI,
artemis-lifecycle-cli, for manual job invocation (artemis-lifecycle-cli compact --table sda_observations --partition 'mission_id=apollo-7,day=2024-03-15') and for inspecting the worker's state.
The worker must run for at least 30 days against the production archive without producing analyst-visible query-latency degradation, measured by the existing observability stack.
Functional Requirements
- Compaction strategy selection. For each partition selected for compaction, the engine consults the table's sort-order metadata: tables with no sort order use bin-packing; tables with a linear sort order use sort-based compaction; tables with a Z-order cluster spec use Z-order compaction.
- Compaction scope. The engine compacts partitions whose small-file count exceeds a threshold (default 50 files below 32 MB). Partitions below the threshold are not touched.
- Compaction throughput limit. The engine consumes at most 30% of the configured object-store bandwidth budget. The bandwidth limit is enforced at the read level (using a token bucket on the Parquet reader) and at the write level.
- Quiet-window scheduling. Compaction is paused between 09:00 and 17:00 UTC daily; expiration runs only between 00:00 and 06:00 UTC.
- Two-phase deletion. Expiration's Phase 1 (metadata commit) runs in a daily batch; Phase 2 (physical delete) runs 24+ hours later. The two phases are decoupled; a Phase 2 run uses the journal of files queued by Phase 1.
- Storage-tier handoff. Before Phase 2 deletes, expired snapshots' data and metadata files are copied to the cold-tier object store. The cold tier's catalog is updated to reference them.
- Orphan detection with safety guards. Files younger than 7 days OR in the active-writer set are not classified as orphans. The active-writer set is read from the catalog's writer heartbeat table.
- Lineage validation. The Module 2 commit path calls the validation hook before its CAS. Commits without the required fields (operation, writer.id, writer.commit_hash, writer.invocation_id) are rejected with a clear error message.
- Operator manual triggers. The CLI supports manual invocation of any job, scoped to a specific table or partition where applicable. Manual invocations bypass the quiet-window schedule but still respect bandwidth limits.
- Resumability. Every job is resumable from its last consistent state. A worker crash mid-job is recoverable by re-running the same job, which picks up where the crash left off (compaction reads its plan from the catalog; orphan cleanup reads its journal; etc.).
Acceptance Criteria
Verifiable (automated tests must demonstrate these)
- A bin-packing compaction reduces a partition's data-file count from N small files (each 1-10 MB) to ceil(N × avg_size / 128 MB) compacted files, with one overwrite commit. The compacted partition has fewer total files; the data row count is preserved exactly.
-
A Z-order compaction against a partition with random
payload_idandsensor_kindordering produces output files whose per-column statistics for both columns are tight (max-min span less than 30% of the column's full range, averaged across files). - A snapshot expiration against a table with 1000 snapshots and a 30-day retention window classifies snapshots older than 30 days as expired (modulo tagged snapshots) and removes their metadata from the snapshot-history-log.
- An orphan cleanup against an object store containing 100 files of mixed ages and reachability classifications correctly identifies and deletes only files that are (a) not reachable, (b) older than 7 days, and (c) not in the active-writer set.
-
An attempted commit with
snapshot.summarymissing the required lineage fields is rejected at pre-commit validation; the catalog pointer is not advanced; the writer receives a structured error. -
A worker process killed mid-compaction with
SIGKILLand then restarted resumes correctly: the next run reads the catalog to identify which compaction was in flight (no commit yet) and re-plans. No data is lost; no double-deletions occur. - A 24-hour soak test against a synthetic high-load environment (1 writer committing every 30s; 50 analyst queries/minute; lifecycle worker running all jobs) shows no analyst-query-latency degradation beyond the workload's own variation (measured as p95/p99 latency change < 10%).
-
The worker's metrics-exposed endpoints (Prometheus-format
/metrics) include files_compacted_total, snapshots_expired_total, orphans_cleaned_total, lineage_validation_failures_total, with per-table labels.
Self-assessed (you write a short justification; reviewer checks it)
-
(self-assessed) The bandwidth-budget tuning is documented in
docs/bandwidth-tuning.md. The doc describes how the 30% budget was chosen against the live workload, the failure modes if it is set too low (compaction falls behind) or too high (analyst impact), and the observable signals that indicate the budget needs adjusting. -
(self-assessed) The retention-window choice is documented in
docs/retention-tuning.mdper table. The doc lists each table, its retention window, and the analytic argument that justifies the choice. -
(self-assessed) The schedule (compaction continuous, expiration 00:00-06:00 UTC, etc.) is documented in
docs/schedule.mdwith the analyst-team time zone constraint that drives the choice and the alternative schedules that were considered. -
(self-assessed) The resumability properties are documented in
docs/resumability.md. The doc enumerates each job's crash-recovery behavior and the catalog state used to resume each one.
Architecture Notes
A reasonable module layout:
artemis-lifecycle-worker/
├── src/
│ ├── lib.rs
│ ├── scheduler.rs # Schedule { Daily, Weekly, Monthly, Continuous }
│ ├── bandwidth.rs # Token-bucket rate limiting for IO
│ ├── compaction.rs # bin-packing, sort-based, Z-order rewrite
│ ├── expiration.rs # reachability calc, two-phase deletion
│ ├── tier_handoff.rs # S3 Glacier IR copy + cold-tier catalog
│ ├── orphan_cleanup.rs # detection scan + deletion with journal
│ ├── lineage.rs # validation hook + manifest compaction
│ ├── metrics.rs # Prometheus metrics
│ ├── bin/artemis_lifecycle_worker.rs
│ └── bin/artemis_lifecycle_cli.rs
├── tests/
│ ├── compaction.rs
│ ├── expiration_two_phase.rs
│ ├── orphan_detection.rs
│ ├── lineage_validation.rs
│ ├── resumability.rs
│ └── soak.rs # ignored by default; the 24-hour soak
└── docs/
├── bandwidth-tuning.md
├── retention-tuning.md
├── schedule.md
└── resumability.md
The token-bucket rate-limiter for bandwidth budgeting is a standard pattern; the governor crate or a hand-rolled tokio::sync::Semaphore-based variant both work. The metrics-exposing endpoint uses prometheus or metrics-exporter-prometheus integrated into a small axum HTTP server.
The active-writer set is read from a Postgres table that the M2 commit code maintains as a heartbeat (writer ID, host, started_at, last_heartbeat_at, in_progress_paths). The lifecycle worker reads this table on every orphan-cleanup scan. The Module 2 capstone's writer code may need a small extension to populate the heartbeat table; this is expected work.
Hints
Hint 1 — Compaction's per-partition serialization
The capstone's compaction engine must avoid compacting the same partition from two workers simultaneously. The discipline: take a per-partition lock from the catalog (a compaction_in_progress table keyed by (table, partition_tuple), with a started_at timestamp). The lock has a TTL (e.g., 1 hour); a worker that crashes leaves the lock; the next worker observes the expired lock and reclaims it. This is the table-format-layer analog of the lease pattern; it's simpler than implementing it in code because Postgres already supports the primitive.
Hint 2 — The reachable-set cache
The reachable-set calculation (Lesson 2) is expensive for tables with many snapshots — it reads every snapshot's metadata. The capstone should cache the result with a TTL of a few hours; the cache is invalidated by any new commit (the lifecycle worker reads the catalog to detect this). Caching reduces the daily expiration job from hours to minutes.
Hint 3 — The Z-order rank-normalization
Z-order compaction requires the cluster columns to be normalized to the same u32 scale (Module 3 Lesson 2). The standard approach: compute the rank of each column's value distribution within the partition being compacted (use arrow::compute::rank or a per-column quantile sketch), and use the rank as the Z-order input. The ranks are stable within a single compaction run; different runs may produce different ranks for the same values (depending on data distribution shifts) — this is fine because clustering is per-file, not global.
Hint 4 — Resumability via the catalog
Resumability means every job's working state lives in the catalog (or another durable store), not in the worker's memory. Compaction writes its plan to the catalog before executing it; expiration writes its plan; orphan cleanup writes its journal. A restarted worker reads the persistent state and continues. The discipline is "no in-memory state lives across restarts"; if you find yourself wanting to remember something between job invocations, write it to a Postgres table.
Hint 5 — The soak-test environment
The 24-hour soak test (acceptance criterion) requires a synthetic environment that produces the right load shape. A reasonable setup: a writer that commits every 30s with mock data; 50 query workers that each issue a typical analyst query every minute against random table partitions; the lifecycle worker running all four jobs at production schedule. The soak test runs in CI on a dedicated test cluster; failure means analyst-latency regression beyond 10%. The Artemis platform has this test as part of the canary deployment pipeline for any lifecycle-worker change.
References
- Apache Iceberg specification — "Snapshot Properties", "Maintenance", "Tagging"
governorcrate documentation — token-bucket rate limitingprometheuscrate documentation — metrics export- AWS S3 Glacier Instant Retrieval documentation — for the cold-tier handoff
When You're Done
The crate is "done" when all eight verifiable acceptance criteria pass in CI, the four self-assessed docs are written, and the 24-hour soak test passes consistently. The Data Lakes track is complete at this point — the cold archive has a transactional table format, good partitioning and clustering, time travel, a SQL surface, and the operational discipline to keep all of it working over years. The next track (Distributed Systems) builds on this foundation for the constellation-scale workloads that span multiple ground stations and orbital assets.
Module 01 — Distributed Systems Fundamentals
"Two relay satellites just lost contact simultaneously over Antarctica. The grid needs to elect a new coordinator before the next pass window closes."
Mission Context
This module is the prerequisite for everything else in the Distributed Systems track. Before you can reason about replication, consensus, fault tolerance, or coordination, you have to internalize three foundational truths about the systems you are about to build:
- The network is unreliable in ways that are fundamentally indistinguishable from outside. A timeout does not tell you what failed. Asynchronous packet networks deliver some messages zero times, some once, and some many times — and you cannot tell which.
- Time is unreliable. Wall clocks disagree across machines, NTP gives you approximate synchronization at best, and any algorithm that depends on tight clock agreement is one bad sync away from misbehaving.
- You cannot have everything. CAP and PACELC formalize the tradeoffs every distributed data system makes, both during partitions and in normal operation. The right question is not "is this system consistent?" but "in which cell of the PACELC matrix does this system live, and is that the cell we want?"
The opening incidents — the MSS-23 telemetry timeout (Lesson 1), the MSS-17 attitude-update overwrite (Lesson 2), and the Antarctic partition (Lesson 3) — are not failure stories from elsewhere. They are the kind of incidents the Constellation Network will produce next quarter if these concepts are not understood by the engineers building it.
Lessons
| # | Title | Source |
|---|---|---|
| 1 | The Unreliable Network | DDIA Ch. 9 |
| 2 | Clocks, Ordering, and Causality | DDIA Ch. 9 + Lamport 1978 |
| 3 | CAP, PACELC, and the Consistency Spectrum | DDIA Ch. 10 + Abadi 2012 |
Project
Constellation Clock Sync — a simulated 4-node satellite cluster with injected partitions, demonstrating that correct distributed ordering does not require NTP. Implements Lamport and vector clocks, a partition-aware message bus, and a test harness that verifies causal ordering under adversarial conditions.
Position
Module 1 of 6 in the Distributed Systems track.
What You Should Be Able to Do After This Module
- Read code that talks to other nodes and identify, by inspection, which of the eight fallacies of distributed computing it implicitly relies on.
- Choose deliberately between
SystemTimeandInstantbased on whether the operation requires absolute time or elapsed time, and articulate the failure mode of each. - Implement and apply Lamport and vector clocks to order distributed events without depending on physical clock agreement.
- Place a candidate data system on the PACELC matrix and explain, in one sentence, what behavior the system exhibits during a partition and what behavior it exhibits during normal operation.
- Distinguish linearizability from serializability and from "strong consistency" claims in vendor documentation, and ask the right follow-up questions when a system claims a consistency model.
Source Materials
- DDIA 2nd Edition (Kleppmann & Riccomini, 2026) — Chapter 9 ("The Trouble with Distributed Systems") is the primary source for Lessons 1 and 2. Chapter 10 ("Consistency and Consensus") opens the framing for Lesson 3.
- Lamport, "Time, Clocks, and the Ordering of Events in a Distributed System" (Communications of the ACM, July 1978) — the canonical reference for logical clocks. Strongly recommended supplemental reading.
- Abadi, "Consistency Tradeoffs in Modern Distributed Database System Design" (IEEE Computer, February 2012) — the original PACELC paper.
- Gilbert & Lynch, "Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services" (SIGACT News, June 2002) — the formal proof of CAP.
Source notes on individual lessons flag where content has been synthesized beyond the available source material and should be verified before publication.
Lesson 1: The Unreliable Network
Context
At 03:47Z, the Pacific ground station reported that satellite MSS-23 had stopped acknowledging telemetry pulls. The on-call engineer paged the constellation team, who attempted a manual command session and got no response for forty-eight seconds — then the satellite returned to operation as if nothing had happened. There was no failure log on the satellite, no dropped link on the ground side, and no recoverable trace on the relay. By the time the incident report was filed, three different teams had arrived at three different conclusions about what had failed.
This is the operating reality of the Constellation Network. Forty-eight LEO satellites communicate with twelve ground stations through asynchronous, lossy packet networks. From any single node's perspective — satellite or ground — a missing acknowledgment is fundamentally ambiguous: the request may have been lost, the reply may have been lost, the remote node may be down, the remote node may be processing slowly, or the remote node may have processed successfully and crashed before replying. The network gives you no way to tell these apart from the outside. This is what DDIA calls the partial failure model, and it is the foundational truth that every other lesson in this track builds on.
Before you can build consensus, replication, or fault tolerance, you have to internalize what the network actually guarantees — which is almost nothing. This lesson is the foundation. The eight fallacies of distributed computing are not historical curiosities; they describe the specific ways production engineers continue to write code that breaks under partial failure. By the end of this lesson, you should be able to read a piece of Rust code that talks to another node over the network and identify, by inspection, what it assumes that the network does not actually guarantee.
Core Concepts
Partial Failure and the Asymmetry of Knowledge
On a single machine, computation is deterministic in failure: either the program runs or it crashes outright. There is no in-between state where some of the registers have updated and others have not. Operating systems and hardware go to significant lengths to maintain this illusion — a CPU prefers to halt with a kernel panic rather than return a wrong result.
Distributed systems shatter this illusion. A "node" in your system is not the satellite's CPU — it is the satellite plus the network path that connects it to whoever is asking. That composite system can be in states the local CPU cannot: the satellite can be alive and well, processing your command, but the response packet can be sitting in a router queue that has just exceeded its memory limit and is silently dropping packets. The local satellite did its job. The local CPU is fine. The system as a whole has failed in a way that has no analog in single-node programming.
The consequence is an asymmetry of knowledge: a node always knows more about its own state than any other node can ever know about it. From the outside, you observe only what arrives over the network within some window of time. Everything else — what the remote node was doing, whether it received your message, whether it is still running — is an inference, never an observation.
The Timeout Is Your Only Failure Detector
If you cannot directly observe a remote node's state, how do you decide whether to give up on it? The standard answer in practice — and the only one that is actually implementable on an asynchronous network — is the timeout. You send a request, you wait some duration, and if no reply arrives in that window, you treat the request as failed.
This sounds simple but encodes a deep tradeoff. Choose the timeout too short and you will falsely declare healthy nodes dead during normal queueing delays, causing spurious failovers, duplicate work, and cascading retries that can amplify the original load spike. Choose the timeout too long and you will leave clients hanging through real failures, blocking ground station operators and missing pass windows. There is no universally correct value. The right timeout depends on the network's tail latency distribution, the cost of a false-positive failure declaration, and the cost of a delayed failure declaration — all of which vary by workload.
Critically, a timeout does not tell you that the remote node failed. It tells you that you stopped waiting. The remote may complete the request five milliseconds after you give up, send a reply you never see, and continue operating in a state inconsistent with what you now believe. This gap — between "I declared you dead" and "you actually are dead" — is the source of an enormous class of distributed systems bugs, and we will return to it repeatedly in this track when we cover fencing tokens, leader leases, and split-brain scenarios.
The Eight Fallacies of Distributed Computing
The list was compiled at Sun Microsystems in the 1990s, but every fallacy still appears in production code today. They are the implicit assumptions that programmers make when they treat a network call as if it were a local function call:
- The network is reliable. Packets are dropped, links flap, NICs corrupt frames, and a backhoe will eventually sever your fiber. RPC frameworks that retry transparently mask this — at the cost of duplicate operations.
- Latency is zero. A round trip across the constellation can be hundreds of milliseconds. Code that issues N sequential calls instead of one batched call will be N× slower.
- Bandwidth is infinite. Satellite uplinks are bandwidth-constrained; pushing a megabyte of debugging output through a kilobit channel will starve the actual mission data.
- The network is secure. Adversaries can replay packets, observe traffic, and inject frames. Anything that matters needs authentication and integrity, not just transport.
- Topology doesn't change. Satellites enter and leave coverage; ground stations rotate; relay paths shift. Any static configuration of node addresses will be wrong within hours.
- There is one administrator. Different ground stations are operated by different agencies. There is no global authority who can fix anything you need fixed.
- Transport cost is zero. Each round trip costs CPU on both ends, plus serialization, plus the wire itself. Naïve serialization formats (JSON for high-throughput telemetry) will dominate CPU usage.
- The network is homogeneous. Some links are 10 Gbps fiber; others are 9.6 kbps S-band over polar regions with 600 ms of latency.
When you read code, look for these assumptions as implicit invariants the author is relying on. A .unwrap() after a send() assumes (1). A request handler that processes N sub-requests in serial assumes (2). A configuration file that lists peer addresses assumes (5).
Indistinguishable Failures: Why You Cannot Tell What Broke
When a request times out, the actual cause could be any of these, and you cannot distinguish them from outside:
- The request was lost in flight. The remote node never saw it. Retrying is safe (with caveats around idempotency).
- The remote node received and processed the request, but the response was lost. Retrying will perform the operation a second time.
- The remote node received the request but crashed before processing. Retrying is safe.
- The remote node received the request, processed it, and then crashed before responding. Like case 2, retrying will duplicate the operation.
- The remote node received the request and is still processing it, just slowly (long GC pause, swapping, IRQ storm). Retrying may cause two concurrent executions of the same operation on the remote node.
- The remote node is partitioned from you but not from other peers. It is doing useful work for others, and may have a quorum that does not include you. Retrying via you will fail; the operation may still complete via another path.
Notice that GC pauses sit in this list. A process pause of several seconds — common with stop-the-world JVM collectors, but also possible in Rust if you call into a library that takes a global mutex in a long-running thread — is indistinguishable from the process being dead. This is why Cassandra's failure detector treats unresponsiveness, not termination signals, as the operational notion of failure.
The practical implication is that you cannot write "retry only if the remote node died" code, because you cannot know whether it died. You can only write "retry if I didn't hear back in time, and design every operation so that double-execution is harmless." This is idempotency, and it is non-negotiable in distributed systems.
Code Examples
A Telemetry Client That Assumes Too Much
The Meridian control plane's legacy Python client looked roughly like this — pseudocode, but the structure is real:
// SCENARIO: First-pass port of the legacy Python ground station client.
// This is wrong in multiple ways. We'll dissect it.
use tokio::net::TcpStream;
use tokio::io::{AsyncReadExt, AsyncWriteExt};
pub async fn fetch_telemetry(satellite_id: u32) -> Vec<u8> {
let mut stream = TcpStream::connect("relay.meridian.internal:7400")
.await
.unwrap(); // Fallacy 1: assumes the network is reliable.
let request = build_request(satellite_id);
stream.write_all(&request).await.unwrap(); // Fallacy 1 again.
let mut response = Vec::new();
stream.read_to_end(&mut response).await.unwrap();
// No timeout. If the relay hangs, this future is blocked forever.
response
}
Three failures hide in nine lines. The connect call has no timeout — a half-open TCP connection can sit in SYN_SENT for minutes on Linux defaults before the kernel gives up, during which your task is wedged. The write_all call has no timeout and .unwrap()s on error, propagating the implicit assumption that writes always succeed. The read_to_end call has no timeout and no application-level framing — it relies on the remote closing the connection to terminate the read, which means a remote that processes the request but then hangs will block this caller indefinitely.
In production, this exact shape is what produces the symptom that an alert dashboard summarizes as "telemetry latency went to infinity at 03:47". The underlying node didn't crash. The relay was healthy. A single TCP connection got stuck in a state the client could not detect, and the client's task never returned to the runtime.
A Telemetry Client That Acknowledges Reality
// PRODUCTION: same fetch, but explicit about every failure mode it can see.
use std::time::Duration;
use tokio::net::TcpStream;
use tokio::io::{AsyncReadExt, AsyncWriteExt};
use tokio::time::timeout;
use anyhow::{Context, Result};
const CONNECT_TIMEOUT: Duration = Duration::from_secs(2);
const REQUEST_TIMEOUT: Duration = Duration::from_secs(5);
pub async fn fetch_telemetry(satellite_id: u32) -> Result<Vec<u8>> {
// Each I/O operation has its own timeout. The connect timeout is shorter
// than the request timeout: failing to establish TCP at all is a stronger
// signal of failure than a slow response.
let mut stream = timeout(
CONNECT_TIMEOUT,
TcpStream::connect("relay.meridian.internal:7400"),
)
.await
.context("connect timed out")?
.context("connect failed")?;
// We wrap the request/response cycle in a single timeout because partial
// progress (writing the request but never receiving a reply) is still a
// failure from the caller's perspective. The caller does not care which
// half got stuck; they care that the operation did not complete.
let response = timeout(REQUEST_TIMEOUT, async {
let request = build_request(satellite_id);
stream.write_all(&request).await?;
// Length-prefixed framing: the protocol declares the size of the reply
// before sending it, so we know exactly how many bytes to read instead
// of waiting for the connection to close.
let mut len_buf = [0u8; 4];
stream.read_exact(&mut len_buf).await?;
let len = u32::from_be_bytes(len_buf) as usize;
let mut response = vec![0u8; len];
stream.read_exact(&mut response).await?;
anyhow::Ok(response)
})
.await
.context("request timed out")??;
Ok(response)
}
fn build_request(_satellite_id: u32) -> Vec<u8> {
// Encoding omitted; in production this uses a versioned binary protocol.
Vec::new()
}
Notice what this version still doesn't do: it doesn't retry. That's deliberate. Retry policy is a higher-level concern than this function — it depends on whether the operation is idempotent, what the caller's deadline budget is, and whether duplicate execution on the remote node is harmful. A reusable transport function should surface the failure cleanly and let the policy layer above it decide. We will return to retry policy in the Fault Tolerance module; for now, the win here is that every observable failure mode produces a distinct, actionable error rather than an infinite wait.
The Indistinguishability Problem in Code
To make the indistinguishability concrete: here is a snippet that captures the timeout, retries, and then later receives a reply for the request it already gave up on.
// PROBLEM: at-least-once delivery without idempotency.
// What happens if the original request *did* succeed on the remote?
use anyhow::Result;
use std::time::Duration;
pub async fn enqueue_command(cmd: Command) -> Result<()> {
for attempt in 0..3 {
match send_with_timeout(&cmd, Duration::from_secs(2)).await {
Ok(()) => return Ok(()),
Err(_) if attempt < 2 => {
// The remote may have processed `cmd`. Or not. We can't tell.
// Retrying enqueues it AGAIN if it did succeed, creating a
// duplicate. This is a bug if `cmd` is "fire thruster for 3s".
tokio::time::sleep(Duration::from_millis(200)).await;
continue;
}
Err(e) => return Err(e),
}
}
unreachable!()
}
struct Command;
async fn send_with_timeout(_c: &Command, _d: Duration) -> Result<()> { Ok(()) }
The standard fix is idempotency: every command carries a unique command_id, and the remote node tracks which IDs it has already executed. Retrying the same command_id is a no-op on the remote. This shifts the problem from "did the network deliver my message" — which is unanswerable — to "has the remote seen this ID before" — which is local and tractable.
This is the pattern that allows the rest of the system to work despite the network's failures. Once you accept that the network will deliver some messages zero times, some once, and some many times, you stop trying to control delivery and start designing operations whose semantics are independent of how many times they execute.
Key Takeaways
- A node has direct knowledge only of itself; everything else is an inference from messages observed over a lossy network with unbounded delay. Code that conflates "remote node alive" with "I received a recent message from it" will be wrong during every transient delay.
- The timeout is the only practical failure detector on an asynchronous network. A timeout firing does not mean the remote node failed — it means you decided to stop waiting. Choose timeouts deliberately, calibrate them against your tail latency distribution, and design for the case where the remote completes the operation after you give up.
- All
nof (a) lost request, (b) lost response, (c) crashed remote, (d) slow remote, (e) partitioned remote produce the same observable symptom: no reply. You cannot distinguish them from outside. Stop writing code that tries to. - Idempotency is not a nice-to-have. It is the mechanism by which at-least-once message delivery — which is the only delivery guarantee any real network provides — becomes safe to use. Every command in the constellation control plane should carry a unique ID, and every receiver should be safe under double-delivery.
- The eight fallacies are not a historical artifact. They describe the implicit assumptions in production code right now. When you review a colleague's PR that does anything across the network, look for which fallacies it has not yet stopped believing.
Source note: This lesson is grounded in DDIA 2nd Edition (Kleppmann & Riccomini), Chapter 9, "The Trouble with Distributed Systems" — specifically the sections "Faults and Partial Failures" and "Unreliable Networks." The eight fallacies of distributed computing originate with Peter Deutsch and James Gosling at Sun Microsystems (c. 1994); this lesson restates them but the framing is synthesized from training knowledge. Any specific historical claims about Sun, Deutsch, or Gosling should be verified before publication.
Lesson 2: Clocks, Ordering, and Causality
Context
At 04:12Z, the Constellation Operations dashboard showed two conflicting attitude updates for MSS-17 arriving at the catalog within 80 milliseconds of each other. One was tagged with timestamp 04:11:59.428Z from the Pacific ground station; the other was tagged 04:12:00.012Z from the Indian Ocean station. The Pacific update — the "older" one by wall-clock — overwrote the Indian Ocean update because it arrived second. Three orbits later, an operator noticed that the satellite was responding to commands as if its attitude was still in the earlier state. The catalog had silently lost the update.
The root cause was not a network failure. Both messages were delivered intact. The root cause was that the Pacific ground station's NTP daemon had drifted forward by 600 milliseconds, and the catalog's last-write-wins resolver — which trusted wall-clock timestamps to order writes — had been handed a lie that it had no way to detect. The two writes were concurrent in the causal sense: neither had observed the other before being submitted. But the system flattened them into a total order on the basis of a clock reading that did not correspond to any real ordering of events.
This is the second foundational truth of distributed systems: time itself is unreliable. Wall clocks on different machines disagree, and they disagree silently. There is no global "now." Once you accept this, the question becomes: how do you order events at all? The answer — and it is a deep one — is that you don't order them by time. You order them by causality, using mechanisms that do not depend on physical clocks. This lesson introduces the three flavors of clock available to a distributed system, explains when each is safe and when each lies, and walks through the logical-clock algorithms (Lamport, vector) that make causal ordering possible without trusting wall time.
Core Concepts
Time-of-Day Clocks vs Monotonic Clocks
Every modern operating system exposes two distinct clock APIs, and they serve different purposes. Confusing them is the source of an enormous amount of distributed-systems pain.
A time-of-day clock — std::time::SystemTime in Rust, clock_gettime(CLOCK_REALTIME) on Linux — returns the current wall-clock time, synchronized to UTC via NTP, PTP, or some other time source. It is useful for displaying timestamps to humans and for cross-machine comparison, provided you accept that the comparison is approximate. The catch is that this clock can jump backward: if NTP determines that the local clock is ahead of the reference, it will rewind it. A naïve now() - earlier_now() calculation can return a negative duration on a wall clock.
A monotonic clock — std::time::Instant in Rust, clock_gettime(CLOCK_MONOTONIC) — is guaranteed to never go backward. It has no defined relationship to UTC; the value is meaningless across processes or machines. But within a single process, the difference between two Instant values is a reliable measure of elapsed time. This is the clock to use for timeouts, durations, and rate limiting.
The rule is simple: wall clocks for "when," monotonic clocks for "how long." A timeout written with SystemTime can fire instantly or never if NTP adjusts the clock during the wait. The Rust standard library reinforces this by making Instant::elapsed() infallible while making SystemTime::elapsed() return a Result.
Why NTP Cannot Save You
NTP is the protocol that synchronizes wall clocks across the network. It works by sampling reference servers, estimating round-trip time, and adjusting the local clock to compensate. Under good conditions on a stable network, NTP can hold machines within a few milliseconds of each other.
"Good conditions" is doing a lot of work in that sentence. NTP synchronization can fail or drift for many reasons: the network path can be congested (which adds skew because RTT estimation is wrong); the reference server can be wrong (which has happened — leap second bugs have taken down entire fleets); the local clock crystal can drift faster than NTP can compensate (typical drift is around 10 ppm, or about 1 second per day); a virtualized clock in a paused VM can lag behind reality and then jump forward when the VM resumes.
DDIA documents specific cases: Google's MapReduce engineers found that clocks could be off by tens of milliseconds during normal operation and seconds during synchronization storms. Cloudflare experienced a 2017 outage when a leap second was applied incorrectly. The takeaway is not that NTP is bad; it is that NTP gives you approximate synchronization, never guarantees, and any algorithm whose correctness depends on tight clock synchronization is one bad NTP day away from misbehaving.
The cleanest summary, from DDIA: a clock reading is best thought of as a value with a confidence interval, not a point. Google's Spanner database makes this explicit through its TrueTime API, which returns TT.now() = (earliest, latest). Algorithms that need to compare timestamps across machines wait out the uncertainty interval before committing — they trade latency for correctness. Most systems do not have a TrueTime equivalent, and they pay for the missing uncertainty in subtler ways, like the lost MSS-17 attitude update.
Lamport Logical Clocks: Counting Events, Not Seconds
If wall clocks are untrustworthy, can we order events without consulting them at all? Yes — and Leslie Lamport showed how in 1978. The construction is so simple it looks like a trick.
Each node maintains a single counter, initially zero. The counter increments on every local event. On sending a message, the sender attaches the current counter value. On receiving a message, the receiver sets its counter to max(local_counter, received_counter) + 1. That's it.
The result is that for any two events a and b where a causally precedes b (the happens-before relation, written a → b), the Lamport timestamp of a is strictly less than the Lamport timestamp of b. The clock captures causality precisely enough to guarantee that if a → b, then L(a) < L(b). It does not give you the converse: L(a) < L(b) does not imply a → b, because two concurrent events can happen to have a particular ordering of timestamps.
Lamport clocks give you a total order on events — every pair of events has a defined ordering, with ties broken by node ID — that respects causality. They cannot tell you which pairs are concurrent. For many systems this is enough: a Lamport timestamp is sufficient to deterministically resolve "last write wins" in a way that is independent of wall clocks. Cassandra used Lamport-like timestamps for early versions of its last-write-wins resolution before adopting hybrid logical clocks.
Vector Clocks: Detecting Concurrency
When you do need to detect concurrency — when "these two events are unrelated and a human needs to make a decision" is a different outcome than "this one happened first" — you need a vector clock. Instead of a single counter per node, each node maintains a vector of counters, one per node in the system.
The update rule extends Lamport's: on a local event, the node increments its own slot in the vector. On sending, it attaches the entire vector. On receiving, the receiver does an element-wise max with the received vector and then increments its own slot.
Vector clocks support a richer comparison: given two vector clocks V1 and V2, V1 < V2 (componentwise, with at least one strict inequality) means V1 happened-before V2. If neither V1 < V2 nor V2 < V1, the events are concurrent — they have no causal relationship, and the system must either surface the conflict to the application or pick a deterministic tiebreaker that is independent of wall clocks.
The cost is space: vector clocks grow linearly with the number of nodes in the system. For a 48-satellite constellation plus 12 ground stations plus dozens of services, this is tractable. For a system with millions of clients, it isn't, and there are compressed variants (dotted version vectors, hybrid logical clocks) that trade off precision for size. We will encounter Lamport-style hybrid logical clocks again in Module 3 when we look at how Spanner-style systems combine physical and logical time.
Knowing What You Cannot Know: The Truth About Distributed State
A useful framing from DDIA: in a distributed system, the notion of "the current state" is itself suspect. You do not directly observe state; you observe messages that purport to describe state, with some unknown delay, from a sender whose own view of state is itself derived from messages, and so on. There is no privileged observer.
This is why the rest of the track is structured the way it is. Consensus algorithms (Module 3) exist because no single node's view of the world is authoritative. Coordination primitives like leases (Module 5) exist because two nodes can simultaneously believe they hold an exclusive resource. Failure detection (Module 4) is fundamentally probabilistic because "this node has not responded" is the only signal available, and it does not distinguish "dead" from "slow."
For now, the takeaway is concrete: wall clocks cannot order events across machines, NTP does not fix this, and you have a choice between logical clocks (which order causally but don't relate to physical time) or accepting that some pairs of events are inherently unordered and surfacing concurrency to the caller. Choose deliberately. The MSS-17 catalog made the choice silently, and it cost an attitude update.
Code Examples
A Lamport Clock for the Catalog
The catalog needs to order updates from multiple ground stations without trusting their wall clocks. A Lamport clock is sufficient if the only requirement is "deterministic last-write-wins across the cluster."
use std::cmp::max; use std::sync::atomic::{AtomicU64, Ordering}; pub struct LamportClock { // AtomicU64 gives us lock-free updates from multiple tasks. The value is // the local notion of logical time: it advances on local events and on // observing higher timestamps from incoming messages. counter: AtomicU64, } impl LamportClock { pub fn new() -> Self { Self { counter: AtomicU64::new(0) } } /// Called for any local event that should advance logical time. Returns /// the timestamp that should be attached to the resulting message or /// catalog write. pub fn tick(&self) -> u64 { // fetch_add returns the *previous* value; we want the new one. self.counter.fetch_add(1, Ordering::SeqCst) + 1 } /// Called when receiving a message tagged with `incoming`. Advances the /// local clock to max(local, incoming) + 1, which is the Lamport update /// rule. Returns the new local value. pub fn observe(&self, incoming: u64) -> u64 { // We use a CAS loop rather than fetch_max because we need to add 1 // after taking the max. fetch_max alone would not advance past // 'incoming' itself when local < incoming. loop { let current = self.counter.load(Ordering::SeqCst); let new = max(current, incoming) + 1; if self .counter .compare_exchange(current, new, Ordering::SeqCst, Ordering::SeqCst) .is_ok() { return new; } // Lost the race; retry. This is fine for low contention; if many // tasks observe simultaneously we'd switch to a more sophisticated // structure, but for ground-station message rates this is plenty. } } } fn main() { let clock = LamportClock::new(); let t1 = clock.tick(); // local event let t2 = clock.observe(42); // received message with ts=42 let t3 = clock.tick(); // another local event println!("{t1} {t2} {t3}"); // prints: 1 43 44 }
Two things to notice. First, the timestamp 42 from the incoming message forced our local clock forward, even though we had only emitted timestamp 1 locally. This is the mechanism that makes causality work: if a remote saw something we haven't seen yet, our clock advances past it. Second, the returned timestamps are total-ordered by (timestamp, node_id): ties are broken by node ID, which guarantees a deterministic global ordering even when two events get the same logical time.
A Vector Clock for Conflict Detection
When concurrency is a meaningful outcome — "these two ground stations issued attitude commands that haven't seen each other; the operator needs to choose" — a Lamport clock will silently pick one. A vector clock surfaces the conflict.
use std::cmp::max; use std::collections::HashMap; #[derive(Debug, Clone, PartialEq, Eq)] pub struct VectorClock { // Map from node_id to that node's counter as last observed by us. // Missing entries are treated as zero, which lets new nodes join without // requiring a global preallocation. counters: HashMap<String, u64>, } #[derive(Debug, PartialEq)] pub enum Ordering3 { Before, // self < other (self happened before other) After, // self > other (other happened before self) Equal, // identical Concurrent, // neither precedes the other - true concurrency } impl VectorClock { pub fn new() -> Self { Self { counters: HashMap::new() } } /// Local event on node `id`. Increments only this node's slot. pub fn tick(&mut self, id: &str) { *self.counters.entry(id.to_string()).or_insert(0) += 1; } /// Merge another node's clock into ours (called on message receive). /// Element-wise max, then we'd typically call tick() afterward. pub fn merge(&mut self, other: &VectorClock) { for (node, &count) in &other.counters { let entry = self.counters.entry(node.clone()).or_insert(0); *entry = max(*entry, count); } } /// Compare two clocks. Returns Concurrent if neither dominates - the /// caller now knows this is a real conflict, not just a stale reading. pub fn compare(&self, other: &VectorClock) -> Ordering3 { let mut self_greater_anywhere = false; let mut other_greater_anywhere = false; let all_nodes: std::collections::HashSet<&String> = self.counters.keys().chain(other.counters.keys()).collect(); for node in all_nodes { let s = self.counters.get(node).copied().unwrap_or(0); let o = other.counters.get(node).copied().unwrap_or(0); if s > o { self_greater_anywhere = true; } if o > s { other_greater_anywhere = true; } } match (self_greater_anywhere, other_greater_anywhere) { (false, false) => Ordering3::Equal, (true, false) => Ordering3::After, (false, true) => Ordering3::Before, (true, true) => Ordering3::Concurrent, } } } fn main() { // MSS-17 attitude updates from Pacific and Indian Ocean stations. // Neither has observed the other's update before submitting; both // start from the same baseline (catalog version). let mut pacific = VectorClock::new(); pacific.tick("catalog"); // baseline, version 1 let mut indian_ocean = pacific.clone(); // both fork from version 1 pacific.tick("pacific"); // local update from Pacific indian_ocean.tick("indian_ocean"); // concurrent local update match pacific.compare(&indian_ocean) { Ordering3::Concurrent => { // The catalog should surface this to the operator rather than // silently picking a winner based on wall-clock arrival order. println!("CONCURRENT - operator decision required"); } other => println!("ordered: {:?}", other), } }
The Pacific clock is {catalog: 1, pacific: 1} and the Indian Ocean clock is {catalog: 1, indian_ocean: 1}. Neither dominates — Pacific has a pacific count that Indian Ocean doesn't, and vice versa. compare returns Concurrent. In the real catalog, this is the signal that the system must store both updates as siblings (the way Riak does) or escalate to a human, rather than picking one based on the lie of a wall clock.
Why You Cannot Use SystemTime for This
For completeness, here is the version that fails:
// BROKEN: ordering catalog writes by wall-clock timestamp.
use std::time::SystemTime;
struct Update { ts: SystemTime, value: u32 }
fn resolve(a: Update, b: Update) -> Update {
if a.ts > b.ts { a } else { b }
}
This is the MSS-17 bug. Two writes from machines with differently-drifted NTP can produce a.ts > b.ts even when b causally happened-after a, or when they are genuinely concurrent and the choice between them should not be silent. The function compiles, passes unit tests on a single machine, and silently corrupts state in production. It is the canonical failure mode of treating physical time as ordering truth in a distributed system.
Key Takeaways
- A wall clock (
SystemTime) can jump backward and disagrees silently across machines; a monotonic clock (Instant) never goes backward but is only meaningful within a single process. Use the first for "when," the second for "how long," and never confuse them. - NTP synchronization is approximate, not guaranteed. A clock reading is best understood as a value with a confidence interval. Algorithms that compare timestamps across machines without accounting for uncertainty will misbehave during NTP storms, leap seconds, and VM pauses.
- A Lamport clock gives you a total order on events that respects causality, using a single integer counter per node. It is sufficient when "last write wins" is acceptable and concurrency does not need to be detected. It cannot tell you whether two events are concurrent.
- A vector clock can detect concurrency, at the cost of O(n) space per timestamp where n is the number of nodes. Use vector clocks when "these two updates have no causal relationship" is operationally meaningful — for example, when concurrent writes should be surfaced as conflicts rather than silently merged.
- "Order by wall-clock timestamp" is one of the most common implicit assumptions in distributed code, and one of the highest-impact bugs when it fails. If you see
if a.ts > b.tsin code that processes data from multiple machines, treat it as a defect until proven otherwise.
Source note: This lesson is grounded in DDIA 2nd Edition (Kleppmann & Riccomini), Chapter 9, "Unreliable Clocks" and "Knowledge, Truth, and Lies." The Lamport clock construction is from Leslie Lamport, "Time, Clocks, and the Ordering of Events in a Distributed System" (Communications of the ACM, 1978) — the algorithm summary is synthesized; the original paper is the canonical reference. The TrueTime API description draws on the Spanner OSDI 2012 paper; specific numbers (Spanner's clock skew bound, NTP drift rates) are illustrative and should be verified against current published values before publication.
Lesson 3: CAP, PACELC, and the Consistency Spectrum
Context
When the Antarctic relay path went down for nine minutes during a Southern Ocean storm, the Constellation Operations team had to make a decision they had not deliberately planned for: should the catalog accept writes from ground stations that could no longer reach a quorum of replicas? The on-call answer was yes — the alternative was rejecting telemetry for nine minutes — but no one had documented what "accept writes during a partition" actually meant for the system's guarantees. By the end of the incident, three ground stations had accepted updates against two divergent replica sets, and reconciliation took another hour of manual work to complete.
This was not a failure of the catalog. It was a failure of the team to have an explicit position on the CAP theorem: the impossibility result that says when a distributed system is partitioned, it must give up either consistency (in a specific technical sense) or availability. CAP is one of the most misunderstood theorems in distributed systems, partly because the standard "CAP triangle" framing — "pick two of three" — is misleading. Partitions are not optional; they are a fact of life on every real network. The actual choice is between consistency and availability during a partition. PACELC, the refinement of CAP introduced by Daniel Abadi in 2010, extends the framing to cover normal operation as well: when there is no partition, you still trade latency against consistency.
This lesson teaches you to characterize a system by its consistency model rather than by its product category. By the end, you should be able to look at a sentence like "our catalog is eventually consistent" and ask the right follow-up questions: eventually consistent under what failure model, with what staleness bound, exhibiting which anomalies under which workloads. The vocabulary in this lesson is what allows the next two modules — Replication and Consensus — to talk precisely about which guarantees a given algorithm provides.
Core Concepts
What CAP Actually Says (and Doesn't Say)
The CAP theorem, formalized by Seth Gilbert and Nancy Lynch in 2002 from Eric Brewer's earlier conjecture, states that a distributed data system cannot simultaneously provide all three of:
- Consistency — in CAP, this specifically means linearizability: every read sees the result of the most recent write that completed before the read began, as if the system were a single non-distributed register.
- Availability — every non-failing node returns a non-error response to every request in a bounded time.
- Partition tolerance — the system continues to operate when the network drops or delays arbitrary messages between nodes.
The "two of three" framing is misleading because partition tolerance is not optional. If you build a real distributed system, the network will partition you sooner or later. The actual theorem says: in the presence of a partition, you must choose between linearizability and availability. You cannot have both, because honoring a write on the unreachable side of a partition either requires waiting for that side to come back (giving up availability) or accepting writes that conflict with the other side (giving up linearizability).
Two more clarifications matter. First, "consistency" in CAP is not the C in ACID. ACID consistency means transactions preserve declared invariants; CAP consistency means a specific linearizability property. The overloading of the word causes endless confusion. Second, CAP is binary in the formal statement, but in practice systems offer a spectrum of consistency models — linearizability is the strongest, but there are many weaker models (sequential, causal, eventual) that are still useful and that don't fall off the CAP cliff in the same way.
PACELC: The Half of the Story CAP Doesn't Tell
CAP only describes behavior during a partition. But in practice, partitions are relatively rare events on a well-engineered network. The system spends most of its time not partitioned, and during that time it still makes tradeoffs — between consistency and latency.
PACELC, proposed by Daniel Abadi, captures this:
If there is a Partition, the system chooses between Availability and Consistency; Else (in normal operation), the system chooses between Latency and Consistency.
The P/A/C side is just CAP. The E/L/C side is new: even with no partition, achieving linearizability requires coordination — a write must be acknowledged by enough replicas to ensure subsequent reads will see it — and coordination costs latency. A system that trades coordination for low latency is fast in the normal case and stale in detail; a system that pays the coordination tax has consistent reads but slower writes.
PACELC produces a four-cell taxonomy, and real systems map to it cleanly:
| System | Partition behavior | Normal-operation tradeoff | PACELC |
|---|---|---|---|
| etcd, ZooKeeper, Spanner | Reject writes to maintain linearizability | Pay latency for strong consistency | PC/EC |
| Cassandra (default QUORUM) | Sloppy quorums, hinted handoff | Pay latency for tunable consistency | PA/EC (tunable) |
| DynamoDB (default), Cassandra (ONE) | Accept writes anywhere | Fast reads possible from any replica | PA/EL |
| MongoDB (default, before 4.x) | Continue accepting from primary | Read from primary by default | PC/EL (historical) |
The cleanest way to characterize any production data system is to place it on this matrix and ask: do you know which cell it is in? If the team cannot answer, that is itself a finding — the system is making the tradeoff implicitly, and the choice will assert itself during an incident.
The Consistency Spectrum
Linearizability is the strongest single-object consistency model in practice. Below it lies a spectrum:
Linearizability — There exists a single total order of operations consistent with real-time. Every read sees the latest committed write. This is the model CAP refers to. Implementation requires coordination on every write and typically every read.
Sequential consistency — There exists a single total order consistent with each node's program order, but not necessarily consistent with real-time. A read can see a slightly stale value as long as no node observes operations in conflicting orders. Cheaper to implement; rarely the default in production data systems because the staleness bound is undefined.
Causal consistency — Operations that are causally related (per the happens-before relation from Lesson 2) are seen in the same order by all nodes. Concurrent operations may be observed in different orders. Strong enough to prevent counterintuitive anomalies like "I see B's reply to my message before I see my own message." Achievable without a single coordinator; Riak and COPS implement variants.
Read-your-writes consistency — A particular client always sees its own writes in subsequent reads. Easier to provide than causal because it only constrains one client; typically implemented with session tokens.
Eventual consistency — If writes stop, all replicas will eventually converge to the same value. Says nothing about staleness, ordering, or anomalies during convergence. Useful as a baseline; insufficient as a guarantee on its own without additional model specification (e.g., "eventually consistent with bounded staleness of N seconds and monotonic-read session guarantees").
The catalog incident in the opening context was specifically an eventual-consistency outcome dressed up as something stronger. The team thought "eventually consistent" was a property of the system; it was actually a placeholder for a real specification they hadn't written. Concurrent writes during the partition diverged with no detection mechanism (the system had no vector clocks), and the convergence was manual rather than automatic.
Linearizability vs Serializability
These two words are often confused, including in textbooks, and the confusion has consequences.
Linearizability is a single-object property: each register or object in the system behaves as if every operation on it happens atomically at some point between its invocation and its response, in an order consistent with real-time.
Serializability is a multi-object property: a set of transactions executes in some order that is equivalent to some serial schedule. Serializability says nothing about real-time ordering; two transactions with no observable dependency can be reordered however the scheduler chooses.
You can have one without the other. Snapshot isolation is serializable in the sense that there exists a serial equivalent, but it is not linearizable because reads see a snapshot rather than the latest committed value. Strict serializability is the combination of both, and it is the gold standard offered by systems like Spanner.
For the constellation, the practical implication is: a "serializable" catalog can still return stale reads. If you need "every read sees the most recent write," you need linearizability — and you need to be prepared to pay the coordination cost.
Reading a System's Consistency Claim Skeptically
When a system claims a consistency model, three follow-up questions tell you whether the claim is operationally meaningful:
- Under what failure model does the claim hold? Many systems advertise strong consistency in the absence of failures and degrade silently when nodes fail. Spanner's strict serializability holds under failures; DynamoDB's strong consistency mode does, too. Some early NoSQL systems claimed "strong consistency" but degraded to last-write-wins under partition.
- What is the staleness bound? "Eventually consistent" is not a bound. "Read-your-writes within 100ms under network conditions X" is. If the documentation does not give you numbers, the bound is "indefinite."
- What anomalies are observable to a client? The CAP and PACELC frameworks classify systems but do not enumerate the specific bugs each model permits. Consult Kyle Kingsbury's Jepsen reports for empirical evidence; their analyses of specific products are the most rigorous public source on what consistency models actually deliver under stress.
The catalog choice in the opening context — "accept writes during the partition" — was a defensible PA/EL position. The failure was not the choice; it was that the team made the choice in the middle of an incident, without knowing it was a choice, and without having pre-defined the conflict resolution that PA/EL requires.
Code Examples
A Linearizable Register (Pseudocode for a Single-Leader Implementation)
The simplest way to provide linearizability is a single leader that serializes all operations. Every write goes through the leader; every read either goes through the leader or waits for confirmation that the local replica is caught up.
// CONCEPTUAL: a linearizable single-register store.
// Production implementations replicate the log via Raft (Module 3) — this
// pseudocode shows the single-leader version for clarity.
use std::sync::Mutex;
use std::time::Duration;
pub struct LinearizableRegister<T> {
leader: bool,
state: Mutex<T>,
}
impl<T: Clone + Send> LinearizableRegister<T> {
pub async fn write(&self, value: T) -> Result<(), Error> {
if !self.leader {
// Forward to leader. Read-only replicas reject writes - this is
// what gives up Availability during a partition: if we can't reach
// the leader, we can't write.
return Err(Error::NotLeader);
}
// Replicate to a majority before acknowledging. This is the cost of
// linearizability: every write pays the round-trip to a quorum of
// followers. During a partition where we lose the quorum, this call
// blocks or fails - we sacrifice availability to preserve consistency.
self.replicate_to_majority(&value).await?;
*self.state.lock().unwrap() = value;
Ok(())
}
pub async fn read(&self) -> Result<T, Error> {
if self.leader {
// Even reads need a coordination step (a 'read index' or 'lease
// read') to ensure we are still the leader and our state isn't
// stale relative to a newer leader. Otherwise we could be a
// deposed leader returning stale data, violating linearizability.
self.confirm_still_leader().await?;
Ok(self.state.lock().unwrap().clone())
} else {
// Followers must contact the leader to get a consistent read, or
// wait until they are caught up to a known committed index.
self.read_from_leader().await
}
}
async fn replicate_to_majority(&self, _v: &T) -> Result<(), Error> { Ok(()) }
async fn confirm_still_leader(&self) -> Result<(), Error> { Ok(()) }
async fn read_from_leader(&self) -> Result<T, Error> { unimplemented!() }
}
enum Error { NotLeader }
The fact that reads also require coordination is what catches most people off guard. Linearizability is not "writes are durable"; it is "reads observe a real-time-consistent order." A leader that has been silently superseded and continues serving local reads is the classic linearizability violation. Module 3 will cover read-index and lease-read techniques in detail.
A PA/EL Store with Last-Write-Wins (Conflict Anti-Pattern)
This is the catalog's original implementation, captured for posterity:
// ANTI-PATTERN: last-write-wins using wall-clock timestamps under PA/EL.
// This is what produced the MSS-17 incident in Lesson 2.
use std::time::SystemTime;
use std::collections::HashMap;
pub struct PAELStore {
// Each entry stores the value along with the timestamp claimed by the
// writer. There's no coordination - writes succeed on whatever replica
// receives them and propagate asynchronously.
state: HashMap<String, (SystemTime, Vec<u8>)>,
}
impl PAELStore {
pub fn put(&mut self, key: String, value: Vec<u8>, claimed_ts: SystemTime) {
match self.state.get(&key) {
// 'Newer' is defined as 'larger SystemTime' - a lie if the writer's
// NTP is off. Concurrent writes from machines with different drift
// will silently pick a winner that has no real causal precedence.
Some((existing_ts, _)) if *existing_ts >= claimed_ts => return,
_ => {
self.state.insert(key, (claimed_ts, value));
}
}
}
pub fn get(&self, key: &str) -> Option<&[u8]> {
// No quorum read, no anti-entropy check - we just return whatever
// this replica happens to have. Two clients reading the same key on
// different replicas can see different values for an unbounded time.
self.state.get(key).map(|(_, v)| v.as_slice())
}
}
The fixes are not subtle. Replace SystemTime with a Lamport or hybrid logical clock (Lesson 2). Replace "if newer, win" with vector-clock-based conflict detection that surfaces concurrent writes to the application. Add anti-entropy (Merkle trees, read repair) so divergent replicas converge automatically. Each of these is a real implementation step — the point is that the PA/EL position is not "we don't care about consistency"; it is "we make consistency a property of the conflict resolution mechanism, not the write path."
Detecting Concurrency in the Catalog (Vector Clocks, Revisited)
Tying Lesson 2 to PACELC: the right resolution for a PA/EL catalog is to use vector clocks to detect concurrency and either store siblings (Riak's approach) or apply a deterministic, causal-aware merge.
use std::cmp::max; use std::collections::HashMap; #[derive(Clone)] pub struct Versioned<T> { pub value: T, pub clock: HashMap<String, u64>, } pub enum Resolution<T> { Single(Versioned<T>), Siblings(Vec<Versioned<T>>), } /// Compare two versioned values; return whichever causally dominates, or /// both as siblings if they are concurrent. The application then decides /// how to merge - last-write-wins is sometimes acceptable here, but the key /// is that we *know* we are doing it, and we know on which axis. pub fn reconcile<T: Clone>(a: Versioned<T>, b: Versioned<T>) -> Resolution<T> { let mut a_greater = false; let mut b_greater = false; let keys: std::collections::HashSet<&String> = a.clock.keys().chain(b.clock.keys()).collect(); for k in keys { let av = a.clock.get(k).copied().unwrap_or(0); let bv = b.clock.get(k).copied().unwrap_or(0); if av > bv { a_greater = true; } if bv > av { b_greater = true; } } match (a_greater, b_greater) { (true, false) => Resolution::Single(a), // a dominates (false, true) => Resolution::Single(b), // b dominates (false, false) => Resolution::Single(a), // equal, pick either (true, true) => Resolution::Siblings(vec![a, b]), // concurrent } } fn main() { let pacific = Versioned { value: "attitude_X".to_string(), clock: HashMap::from([("catalog".into(), 1), ("pacific".into(), 1)]), }; let indian = Versioned { value: "attitude_Y".to_string(), clock: HashMap::from([("catalog".into(), 1), ("indian_ocean".into(), 1)]), }; match reconcile(pacific, indian) { Resolution::Siblings(_) => println!("conflict surfaced - operator decides"), Resolution::Single(v) => println!("merged to {}", v.value), } }
The output is conflict surfaced - operator decides, which is the correct behavior. The PA/EL system is now honest about what it doesn't know: two concurrent updates exist, the system cannot decide which one is "right," and a human (or an application-level rule) needs to resolve it. This is the operational discipline that makes eventual consistency safe to deploy.
Key Takeaways
- The CAP theorem says that during a partition, a system must give up either linearizability (CAP-style consistency) or availability. Partition tolerance is not an option you opt into; it is a fact of any real network. The choice is about partition behavior, not partition occurrence.
- PACELC extends CAP with a second axis: during normal (non-partitioned) operation, the system trades latency for consistency. Real systems map to one of four cells (PA/EL, PA/EC, PC/EL, PC/EC) — and the team should know which cell its stack is in, in advance.
- "Consistency" is overloaded. The C in CAP is linearizability. The C in ACID is invariant preservation. They are different concepts. Be explicit about which you mean.
- Linearizability and serializability are not the same thing. Linearizability is a single-object real-time property; serializability is a multi-transaction equivalence property. A serializable system can still return stale reads.
- "Eventually consistent" by itself is not a specification. The useful version names the staleness bound, the failure model, the conflict resolution mechanism, and the observable anomalies. If the documentation doesn't, the operations team will discover those properties during an incident.
Source note: This lesson is grounded in DDIA 2nd Edition (Kleppmann & Riccomini), Chapter 10, "Linearizability" and the chapter introduction. CAP's formal statement is from Gilbert & Lynch (2002); the original conjecture is Brewer (PODC 2000). PACELC is from Daniel Abadi, "Consistency Tradeoffs in Modern Distributed Database System Design" (IEEE Computer, 2012). The system placements on the PACELC matrix reflect documented defaults as of training cutoff and should be verified against current vendor documentation — particularly for DynamoDB and MongoDB, whose default consistency modes have evolved across releases. The Jepsen reference is to Kyle Kingsbury's analyses at jepsen.io; specific findings should be cited to specific reports.
Module 01 Project — Constellation Clock Sync
Mission Brief
Incident ticket CN-2604-001 Severity: P2 Reporter: Constellation Operations, Pacific Watch Status: Open
The MSS-17 attitude-update incident (Lesson 2, opening context) has been triaged. Root cause is confirmed as wall-clock-based ordering in the catalog write path: ground stations submit updates tagged with SystemTime::now(), and the catalog resolves conflicts by picking the larger timestamp. Pacific NTP drift introduced a 600 ms forward skew that caused a stale Pacific update to clobber a newer Indian Ocean update.
The fix is a clock-synchronization layer that does not depend on wall-clock agreement across ground stations. You are building it.
The deliverable is a Rust library, constellation_clock, that simulates a four-node cluster (Pacific, Indian Ocean, Atlantic, Antarctic) exchanging messages under injected partitions and delivers a working implementation of:
- A Lamport logical clock for total ordering of events.
- A vector clock for concurrency detection.
- A simulated message bus with controllable partitions and message reordering.
- A test harness that verifies causal ordering is preserved under adversarial network conditions.
This project does not implement physical clock synchronization (PTP, NTP). The point is to show that you can build correct distributed ordering without physical clock agreement.
Repository Layout
constellation-clock/
├── Cargo.toml
├── src/
│ ├── lib.rs # Public API
│ ├── lamport.rs # LamportClock type
│ ├── vector.rs # VectorClock type
│ ├── bus.rs # Simulated network with partition injection
│ └── node.rs # ClusterNode that uses both clocks
├── tests/
│ ├── lamport_ordering.rs
│ ├── vector_concurrency.rs
│ └── partition_recovery.rs
└── README.md
Required API
// lamport.rs
pub struct LamportClock { /* ... */ }
impl LamportClock {
pub fn new() -> Self;
pub fn tick(&self) -> u64;
pub fn observe(&self, incoming: u64) -> u64;
pub fn current(&self) -> u64;
}
// vector.rs
pub struct VectorClock { /* ... */ }
impl VectorClock {
pub fn new(node_id: &str) -> Self;
pub fn tick(&mut self);
pub fn merge(&mut self, other: &VectorClock);
pub fn compare(&self, other: &VectorClock) -> Ordering;
pub fn snapshot(&self) -> HashMap<String, u64>;
}
pub enum Ordering { Before, After, Equal, Concurrent }
// bus.rs
pub struct MessageBus { /* ... */ }
impl MessageBus {
pub fn new(node_ids: Vec<String>) -> Self;
pub fn partition(&mut self, group_a: &[&str], group_b: &[&str]);
pub fn heal(&mut self);
pub async fn send(&self, from: &str, to: &str, payload: Vec<u8>) -> Result<()>;
pub async fn recv(&self, node: &str) -> Option<Message>;
}
// node.rs
pub struct ClusterNode { /* ... */ }
impl ClusterNode {
pub fn new(id: String, bus: Arc<MessageBus>) -> Self;
pub async fn submit(&mut self, payload: Vec<u8>) -> Event;
pub async fn run(&mut self); // background task processing incoming messages
}
pub struct Event {
pub origin: String,
pub lamport: u64,
pub vector: HashMap<String, u64>,
pub payload: Vec<u8>,
}
Acceptance Criteria
The project is complete when the following checks pass. Verifiable criteria are checked by the test harness; self-assessed items require the engineer's judgment because they involve subjective design decisions.
-
cargo build --releasecompletes without warnings under#![deny(warnings)]. -
cargo testpasses all integration tests with zero flakes across 10 consecutive runs. -
cargo clippy -- -D warningsproduces no lints. -
The Lamport clock implementation passes the monotonic causality test: for any sequence of
tick()andobserve()calls, the returned values strictly increase, andobserve(n)always returns a value strictly greater than both the current local value andn. -
The vector clock implementation passes the concurrency detection test: given two clocks
AandBproduced by independent tick sequences from a shared starting state,compare(A, B)returnsConcurrent. - The vector clock implementation passes the causal precedence test: if node X sends to node Y and Y receives, then Y's clock after merge dominates the snapshot of X's clock at send time.
-
The simulated message bus correctly partitions: after
partition(group_a, group_b), messages sent from a node ingroup_ato a node ingroup_bare buffered untilheal()is called. - Under a 4-node test where Pacific and Indian Ocean issue concurrent updates during an Antarctic-relay partition, the catalog (modeled as the receiving node's reconcile logic) correctly identifies the two updates as concurrent rather than picking one based on arrival order.
- The recovery test passes: after a 5-second partition with 20 events on each side, healing the partition causes all nodes to converge to the same set of events (no message loss).
-
(self-assessed) The code is structured so that swapping the clock implementation (e.g., to a hybrid logical clock) would not require changes to
ClusterNodeorMessageBus. The clock is an injected dependency, not a hardcoded type. - (self-assessed) The README explains the model assumptions clearly enough that another engineer joining the team could read it once and explain to a third party why the system does not depend on NTP.
- (self-assessed) The test for concurrent updates fails informatively if a bug is introduced — that is, the test output explains which two events were expected to be concurrent and what the implementation reported instead.
Expected Output
When the test harness runs the conjunction simulation (cargo test --release partition_concurrent_updates -- --nocapture), the output should look approximately like:
[t=0.000s] Bus: 4 nodes online (pacific, indian_ocean, atlantic, antarctic)
[t=0.100s] Bus: partition initiated (group A: pacific, atlantic; group B: indian_ocean, antarctic)
[t=0.150s] pacific: submit(attitude_X) -> lamport=1, vector={pacific:1}
[t=0.200s] indian_ocean: submit(attitude_Y) -> lamport=1, vector={indian_ocean:1}
[t=2.000s] Bus: partition healed
[t=2.100s] pacific: received from indian_ocean (lamport=1, vector={indian_ocean:1})
[t=2.105s] pacific: reconcile(attitude_X, attitude_Y) -> CONCURRENT
[t=2.110s] pacific: surfaced conflict to application layer
[t=2.150s] indian_ocean: received from pacific (lamport=1, vector={pacific:1})
[t=2.155s] indian_ocean: reconcile(attitude_Y, attitude_X) -> CONCURRENT
PASS: both nodes identified the updates as concurrent
PASS: no silent winner was selected
The exact timestamps and formatting are not graded; what is graded is that the test successfully detects concurrency in both directions and that no node silently picks a winner.
Hints
1. Where to start
Begin with lamport.rs. It is the smallest correct piece of the system. Write the type, write its three methods, and write the monotonic-causality unit test before anything else. Once the Lamport clock is solid, the vector clock is structurally similar with a wider state shape.
2. AtomicU64 vs Mutex for the Lamport counter
The Lamport clock can use AtomicU64 for the local counter, but the observe() operation needs max(local, incoming) + 1, which is not a single atomic op. You have two choices: a CAS loop on AtomicU64 (lock-free, slightly more complex) or a Mutex<u64> (simpler, contended under load). For this project, either is acceptable; pick the one whose tradeoffs you can articulate in a code-review conversation.
3. Designing the MessageBus partition model
The bus needs to track which nodes can deliver to which other nodes. A HashMap<(String, String), VecDeque<Message>> per-pair queue model works well: when partitioned, the (from, to) pair's queue continues to receive sends but is not drained until heal. An alternative is a single global queue with a per-message "deliverable_at" timestamp, but the per-pair model maps more directly to the test scenarios.
4. Concurrent updates in tests
To force concurrency reliably in tests, you cannot rely on real time. Instead, use the bus's partition feature: partition before both nodes submit their events, then heal afterward. Each node's local clock advances independently during the partition, producing vector clocks that the other node has not observed — which is the definition of concurrency.
5. Why test ordering instead of message content
The acceptance criterion is that concurrent events are identified as concurrent, not that any particular winner is picked. Your test should assert that compare() returns Concurrent, not that a specific value won. Asserting on the winner couples the test to a tiebreaker policy that the project is explicitly trying to make pluggable.
6. Cleanly testing partition recovery
Recovery has two failure modes worth testing separately: (a) messages sent during the partition are eventually delivered after heal; (b) the vector clocks correctly identify which post-heal merges represent concurrency vs causal precedence. Write one test for each. A single test that asserts everything is harder to debug when one assertion fails.
Source Anchors
- DDIA 2nd Edition, Chapter 9 — "Unreliable Clocks," "Knowledge, Truth, and Lies"
- Lamport, "Time, Clocks, and the Ordering of Events in a Distributed System" (CACM, 1978) — the canonical reference for the logical-clock construction; recommended supplemental reading
- DDIA 2nd Edition, Chapter 10 — used briefly for the CAP/PACELC framing of why this matters
Module 02 — Replication Strategies
"The Constellation Catalog serves twelve ground stations across five continents. Forcing every TLE write through a single Virginia primary is no longer a viable design."
Mission Context
Module 1 established that the network is unreliable and time is unreliable. Module 2 introduces the first family of mechanisms for building reliable systems on top of those unreliable substrates: replication. The catalog you operate cannot survive a datacenter failure with a single instance; it cannot serve global reads with acceptable latency from a single region; and it cannot accept writes during inter-region partitions without giving up some property the team must choose deliberately.
The three lessons in this module walk through the three replication shapes that real systems use. Single-leader is the default: one writer, many readers, well-understood failure modes, the basis for almost every production database. Multi-leader and leaderless trade that simplicity for write availability across regions and partitions, at the cost of conflict resolution. Read consistency under lag is the operational discipline that makes either model usable at scale — the menu of session guarantees that bound what a user can observe.
The opening incidents — the catalog's earlier failovers (Lesson 1), the MSS-17 attitude overwrite extended into the multi-leader regime (Lesson 2), and the three categories of read anomaly that appear after enabling follower-served reads (Lesson 3) — are not edge cases. They are the standard operational landscape of any system that has decided to replicate.
Lessons
| # | Title | Source |
|---|---|---|
| 1 | Single-Leader Replication | DDIA Ch. 6 |
| 2 | Multi-Leader and Leaderless Replication | DDIA Ch. 6 + Dynamo paper |
| 3 | Read Consistency and Replication Lag | DDIA Ch. 6 + Bayou 1994 |
Project
Replicated Telemetry Store — a Rust crate implementing single-leader replication with synchronous and asynchronous followers, a pluggable read router supporting three modes (Leader / AnyFollower / SessionConsistent), and a test suite that demonstrates each replication-lag anomaly and shows the session guarantees that eliminate them. Includes a partition-durability test that promotes the synchronous follower after a leader crash with no lost acknowledged writes.
Position
Module 2 of 6 in the Distributed Systems track.
What You Should Be Able to Do After This Module
- Read a system's replication configuration (sync vs async, leader topology, conflict policy) and predict its behavior under follower crash, leader crash, and partition.
- Choose between WAL-shipping, statement-based, and logical replication for a specific operational scenario, articulating the upgrade-path tradeoffs of each.
- Diagnose a multi-leader or leaderless system's conflict-resolution policy by inspecting the data structures it uses to detect concurrency (vector clocks, last-write-wins timestamps, CRDTs).
- Map a workload to the right read consistency mode: linearizable (leader), eventually consistent (any follower), session-consistent (LSN tokens), or bounded staleness (lag-aware routing).
- Implement read-after-write and monotonic-reads session guarantees on top of a leader/follower replication system without modifying the underlying storage engine.
Source Materials
- DDIA 2nd Edition (Kleppmann & Riccomini, 2026) — Chapter 6 ("Replication") is the primary source for all three lessons. The chapter's treatment of single-leader, multi-leader, leaderless, and replication-lag anomalies is the most rigorous public reference.
- DeCandia et al., "Dynamo: Amazon's Highly Available Key-value Store" (SOSP 2007) — the foundational paper for the leaderless model and the source of the N/W/R quorum framework. Strongly recommended supplemental reading for Lesson 2.
- Shapiro et al., "Conflict-Free Replicated Data Types" (SSS 2011) — the formal treatment of CRDTs. Lesson 2 introduces the G-Set as an example; this paper is the reference for the full taxonomy.
- Terry et al., "Session Guarantees for Weakly Consistent Replicated Data" (PDIS 1994) — the Bayou paper that formalized the four session guarantees (read-your-writes, monotonic reads, monotonic writes, writes-follow-reads). The canonical reference for Lesson 3.
Source notes on individual lessons flag where content has been synthesized beyond the available source material.
Lesson 1: Single-Leader Replication
Context
The Constellation Catalog is the source of truth for every orbital object Meridian tracks — fifty thousand TLEs, attitude states, link budgets, conjunction predictions. When the catalog was first deployed it ran as a single PostgreSQL instance, and reads were fast and writes were obvious. Three years later, ground stations on five continents need to read the catalog with bounded latency, and the catalog needs to survive a datacenter failure without losing acknowledged writes. The single-instance design no longer fits. Replication is how you get from "one machine holds the truth" to "many machines hold copies of the truth, and the copies eventually agree."
The simplest, most widely deployed replication model is single-leader replication: one designated node accepts all writes, applies them to its local state, and propagates them to a set of followers. Every read can be served from any replica that the workload tolerates the staleness of. PostgreSQL streaming replication, MySQL primary-replica, MongoDB replica sets, and the leader-tier of every Raft-based system all use this shape. The shape is so prevalent that most engineers have used it without naming it, and most production database outages — failover storms, replication-lag-induced read anomalies, lost writes after primary failure — are failure modes specific to this shape.
This lesson treats single-leader replication as the baseline that the next two lessons (multi-leader/leaderless, read consistency) will refine and complicate. By the end, you should be able to read a system's replication configuration and predict its behavior under three specific failure modes: a follower crash, a leader crash, and a network partition between the leader and a majority of followers.
Core Concepts
The Replication Log: Synchronous vs Asynchronous
The mechanism that keeps followers in sync with the leader is the replication log: an ordered stream of write operations that the leader emits and each follower applies in order. The interesting question is when the leader considers a write committed: after sending it to the followers, or after the followers acknowledge receipt and durability?
In synchronous replication, the leader waits for a designated number of followers to confirm a write before reporting success to the client. The strict version requires every follower to acknowledge; this is unusable in practice because one slow follower blocks all writes. The realistic version requires a quorum — for example, "one synchronous follower plus the leader's local commit," which PostgreSQL calls quorum_commit. Synchronous replication gives you durability across machines for every acknowledged write, at the cost of write latency equal to the slowest synchronous follower's round trip.
In asynchronous replication, the leader acknowledges the write as soon as its local commit succeeds and propagates to followers in the background. Writes are durable on the leader but not yet on followers when acknowledged. If the leader fails before propagation completes, the acknowledged writes are lost. Asynchronous replication is the default in most systems because it is faster and tolerates slow followers — at the cost of weaker durability guarantees, which manifests as the "lost writes after failover" failure mode.
The catalog's policy should be explicit: the team chose one synchronous follower for the catalog primary, on the basis that "we will accept slightly higher write latency in exchange for never losing an acknowledged orbital element." This is a judgment call. The Mission Control telemetry pipeline made the opposite call — asynchronous replication, fast writes, accept that some recent telemetry samples may be lost if the primary fails — because telemetry is high-volume and individual samples are not individually critical. The same database engine supports both policies; the configuration is the contract.
Replication Log Implementations
Three implementation strategies appear in practice, and each has tradeoffs that matter when you operate the system.
Statement-based replication ships the SQL (or equivalent) command and re-executes it on followers. It is conceptually simple and compact, but it breaks for any non-deterministic operation: NOW(), RANDOM(), sequences, triggers with side effects, or any function whose result depends on local state. MySQL used statement-based replication by default for years and accumulated a long list of incompatible features. Modern systems generally avoid this.
Write-ahead log (WAL) shipping ships the byte-level WAL of the storage engine. PostgreSQL streaming replication does this. It is exactly correct because the follower applies the same bytes the leader did, but it tightly couples the leader and follower to the same storage engine version. A WAL produced by PG 17 cannot be replayed on PG 16. This makes rolling upgrades possible (upgrade follower first, then promote) but cross-version replication impossible.
Logical (row-based) replication ships logical change records: "row with primary key X now has columns set to Y." This decouples the replication format from the storage engine, allowing replicas to run different versions or even different database systems. It is the basis for change data capture (CDC) systems like Debezium and the mechanism behind many cross-cloud replication setups. The cost is that logical replication is per-row, not per-page; the volume of replication traffic can be much higher than WAL shipping for bulk operations.
The catalog uses PostgreSQL with logical replication to a separate analytics replica, plus WAL streaming to a hot standby. The two replicas serve different purposes and use different mechanisms; the choice is intentional.
Setting Up New Followers Without Stopping the Leader
Adding a follower to a running cluster is one of the operations that distinguishes a real production system from a toy. The naive answer — "take a snapshot, copy it, start replaying logs from there" — has a subtle correctness issue: between the snapshot and the start of log replay, the leader has accepted writes. You need a consistent point that the snapshot represents and a known position in the log from which to start replay.
DDIA describes the standard sequence: (1) take a snapshot of the leader's database at a moment when the WAL position is known, ideally without taking a lock that blocks writes; (2) copy the snapshot to the new follower; (3) start the follower's WAL replay from the known position; (4) wait for the follower to catch up. PostgreSQL's pg_basebackup implements this; MongoDB's initial sync does the same shape. The follower is considered "caught up" when its replay lag is below some threshold, after which it can serve reads.
This procedure is also the basis for "follower restore" — recovering a follower that has fallen behind and cannot catch up from the log because the leader has discarded the older log segments. The same snapshot-and-replay process applies. The operational cost is bandwidth and time; for a large database, a new follower can take hours to bootstrap, during which the cluster is running with reduced redundancy.
Failover: The Hardest Operation in the System
When the leader fails, the cluster needs a new leader. The mechanics — pick a follower, promote it, redirect writes — sound simple, and they are not. Failover is the operation that loses the most data, breaks the most invariants, and produces the most outages in single-leader systems.
DDIA's enumeration of failover hazards is exhaustive; the highlights:
- Lost writes from asynchronous replication. Any writes the old leader had acknowledged but not yet propagated to the new leader are lost. The catalog's "one synchronous follower" policy is designed precisely to ensure that the synchronous follower is the candidate for promotion, so no acknowledged write is lost.
- Split brain. If the old leader is not actually dead — just unreachable — and the cluster promotes a new leader, both leaders will accept writes. When the partition heals, the two write streams must be reconciled, and there is no general-purpose mechanism for this in a single-leader system. The standard defense is fencing: the new leader is given a generation number, and any write tagged with an older generation is rejected by the storage layer.
- Coordinator failures. The component that decides "the leader is dead, promote a follower" is itself a distributed system, and getting it wrong is how you produce more outages than you prevent. Production systems use Raft (Module 3) or ZooKeeper to make the failover decision, because consensus is exactly the right tool for "we all agree on who is leader now."
- Hard timeouts and flapping. If the failover threshold is too aggressive, transient network blips trigger failovers that immediately reverse. If it is too lenient, real failures leave the cluster unavailable for too long. There is no universally correct timeout, and tuning it is one of the standing operational tasks.
The catalog has had two failovers in three years. The first lost two minutes of telemetry writes because asynchronous replication had not caught up. The second succeeded cleanly because the team had since switched to synchronous replication for the primary's first follower. The cost is the higher steady-state write latency. The tradeoff is documented in the runbook.
Replication Lag: The Anomaly Surface
Even synchronous replication only guarantees that one follower has the write. The other followers are still asynchronous, and the time between the leader committing and a follower applying is replication lag. Replication lag is the source of the anomalies that Lesson 3 covers in detail (read-after-write, monotonic reads, consistent prefix reads), but the underlying phenomenon is worth naming here: a system that routes reads to followers is a system that will, sometimes, return stale data.
This is not a bug. It is the inherent behavior of any system that prefers read scalability over consistency. The question is whether the staleness is bounded and whether the application can tolerate it. For the catalog, conjunction predictions that read from a follower can tolerate seconds of staleness; orbital element submissions cannot tolerate any staleness on their own read path (a ground station must see its own write). The system handles this by routing the writer's subsequent reads to the leader for a session-scoped window, which is the standard read-your-writes pattern.
Code Examples
A Minimal Single-Leader Replicated Counter
The smallest correct shape of single-leader replication is small enough to fit in a single file. This example shows the leader/follower roles, the replication log as a tokio::broadcast channel, and the asynchronous-acknowledgment model.
use std::sync::Arc; use std::sync::atomic::{AtomicU64, Ordering}; use tokio::sync::broadcast; #[derive(Clone, Debug)] struct LogEntry { seq: u64, delta: i64, } struct Leader { state: AtomicU64, next_seq: AtomicU64, // Broadcast channel models the replication log. In production this is a // durable WAL on disk; the broadcast channel is a faithful concurrency // model of what 'followers subscribe to the log' looks like. log: broadcast::Sender<LogEntry>, } impl Leader { fn new() -> Arc<Self> { let (tx, _) = broadcast::channel(1024); Arc::new(Self { state: AtomicU64::new(0), next_seq: AtomicU64::new(0), log: tx, }) } /// Write path. Returns immediately after local commit and log append - /// this is asynchronous replication. To make it synchronous, we would /// wait here for a quorum of followers to acknowledge a given seq. fn apply(&self, delta: i64) -> u64 { let seq = self.next_seq.fetch_add(1, Ordering::SeqCst); // Apply locally. The local commit must happen before the log append // is observable - otherwise a follower could see an entry that the // leader does not yet have. (In a real WAL system, the order is // reversed: log first, then apply, with crash-recovery replay.) if delta >= 0 { self.state.fetch_add(delta as u64, Ordering::SeqCst); } else { self.state.fetch_sub((-delta) as u64, Ordering::SeqCst); } // Best-effort broadcast - if no followers are subscribed, the entry // is dropped from the broadcast buffer (this is fine; we still have // it locally). In production the log is persisted and followers can // catch up by reading the persistent log. let _ = self.log.send(LogEntry { seq, delta }); seq } fn read(&self) -> u64 { // Leader reads are linearizable: the leader has the latest value. // Production systems add a 'read index' check to ensure we're still // the leader (see Module 3). self.state.load(Ordering::SeqCst) } fn subscribe(&self) -> broadcast::Receiver<LogEntry> { self.log.subscribe() } } struct Follower { state: AtomicU64, last_applied_seq: AtomicU64, } impl Follower { fn new() -> Arc<Self> { Arc::new(Self { state: AtomicU64::new(0), last_applied_seq: AtomicU64::new(0), }) } async fn run(self: Arc<Self>, mut log: broadcast::Receiver<LogEntry>) { // Apply log entries in order. If we lag behind, the broadcast channel // will return Lagged, and we'd need to bootstrap from a snapshot. // This is the 'follower fell behind, needs reseeding' case from DDIA. while let Ok(entry) = log.recv().await { if entry.delta >= 0 { self.state.fetch_add(entry.delta as u64, Ordering::SeqCst); } else { self.state.fetch_sub((-entry.delta) as u64, Ordering::SeqCst); } self.last_applied_seq.store(entry.seq, Ordering::SeqCst); } } fn read(&self) -> u64 { // Follower reads may be stale by an unbounded amount, depending on // replication lag. The caller must tolerate this or route to leader. self.state.load(Ordering::SeqCst) } } #[tokio::main] async fn main() { let leader = Leader::new(); let follower = Follower::new(); let f = follower.clone(); let log = leader.subscribe(); tokio::spawn(f.run(log)); leader.apply(10); leader.apply(5); tokio::time::sleep(std::time::Duration::from_millis(50)).await; println!("leader: {}, follower: {}", leader.read(), follower.read()); // Both should print 15. If we read the follower immediately after apply, // we could see 0 - that's replication lag in action. }
This is small enough to reason about and big enough to demonstrate every shape that matters in production: a write path, a replication log, follower subscription, and the lag window between leader commit and follower apply. The production version replaces broadcast::channel with a durable WAL, replaces AtomicU64 with a real storage engine, and adds the failover coordinator and fencing mechanism described above. The shape, however, is the same.
A Failover Hazard: Acknowledged Writes Lost Without Synchronous Replication
This snippet captures the failure mode of asynchronous-only replication:
// SCENARIO: leader acknowledges write, crashes before propagation completes.
async fn submit_telemetry(leader: &Leader, sample: TelemetrySample) -> Result<()> {
let seq = leader.apply_async(sample).await?; // local commit only
Ok(()) // We return success to the client here.
// At this point, the write is durable on the leader's disk but the
// followers have not yet seen it. If the leader's storage fails before
// the WAL ships, the write is gone - even though the client got an ack.
}
// What synchronous replication adds:
async fn submit_telemetry_safe(leader: &Leader, sample: TelemetrySample) -> Result<()> {
let seq = leader.apply_async(sample).await?;
leader.wait_for_quorum(seq).await?; // block until N followers ack
Ok(()) // Now safe: the write exists on >=N+1 machines.
}
The two functions look almost identical and have radically different operational properties. The first is the source of "we acked the write but it's not there after failover" incidents. The second is the source of "writes are slower than they used to be after we tightened replication policy" tickets. Choose deliberately; document the choice; review it during the postmortem of the first incident that touches it.
Read-Your-Writes via Session-Pinned Routing
The simplest read-your-writes implementation pins a session's reads to the leader for a window after each write:
use std::sync::Arc; use std::sync::Mutex; use std::time::{Duration, Instant}; struct SessionRouter { leader_endpoint: String, follower_endpoints: Vec<String>, last_write: Mutex<Option<Instant>>, } impl SessionRouter { const STICKY_WINDOW: Duration = Duration::from_secs(10); fn route_read(&self) -> &str { let last = self.last_write.lock().unwrap(); match *last { // Within the sticky window, route reads to the leader. The leader // is linearizable, so the writer is guaranteed to see its own // recent write. The cost is leader load. Some(t) if t.elapsed() < Self::STICKY_WINDOW => &self.leader_endpoint, // Outside the window, follower reads are fine. Replication lag // far exceeding STICKY_WINDOW is treated as a separate alert. _ => &self.follower_endpoints[0], } } fn note_write(&self) { *self.last_write.lock().unwrap() = Some(Instant::now()); } } fn main() { let router = SessionRouter { leader_endpoint: "leader.catalog.meridian.internal".into(), follower_endpoints: vec!["follower-1.catalog.meridian.internal".into()], last_write: Mutex::new(None), }; println!("initial read routes to: {}", router.route_read()); router.note_write(); println!("post-write read routes to: {}", router.route_read()); }
The choice of STICKY_WINDOW is operational: too short and replication lag will cause stale reads inside the window; too long and the leader becomes a read bottleneck for every active client. The window should be longer than the 99th-percentile replication lag. Monitoring should alert when actual lag approaches the window value; that alert is the early warning that the policy is about to start producing anomalies.
Key Takeaways
- Single-leader replication is the default replication model for a reason: it gives you a linearizable single object (the leader), a clear write path, and well-understood failure modes. Almost every production system has a "single leader" tier somewhere, even when it advertises as multi-master or leaderless.
- Synchronous vs asynchronous replication is a durability/latency tradeoff. "One synchronous follower plus the leader" is the practical choice for systems that cannot lose acknowledged writes; pure asynchronous is the choice for high-volume systems where individual writes are not individually critical.
- The replication log implementation (statement-based, WAL-shipping, logical) constrains your operational story. WAL shipping requires version-compatible replicas; logical replication enables cross-version and cross-engine replication at the cost of higher traffic volume.
- Failover is the hardest operation in the system. It loses writes, can produce split brains, and depends on a failure detector whose timeout has no universally correct value. Production failover requires a consensus-based coordinator (Module 3), fencing tokens, and explicit synchronous replication of the candidate replica.
- Replication lag is not a bug; it is the price of follower-served reads. Bound the staleness with monitoring, route session-critical reads to the leader, and document the tolerance every workload expects. The next lesson (read consistency) makes this discipline concrete.
Source note: This lesson is grounded in DDIA 2nd Edition (Kleppmann & Riccomini), Chapter 6, "Replication" — specifically the sections "Single-Leader Replication," "Setting Up New Followers," "Handling Node Outages," and "Implementation of Replication Logs." Specific failure stories (the catalog's two failovers, the synchronous replication policy decision) are synthesized illustrative scenarios consistent with documented industry patterns; they are not real Meridian incidents and the operational numbers (window sizes, lag thresholds) are illustrative.
Lesson 2: Multi-Leader and Leaderless Replication
Context
The catalog's single-leader replication works when there is a clear "primary" datacenter that all writes flow through. But the Constellation Catalog needs to accept writes at twelve ground stations spanning every continent, often with hundreds of milliseconds of inter-region latency, and sometimes with hours of disconnection. Forcing every Antarctic ground station's update to round-trip through a primary in Virginia is not a viable design: latency is unacceptable for routine writes, and a transatlantic partition would silence Antarctica entirely.
This is the regime where multi-leader and leaderless replication become real options. Multi-leader replication lets multiple regions accept writes locally and reconcile asynchronously across regions — the model that Couchbase, CockroachDB's regional tables, and historically MySQL multi-master use. Leaderless replication abolishes the leader concept entirely: clients send writes to multiple replicas in parallel, and reads gather responses from multiple replicas and reconcile. This is the Dynamo model, used by Cassandra, ScyllaDB, and Riak.
Both models gain you availability during partitions and lower write latency in exchange for one significant cost: writes can conflict, and the system has to detect and resolve those conflicts. This lesson covers when each model is the right choice, the conflict detection mechanisms (which build on the vector clocks from Module 1), and the menu of resolution strategies. By the end, you should be able to evaluate whether a given workload belongs in a single-leader system, a multi-leader system, or a leaderless system — and to articulate the cost of each choice precisely.
Core Concepts
Multi-Leader Replication: Local Writes, Asynchronous Cross-Region Sync
In a multi-leader configuration, each region (or datacenter, or in some cases each user device) has its own leader. Writes are accepted locally and propagate asynchronously to leaders in other regions. From a single region's perspective, the model looks exactly like single-leader. The complication is the cross-region path.
The standard use cases for multi-leader, per DDIA:
Multi-datacenter operation. A multi-region deployment with a leader per region. Writes are local-latency for every region. Cross-region replication is asynchronous and tolerates inter-region partitions. This is what the catalog needs.
Clients with offline operation. A calendar app where each device is itself a leader. Writes happen locally without connectivity; they sync when the device reconnects. CouchDB, Pouch, and many mobile-first systems use this shape.
Collaborative editing. Google Docs and similar tools are effectively multi-leader: each user's local edit is accepted immediately and propagated to other users. The conflict resolution is the operational transform or CRDT layer.
The benefit is local-write latency and partition tolerance. The cost is write conflicts: two leaders can accept conflicting writes for the same key, and the system has no a-priori way to decide which is correct.
Handling Conflicts: Avoid, Detect, Resolve
Conflict-handling strategies fall on a spectrum from "structurally impossible" to "manually resolved." DDIA's framing maps well:
Conflict avoidance — Structure the data so that conflicts cannot occur. The catalog could partition orbital objects by hemisphere: each region owns writes for its hemisphere's objects, and any write for an object outside the local hemisphere is forwarded. This is effectively reverting to single-leader for each piece of data, just with the leader selected by data partition rather than by topology. It is the simplest conflict story (no conflicts to handle) but it requires the data to have a partition that maps cleanly to writers.
Last-write-wins (LWW). Each write carries a timestamp; on conflict, the larger timestamp wins. This is the most common default and the most subtly wrong. Cassandra's default behavior is LWW with wall-clock timestamps; the failure mode (the MSS-17 incident from Module 1) is that clock skew silently picks the wrong winner. LWW is fine when concurrent writes to the same key are extremely rare and the system can tolerate dropping the loser — but treat LWW with wall-clock timestamps as silently lossy until proven otherwise.
Manual resolution. On conflict, surface both versions to the application; the application (or a human) decides. Git's merge conflicts are the canonical example: when two branches modify the same file, the user resolves. The cost is operator burden; the benefit is that no automatic policy can quietly make the wrong choice.
Automatic semantic merge. For specific data types — sets, counters, last-write maps with causal context — convergent replicated data types (CRDTs) provide automatic merges that are guaranteed to produce the same result regardless of merge order. The catalog's "set of authorized ground stations" could be a grow-only set CRDT; concurrent additions converge automatically. Operational transforms (used in collaborative editors) are the equivalent for ordered sequences. CRDTs are powerful where they apply but require modeling the data as a specific algebraic structure, which is not always possible.
The catalog uses conflict avoidance for orbital object writes (each region owns its assigned objects) and CRDT-based merge for the global registry of active satellites (a grow-only set). The conflict resolution policy is per-table, not per-database.
Leaderless Replication: The Dynamo Model
Leaderless replication, introduced by Amazon's Dynamo paper and popularized by Cassandra, Riak, and Voldemort, takes a different approach. There is no leader. Clients write to N replicas in parallel and wait for W of them to acknowledge; reads query R replicas and reconcile.
If W + R > N, the system is strongly consistent in the quorum sense: any read overlap with the latest write by at least one replica, so the reader can detect and prefer the newest value. This is the formula Dynamo-style systems use. Common configurations are N=3, W=2, R=2; or N=5, W=3, R=3.
Three mechanisms make this work in practice:
Read repair. When a client reads from R replicas and detects that some have stale values, it writes the newest value back to the lagging replicas as part of the read. This is opportunistic anti-entropy on the read path.
Anti-entropy / Merkle trees. Background processes compare replicas and reconcile divergences. Riak and Cassandra use Merkle trees to identify diverged ranges efficiently. This is the catch-all that ensures eventual convergence even for keys that are not being read frequently.
Hinted handoff. When a target replica is unreachable, the writer leaves a "hint" with another node, which forwards the write when the target comes back. This is what allows the system to accept writes during partial node outages without sacrificing the durability target — at the cost of some staleness during the outage.
The Dynamo model gives you sloppy quorums: when nodes fail, the system can broaden the replica set to accept writes from any reachable node, then reconcile when the original replicas return. This trades strong durability guarantees during partitions (writes may land on nodes that are not the "right" replicas) for availability. The catalog's TLE registry uses sloppy quorums with N=5, W=3, R=3, and tolerates the resulting eventual consistency.
Detecting Concurrent Writes Without a Leader
In single-leader replication, the leader imposes a total order on writes; conflicts cannot occur because there is exactly one writer. In multi-leader and leaderless systems, two writes to the same key from different writers can be concurrent in the causal sense — neither writer observed the other's write.
The mechanism for detecting this is the vector clock from Module 1. Riak attaches a vector clock to every object; when two writes have vector clocks that neither dominates, the system stores both as siblings and surfaces them to the application on the next read. This is the operational shape of "we detected a conflict; the application decides."
DDIA's discussion of "Detecting Concurrent Writes" is the practical version of the theory we developed in Module 1: vector clocks are not just a teaching example, they are the production mechanism that lets leaderless systems give correct semantics. The cost is bookkeeping (every value carries a vector clock; sibling values may accumulate); the benefit is that the system never silently drops a write because of a clock skew.
Topology in Multi-Leader Systems
When you have N leaders, you have to decide how they replicate to each other. Three topologies appear:
Star (one hub, several spokes). Every leader writes to a central coordinator that forwards to the others. Simple but the hub is a bottleneck and a single point of failure.
Circle / ring. Each leader forwards to one neighbor, which forwards to the next, around a loop. The replication delay across the full ring is the worst-case latency. A single failed leader breaks the ring.
All-to-all. Every leader sends every write to every other leader. Maximum throughput, maximum redundancy, but writes can arrive out of order — and "out of order" in a multi-leader system means causally inverted, which is what vector clocks are for.
The catalog uses all-to-all because the twelve ground stations are well-connected to each other and the per-write fan-out is manageable at the catalog's write rate. The conflict detection layer handles causal inversions.
Code Examples
A Multi-Leader Conflict Surfaced via Vector Clock
This sketches the shape of a multi-leader write path that surfaces concurrent writes as siblings, modeled on Riak's behavior.
use std::cmp::max; use std::collections::HashMap; #[derive(Clone, Debug)] pub struct VersionedValue { pub value: String, pub clock: HashMap<String, u64>, } pub enum WriteOutcome { Applied(VersionedValue), Siblings(Vec<VersionedValue>), } #[derive(Default)] pub struct MultiLeaderStore { // Per-key state: either a single dominant value, or a set of siblings // that the application has not yet resolved. state: HashMap<String, Vec<VersionedValue>>, } impl MultiLeaderStore { pub fn write(&mut self, node_id: &str, key: &str, value: String) -> WriteOutcome { let existing = self.state.entry(key.to_string()).or_default(); // The incoming clock is built from the union of existing siblings // (so that the new write 'dominates' anything the writer had observed) // plus a tick of the writer's own slot. let mut new_clock: HashMap<String, u64> = HashMap::new(); for sib in existing.iter() { for (k, v) in &sib.clock { let entry = new_clock.entry(k.clone()).or_insert(0); *entry = max(*entry, *v); } } *new_clock.entry(node_id.to_string()).or_insert(0) += 1; let new_version = VersionedValue { value, clock: new_clock }; // The new write dominates any existing sibling whose clock it has // observed. Surviving siblings are those that are NOT dominated by // the new write. let surviving: Vec<VersionedValue> = existing .drain(..) .filter(|sib| !dominates(&new_version.clock, &sib.clock)) .collect(); let mut new_state = surviving; new_state.push(new_version.clone()); if new_state.len() == 1 { self.state.insert(key.to_string(), new_state.clone()); WriteOutcome::Applied(new_state.into_iter().next().unwrap()) } else { self.state.insert(key.to_string(), new_state.clone()); WriteOutcome::Siblings(new_state) } } } fn dominates(a: &HashMap<String, u64>, b: &HashMap<String, u64>) -> bool { let mut strictly = false; let keys: std::collections::HashSet<&String> = a.keys().chain(b.keys()).collect(); for k in keys { let av = a.get(k).copied().unwrap_or(0); let bv = b.get(k).copied().unwrap_or(0); if av < bv { return false; } if av > bv { strictly = true; } } strictly } fn main() { let mut store = MultiLeaderStore::default(); // Pacific writes 'attitude_X', Indian Ocean concurrently writes 'attitude_Y' // from the same baseline. Neither observed the other. match store.write("pacific", "MSS-17/attitude", "X".into()) { WriteOutcome::Applied(_) => println!("pacific: applied"), WriteOutcome::Siblings(_) => println!("pacific: siblings exist"), } // Indian Ocean writes from the same starting state, so the second branch // does not observe the Pacific write. We simulate this by directly // injecting a sibling rather than reading first. let mut store2 = MultiLeaderStore::default(); store2.write("indian_ocean", "MSS-17/attitude", "Y".into()); // Now the two stores reconcile by exchanging their versions. for v in store2.state.get("MSS-17/attitude").unwrap() { match store.write_external(v.clone(), "MSS-17/attitude") { WriteOutcome::Siblings(s) => println!("merged: {} siblings", s.len()), WriteOutcome::Applied(_) => println!("merged: dominated existing"), } } } impl MultiLeaderStore { /// Merge an externally-produced versioned value (e.g., received from /// another leader in the all-to-all replication topology). fn write_external(&mut self, incoming: VersionedValue, key: &str) -> WriteOutcome { let existing = self.state.entry(key.to_string()).or_default(); let surviving: Vec<VersionedValue> = existing .drain(..) .filter(|sib| !dominates(&incoming.clock, &sib.clock)) .collect(); let dominated_by_existing = surviving.iter().any(|s| dominates(&s.clock, &incoming.clock)); let mut new_state = surviving; if !dominated_by_existing { new_state.push(incoming); } let n = new_state.len(); self.state.insert(key.to_string(), new_state.clone()); if n == 1 { WriteOutcome::Applied(new_state.into_iter().next().unwrap()) } else { WriteOutcome::Siblings(new_state) } } }
This is the essential shape. The writer attaches a vector clock derived from what it has observed; existing siblings the new write dominates are discarded; siblings the new write does not dominate are preserved. The application reads and sees either a single value or a set of siblings to resolve. The same mechanism handles both local writes and merge events from other leaders.
A Leaderless Quorum Write with Read Repair
This shows the Dynamo-style write path: send to N, wait for W, return success.
// SKETCH: leaderless write with W=2, N=3 quorum.
use anyhow::Result;
use tokio::time::{Duration, timeout};
use futures::stream::{FuturesUnordered, StreamExt};
const N: usize = 3;
const W: usize = 2;
const R: usize = 2;
const REPLICA_TIMEOUT: Duration = Duration::from_millis(500);
async fn quorum_write(
replicas: &[Replica],
key: &str,
value: VersionedValue,
) -> Result<()> {
// Dispatch the write to all N replicas in parallel.
let mut pending: FuturesUnordered<_> = replicas
.iter()
.take(N)
.map(|r| {
let v = value.clone();
let k = key.to_string();
async move { timeout(REPLICA_TIMEOUT, r.write(&k, v)).await }
})
.collect();
let mut acks = 0;
let mut errors = 0;
// Drain results as they arrive. Return as soon as W replicas ack;
// any later acks are ignored (the request returns; the writes still
// complete on the slow replicas in the background).
while let Some(result) = pending.next().await {
match result {
Ok(Ok(())) => {
acks += 1;
if acks >= W {
return Ok(()); // Quorum met.
}
}
_ => {
errors += 1;
// If too many errors, we cannot meet the quorum.
if errors > N - W {
return Err(anyhow::anyhow!("quorum unreachable"));
}
}
}
}
if acks >= W { Ok(()) } else { Err(anyhow::anyhow!("quorum unreachable")) }
}
// Read with read-repair: collect R responses, identify the newest, write
// back to lagging replicas.
async fn quorum_read(
replicas: &[Replica],
key: &str,
) -> Result<VersionedValue> {
let mut pending: FuturesUnordered<_> = replicas
.iter()
.take(N)
.map(|r| {
let k = key.to_string();
async move { timeout(REPLICA_TIMEOUT, r.read(&k)).await }
})
.collect();
let mut responses: Vec<(usize, VersionedValue)> = Vec::new();
let mut replica_idx = 0;
while let Some(result) = pending.next().await {
if let Ok(Ok(Some(v))) = result {
responses.push((replica_idx, v));
if responses.len() >= R {
break;
}
}
replica_idx += 1;
}
let newest = responses
.iter()
.max_by(|(_, a), (_, b)| compare_clocks(&a.clock, &b.clock))
.map(|(_, v)| v.clone())
.ok_or_else(|| anyhow::anyhow!("no quorum"))?;
// Read repair: write back to lagging replicas. Don't block the response
// on this - fire-and-forget.
for (idx, val) in &responses {
if !val.clock.eq(&newest.clock) {
let target = replicas[*idx].clone();
let k = key.to_string();
let v = newest.clone();
tokio::spawn(async move {
let _ = target.write(&k, v).await;
});
}
}
Ok(newest)
}
struct Replica;
impl Clone for Replica { fn clone(&self) -> Self { Replica } }
impl Replica {
async fn write(&self, _k: &str, _v: VersionedValue) -> Result<()> { Ok(()) }
async fn read(&self, _k: &str) -> Result<Option<VersionedValue>> { Ok(None) }
}
#[derive(Clone)] struct VersionedValue { clock: std::collections::HashMap<String, u64> }
fn compare_clocks(_a: &std::collections::HashMap<String, u64>, _b: &std::collections::HashMap<String, u64>) -> std::cmp::Ordering { std::cmp::Ordering::Equal }
Three things to notice. First, the early-return on quorum: as soon as W replicas have acknowledged, the call returns — the remaining writes proceed in the background but do not block the client. Second, read-repair is fire-and-forget: the read returns the newest value immediately and lazily updates lagging replicas. Third, the failure model is explicit: if too many replicas fail to respond within the timeout, the quorum is unreachable and the operation returns an error rather than silently producing a partial write.
A CRDT for the Authorized-Ground-Stations Set
When the data type permits, a CRDT avoids conflict resolution entirely:
use std::collections::HashSet; // G-Set (Grow-only Set) CRDT. Two replicas can independently add items; // merging is union, which is commutative, associative, and idempotent - // so the merge result is independent of the order replicas reconcile in. #[derive(Default, Clone)] pub struct GSet<T: Eq + std::hash::Hash + Clone> { items: HashSet<T>, } impl<T: Eq + std::hash::Hash + Clone> GSet<T> { pub fn add(&mut self, item: T) { self.items.insert(item); } pub fn contains(&self, item: &T) -> bool { self.items.contains(item) } pub fn merge(&mut self, other: &GSet<T>) { for item in &other.items { self.items.insert(item.clone()); } } pub fn len(&self) -> usize { self.items.len() } } fn main() { let mut pacific = GSet::default(); pacific.add("ground-pacific-01"); pacific.add("ground-pacific-02"); let mut indian = GSet::default(); indian.add("ground-indian-01"); indian.add("ground-pacific-01"); // overlap pacific.merge(&indian); println!("merged set size: {}", pacific.len()); // 3 // The merge is commutative: doing it the other way produces the same set. let mut indian2 = GSet::default(); indian2.add("ground-indian-01"); indian2.add("ground-pacific-01"); let mut pacific2 = GSet::default(); pacific2.add("ground-pacific-01"); pacific2.add("ground-pacific-02"); indian2.merge(&pacific2); println!("symmetric merge size: {}", indian2.len()); // 3 }
The constraint that makes this work is that addition is the only operation. The moment you need removal, you need a more complex CRDT (an OR-Set or LWW-Set) that tracks causal context for removes, because "is this element present" depends on whether the add or the remove was more recent. For the authorized-ground-stations set, where stations are added and decommissioning is rare and audited, a G-Set is sufficient. For more dynamic sets, more sophisticated CRDTs apply — see Shapiro et al., "Conflict-free Replicated Data Types" (2011).
Key Takeaways
- Multi-leader replication trades local-write latency and partition tolerance for the burden of conflict resolution. The catalog uses it for geographic regions; calendars and collaborative editors use it for offline-capable clients.
- Leaderless (Dynamo-style) replication uses parallel writes to N replicas and quorum reads from R replicas. When W + R > N the system gives quorum-consistent reads, but writes can still conflict across concurrent operations.
- Conflict resolution strategies fall on a spectrum: avoid (structure the data to prevent conflicts), automatic (LWW or CRDTs), or manual (surface siblings to the application). Default LWW with wall-clock timestamps is silently lossy under clock skew; treat it as a deliberate choice, not a default.
- Vector clocks are the production mechanism for detecting concurrent writes in multi-leader and leaderless systems. When a system says it "stores siblings" or "surfaces conflicts," it almost certainly means a vector-clock comparison underneath.
- Replication topology (star, ring, all-to-all) determines the redundancy and bandwidth profile of cross-replica traffic. All-to-all is the simplest but requires conflict detection because writes can be observed in causally-inverted orders.
Source note: This lesson is grounded in DDIA 2nd Edition (Kleppmann & Riccomini), Chapter 6, "Multi-Leader Replication," "Leaderless Replication," "Handling Concurrent Writes to the Same Key," and "Detecting Concurrent Writes." The Dynamo model traces to DeCandia et al., "Dynamo: Amazon's Highly Available Key-value Store" (SOSP 2007). CRDTs are from Shapiro, Preguiça, Baquero, Zawirski, "Conflict-Free Replicated Data Types" (SSS 2011). Specific quorum parameters (N=3 W=2 R=2 etc.) are documented Cassandra/Riak defaults but should be verified against current vendor docs.
Lesson 3: Read Consistency and Replication Lag
Context
Two weeks after enabling follower-served reads in the catalog, the on-call queue acquired three new categories of ticket. "I submitted a TLE update; the next read returned the old value, but the read after that returned the new one." "My terminal showed the satellite as 'active' and then thirty seconds later showed it as 'pending' — it went backward in time." "Operator A says the launch is approved; operator B at a different ground station says it isn't. They're both looking at the same satellite record." All three are correct observations of catalog state. None of them are bugs in the catalog code. All of them are manifestations of replication lag producing observable anomalies that the original single-instance catalog could not produce.
Replication lag is not eliminable in any system that serves reads from asynchronous followers; the question is which specific anomalies the application can tolerate and which it cannot. DDIA names three: read-after-write, monotonic reads, and consistent prefix reads. Each describes a class of anomaly, and each has a corresponding "session guarantee" that an application can request — typically at the cost of routing some reads to the leader or pinning a session to a specific replica.
This lesson is the operational counterpart to Lessons 1 and 2. The previous lessons covered how replication works (single-leader, multi-leader, leaderless). This one covers what your users will see when it doesn't work the way they expected, and what session guarantees you can offer to bound the surprise. By the end, you should be able to diagnose a replication-lag anomaly from a user report alone, and to choose the right session guarantee for each workload in the catalog.
Core Concepts
Read-After-Write Consistency (a.k.a. Reading Your Own Writes)
The most common anomaly: a user submits a write, immediately reads the same data, and sees the previous value rather than the one they just wrote. The cause is that the write went to the leader and the read went to a follower that has not yet applied the write.
The user-facing symptom is profound confusion. The submission UI shows "save successful" and the next page load shows the old data. Users assume the system has lost their write. Even when the data is correctly written and propagation will complete in milliseconds, the experience makes the system feel broken.
Read-your-writes consistency is the guarantee that a client never sees the system in a state older than its own most recent write. It is a per-client guarantee: client A may still see stale data that client B just wrote, but no client sees its own writes regress.
The standard implementation strategies, in order of operational simplicity:
- Pin the client's reads to the leader for a window after a write. Simplest, most reliable, costs leader read load. The catalog's session-pinning router from Lesson 1 implements this.
- Track the write's log position on the client and route reads to a replica that has caught up. The client carries a "last write LSN" token; reads include the token, and any replica behind that LSN forwards the read elsewhere or waits. PostgreSQL exposes this via
pg_last_wal_replay_lsn(); the application binds it to the user's session. - Track the write's log position server-side, indexed by user. A coordinator service maps each user to their last write position and routes their reads accordingly. More moving parts; supports stateless clients.
The catalog uses option 1 for ground-station write sessions. The orbital catalog's analytics tier — where the workload is mostly reads, and the writes are bulk imports done by a separate service — does not implement read-your-writes at all, because the analytics user never reads back their own write.
The case where read-your-writes is not enough: cross-device sessions. A user submits a write from their phone and reads from their laptop; the laptop has no session token tying it to the recent write. The standard answer is to use the user's account identity rather than the device session, but this requires server-side state tied to identity, which is a significantly more elaborate mechanism than session-token pinning.
Monotonic Reads
The user reads a record and sees value X. They reload. They see value W, where W is older than X. The clock has gone backward — not in real time, but in their observable view of system state.
The cause: the first read landed on a follower close to the leader's LSN; the second read landed on a different follower that was further behind. The catalog's state is monotonic from the leader's perspective, but the follower-routed reads can hop between replicas with different lag profiles, producing an observable regression.
Monotonic reads is the guarantee that within a single client session, successive reads never see system state go backward in time. The standard implementation is read-from-the-same-replica: pin the client's reads (within a session) to a single follower. The follower may still lag behind the leader, but it cannot go backward. The cost is that pinning reduces load distribution — a single follower may become hot.
A subtler implementation: the client tracks the leader LSN it has observed and refuses reads from replicas behind that LSN. This is the same token-based mechanism as read-your-writes but tracking observed reads rather than just observed writes. Both can be implemented together with a unified "session position token."
Consistent Prefix Reads
The third anomaly is the most subtle and the easiest to miss in testing. The system contains causally related writes — a question, then an answer; a satellite launch, then a status update — and reads from a replica that has the answer but not the question, or the status update but not the launch. The user sees the second event before the first, which is logically incoherent.
Consistent prefix reads is the guarantee that if a sequence of writes happens in a particular order, then any reader sees them in the same order (or doesn't see the later ones yet). It is fundamentally a guarantee about causality preservation across replication.
In a single-leader system, this is largely automatic for a single client because the leader emits writes in order and followers apply them in order. The anomaly appears when:
- Sharded systems route writes for different keys to different leaders. Causal dependencies that span shards are not preserved by the shard-level ordering. The fix is to track causal dependencies explicitly (e.g., Spanner's TrueTime-based commit timestamps) or to use a single global ordering (which is what consensus protocols give you).
- Multi-leader systems with all-to-all topology propagate writes from different leaders along different paths. Without a causal-ordering mechanism (vector clocks!), the order in which a third leader observes the writes may not match the order they were applied.
- Followers serving reads from different parts of the log. If a client reads from one follower for key A and another for key B, the per-follower ordering is preserved but the cross-key ordering is not.
The catalog's solution for the multi-leader regime is causal consistency tracking with vector clocks (the mechanism from Module 1 and Lesson 2). Every read carries the vector clock of the reader's session; replicas refuse to serve reads that would violate the reader's observed causal order. This is more sophisticated than read-your-writes or monotonic reads, and it is the right tool when ordering across keys matters operationally.
Session Guarantees: The Composable Layer
Read-your-writes, monotonic reads, and consistent prefix reads are three of a set of session guarantees originally formalized by Terry et al. in the 1994 Bayou paper. The full set is:
- Read-your-writes — a client sees its own writes.
- Monotonic reads — a client doesn't see state regress.
- Monotonic writes — a client's writes are applied in the order they were submitted.
- Writes-follow-reads — if a client read X and then wrote Y, any reader that sees Y also sees X.
These can be offered independently or together. A system that offers all four is providing causal consistency at the session level, which is the strongest model achievable without coordination across writers. Strong consistency models like linearizability provide all four plus cross-client guarantees, at the cost of coordination.
The operational pattern: treat session guarantees as a menu offered to the application, not a property of the entire database. The catalog's TLE update endpoint requests read-your-writes (writers see their own submissions). The analytics dashboard requests monotonic reads (operators don't see time-traveling state). The conjunction-prediction service requests consistent prefix reads (cause precedes effect). The same underlying database supports all three by varying the routing policy per session.
Bounded Staleness as an Operational Knob
Independent of session guarantees, the system can offer bounded staleness: a guarantee that any read from a follower is no more than N seconds (or N log entries) behind the leader. This is what Azure Cosmos DB calls "bounded staleness consistency" and what Cassandra implements via the consistency level LOCAL_QUORUM plus monitoring of replication lag.
Bounded staleness is operationally useful because it gives the application a number to reason about. "Reads may be up to 5 seconds stale" is something a dashboard can communicate to its users; "reads may be arbitrarily stale" is not. The cost is that the system must reject reads from replicas exceeding the staleness bound, which under prolonged replication lag manifests as availability degradation rather than correctness violation. This is a deliberate tradeoff: surface the lag as an alert rather than hide it behind silently stale data.
The catalog publishes a staleness metric: every replica reports its current lag to a metrics pipeline, and any replica exceeding 1 second of lag is removed from the read-routing pool. The pool re-includes the replica once it catches up. The application sees consistent low-lag reads even when individual replicas are temporarily lagging; the cost is fewer replicas during catch-up periods, which is an availability tradeoff the team has accepted.
Code Examples
A Session-Token-Based Read Router
This implementation routes reads based on a session token that tracks the highest write LSN the session has observed:
use std::collections::HashMap; use std::sync::Mutex; #[derive(Default, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)] pub struct Lsn(u64); pub struct ReplicaInfo { pub endpoint: String, pub current_lsn: Lsn, } pub struct ReadRouter { leader: String, replicas: Mutex<Vec<ReplicaInfo>>, // Per-session: the LSN the session has observed. Any read from this // session must go to a replica with current_lsn >= session_lsn. sessions: Mutex<HashMap<String, Lsn>>, } impl ReadRouter { pub fn new(leader: String, replicas: Vec<ReplicaInfo>) -> Self { Self { leader, replicas: Mutex::new(replicas), sessions: Mutex::new(HashMap::new()), } } /// Called after a successful write. Records the LSN the leader assigned /// so subsequent reads in this session can wait for the followers to /// catch up. pub fn record_write(&self, session: &str, lsn: Lsn) { let mut s = self.sessions.lock().unwrap(); let entry = s.entry(session.to_string()).or_insert(Lsn(0)); if lsn > *entry { *entry = lsn; } } /// Called on read. Returns the endpoint of a replica that satisfies the /// session's monotonicity requirement, or the leader if no replica is /// sufficiently caught up. pub fn route(&self, session: &str) -> String { let sessions = self.sessions.lock().unwrap(); let required = sessions.get(session).copied().unwrap_or(Lsn(0)); let replicas = self.replicas.lock().unwrap(); // Prefer the freshest replica that has reached `required`. Falling // back to the leader guarantees read-your-writes; the cost is leader // load when followers lag. replicas .iter() .filter(|r| r.current_lsn >= required) .max_by_key(|r| r.current_lsn) .map(|r| r.endpoint.clone()) .unwrap_or_else(|| self.leader.clone()) } /// Background task updates this from heartbeats; in production this would /// poll each replica's pg_last_wal_replay_lsn() equivalent. pub fn update_replica_lsn(&self, endpoint: &str, lsn: Lsn) { let mut replicas = self.replicas.lock().unwrap(); if let Some(r) = replicas.iter_mut().find(|r| r.endpoint == endpoint) { r.current_lsn = lsn; } } } fn main() { let router = ReadRouter::new( "leader.catalog".into(), vec![ ReplicaInfo { endpoint: "follower-1".into(), current_lsn: Lsn(100) }, ReplicaInfo { endpoint: "follower-2".into(), current_lsn: Lsn(95) }, ], ); // Initial read - no writes yet, any follower works println!("initial: {}", router.route("session-A")); // Write happens, LSN advances to 110 router.record_write("session-A", Lsn(110)); // No follower has caught up to 110 yet - route to leader println!("post-write: {}", router.route("session-A")); // Follower-1 catches up router.update_replica_lsn("follower-1", Lsn(115)); // Now follower-1 satisfies the session requirement println!("after catch-up: {}", router.route("session-A")); }
The key property is that the session token, not the read endpoint, is what carries the guarantee. A read from session A and a read from session B can land on different replicas; each is consistent with its own session's history. The replicas don't need to know about session ordering; they just expose their current LSN, and the router picks accordingly.
Detecting a Consistent-Prefix Violation in Practice
For systems that need consistent prefix reads across shards or replicas, the test harness needs to detect violations. This is a sketch of what such a test looks like:
// Test scenario: a satellite launches (writes 'state=active' to key MSS-23)
// and then immediately reports telemetry (writes a telemetry row keyed by
// MSS-23). A reader following the convoy must see the launch before the
// telemetry, even if the two writes go to different shards.
use anyhow::Result;
use std::time::Duration;
async fn assert_consistent_prefix(client: &CatalogClient) -> Result<()> {
// Write 1: satellite state transition (shard A)
client.put("satellites/MSS-23/state", "active").await?;
// Write 2: first telemetry sample (shard B, references MSS-23)
client.put("telemetry/MSS-23/0001", r#"{"alt": 550}"#).await?;
// A subsequent reader should not be able to see the telemetry without
// also seeing the state transition. We check by reading both and
// asserting either (a) we see state=active AND telemetry, or (b) we see
// neither (we caught the window before propagation), but never (c)
// telemetry without state, which would be a consistent-prefix violation.
for attempt in 0..20 {
let state = client.get("satellites/MSS-23/state").await?;
let telemetry = client.get("telemetry/MSS-23/0001").await?;
match (state, telemetry) {
(Some(_), None) => return Ok(()), // both writes visible (a)
(None, None) => {
// Propagation in progress; retry. Caps at 20 attempts.
tokio::time::sleep(Duration::from_millis(50)).await;
continue;
}
(None, Some(_)) => {
// VIOLATION: saw effect without cause.
anyhow::bail!("consistent-prefix violation: telemetry visible before state");
}
(Some(state), Some(_)) if state == "active" => return Ok(()),
_ => anyhow::bail!("unexpected state combination"),
}
}
anyhow::bail!("writes never converged");
}
struct CatalogClient;
impl CatalogClient {
async fn put(&self, _k: &str, _v: &str) -> Result<()> { Ok(()) }
async fn get(&self, _k: &str) -> Result<Option<String>> { Ok(None) }
}
The test is intentionally about what should never be observed. A passing run does not prove the system is consistent — it proves no violation was observed this time. The way to gain confidence is to run the test under partition injection, replication lag, and adversarial scheduling, and accumulate enough successful runs to bound the violation probability. This is what Jepsen-style testing automates.
Bounded Staleness Enforcement on the Read Path
The router from earlier can be extended to enforce a maximum staleness:
use std::time::{Duration, Instant}; struct StalenessAwareReplica { endpoint: String, last_seen_leader_lsn: u64, last_seen_at: Instant, } impl StalenessAwareReplica { fn is_within_bound(&self, max_lag: Duration) -> bool { // The replica is considered within bound if its last heartbeat is // recent. A stale heartbeat means we don't know how far behind it is, // and the conservative choice is to remove it from the pool. self.last_seen_at.elapsed() < max_lag } } struct BoundedStalenessRouter { replicas: Vec<StalenessAwareReplica>, leader: String, max_staleness: Duration, } impl BoundedStalenessRouter { fn route(&self) -> &str { for r in &self.replicas { if r.is_within_bound(self.max_staleness) { return &r.endpoint; } } // No replica is within the staleness bound. Fall back to the leader. // In some configurations the right behavior is to return an error // (preserve correctness over availability); here we fall back, which // is the right call when the leader is the still-current source. &self.leader } } fn main() { let router = BoundedStalenessRouter { replicas: vec![ StalenessAwareReplica { endpoint: "follower-1".into(), last_seen_leader_lsn: 100, last_seen_at: Instant::now(), }, ], leader: "leader".into(), max_staleness: Duration::from_secs(5), }; println!("routing to: {}", router.route()); }
This is the staleness equivalent of a circuit breaker (Module 4): if a replica's freshness signal goes stale, it is removed from the pool until it recovers. The cost is reduced read capacity; the benefit is that the application sees only reads that are within the documented staleness bound.
Key Takeaways
- Replication lag produces three distinct user-visible anomalies: read-after-write (a writer doesn't see their own write), monotonic reads (state appears to regress), and consistent prefix reads (cause appears after effect). Each has a corresponding session guarantee.
- Read-your-writes is the most commonly needed guarantee. The cheapest implementation is to pin the writer's reads to the leader for a window after each write. The more sophisticated implementations track the leader LSN per session and route to any replica caught up to that LSN.
- Monotonic reads is provided by pinning a session's reads to a single replica, accepting that replica's lag profile in exchange for never going backward in observable state.
- Consistent prefix reads requires causal-order tracking across writes — typically vector clocks (Module 1) or commit timestamps from a coordinated time source like Spanner's TrueTime. This is the guarantee that fails most often in sharded systems where different keys live on different shards.
- Bounded staleness is the operational tool for making lag tolerable: surface lag as a metric, remove lagging replicas from the read pool, alert when no replica is within bound. Replication lag is not a bug; unbounded staleness is.
Source note: This lesson is grounded in DDIA 2nd Edition (Kleppmann & Riccomini), Chapter 6, "Problems with Replication Lag" — specifically the subsections "Reading Your Own Writes," "Monotonic Reads," and "Consistent Prefix Reads." The original session-guarantee taxonomy is from Terry, Demers, Petersen, Spreitzer, Theimer, Welch, "Session Guarantees for Weakly Consistent Replicated Data" (PDIS 1994). Specific implementation details (PostgreSQL's pg_last_wal_replay_lsn, Cosmos DB's bounded staleness) are illustrative and should be verified against current product documentation.
Module 02 Project — Replicated Telemetry Store
Mission Brief
Incident ticket CN-2607-014 Severity: P3 Reporter: Mission Control Platform, Telemetry Group Status: Open
The telemetry pipeline writes ~10 kHz of samples to the Constellation Catalog. Reads come from three sources: the operations dashboard (operators viewing real-time state), the conjunction-prediction service (reads orbital state and recent telemetry together), and the analytics tier (bulk historical scans). The current single-instance design cannot keep up, and the team is migrating to a leader/follower configuration.
This project is the proving ground for that migration. You will build a Rust crate, replicated_telemetry_store, that implements:
- Single-leader replication with one synchronous and N asynchronous followers.
- A pluggable read router that supports three modes —
Leader,AnyFollower, andSessionConsistent— corresponding to "linearizable reads at leader cost," "fastest reads, accept staleness," and "read-your-writes via LSN tokens." - A simulated network layer that lets tests inject replication lag, drop messages, and partition the cluster.
- A test suite that demonstrates each of the three replication-lag anomalies and shows that the corresponding session guarantees eliminate them.
The deliverable does not need to be production-ready storage — an in-memory map is fine. The deliverable must be a correct demonstration that you can reason about replication lag, route reads accordingly, and prove the routing works under adversarial conditions.
Repository Layout
replicated-telemetry-store/
├── Cargo.toml
├── src/
│ ├── lib.rs
│ ├── leader.rs # Leader: write path, log emission, sync replication wait
│ ├── follower.rs # Follower: log apply, lag tracking
│ ├── router.rs # ReadRouter with the three modes
│ ├── network.rs # Simulated bus with lag injection
│ └── lsn.rs # Lsn type, ordering, monotonic generator
├── tests/
│ ├── read_after_write.rs
│ ├── monotonic_reads.rs
│ ├── consistent_prefix.rs
│ └── partition_durability.rs
└── README.md
Required API
// lsn.rs
#[derive(Copy, Clone, PartialEq, Eq, PartialOrd, Ord, Hash, Debug)]
pub struct Lsn(pub u64);
// leader.rs
pub struct Leader {
// synchronous followers - writes wait for these
// asynchronous followers - writes do not wait
}
impl Leader {
pub fn new(node_id: String, network: Arc<Network>) -> Self;
pub fn add_sync_follower(&self, follower_id: &str);
pub fn add_async_follower(&self, follower_id: &str);
pub async fn write(&self, key: String, value: Vec<u8>) -> Result<Lsn>;
pub async fn read(&self, key: &str) -> Option<Vec<u8>>;
pub fn current_lsn(&self) -> Lsn;
}
// follower.rs
pub struct Follower {
// applies entries from the leader's log; tracks last applied LSN
}
impl Follower {
pub fn new(node_id: String, network: Arc<Network>) -> Self;
pub async fn run(&self);
pub async fn read(&self, key: &str) -> Option<Vec<u8>>;
pub fn current_lsn(&self) -> Lsn;
}
// router.rs
pub enum ReadMode {
Leader,
AnyFollower,
SessionConsistent { session_id: String },
}
pub struct ReadRouter {
leader: Arc<Leader>,
followers: Vec<Arc<Follower>>,
}
impl ReadRouter {
pub fn new(leader: Arc<Leader>, followers: Vec<Arc<Follower>>) -> Self;
pub async fn read(&self, key: &str, mode: ReadMode) -> Option<Vec<u8>>;
pub fn record_session_write(&self, session_id: &str, lsn: Lsn);
}
// network.rs
pub struct Network {
// pluggable lag injection per-pair, partition control
}
impl Network {
pub fn new() -> Self;
pub fn inject_lag(&self, from: &str, to: &str, lag: Duration);
pub fn partition(&self, group_a: &[&str], group_b: &[&str]);
pub fn heal(&self);
}
Acceptance Criteria
-
cargo build --releasecompletes without warnings under#![deny(warnings)]. -
cargo testpasses all integration tests with zero flakes across 10 consecutive runs. -
cargo clippy -- -D warningsproduces no lints. -
The leader correctly waits for synchronous-follower acknowledgment before returning from
write(). Test: inject a 500ms lag on the sync follower path; a write call should take at least 500ms to complete. - An asynchronous-follower path does not block writes. Test: inject a 5-second lag on the async follower; a write call completes in well under 5 seconds.
-
The
Leaderread mode is linearizable: under any lag injection, a read immediately after a write sees the new value. -
The
AnyFollowerread mode demonstrably exhibits read-after-write violations under injected lag. Test: write a value, immediately read withAnyFollower; the test asserts that sometimes the read returns the previous value when lag is injected. -
The
SessionConsistentread mode eliminates read-after-write violations. Test: under the same injected lag, a session that writes-then-reads always sees the new value. -
The
SessionConsistentread mode falls back to the leader when no follower is caught up. Test: with 5-second lag on all followers, session-consistent reads in the first second land on the leader. -
Monotonic reads test: with two followers at different lag levels, repeated reads from an
AnyFollowermode can demonstrably regress (the test detects the regression). WithSessionConsistentmode, no regression is observed across 1000 successive reads. - Partition durability test: with one synchronous follower and one async follower, after partitioning the leader from the async follower and then crashing the leader, the synchronous follower can be promoted with no acknowledged writes lost. The async follower's missing writes are recovered when it rejoins.
- (self-assessed) The README explains the three read modes clearly enough that a new engineer could choose the right mode for a workload they describe.
-
(self-assessed) The lag-injection tests are deterministic — running them 100 times produces consistent results, with violations appearing under
AnyFollowerand not underSessionConsistentevery single time. -
(self-assessed) The code path for synchronous-follower acknowledgment is straightforward to extend to a quorum (e.g., "wait for 2 of 3 followers"). The current
add_sync_followerAPI does not preclude this generalization.
Expected Output
cargo test --release read_after_write -- --nocapture:
[setup] leader + 1 sync follower + 2 async followers
[setup] injecting 200ms lag on async-follower-1
[setup] injecting 200ms lag on async-follower-2
[any-follower] write 'MSS-23/state' = 'active' returned in 5ms (sync only)
[any-follower] immediate read returned: None
[any-follower] re-read after 250ms returned: Some('active')
PASS: AnyFollower exhibits read-after-write anomaly under lag
[session-consistent] write 'MSS-23/state' = 'active' returned in 5ms
[session-consistent] immediate read returned: Some('active') from leader (no follower caught up)
[session-consistent] re-read after 250ms returned: Some('active') from async-follower-1
PASS: SessionConsistent eliminates read-after-write anomaly
Hints
1. Modeling the synchronous-follower wait
The leader's write() needs to: (1) append to its local state, (2) generate the LSN, (3) dispatch to all followers via the network, (4) wait for synchronous followers' acks, (5) return. The "wait for acks" part is the structural new piece. A tokio::sync::oneshot per-write, completed by the sync-follower path on ack, is the cleanest way to model this. The async-follower path completes the same oneshot but the leader's write does not wait for it.
2. Simulating lag without sleeping in tests
Real tokio::time::sleep calls work but make tests slow. Consider tokio::time::pause() and tokio::time::advance() in tests: you can advance virtual time without actually waiting, which makes a "500ms lag" test complete in microseconds. The lag-injection module should call tokio::time::sleep, which respects the virtual-time pause.
3. Testing for "sometimes" anomalies
The AnyFollower mode produces read-after-write violations sometimes — depending on which follower is chosen and when. To make the test deterministic, control the follower selection (e.g., a select_follower hook that the test can override) and force selection of the lagging follower. The test asserts that selecting a known-lagging follower produces the anomaly; this is far more robust than "run it 1000 times and hope to observe a violation."
4. LSN ordering and session tracking
The router's SessionConsistent mode needs to track each session's last-observed LSN. A HashMap<String, Lsn> behind a Mutex is the simplest implementation. On write(), the router records the returned LSN against the session. On read(), the router compares each follower's current_lsn() against the session's required LSN. Falling back to the leader when no follower satisfies the requirement is correctness-preserving but costs leader load — that's the documented tradeoff.
5. Partition durability
For the durability test: have the leader configured with one sync follower and one async follower. Inject lag on the async path. Write 100 entries. Verify the sync follower has all 100; the async follower has < 100. Partition the leader (no further writes can complete). "Crash" the leader by setting its state to a special failed flag. Promote the sync follower. Heal the partition. Verify the async follower catches up to the sync follower's state. The test asserts no acknowledged writes were lost.
6. Why the AnyFollower mode is intentionally bad
It's tempting to make AnyFollower smarter — e.g., to fall back to the leader when no follower has the latest LSN. Don't. The point of AnyFollower is to model the "fast, possibly stale" mode that real production systems offer for non-critical reads. Making it smarter erases the educational distinction between modes. If you want a hybrid mode, make it a fourth option (BestEffortFresh or similar), not a modification of AnyFollower.
Source Anchors
- DDIA 2nd Edition, Chapter 6 — "Single-Leader Replication," "Problems with Replication Lag"
- Terry, Demers, Petersen, Spreitzer, Theimer, Welch, "Session Guarantees for Weakly Consistent Replicated Data" (PDIS 1994) — the canonical reference for session guarantees
- PostgreSQL streaming replication documentation — for an example of how synchronous and asynchronous followers are configured in a real system
Module 03 — Consensus & Raft
"The Antarctic relay path has a 14-minute coverage gap. During that gap, the catalog's leader-election runbook calls for a human operator to declare promotion. The November storm proved this doesn't scale."
Mission Context
Modules 1 and 2 established the failure model (the network and time are unreliable) and the basic replication strategies that build reliability on top. This module covers the consensus mechanisms that allow a cluster to make safe, automatic decisions despite that unreliability — most importantly, which node is the leader, but also which writes are committed, which membership changes have taken effect, and which transactions have been atomically committed.
Consensus is theoretically difficult (FLP impossibility) and practically achievable (Raft, Paxos under partial synchrony). The three lessons in this module trace that arc: why consensus is hard and why it is the right tool; how Raft works at the level of detail needed to implement and operate it; and what the protocol's operational regime — membership changes, log compaction, partition recovery — looks like in production. The capstone project asks you to implement a working Raft library and demonstrate it survives adversarial scheduling.
Once a team operates a Raft cluster, they have moved from a "single leader with manual failover" architecture to a "cluster that elects its own leader" architecture. This is the architectural inflection point that the rest of the Constellation Network depends on — coordination services (Module 5), distributed locks, fencing tokens, and most automated cluster management all build on consensus.
Lessons
| # | Title | Source |
|---|---|---|
| 1 | Why Consensus Is Hard | DDIA Ch. 10 + FLP 1985 |
| 2 | The Raft Algorithm — Leader Election & Log Replication | Raft paper (Ongaro & Ousterhout 2014) |
| 3 | Raft in Practice — Membership, Snapshots, Recovery | Raft paper + Ongaro dissertation |
Project
Orbital Raft — implement a Raft consensus library in Rust. The project covers the full protocol: leader election, log replication, persistent state, single-server membership changes, snapshots and InstallSnapshot, and a test harness that injects partitions and verifies both safety (no two leaders in a term, no committed entry lost) and liveness (the cluster elects a leader and commits commands within bounded time) under adversarial conditions.
Position
Module 3 of 6 in the Distributed Systems track.
What You Should Be Able to Do After This Module
- Explain why consensus is required for linearizable operations and automatic leader election, and articulate the FLP impossibility and how Raft sidesteps it.
- Trace a write through Raft from client submission to commit to state-machine application, identifying every persistence and quorum-acknowledgment step along the way.
- Identify the election restriction in a Raft implementation and explain why it is safety-critical.
- Diagnose a Raft cluster's operational state from its metrics: leader stability, commit latency, election frequency, snapshot rate.
- Choose between read-index and lease-read for linearizable reads, and articulate the clock-skew dependency of the lease-read approach.
- Reason about cluster behavior under partition scenarios — which side makes progress, what happens to uncommitted entries when the partition heals, and why odd-numbered clusters are preferred.
Source Materials
- DDIA 2nd Edition (Kleppmann & Riccomini, 2026), Chapter 10 — "Consensus" and "Consensus in Practice." The chapter framing for why consensus is needed and how it is structurally equivalent to atomic commit, linearizable CAS, and total-order broadcast.
- Ongaro & Ousterhout, "In Search of an Understandable Consensus Algorithm" (USENIX ATC 2014) — the canonical Raft paper. Required reading; readable; rigorous. This module's pedagogy follows the paper's decomposition into election, replication, and safety.
- Ongaro, "Consensus: Bridging Theory and Practice" (Stanford PhD dissertation, 2014) — the deep reference for the operational concerns covered in Lesson 3: membership changes, snapshots, linearizable reads, performance optimizations.
- Fischer, Lynch, Paterson, "Impossibility of Distributed Consensus with One Faulty Process" (Journal of the ACM, April 1985) — the original FLP impossibility paper. Foundational but dense; the dissertation discussion in Lesson 1 is sufficient unless you want the formal proof.
- etcd raft library source (github.com/etcd-io/raft) — a high-quality production Raft reference. Useful for comparing design choices to your project implementation.
Source notes on individual lessons flag specific claims for verification.
Lesson 1: Why Consensus Is Hard
Context
The Constellation Network's most consequential decision is one that has not yet been automated: when the primary leader of the catalog cluster fails, who promotes the new leader? Today this is a paging-followed-by-runbook procedure. An operator confirms the failure (often by waiting for several other operators to confirm the same observation), uses an out-of-band channel to declare the promotion, and updates the configuration of the cluster's followers to point to the new leader. The procedure works because there is exactly one operations team and they can coordinate among themselves. It does not scale, it does not work at 03:00 local time on a weekend, and it does not work when the operations team's communication channel is the same network that just partitioned.
The mechanism the team needs is distributed consensus: a protocol by which a group of nodes can agree on a value (in this case, "who is the leader now") in the presence of failures and network partitions. Consensus is the mechanism behind every modern automated failover system, every distributed lock service, every cluster-membership coordinator. ZooKeeper, etcd, Consul, and the consensus layer of Spanner, CockroachDB, and YugabyteDB all implement variants of consensus.
This lesson sets up the why of consensus before Lesson 2 covers the how (Raft). By the end, you should understand why consensus is theoretically difficult (the FLP impossibility result), why it is practically achievable (with weak synchrony assumptions), and why the family of problems — atomic commit, leader election, total-order broadcast, linearizable CAS — are all the same problem in disguise. This framing matters because it tells you when reaching for consensus is the right tool, and when something cheaper will do.
Core Concepts
What Consensus Is, Precisely
DDIA defines consensus in terms of four formal properties that any consensus protocol must satisfy:
- Uniform agreement — No two nodes decide on different values. (Sometimes called "agreement" or "safety.")
- Integrity — No node decides twice. Once a node decides, the decision stands.
- Validity — If a node decides on value
v, thenvwas proposed by some node. - Termination — Every non-failing node eventually decides on a value. (Sometimes called "liveness.")
Agreement and integrity say that nothing bad happens (no inconsistent or repeated decisions). Validity says the system can't decide on a value out of nowhere. Termination says something good eventually happens (the protocol completes).
These are not trivial requirements. Most weaker protocols give you only a subset. Best-effort broadcast satisfies validity but not agreement. Single-leader replication satisfies agreement (the leader decides) but only conditional termination (if the leader is up). A consensus protocol is one that satisfies all four properties in the presence of any minority of failures.
The "minority of failures" qualifier is essential. No protocol can tolerate an arbitrary number of failures: if every node fails, no protocol can produce a decision. The standard assumption is that fewer than half the nodes fail, which is why Raft, Paxos, and their relatives are typically configured with odd numbers of nodes (3, 5, 7) — to maximize the failure tolerance for a given cluster size.
The FLP Impossibility Result
In 1985, Fischer, Lynch, and Paterson published a result that is one of the foundational impossibility theorems of distributed systems: in a fully asynchronous system, no deterministic consensus protocol can guarantee termination in the presence of even a single faulty process.
The intuition: in a fully asynchronous network, you cannot distinguish a node that has crashed from a node that is merely slow. A consensus protocol must wait to hear from enough nodes to know the decision is valid — but if it waits forever for a node that is "just slow," it violates termination; if it gives up too early, it might be wrong about which value is decided, violating agreement.
The result sounds devastating but is more constraining than fatal. It says that you cannot have guaranteed termination in a fully asynchronous system. Real consensus protocols sidestep it in two ways:
Randomization. Probabilistic consensus protocols (Ben-Or's algorithm, Bitcoin's proof-of-work) guarantee termination with probability 1, but not with probability 1.0 in finite time. For practical purposes the convergence is fast; the FLP result still applies, but only to deterministic protocols.
Partial synchrony. Real networks are not fully asynchronous — they are "mostly synchronous, sometimes not." The partial-synchrony model (Dwork, Lynch, Stockmeyer 1988) says the network has some unknown but finite period during which messages are delivered in bounded time, even though delays can be arbitrary outside that period. Under partial synchrony, deterministic consensus is possible, and Raft and Paxos are the production protocols for this model. They make progress during synchronous periods and gracefully stall during asynchronous ones, resuming when the network stabilizes.
The practical takeaway: when you read that a consensus protocol "tolerates network delays," what is meant is that it preserves safety (no wrong decisions) under arbitrary delays, and preserves liveness (progress) only when the network is synchronous enough. This is the right model. Permanent asynchrony is indistinguishable from total network failure, against which no protocol can make progress.
Why Linearizability Requires Consensus
Module 1 introduced linearizability as the strongest single-object consistency model. DDIA's deep dive in Chapter 10 makes a specific claim: linearizable operations require consensus. Not just resemble consensus, not just enable consensus — they are equivalent in their fundamental requirements.
The argument: a linearizable register is one where every operation appears to take effect at a single point in time, and all clients observe the same total order of operations. Producing this total order requires that the set of replicas agree on the order. Agreement on an order is consensus — specifically, total-order broadcast consensus, where the value being agreed on is "the next operation in the log."
The implications are practical:
- Any database that offers linearizability under failures is running consensus internally. Spanner, CockroachDB, FoundationDB, and etcd all expose linearizable operations and all run Paxos or Raft.
- A database that advertises strong consistency without consensus is making a weaker claim. It may be linearizable when no failures occur; it may degrade silently when they do.
- Performance-wise, every linearizable operation pays the consensus round-trip cost. This is why high-throughput systems offer linearizability as an opt-in (per-operation or per-table) rather than the default — the cost is too high for workloads that can tolerate weaker consistency.
Atomic Commit and Consensus Are the Same Problem
The other place consensus appears unexpectedly is in atomic commit — the problem of ensuring that a transaction either commits on all participating nodes or aborts on all of them. The classical solution is two-phase commit (2PC), and DDIA discusses 2PC's pathologies in detail: it blocks indefinitely if the coordinator fails between the prepare and commit phases.
The deep observation is that atomic commit and consensus are equivalent: any solution to one gives you a solution to the other. 2PC's blocking behavior is not a quirk of 2PC; it is a manifestation of the same fundamental difficulty as FLP. The fix is to replace the single coordinator with a consensus-protected log of commit decisions, which is what production systems do — Spanner's 2PC is built on top of Paxos, CockroachDB's distributed transactions are built on Raft.
The atomic commit problem has its own four formal properties (validity, agreement, integrity, termination — same names, slightly different meanings), and they map cleanly onto consensus. If you understand consensus, you understand atomic commit; the implementation details differ but the impossibility is the same.
Single-Value Consensus and Compare-and-Set
The simplest expressions of consensus are single-value consensus (decide on one value, once) and compare-and-set (decide which of two writes to a register wins). Both can be implemented on top of total-order broadcast: each operation is broadcast through the log, and the first to land on the deciding offset is the one that takes effect.
DDIA's table makes this explicit: single-value consensus, atomic commit, total-order broadcast, linearizable CAS, lock acquisition, and uniqueness constraints are all equivalent in computational power. Any of these can be reduced to any other. The "consensus number" of a primitive (Herlihy 1991) measures how powerful it is: linearizable CAS has infinite consensus number, meaning it can implement consensus for any number of processes. Lower-power primitives (atomic increment, fetch-and-add) cannot implement consensus past a certain group size.
The practical implication: when you reach for a distributed lock service or a leader election service, you are reaching for consensus. The advertised primitive (lock, election, CAS) is a thin veneer over a consensus log. ZooKeeper's API exposes locks; etcd's API exposes a KV store with linearizable operations; both are consensus engines under the hood. Knowing this lets you pick the right level of abstraction — and lets you recognize when you're solving a consensus problem by accident, with mechanisms that don't actually provide consensus guarantees.
When Not to Reach for Consensus
Given how powerful and important consensus is, the temptation is to use it for everything. Don't. Consensus pays a per-operation latency cost (at minimum, one round-trip to a majority quorum) and a configuration cost (running and operating a consensus cluster). For many problems, cheaper mechanisms suffice:
- Eventually consistent updates (CRDTs, vector clocks, anti-entropy) — when the application can tolerate temporary divergence.
- Single-leader replication with manual failover — when leader election doesn't need to be automatic.
- Sharding by key — when independent shards don't need cross-shard agreement.
- Best-effort broadcast — for non-critical notifications.
Use consensus when you need: automatic leader election that survives failures; a linearizable operation under failures; an atomic decision among nodes that must converge to the same answer; or membership changes that must be totally ordered. The catalog's leader election is one of these. The catalog's bulk telemetry ingest is not — it does not need consensus, and forcing it through a consensus log would limit throughput unacceptably.
Code Examples
A Toy Single-Value Consensus Decision (Sketch)
The smallest illustrative shape — not a production protocol — captures the structural difficulty:
// TOY: a single-value consensus over N nodes with a fixed proposer.
// This is NOT Raft or Paxos; it elides leader election, log replication,
// and crash recovery. The point is to expose what consensus *requires*.
use anyhow::Result;
use std::time::Duration;
struct Node { id: u64 }
async fn propose(nodes: &[Node], value: &str) -> Result<&str> {
// Phase 1: ask a majority if they have already accepted a value.
// If any has, we must adopt that value (validity / safety).
let mut acks = 0;
let mut existing_value: Option<&str> = None;
for node in nodes {
if let Some(prev) = ask_node(node).await? {
existing_value = Some(prev); // Honor what was already decided.
}
acks += 1;
if acks > nodes.len() / 2 { break; } // Majority reached.
}
let chosen = existing_value.unwrap_or(value);
// Phase 2: tell a majority to accept the chosen value.
// Once a majority has accepted, the decision is final.
let mut applies = 0;
for node in nodes {
if accept_value(node, chosen).await? {
applies += 1;
if applies > nodes.len() / 2 {
return Ok(chosen);
}
}
}
anyhow::bail!("could not achieve majority")
}
async fn ask_node(_n: &Node) -> Result<Option<&'static str>> { Ok(None) }
async fn accept_value(_n: &Node, _v: &str) -> Result<bool> { Ok(true) }
Three things to notice. First, the two-phase structure: we cannot just tell nodes the value; we first have to discover whether a previous decision exists, then propagate the chosen value. Without phase 1, two simultaneous proposers could pick different values and both reach majority, violating agreement. Second, the majority quorums: every phase contacts more than half the nodes, which is the mechanism that guarantees overlap — any two majority quorums share at least one node, so any decision visible to one quorum is visible to the next. Third, the failure mode: if a majority is unreachable, the protocol returns an error rather than producing an unsafe decision. This is the consensus tradeoff: preserve safety, sacrifice availability when the quorum is gone.
Real protocols (Raft, Paxos) add leader election to avoid live-lock between competing proposers, log replication to handle a stream of decisions efficiently, and crash recovery to make the protocol durable. Lesson 2 covers Raft's specific instantiation of these mechanisms.
A Failure Detector That Underpins Consensus
The consensus protocol relies on a failure detector to suspect that a node has crashed. The most basic implementation is a heartbeat:
use std::time::{Duration, Instant}; use std::collections::HashMap; pub struct HeartbeatDetector { last_seen: HashMap<String, Instant>, timeout: Duration, } impl HeartbeatDetector { pub fn new(timeout: Duration) -> Self { Self { last_seen: HashMap::new(), timeout } } pub fn note_heartbeat(&mut self, node: &str) { self.last_seen.insert(node.to_string(), Instant::now()); } pub fn suspected_dead(&self) -> Vec<&String> { let now = Instant::now(); self.last_seen .iter() .filter(|(_, &t)| now.duration_since(t) > self.timeout) .map(|(name, _)| name) .collect() } } fn main() { let mut det = HeartbeatDetector::new(Duration::from_secs(3)); det.note_heartbeat("node-a"); det.note_heartbeat("node-b"); // Imagine some time passes... println!("suspected: {:?}", det.suspected_dead()); }
The choice of timeout is the consensus protocol's liveness knob. Too short and the protocol declares healthy nodes dead, triggering unnecessary leader elections; too long and real failures take too long to recover. Module 4 covers more sophisticated detectors (phi accrual) that adapt to observed network conditions, but the principle is the same: consensus requires some failure detector, and the detector's accuracy directly determines the protocol's availability profile.
Key Takeaways
- Consensus is the protocol by which a group of nodes agree on a value despite failures and asynchrony. The four formal properties (agreement, integrity, validity, termination) are what distinguish consensus from cheaper protocols.
- The FLP impossibility result says deterministic consensus cannot guarantee termination in a fully asynchronous system. Practical protocols sidestep this with partial synchrony: they preserve safety always, and make progress whenever the network is synchronous enough.
- Linearizability and consensus are equivalent in their requirements. Any database that offers linearizable operations under failures is running consensus internally. The performance cost is the consensus round-trip per operation.
- Atomic commit, total-order broadcast, linearizable CAS, leader election, and distributed locks are all the same problem. If you understand consensus, you understand all of them; the implementation surface differs but the impossibility profile does not.
- Consensus is expensive. Use it when you need linearizability under failures, automatic leader election, or totally-ordered membership changes. Use cheaper mechanisms (eventually consistent updates, manual failover, sharding) when the application can tolerate them.
Source note: This lesson is grounded in DDIA 2nd Edition (Kleppmann & Riccomini), Chapter 10, "Linearizability," "Ordering Guarantees," "Distributed Transactions and Consensus" — particularly the subsection "Consensus." The FLP result is from Fischer, Lynch, Paterson, "Impossibility of Distributed Consensus with One Faulty Process" (Journal of the ACM, April 1985). Partial synchrony is from Dwork, Lynch, Stockmeyer, "Consensus in the Presence of Partial Synchrony" (Journal of the ACM, April 1988). The consensus-number framework is from Herlihy, "Wait-Free Synchronization" (TOPLAS, 1991). Specific equivalence claims among consensus, atomic commit, CAS, and total-order broadcast are standard textbook results; the framing here follows DDIA's presentation.
Lesson 2: The Raft Algorithm — Leader Election and Log Replication
Context
The previous lesson made the case that consensus is the right tool for the catalog's leader-election problem and laid out what consensus requires. This lesson covers Raft — the specific protocol the Constellation Network will implement. Raft was designed by Diego Ongaro and John Ousterhout at Stanford in 2014, with the explicit goal of being understandable. Paxos, the canonical consensus protocol since Lamport's 1989 paper, is famously difficult to teach and famously easy to get wrong in implementation. Raft achieves the same correctness guarantees with a structure that decomposes cleanly into three subproblems: leader election, log replication, and safety.
That decomposition is why Raft is now the default consensus protocol for new systems. etcd, Consul, CockroachDB, TiKV, MongoDB, and the Kafka KRaft mode are all built on Raft or Raft variants. Knowing Raft is no longer optional for engineers working on distributed data systems; it is the lingua franca.
By the end of this lesson, you should be able to: describe the three roles in Raft and the transitions between them; trace a single write through the log-replication path from client to commit; identify the safety property that the leader-election restriction enforces; and recognize the failure modes (split votes, deposed leaders, lost AppendEntries) that the protocol is designed to handle. Lesson 3 will cover the operational concerns — membership changes, log compaction, and partition recovery — that the protocol's basic shape doesn't address.
Core Concepts
The Three Roles: Follower, Candidate, Leader
Every Raft node is in one of three states at any time:
Follower. The default state. Followers passively receive log entries from a leader, vote on candidate requests, and reset an election timeout whenever they hear from a legitimate leader. If the election timeout expires without contact, the follower becomes a candidate.
Candidate. A follower that has decided to call an election. Candidates increment their current term, vote for themselves, and request votes from all other nodes. If a candidate receives votes from a majority of nodes in the same term, it becomes the leader. If it receives an AppendEntries from a node claiming to be a leader for the current or higher term, it reverts to follower. If the election times out without a winner (a split vote), it starts a new election with an incremented term.
Leader. The exclusive writer for the cluster. The leader receives client requests, appends them to its log, replicates the entries to followers, and tells followers when entries are committed. The leader sends periodic heartbeats (empty AppendEntries) to maintain its authority.
The state machine is small enough to memorize, but the dynamics matter. Term numbers monotonically increase across the cluster; a node that hears a higher term than its own immediately reverts to follower and adopts the higher term. This is the mechanism that prevents stale leaders from continuing to act after a new leader is elected.
Leader Election: How a Cluster Picks a Leader
Election begins when a follower's election timeout fires without a heartbeat from a leader. The timeout is randomized within a window (typically 150–300 ms) — this is the protocol's defense against split votes. If every follower had the same timeout, they would all become candidates simultaneously and split the vote.
The candidate increments its current term, votes for itself, and sends a RequestVote RPC to all other nodes. A node grants a vote if all of the following are true:
- The candidate's term is at least as high as the node's current term.
- The node has not already voted for someone else in this term.
- The candidate's log is at least as up-to-date as the node's log. (See "The Election Restriction" below.)
If a candidate receives votes from a majority, it becomes leader and immediately sends heartbeats to assert its authority — this stops other followers from timing out and starting their own elections.
A few failure modes the protocol must handle:
Split vote. Two candidates each get less than a majority. The election times out, both increment their terms, and randomize their next election timeout. The randomization makes it overwhelmingly likely that one will start its next election before the other, monopolizing the votes.
Network partition. A leader on the minority side of a partition cannot get heartbeats acknowledged by a majority. After its own commit-progress check fails (or after another node on the majority side starts an election), the partitioned leader effectively becomes irrelevant — its term will be superseded when the partition heals. This is the "deposed leader" scenario that fencing tokens (Module 5) defend against.
Cascading election timeouts. If many followers time out at once, they all become candidates and split the vote repeatedly. The randomized timeout prevents this from being a permanent failure, but it can produce a flurry of failed elections before the cluster converges. This is why the Raft paper recommends timeouts well above the typical network round-trip but well below human attention spans (150–300 ms is the standard band).
Log Replication: From Client Request to Commit
Once a leader is elected, the catalog's clients submit writes to it. The leader's job is to replicate every write to a majority of nodes before considering it committed. The mechanism:
- Client sends a request to the leader.
- Leader appends the entry to its local log with the current term and a monotonically increasing index.
- Leader sends
AppendEntriesRPCs to each follower, containing the new entry plus a reference to the immediately preceding entry (prevLogIndex,prevLogTerm). - Each follower verifies that its log matches at
prevLogIndex(the log matching property). If yes, it appends the new entry and acknowledges. If no, it rejects, prompting the leader to back up and try with an earlier index. - Once the leader sees acknowledgment from a majority of nodes (including itself), the entry is committed.
- Leader applies the entry to its state machine and informs followers (via the
commitIndexin subsequent AppendEntries) that they can apply too. - Leader responds to the client.
Two structural details deserve attention. First, the log matching property: if two logs contain an entry with the same index and term, then the logs are identical up to that entry. This is the invariant that lets the protocol detect and repair log divergence efficiently. Followers verify it before appending; leaders use it to back up after rejection.
Second, the commit point is when a majority has the entry, not when all have it. A slow follower does not block commits. The slow follower will eventually catch up via subsequent AppendEntries, but the cluster does not wait for it. This is the same shape as the "one synchronous follower plus the leader" pattern from Module 2, generalized to a majority quorum.
The Election Restriction: The Safety Critical Detail
The most delicate part of Raft — and the one that distinguishes it from naive consensus protocols — is the election restriction. A node grants a vote to a candidate only if the candidate's log is at least as up-to-date as the voter's log. "Up-to-date" is defined precisely: a log A is more up-to-date than log B if A's last entry has a higher term, or has the same term and a higher index.
The purpose of this restriction is to prevent a leader from being elected that does not have all the committed entries. Without the restriction, a candidate with an incomplete log could win an election and start replacing committed entries with new ones — violating the safety property that committed entries are durable.
The argument that the restriction works: any committed entry exists on a majority of nodes. A candidate that wins an election needs votes from a majority. The two majorities must overlap (any two majorities share at least one node). The overlapping node has the committed entry. The election restriction forces the candidate's log to be at least as up-to-date as the overlapping node's, which means the candidate has the committed entry. Therefore, no leader can be elected that is missing a committed entry.
The proof is in the original Raft paper; the operational consequence is that committed entries are durable across leader changes. This is the safety guarantee that consensus must provide and that 2PC-style protocols famously do not.
Heartbeats and the Cost of Leadership
The leader sends periodic heartbeats (empty AppendEntries) to followers to maintain its authority. Heartbeats are sent at an interval shorter than the election timeout — typically 50 ms when the election timeout is 150–300 ms. The 3× margin gives the protocol some slack for network jitter without spawning unnecessary elections.
The CPU and bandwidth cost of heartbeats is non-trivial at scale. A cluster of 5 nodes generates 4 heartbeats every 50 ms from the leader, plus 4 responses — 160 messages per second of cluster-internal traffic just to maintain leadership. For wide-area Raft clusters (the catalog's twelve ground stations), this is unacceptable, and production systems use longer heartbeat intervals or alternative liveness checks. We will discuss this in Lesson 3 under "Raft for wide-area deployments."
Why "Safety Always, Liveness When Synchronous"
Raft's design preserves the four consensus properties with one important qualification: safety holds under any failure model, including arbitrary message delays, lost messages, network partitions, and node crashes. Liveness (the protocol making progress) requires a window of partial synchrony — specifically, the heartbeat must reach a majority of followers within the election timeout often enough that an election can complete.
If the network is permanently asynchronous — heartbeats never arrive in bounded time — Raft will spend its life electing and re-electing without ever committing entries. This is not a Raft bug; it is the FLP impossibility re-expressed. Production systems mitigate by tuning timeouts to the observed network behavior and by alerting when election rate exceeds a baseline. A spike in election count is the canonical signal that the network has degraded faster than the cluster can adapt.
Code Examples
Raft State Skeleton
A minimal Raft state representation in Rust. This is not a complete implementation, but the structural shape:
#![allow(unused)] fn main() { use std::sync::Mutex; use std::time::Instant; #[derive(Clone, Copy, PartialEq, Eq, Debug)] pub enum Role { Follower, Candidate, Leader } #[derive(Clone, Debug)] pub struct LogEntry { pub term: u64, pub index: u64, pub command: Vec<u8>, } pub struct RaftState { pub role: Role, // Persistent state - must survive crash, written before responding to RPCs pub current_term: u64, pub voted_for: Option<String>, // candidate this node voted for in current term pub log: Vec<LogEntry>, // Volatile state - reset on restart pub commit_index: u64, pub last_applied: u64, // Leader-only volatile state - reset on election pub next_index: std::collections::HashMap<String, u64>, pub match_index: std::collections::HashMap<String, u64>, // Election timing pub last_heartbeat: Instant, } impl RaftState { pub fn new() -> Self { Self { role: Role::Follower, current_term: 0, voted_for: None, log: Vec::new(), commit_index: 0, last_applied: 0, next_index: Default::default(), match_index: Default::default(), last_heartbeat: Instant::now(), } } /// Called when this node receives any RPC with a term higher than its own. /// This is the universal rule that prevents stale leaders from continuing /// to act after a new term has been established. pub fn observe_higher_term(&mut self, new_term: u64) { if new_term > self.current_term { self.current_term = new_term; self.voted_for = None; self.role = Role::Follower; } } } }
The split into persistent and volatile state is essential. Persistent state (current term, voted_for, log) must be flushed to disk before any RPC response — losing it on crash would violate safety, because the node could vote twice in the same term. Volatile state (commit_index, last_applied) is reconstructable from the log on restart.
The RequestVote Handler
This is the heart of leader election:
#![allow(unused)] fn main() { use std::sync::Mutex; #[derive(Clone, Copy, PartialEq, Eq, Debug)] enum Role { Follower, Candidate, Leader } #[derive(Clone, Debug)] struct LogEntry { term: u64, index: u64 } struct RaftState { role: Role, current_term: u64, voted_for: Option<String>, log: Vec<LogEntry>, last_heartbeat: std::time::Instant } impl RaftState { fn observe_higher_term(&mut self, t: u64) { if t > self.current_term { self.current_term = t; self.voted_for = None; self.role = Role::Follower; } } } pub struct RequestVoteArgs { pub term: u64, pub candidate_id: String, pub last_log_index: u64, pub last_log_term: u64, } pub struct RequestVoteReply { pub term: u64, pub vote_granted: bool, } pub fn handle_request_vote( state: &Mutex<RaftState>, args: RequestVoteArgs, ) -> RequestVoteReply { let mut s = state.lock().unwrap(); // Universal rule: any RPC with a higher term forces revert to follower. s.observe_higher_term(args.term); // Reject if the candidate's term is stale. if args.term < s.current_term { return RequestVoteReply { term: s.current_term, vote_granted: false, }; } // Reject if we've already voted for someone else this term. if let Some(ref voted) = s.voted_for { if voted != &args.candidate_id { return RequestVoteReply { term: s.current_term, vote_granted: false, }; } } // The election restriction: the candidate's log must be at least as // up-to-date as ours. 'Up-to-date' = higher last term, or same last term // with higher index. let our_last_index = s.log.len() as u64; let our_last_term = s.log.last().map(|e| e.term).unwrap_or(0); let candidate_log_ok = args.last_log_term > our_last_term || (args.last_log_term == our_last_term && args.last_log_index >= our_last_index); if !candidate_log_ok { return RequestVoteReply { term: s.current_term, vote_granted: false, }; } // Grant the vote. Persist voted_for to disk before returning - if we crash // after responding but before persisting, we could vote twice on restart. s.voted_for = Some(args.candidate_id.clone()); // (persist_voted_for(s.voted_for) would go here in a real implementation) // Reset heartbeat timer: hearing from a candidate counts as activity, // preventing this node from also starting an election. s.last_heartbeat = std::time::Instant::now(); RequestVoteReply { term: s.current_term, vote_granted: true, } } }
Every branch in this handler corresponds to a safety case. Skip the term check and you can vote for stale candidates; skip the election restriction and you can elect a leader missing committed entries; skip the persistence and you can double-vote on crash recovery. Raft is a protocol where the code looks short until you account for every case the safety proof requires.
AppendEntries: The Log Replication Path
The AppendEntries handler is structurally similar but handles a different invariant:
#![allow(unused)] fn main() { #[derive(Clone, Debug)] struct LogEntry { term: u64, index: u64 } struct RaftState { current_term: u64, log: Vec<LogEntry>, commit_index: u64, last_heartbeat: std::time::Instant } impl RaftState { fn observe_higher_term(&mut self, _t: u64) {} } use std::sync::Mutex; pub struct AppendEntriesArgs { pub term: u64, pub leader_id: String, pub prev_log_index: u64, pub prev_log_term: u64, pub entries: Vec<LogEntry>, pub leader_commit: u64, } pub struct AppendEntriesReply { pub term: u64, pub success: bool, } pub fn handle_append_entries( state: &Mutex<RaftState>, args: AppendEntriesArgs, ) -> AppendEntriesReply { let mut s = state.lock().unwrap(); s.observe_higher_term(args.term); // Reject stale-term AppendEntries from old leaders. if args.term < s.current_term { return AppendEntriesReply { term: s.current_term, success: false }; } // Heartbeat received from legitimate leader: reset election timer. s.last_heartbeat = std::time::Instant::now(); // The log matching property check: our log must agree with the leader // at prev_log_index. If it doesn't, we reject - the leader will back up // and retry with an earlier index until we find a common point. if args.prev_log_index > 0 { let prev_entry = s.log.iter().find(|e| e.index == args.prev_log_index); match prev_entry { Some(e) if e.term == args.prev_log_term => {} // match - continue _ => return AppendEntriesReply { term: s.current_term, success: false }, } } // Conflict resolution: if an existing entry conflicts with a new one // (same index but different term), truncate our log and accept the new // entries. This is the mechanism that recovers from leader changes that // produced divergent suffixes. for new_entry in &args.entries { if let Some(pos) = s.log.iter().position(|e| e.index == new_entry.index) { if s.log[pos].term != new_entry.term { s.log.truncate(pos); break; } } } // Append entries we don't already have. for new_entry in args.entries { if !s.log.iter().any(|e| e.index == new_entry.index) { s.log.push(new_entry); } } // Advance commit_index to whatever the leader has committed, capped by // the index of the last entry we now have. if args.leader_commit > s.commit_index { let last_new_index = s.log.last().map(|e| e.index).unwrap_or(0); s.commit_index = args.leader_commit.min(last_new_index); } AppendEntriesReply { term: s.current_term, success: true } } }
The truncate-and-replace behavior is what allows the cluster to recover from a leader change that produced a divergent suffix. If a new leader has different entries at indices the old leader replicated, the followers' uncommitted suffixes are overwritten — which is safe precisely because the election restriction ensures the new leader had every committed entry.
Key Takeaways
- Raft has three roles (Follower, Candidate, Leader) and the role transitions are driven by a randomized election timeout. Any RPC with a higher term forces the receiver back to Follower; this is the mechanism that keeps the cluster on a single term.
- Leader election uses a majority-vote protocol with the election restriction: a vote is granted only if the candidate's log is at least as up-to-date as the voter's. This ensures any elected leader has all committed entries.
- Log replication is two-phase: the leader appends to its local log, replicates to followers via AppendEntries, and considers the entry committed once a majority (including itself) has it. The log matching property — same index and term implies identical logs up to that point — is what makes divergence repair tractable.
- Persistent state (current_term, voted_for, log) must be flushed to disk before responding to any RPC. Skipping persistence is one of the easiest ways to violate Raft's safety guarantees on crash recovery.
- Raft preserves safety unconditionally. It preserves liveness only when the network is synchronous enough for a majority to exchange heartbeats within the election timeout. A spike in election count is the canonical signal that the network has degraded.
Source note: This lesson is grounded in DDIA 2nd Edition (Kleppmann & Riccomini), Chapter 10, "Consensus" and "Single-leader replication and consensus" — and in the canonical Raft paper: Diego Ongaro and John Ousterhout, "In Search of an Understandable Consensus Algorithm (Extended Version)" (USENIX ATC 2014). The Raft paper is strongly recommended supplemental reading; it is one of the clearest systems papers ever published. Specific implementation details (timeout values, message formats) follow the paper's conventions; production Raft libraries (etcd, TiKV) deviate in operational specifics that should be checked before using in production.
Lesson 3: Raft in Practice — Membership Changes, Log Compaction, Recovery
Context
The Raft algorithm covered in Lesson 2 is what a correct cluster running on a fixed set of nodes does. Real clusters do not run on fixed sets of nodes. Operators add capacity, retire failing machines, expand from three nodes to five for higher fault tolerance, and shrink back when the workload drops. Real clusters accumulate log entries indefinitely until either disk fills up or replay-on-restart takes hours. Real clusters experience network partitions that split the cluster into pieces, leave the minority partition unable to make progress, and require the operator to understand whether the cluster will heal on its own when the partition resolves.
These are the operational concerns that the basic Raft algorithm does not address by itself. The Raft paper extends the algorithm with mechanisms for cluster membership changes (adding or removing nodes safely), log compaction (discarding old log entries via snapshots), and partition behavior (what happens during and after partition events). This lesson covers all three. The catalog's deployment runs in this operational regime continuously; understanding these mechanisms is what separates a Raft cluster you can operate from one that will produce incidents you cannot diagnose.
By the end of this lesson, you should be able to: describe the joint-consensus mechanism for safely changing cluster membership; identify when log compaction is needed and how snapshots interact with InstallSnapshot RPCs; predict cluster behavior under specific partition scenarios; and recognize the configurations under which a cluster can permanently fail to make progress.
Core Concepts
Membership Changes: The Naive Approach Doesn't Work
Suppose the catalog cluster runs on three nodes (A, B, C) and the team wants to expand to five nodes (A, B, C, D, E) for higher fault tolerance. The naive approach — push a configuration change through the cluster as an ordinary log entry — has a subtle but devastating safety failure.
The problem: during the transition, some nodes have applied the new configuration and some have not. If A applies the new configuration first (it now thinks the cluster has 5 nodes), and B and C have not yet applied it (they still think the cluster has 3 nodes), then it is possible to have two disjoint majorities — A plus two of D, E forms a majority under the new configuration; B plus C forms a majority under the old. Both can elect leaders simultaneously. Both can commit different log entries. The safety property is destroyed.
The Raft paper's solution is joint consensus — a two-phase membership change that explicitly handles the transition.
Joint Consensus
The transition from configuration C_old to C_new proceeds through a transitional state C_old,new that requires majorities from both the old and new configurations.
Phase 1: The cluster commits a special log entry representing C_old,new. Once committed, this entry is the active configuration. From this point on, any decision (leader election, log commit) requires a majority of C_old AND a majority of C_new. No decision can be made by either side alone.
Phase 2: Once C_old,new is committed, the leader proposes the entry representing C_new alone. When committed, the cluster transitions fully to the new configuration. C_old nodes that are no longer in C_new can be removed from the cluster.
The joint-consensus mechanism preserves safety because there is no point at which a decision could be made by a majority of one configuration alone. During the transition, every decision requires overlap with both configurations.
The cost is operational complexity: a membership change requires committing two log entries instead of one, and the cluster is in an unusual state between them. Production Raft libraries (etcd's Raft, hashicorp/raft) implement this for you, but you should understand the shape because the operational signals during a membership change — temporary increase in commit latency, brief windows when the cluster cannot tolerate a failure — are direct consequences of the joint-consensus structure.
Single-Server Membership Changes (the Simpler Alternative)
A subsequent paper by Ongaro proposed a simpler approach: single-server membership changes. Instead of changing membership in arbitrary increments, restrict each change to a single node added or removed at a time. With this restriction, the old and new configurations differ by exactly one node, and any majority of the new configuration overlaps with any majority of the old configuration (because removing or adding one node cannot create disjoint majorities).
This is the approach most production Raft implementations use, because it is simpler to reason about and simpler to implement correctly. To go from 3 nodes to 5, the operator adds one node, waits for the cluster to commit the membership change, then adds the next, and so on. Each step is a normal log entry; no joint-consensus state is needed.
The catalog uses single-server changes. The runbook for "add a new ground-station-tier replica" is: provision the node, register it with the cluster via the membership API, wait for the membership-change entry to commit, repeat. The operational story is much simpler than joint consensus, at the cost of more steps for large changes.
Log Compaction via Snapshots
A Raft log grows forever. A cluster running for a year, accepting one write per second, accumulates 31 million log entries — terabytes of data and hours of replay time on a fresh follower. The protocol needs a mechanism to discard old entries.
The mechanism is snapshots: periodically, each node captures the current state of its state machine, writes it to disk, and discards log entries up to the snapshot point. When a new follower joins, or a follower that has fallen too far behind needs to catch up, the leader sends an InstallSnapshot RPC containing the snapshot, after which normal log replication resumes from the snapshot's point.
Three operational details matter:
When to snapshot. Production systems snapshot every N log entries or every M bytes. Too frequent and the CPU cost of snapshotting dominates; too rare and the log grows unbounded. The standard heuristic is to snapshot when the log exceeds 10× the snapshot size, balancing snapshot frequency against catch-up cost.
Concurrent application and snapshotting. A snapshot of a large state machine takes time. During that time, the cluster continues to accept writes. The state machine implementation must support copy-on-write snapshots or some equivalent mechanism that allows reads (for the snapshot) to proceed concurrently with writes (for the active workload). RocksDB-backed Raft implementations use RocksDB's checkpoint feature for this.
InstallSnapshot vs incremental catch-up. When a follower falls behind, the leader first tries to catch it up via normal AppendEntries with earlier indices. If the earlier entries have been compacted away (the leader's log no longer goes that far back), the leader falls back to sending the entire snapshot. This is much more expensive — gigabytes vs kilobytes of network traffic — and a frequent indication of either a chronically lagging follower or a misconfigured snapshot interval.
Partition Recovery: What the Cluster Does During and After
The way a Raft cluster behaves during a network partition depends on which side of the partition contains the majority.
Majority side. Continues to operate normally. It elects a leader (if the partition is brand new and the old leader was on the minority side) or retains the existing leader. Writes commit normally. The minority side's nodes are detected as failed (no AppendEntries acks) but the cluster proceeds without them.
Minority side. Cannot commit new entries. The local leader (if any) accepts writes from clients but cannot replicate them to a majority. After some time, the local leader may step down, or may continue trying indefinitely (the behavior here depends on the implementation; some Raft libraries automatically step down after losing majority contact for a sustained period). Clients connected to the minority side experience write failures.
Read behavior on the minority side. This is where implementations differ. The strict reading of Raft says reads should also require a majority quorum (a "read index" check) to be linearizable, which means minority-side reads should fail. Some implementations allow reads from the local follower as a degraded mode. The catalog's deployment requires linearizable reads, so minority-side reads fail; this is documented and accepted as a tradeoff for correctness.
Healing the partition. When connectivity restores, nodes on the minority side observe heartbeats from the majority side, advance their terms (if necessary), and resume normal operation as followers. Any log entries the minority side appended but did not commit are truncated and replaced by the majority side's log. This is the same truncation behavior as the deposed-leader scenario from Lesson 2.
The pathological case: a partition that splits the cluster into two halves of equal size, with no majority on either side. The cluster makes no progress on either side until the partition heals. This is why production Raft clusters always have an odd number of nodes — a 5-node cluster split 2-2-1 (with two-thirds of the cluster on each side of a different partition) is still recoverable; a 4-node cluster split 2-2 is permanently stalled until the partition heals.
When the Cluster Permanently Fails to Make Progress
A Raft cluster can be in a state where it cannot make progress even when nominally healthy. The most common cases:
Majority failure. If more than half the nodes are permanently dead, no quorum can form. The cluster cannot recover without operator intervention. The recovery procedure is to manually force a configuration change to a configuration where the surviving nodes form a majority. This is invasive and requires manual procedures that vary by implementation.
Persistent partial synchrony failure. If the network is degraded such that no leader can maintain heartbeats to a majority within the election timeout, the cluster will spend its time electing and re-electing without committing. The signal is a high election rate with low commit progress. The mitigation is to widen the election timeout or to investigate the underlying network degradation.
Disk failure on a quorum-essential node. If a node's persistent log is corrupted and replicas are unavailable, the cluster can lose committed entries. Production systems address this by replicating the log durably to multiple disks per node, by configuring cluster size to tolerate the expected failure rate, and by alerting on disk error rates before they reach catastrophic levels.
The operational discipline is to monitor: leader stability (term changes per hour), commit latency (p50, p99), election frequency, snapshot frequency, and log size per node. A change in any of these metrics is the early signal of a problem that has not yet produced an outage.
Read-Only Operations: Linearizable Reads in Raft
Raft preserves linearizability for writes by construction. Reads are subtler. The naive approach — let the leader serve reads from its local state — has a flaw: a deposed leader (whose term has been superseded but who has not yet heard about it) can serve stale reads. The Raft paper proposes two mechanisms:
Read index. Before serving a read, the leader confirms it is still the leader by sending a heartbeat to all followers and waiting for majority acknowledgment. Only after the heartbeat succeeds does the leader read from its local state. This adds a round-trip to every read but guarantees linearizability.
Lease reads. The leader maintains a "lease" that says no other node can become leader for some duration. Within the lease, the leader can serve reads from local state without coordination. The lease is typically tied to the election timeout: if no follower will start an election within election_timeout, the leader can serve reads without checking. This is faster but requires bounded clock skew across the cluster — a fact that ties this lesson back to Module 1's discussion of clock unreliability.
The catalog uses read-index for linearizable reads (the orbital catalog and the conjunction-prediction service). It uses local-leader reads (without read-index) for the operational dashboard, where the team has accepted the rare possibility of stale reads as a worthwhile tradeoff for latency.
Code Examples
Single-Server Membership Change
The shape of a single-server membership change, as a configuration log entry:
#[derive(Clone, Debug)] pub struct ConfigChange { pub change_type: ChangeType, pub node_id: String, pub address: String, } #[derive(Clone, Copy, Debug)] pub enum ChangeType { Add, Remove } #[derive(Clone, Debug)] pub struct Configuration { pub voters: Vec<String>, } impl Configuration { pub fn apply(&mut self, change: &ConfigChange) -> Result<(), String> { match change.change_type { ChangeType::Add => { if self.voters.contains(&change.node_id) { return Err(format!("node {} already in cluster", change.node_id)); } self.voters.push(change.node_id.clone()); Ok(()) } ChangeType::Remove => { let pos = self.voters.iter().position(|n| n == &change.node_id); match pos { None => Err(format!("node {} not in cluster", change.node_id)), Some(p) => { self.voters.remove(p); Ok(()) } } } } } pub fn majority(&self) -> usize { self.voters.len() / 2 + 1 } } fn main() { let mut config = Configuration { voters: vec!["A".into(), "B".into(), "C".into()], }; println!("initial majority: {} of {}", config.majority(), config.voters.len()); config.apply(&ConfigChange { change_type: ChangeType::Add, node_id: "D".into(), address: "10.0.0.4".into(), }).unwrap(); println!("after add D: majority {} of {}", config.majority(), config.voters.len()); }
The configuration is itself stored as a log entry, applied to the state machine like any other write. The persistence and replication of configuration changes use the same Raft machinery as data writes — which is what makes the membership change safe (it goes through the same election restrictions and majority requirements).
A Read-Index Implementation
// Linearizable read via read-index.
use anyhow::Result;
use std::sync::Arc;
async fn linearizable_read(node: &RaftNode, key: &str) -> Result<Option<Vec<u8>>> {
// 1. Confirm we are still the leader: send a heartbeat round.
// This is the same AppendEntries RPC used for replication, but with
// no new entries - just confirming followers still recognize us.
let confirmed = node.confirm_leadership().await?;
if !confirmed {
return Err(anyhow::anyhow!("not leader"));
}
// 2. Record the current commit index. We will wait until our state
// machine has applied up to this index before serving the read.
// (In a high-throughput system, the state machine application can lag
// the log commit; we must wait for it to catch up.)
let read_index = node.commit_index();
node.wait_until_applied(read_index).await?;
// 3. Read from local state.
Ok(node.state_machine_get(key))
}
struct RaftNode;
impl RaftNode {
async fn confirm_leadership(&self) -> Result<bool> { Ok(true) }
fn commit_index(&self) -> u64 { 0 }
async fn wait_until_applied(&self, _i: u64) -> Result<()> { Ok(()) }
fn state_machine_get(&self, _k: &str) -> Option<Vec<u8>> { None }
}
The cost is a heartbeat round per read. For a 5-node cluster, this is 4 heartbeats and 4 acknowledgments per read — non-trivial but acceptable for most workloads. The catalog's read-index implementation batches reads within a window so that one heartbeat serves many reads, amortizing the cost.
Snapshot Trigger Logic
#![allow(unused)] fn main() { pub struct SnapshotPolicy { pub min_log_entries_between_snapshots: u64, pub log_size_multiplier: u64, // snapshot when log > N × snapshot_size } pub struct SnapshotTrigger { policy: SnapshotPolicy, last_snapshot_log_index: u64, last_snapshot_size_bytes: u64, } impl SnapshotTrigger { pub fn should_snapshot(&self, current_log_index: u64, current_log_bytes: u64) -> bool { let entries_since = current_log_index - self.last_snapshot_log_index; if entries_since < self.policy.min_log_entries_between_snapshots { return false; } // Snapshot when the log has grown to N× the previous snapshot size. // This bounds the worst-case catch-up cost: a new follower's catch-up // is at most one snapshot plus N× snapshot size of log replay. current_log_bytes > self.last_snapshot_size_bytes * self.policy.log_size_multiplier } pub fn note_snapshot_taken(&mut self, log_index: u64, snapshot_bytes: u64) { self.last_snapshot_log_index = log_index; self.last_snapshot_size_bytes = snapshot_bytes; } } }
The trigger logic is local to each node; nodes do not coordinate snapshots. A consequence is that different nodes snapshot at different points in time, and a new follower may receive a snapshot from any leader. The InstallSnapshot RPC carries the snapshot's last_included_index and last_included_term so the receiver knows where to resume log replication from.
Key Takeaways
- Membership changes use joint consensus (the original Raft mechanism) or single-server changes (the simpler variant most production systems implement). Both preserve safety; single-server is simpler operationally at the cost of requiring step-at-a-time changes.
- Log compaction via snapshots is essential for long-running clusters. The snapshot trigger should balance snapshot CPU cost against new-follower catch-up cost; 10× log-to-snapshot ratio is a standard heuristic.
- Partition behavior is asymmetric: the majority side continues to operate; the minority side cannot commit new entries. Healing the partition causes truncation of uncommitted entries on the minority side. Two equal-sized partitions in an even-numbered cluster permanently stall — use odd-numbered clusters.
- Linearizable reads in Raft require either a read-index round (heartbeat-confirmed leadership before serving the read) or lease reads (within a bounded clock-skew assumption). Local-leader reads without coordination are not linearizable — they may serve stale data from a deposed leader.
- Cluster health metrics — leader stability, commit latency, election frequency, snapshot rate — are the early signals of operational problems. A change in any of these is worth investigating before it produces an outage.
Source note: This lesson is grounded in DDIA 2nd Edition (Kleppmann & Riccomini), Chapter 10, "Consensus in Practice" — and in the original Raft paper (Ongaro & Ousterhout 2014) plus Ongaro's PhD dissertation, "Consensus: Bridging Theory and Practice" (Stanford 2014), which is the authoritative reference for single-server membership changes, log compaction, and the read-index/lease-read mechanisms. The dissertation is the recommended deep-dive for engineers implementing or operating production Raft. Specific operational parameters (snapshot ratios, election timeout bands) are conventional defaults; production deployments tune these based on workload.
Module 03 Project — Orbital Raft
Mission Brief
Incident ticket CN-2611-009 Severity: P1 (track-defining) Reporter: Constellation Operations, Antarctic Watch Status: Open
The Antarctic relay path has a 14-minute coverage gap during which Antarctic ground stations lose contact with the rest of the Constellation Network. The catalog's leader-election runbook calls for a human operator to declare promotion during a leader outage. During the November storm, this took 47 minutes — the Antarctic team was offline for the duration of the gap, the next operator on call did not see the page until 06:32 local, and by then a partial split-brain had developed because the Antarctic-side replica had continued accepting writes without quorum.
The fix is to automate leader election with a consensus protocol. You are implementing Orbital Raft — a Raft consensus library, in Rust, designed to handle the constellation's specific operational regime: 5-node clusters, intermittent partitions, and recovery without human intervention.
This is the most significant project in the track. The deliverable is a Rust crate, orbital_raft, that implements:
- The Raft state machine (Follower / Candidate / Leader transitions).
- The Raft RPCs (RequestVote, AppendEntries, InstallSnapshot).
- Persistent state (current_term, voted_for, log) flushed to disk before any RPC response.
- Single-server membership changes (add/remove one node at a time).
- A test harness that injects partitions, drops messages, and verifies safety and liveness properties.
The bar is correctness under adversarial conditions, not performance. Production Raft is its own subspecialty; this project demonstrates that you understand the protocol well enough to operate one.
Repository Layout
orbital-raft/
├── Cargo.toml
├── src/
│ ├── lib.rs
│ ├── state.rs # RaftState, term, role, log
│ ├── rpc.rs # RequestVote, AppendEntries, InstallSnapshot types
│ ├── node.rs # RaftNode: the state machine driver
│ ├── transport.rs # Network abstraction (real or simulated)
│ ├── storage.rs # Persistent state I/O (fsync-backed file or sled)
│ ├── membership.rs # Single-server config changes
│ └── snapshot.rs # Snapshot trigger logic and InstallSnapshot
├── tests/
│ ├── leader_election.rs
│ ├── log_replication.rs
│ ├── partition_safety.rs
│ ├── membership_change.rs
│ └── snapshot_install.rs
└── README.md
Required API
// node.rs
pub struct RaftNode<SM: StateMachine> {
// owns the RaftState, drives election timeouts, dispatches RPCs
}
pub trait StateMachine: Send + Sync {
fn apply(&self, command: &[u8]) -> Vec<u8>;
fn snapshot(&self) -> Vec<u8>;
fn restore(&self, snapshot: &[u8]);
}
impl<SM: StateMachine> RaftNode<SM> {
pub fn new(
node_id: String,
peers: Vec<String>,
transport: Arc<dyn Transport>,
storage: Arc<dyn Storage>,
state_machine: Arc<SM>,
) -> Self;
pub async fn run(&self);
pub async fn submit(&self, command: Vec<u8>) -> Result<Vec<u8>>;
pub async fn change_membership(&self, change: ConfigChange) -> Result<()>;
pub fn is_leader(&self) -> bool;
}
// transport.rs
#[async_trait]
pub trait Transport: Send + Sync {
async fn request_vote(&self, target: &str, args: RequestVoteArgs) -> Result<RequestVoteReply>;
async fn append_entries(&self, target: &str, args: AppendEntriesArgs) -> Result<AppendEntriesReply>;
async fn install_snapshot(&self, target: &str, args: InstallSnapshotArgs) -> Result<InstallSnapshotReply>;
}
// storage.rs
#[async_trait]
pub trait Storage: Send + Sync {
async fn save_state(&self, term: u64, voted_for: Option<String>) -> Result<()>;
async fn load_state(&self) -> Result<(u64, Option<String>)>;
async fn append_log(&self, entries: &[LogEntry]) -> Result<()>;
async fn load_log(&self) -> Result<Vec<LogEntry>>;
async fn truncate_log_after(&self, index: u64) -> Result<()>;
async fn save_snapshot(&self, snapshot: &Snapshot) -> Result<()>;
async fn load_snapshot(&self) -> Result<Option<Snapshot>>;
}
Acceptance Criteria
-
cargo build --releasecompletes without warnings under#![deny(warnings)]. -
cargo test --releasepasses all integration tests with zero flakes across 50 consecutive runs. -
cargo clippy -- -D warningsproduces no lints. - Leader election test: start a 5-node cluster, observe that exactly one node becomes leader within 2 election timeouts. After killing the leader, another node becomes leader within 2 election timeouts.
- Single-leader-per-term test: across 1000 randomly-seeded test runs, no term ever has two leaders simultaneously.
- Log replication test: submit 1000 commands to the leader; verify that all five nodes eventually have identical logs after waiting for replication.
- Commit-only-on-majority test: under a partition where the leader is on the minority side, no command submitted to that leader is reported as committed.
- Election restriction test: construct a scenario where one node has a longer log but stale terms; verify the cluster does not elect that node leader (the more-up-to-date logs win).
- Persistence test: crash all 5 nodes; restart them; verify that the cluster recovers and that no committed entry is lost. (Implementation: simulated crash via dropping the in-memory node state while preserving the on-disk storage.)
- Membership change test: start a 3-node cluster; add a 4th node; remove the original first node; verify the cluster maintains availability throughout and the resulting 3-node cluster (nodes 2, 3, 4) is correctly configured.
- Snapshot test: configure aggressive snapshot triggers; submit enough commands to trigger snapshotting; bring up a new follower; verify the new follower receives an InstallSnapshot RPC, applies it, and catches up to current state.
- Partition recovery test: partition a 5-node cluster into 3 and 2; submit commands to the majority side; heal the partition; verify the minority side's nodes catch up and no committed entry is lost.
- (self-assessed) The code is structured so that the state machine, transport, and storage are independently swappable. Tests verify this by using an in-memory transport and storage that differ from any real-world implementation.
- (self-assessed) The README explains, in plain prose, what guarantees the implementation provides and what it does NOT provide (no linearizable reads via lease, no joint consensus for arbitrary membership changes, etc.). A reader should understand the scope and limitations after one pass.
-
(self-assessed) Persistence ordering is correct: any state that must be durable before a response is observably fsynced. A code reviewer should be able to confirm this by inspecting
save_stateandappend_logand following their callers.
Expected Output
cargo test --release leader_election -- --nocapture:
[t=0.000s] cluster=[A,B,C,D,E] started
[t=0.000s] A: follower, term=0
[t=0.000s] B: follower, term=0
[t=0.150s] D: election timeout, becoming candidate (term=1)
[t=0.155s] D: requested votes from {A,B,C,E}
[t=0.158s] A: vote granted to D (term=1)
[t=0.159s] B: vote granted to D (term=1)
[t=0.162s] C: vote granted to D (term=1)
[t=0.165s] D: received majority votes, becoming leader (term=1)
[t=0.165s] D: sending initial heartbeats
PASS: cluster elected D as leader in term 1 within 2 election timeouts
[t=2.000s] D crashed
[t=2.150s] A: election timeout, becoming candidate (term=2)
[t=2.155s] A: requested votes from {B,C,E}
[t=2.160s] A: received majority votes, becoming leader (term=2)
PASS: cluster re-elected A as leader in term 2 after D's failure
Hints
1. Structure the state machine as an event loop
The cleanest Raft implementation is a single async loop per node that processes events: incoming RPC, election timeout, heartbeat timeout, new client command. A tokio::select! over channels for each event type makes the state transitions easy to follow. The loop's body is a giant match on self.role and the event kind. Avoid spreading the state machine across many threads — single-loop is correct by construction; multi-threaded is correct only if you're disciplined about locking.
2. Persist BEFORE responding to RPCs
Raft's safety relies on persistent state being durable before any RPC response that depends on it. In RequestVote: persist voted_for before returning the reply. In AppendEntries: persist log entries before acknowledging. The cost is a disk write per RPC, which in production is mitigated by batching. For this project, an unbatched fsync per RPC is acceptable. Pay attention to the failure mode of "responded but didn't persist" — it manifests as double-voting on crash recovery, which is a safety violation.
3. Testing safety vs liveness separately
Safety tests assert "no two leaders in same term," "no committed entry lost," "log matching property holds." These tests should pass under any scheduling, any message order, any failure injection. Liveness tests assert "cluster elects a leader within bounded time," "submitted commands eventually commit." These tests require some synchrony — they will fail if you inject permanent message loss. Structure the test suite so that safety tests run under chaotic injection and liveness tests run under bounded injection.
4. Simulating crashes
A "crash" in tests is implemented as dropping the RaftNode while preserving its Storage. On restart, construct a new RaftNode pointing at the same Storage and verify it recovers the term, voted_for, and log. This catches the persistence-correctness bugs that are hardest to find by inspection — particularly the case where you persisted current_term but not voted_for, allowing a node to vote twice in the same term across a crash.
5. The election restriction test scenario
To exercise the election restriction: create a 3-node cluster, partition node C from A and B, advance terms on A,B by triggering several elections among them while C is isolated, then submit some commands so A,B have committed entries in term 2 that C does not have. C's log is shorter (no term-2 entries) — even if C's stale term-1 last index is high, the election restriction says A,B should refuse to vote for C because A,B's last log term (2) is higher than C's (1). Verify that healing the partition results in C becoming follower, not leader.
6. Snapshot test pitfalls
InstallSnapshot is subtle. The receiver must: (a) clear its log up to last_included_index, (b) restore the state machine from the snapshot, (c) set last_applied to last_included_index, (d) be ready to receive subsequent AppendEntries starting from last_included_index + 1. A common bug is to forget step (a) and end up with stale log entries that conflict with the new snapshot. Verify the test asserts the receiver's log is exactly the entries after the snapshot point, not the entries plus the pre-snapshot remnants.
Source Anchors
- DDIA 2nd Edition, Chapter 10 — "Consensus," "Consensus in Practice," "Membership and Coordination Services"
- Ongaro & Ousterhout, "In Search of an Understandable Consensus Algorithm (Extended Version)" (USENIX ATC 2014) — the Raft paper; this project's primary reference
- Ongaro, "Consensus: Bridging Theory and Practice" (Stanford PhD dissertation, 2014) — the deep reference for membership changes, snapshots, and operational concerns
- The etcd Raft library (github.com/etcd-io/raft) — a high-quality production reference implementation; useful for comparing design choices
Module 04 — Fault Tolerance Patterns
"The catalog's resilience layer has prevented 14 documented incidents. It has also failed to prevent three incidents that occurred because of failure modes the resilience layer had not been designed for. The next outage is, by definition, an unknown failure mode."
Mission Context
Modules 1 through 3 built up the foundations: how to reason about an unreliable network and clock, how to replicate state across nodes, and how to reach agreement on shared state via consensus. The constellation needs one more layer before it can run unattended at 3 AM: the operational discipline that bounds failure when something does go wrong, and the testing practice that discovers failure modes before they discover production.
This module covers three patterns and one practice. Failure detection (Lesson 1) is the primitive every higher-level recovery mechanism depends on; without an accurate liveness signal, elections happen for no reason and real failures take too long to surface. Circuit breakers and bulkheads (Lesson 2) bound the resource coupling that turns a single slow downstream into a constellation-wide cascade. Chaos engineering (Lesson 3) is the discipline that closes the inevitable gap between the failure modes you designed against and the failure modes the system will actually encounter.
The lessons are synthesis-heavy compared to Modules 1–3, because the canonical references (DDIA Ch 9 for failure detection; Release It! and the Hystrix design documents for resilience patterns; Netflix and FoundationDB writeups for chaos) are scattered across multiple sources. Source notes call out where claims are synthesized versus directly cited.
Lessons
| # | Title | Source |
|---|---|---|
| 1 | Failure Detection — Heartbeats, Timeouts, Phi Accrual | DDIA Ch. 9 + Chandra-Toueg 1996 + Phi Accrual 2004 |
| 2 | Circuit Breakers and Bulkheads | Release It! + Hystrix design |
| 3 | Chaos Engineering and Fault Injection | Principles of Chaos Engineering + FoundationDB simulation |
Project
Ground Station Failover — a Rust crate that integrates the module's patterns into an operational failover system. Phi accrual failure detection, rate-based circuit breakers, semaphore bulkheads, retry budgets with backoff, and a chaos-injection harness for verifying the integrated system survives the failure modes it was designed for.
Position
Module 4 of 6 in the Distributed Systems track.
What You Should Be Able to Do After This Module
- Calibrate a failure detector to the actual latency distribution of the network it monitors, and articulate the tradeoff between detection speed and false-positive rate.
- Recognize resource-coupling failure modes in code by inspection and choose the appropriate isolation pattern (semaphore vs thread-pool bulkhead) for each.
- Configure a circuit breaker with rate-based thresholds, minimum-volume floors, and Half-Open recovery probing.
- Design a retry policy that does not amplify load: exponential backoff, jitter, retry budget, and idempotency requirements.
- Plan a chaos engineering program with a rotating set of experiment categories, measurable success metrics (MTTD, MTTR, findings per experiment), and the operational guardrails to run experiments in production safely.
- Distinguish where chaos engineering applies (service-level failure modes), where deterministic simulation applies (low-level consensus and storage), and where neither replaces good architecture review.
Source Materials
- DDIA 2nd Edition (Kleppmann & Riccomini, 2026), Chapter 9 — covers fault detection and the response to degraded performance. The primary direct source for Lesson 1.
- Michael Nygard, Release It! (2nd ed, Pragmatic Bookshelf, 2018) — the canonical reference for circuit breaker, bulkhead, and stability patterns. Strongly recommended for engineers operating production systems.
- Casey Rosenthal & Nora Jones, Chaos Engineering: System Resiliency in Practice (O'Reilly, 2020) — the modern reference for chaos engineering practice.
- Hayashibara et al., "The φ Accrual Failure Detector" (SRDS 2004) — for the adaptive failure detection algorithm.
- Chandra & Toueg, "Unreliable Failure Detectors for Reliable Distributed Systems" (JACM 1996) — the foundational failure detector taxonomy.
- Will Wilson, "Testing Distributed Systems w/ Deterministic Simulation" (Strange Loop 2014) — FoundationDB's approach. Recommended viewing.
Track-level synthesis note: Foundations of Scalable Systems — the source book originally planned for parts of this module — was not available during authoring. Lessons 1, 2, and 3 are synthesized from training knowledge plus the cited papers and books. Source-note callouts within each lesson flag specific claims that should be cross-referenced against Foundations of Scalable Systems or another systems-scaling reference before publication.
Lesson 1: Failure Detection — Heartbeats, Timeouts, and Phi Accrual
Context
Every mechanism we have built so far — Raft elections, replication failovers, conflict resolution — depends on a single primitive that we have so far treated as a given: the ability to suspect that a node has failed. When a heartbeat is overdue, when an AppendEntries does not return, when a TCP connection hangs in a half-open state, some component of the system must declare "this node is suspected dead" and trigger the appropriate response (a new election, a failover, a circuit breaker trip). This component is the failure detector, and its design choices determine the cluster's behavior under partial network degradation.
The failure detector is also the most direct manifestation of the truth from Module 1 Lesson 1: a timeout does not tell you the remote is dead; it tells you that you stopped waiting. Choosing the timeout is choosing the false-positive vs false-negative tradeoff. Choosing the detector algorithm is choosing how aggressively the system adapts to changing network conditions. Production systems spend significant engineering effort on this layer because the failure detector is what determines whether the cluster behaves correctly during incidents — and incidents are exactly the time when correctness matters most.
This lesson covers the spectrum of failure detection approaches, from simple fixed-timeout heartbeats to the phi accrual failure detector used by Cassandra and Akka. By the end, you should be able to choose a failure detection approach for a given workload, calibrate its timeouts and thresholds against the operational regime, and recognize the failure modes (flapping, false positives, slow detection) that each approach is and is not robust to.
Core Concepts
The Spectrum of Failure Detectors
DDIA's discussion of failure detection introduces a useful spectrum (originally from Chandra & Toueg 1996):
Perfect failure detector. Suspects exactly the failed processes. Available only with synchronous assumptions that real systems don't have.
Eventually perfect failure detector. Eventually suspects exactly the failed processes, but may produce false positives or false negatives temporarily. Achievable under partial synchrony; this is what real systems aim for.
Eventually strong failure detector. Eventually suspects all failed processes, and eventually at least one correct process is never suspected. Sufficient for solving consensus.
The takeaway is operational: no real failure detector is perfect. The choice is between approaches that fail in different ways under different conditions. The goal is to pick an approach whose failure modes are tolerable for the workload, not to find one that never fails.
Fixed-Timeout Heartbeats: The Baseline
The simplest detector: each node sends a heartbeat at interval H, and a monitor declares a node dead if no heartbeat arrives within timeout T (typically T = 3 * H). This is what Raft, etcd, ZooKeeper, and most production systems use as a baseline.
The two parameters interact:
- H (heartbeat interval). Shorter means faster detection but more network traffic. For a 5-node cluster with 50ms heartbeats, that's 100 heartbeats per second of cluster-internal traffic just for liveness. For wide-area deployments with 12 ground stations, this becomes meaningful.
- T (timeout). Longer means fewer false positives during transient network blips but slower detection of real failures. The 3× rule (
T = 3 * H) is a reasonable default: it tolerates one or two missed heartbeats from jitter but declares a failure within 3H of the actual event.
The failure modes:
- False positives under network congestion. When the network is slow but not failed, heartbeats are delayed past
T, triggering spurious failure declarations. In Raft, this means spurious elections; in a circuit breaker, this means breaking on healthy services. - Slow detection under bursty failure. A node that fails immediately after sending a heartbeat will not be detected for up to
Tseconds. For services with strict latency budgets, this is too slow. - Flapping. A node that intermittently fails (e.g., a process that periodically GCs for 200ms) repeatedly crosses the threshold, causing alternating "alive" and "suspected" declarations. Each flap may trigger an expensive recovery action; the system spends more time recovering than running.
The constellation network's standard is H=50ms, T=150ms for ground-station-to-leader heartbeats within a region, and H=500ms, T=2s for cross-region paths. The two settings reflect the different latency regimes; using the same settings everywhere either produces false positives on the cross-region path or slow detection on the intra-region path.
Adaptive Detectors: Phi Accrual
The fixed-timeout approach assumes a stable distribution of inter-heartbeat times. Production networks aren't stable. A detector that uses the same threshold during a midday traffic spike as during an idle period will be wrong in one of those regimes.
Phi accrual is the technique pioneered by Hayashibara et al. (2004) and adopted by Cassandra, Akka, and Apache HBase. Instead of "declare dead after T seconds without heartbeat," phi accrual outputs a continuous suspicion level: the probability that the node is dead, given the inter-heartbeat history.
The algorithm maintains a sliding window of recent inter-heartbeat times, fits a distribution (typically lognormal), and on each tick computes the probability that the current gap since the last heartbeat could arise from the observed distribution. The output is phi = -log10(P), where P is the probability of the current gap. Higher phi = more suspicious. The application chooses a phi threshold above which to declare the node dead.
The advantages:
- Adapts to network conditions. If inter-heartbeat times are normally clustered around 50ms with occasional 100ms outliers, the detector tolerates the outliers without flagging. If conditions degrade and 100ms becomes the new normal, the detector recalibrates.
- Suspicion is continuous. The application can take graduated action: at phi=3, log a warning; at phi=5, redirect new traffic; at phi=8, declare dead and trigger failover. Different consumers of the suspicion signal can use different thresholds without re-running the detector.
- Operational tunability. The phi threshold is a single dimensionless parameter, easier to reason about than a timeout (which interacts with the heartbeat interval).
The disadvantages:
- Implementation complexity. Maintaining the sliding window, fitting the distribution, and computing phi efficiently is non-trivial. Several open-source implementations exist (Akka's, Cassandra's, the akka-failure-detector crate); reimplementing for a new system is rarely worth it.
- Sensitivity to bootstrap. The detector needs some history before it can produce meaningful phi values. During bootstrap, it falls back to fixed timeouts.
- Not necessarily appropriate for all workloads. For workloads where the cost of false positives is high and the cost of slow detection is low (financial trading, for example), a more conservative fixed timeout may be the right call.
The catalog uses fixed-timeout heartbeats inside the Raft consensus layer (because the Raft paper's analysis assumes fixed timeouts, and changing it requires re-deriving the safety properties) and phi accrual for service-to-service liveness (where graduated suspicion drives load balancing and circuit breaker decisions).
Gossip-Based Failure Detection
For large clusters (hundreds or thousands of nodes), point-to-point heartbeating doesn't scale. With N nodes, every-to-every heartbeating is O(N²) messages per round, which dominates network traffic at large N.
Gossip protocols (covered in detail in Module 5) provide a scalable alternative: each node tracks failure information for a small subset of peers, exchanges information with random peers periodically, and the cluster-wide view emerges via epidemic propagation. The SWIM protocol (Das, Gupta, Motivala 2002) is the canonical reference, and Hashicorp's Serf is a widely-deployed implementation.
The tradeoff is delayed propagation: information about a failure takes O(log N) gossip rounds to reach the entire cluster, where N is the cluster size. For latency-critical detection (a Raft cluster), this is too slow. For broad cluster-membership awareness (Cassandra-style "which nodes are up"), it's the right shape.
The constellation's 48-satellite + 12-ground-station deployment is small enough that point-to-point heartbeating is feasible, but the architecture team chose gossip for the data plane (telemetry distribution) on the basis that it scales to future expansion without rearchitecting.
Failure Suspicion vs Failure Declaration
A subtle distinction worth naming: suspicion (a probability the node may be dead) is different from declaration (a decision to act as if the node is dead). Production systems separate these.
Suspicion is continuous and cheap. Every component that consumes liveness information (the load balancer, the circuit breaker, the Raft cluster) can subscribe to suspicion updates and react in proportion. A 30% suspicion might mean "don't send new requests"; an 80% suspicion might mean "drop existing connections."
Declaration is discrete and expensive. Acting as if a node is dead — promoting a follower, evicting from the cluster, triggering reconciliation — has costs that should not be paid casually. Declarations should require sustained high suspicion, not transient spikes.
The phi accrual model makes this separation natural: phi is the suspicion signal, and each consumer chooses its own declaration threshold. The fixed-timeout model conflates the two, which is one of the reasons phi accrual is preferred for systems with many consumers of liveness information.
Failure Detector Calibration
Whatever detector you choose, calibration is operational work that pays off:
- Measure the actual inter-heartbeat distribution. Histograms of p50, p99, p99.9 inter-heartbeat times tell you what "normal" looks like. The timeout (or phi threshold) should be set relative to this baseline.
- Estimate the false-positive rate. In a healthy cluster, how often does the detector spuriously flag a node? If the answer is "weekly," the cluster will see a steady drumbeat of unnecessary failover attempts.
- Measure detection latency. When a real failure occurs, how long until the detector flags it? This bounds the recovery latency the cluster can offer.
- Track the calibration over time. Network conditions drift. A detector tuned for last year's traffic patterns may be poorly tuned for this year's. Periodic recalibration is part of normal operations.
The catalog publishes failure-detector metrics — false positive count per day, p99 detection latency, current phi values — to the operations dashboard. The metrics are reviewed quarterly and the thresholds adjusted. This sounds tedious. It is. It is also the difference between a system that ages gracefully and one that produces an increasing rate of incidents over time.
Code Examples
Fixed-Timeout Heartbeat Detector
use std::collections::HashMap; use std::time::{Duration, Instant}; pub struct FixedHeartbeatDetector { last_seen: HashMap<String, Instant>, timeout: Duration, } #[derive(Debug, Clone, Copy, PartialEq, Eq)] pub enum NodeStatus { Alive, Suspected, Dead } impl FixedHeartbeatDetector { pub fn new(timeout: Duration) -> Self { Self { last_seen: HashMap::new(), timeout } } pub fn note_heartbeat(&mut self, node: &str) { self.last_seen.insert(node.to_string(), Instant::now()); } /// Returns the status of a node based on time since last heartbeat. /// The 'suspected' band gives consumers a chance to react before /// the node is declared dead. pub fn status(&self, node: &str) -> NodeStatus { match self.last_seen.get(node) { None => NodeStatus::Dead, Some(&t) => { let elapsed = t.elapsed(); if elapsed < self.timeout / 2 { NodeStatus::Alive } else if elapsed < self.timeout { NodeStatus::Suspected } else { NodeStatus::Dead } } } } } fn main() { let mut det = FixedHeartbeatDetector::new(Duration::from_millis(150)); det.note_heartbeat("node-a"); // Status is Alive immediately after heartbeat println!("{:?}", det.status("node-a")); // Status of a never-seen node is Dead println!("{:?}", det.status("node-z")); }
The Alive / Suspected / Dead three-state model is the minimal version of the suspicion-vs-declaration separation. Production systems extend this with finer-grained suspicion levels.
Phi Accrual (Simplified)
A correct phi accrual implementation requires careful numerical work; this sketch shows the structural shape:
use std::collections::VecDeque; use std::time::{Duration, Instant}; pub struct PhiAccrualDetector { intervals: VecDeque<f64>, // recent inter-heartbeat times in milliseconds max_samples: usize, last_heartbeat: Option<Instant>, } impl PhiAccrualDetector { pub fn new(max_samples: usize) -> Self { Self { intervals: VecDeque::new(), max_samples, last_heartbeat: None } } pub fn note_heartbeat(&mut self) { let now = Instant::now(); if let Some(prev) = self.last_heartbeat { let interval_ms = now.duration_since(prev).as_secs_f64() * 1000.0; self.intervals.push_back(interval_ms); if self.intervals.len() > self.max_samples { self.intervals.pop_front(); } } self.last_heartbeat = Some(now); } /// Compute phi based on current time-since-last-heartbeat and the /// distribution of past intervals. Uses normal approximation; real /// implementations use lognormal or other empirically-validated fits. pub fn phi(&self) -> f64 { let last = match self.last_heartbeat { None => return f64::INFINITY, // never seen - maximum suspicion Some(t) => t, }; if self.intervals.is_empty() { return 0.0; // not enough data } let n = self.intervals.len() as f64; let mean: f64 = self.intervals.iter().sum::<f64>() / n; let variance: f64 = self .intervals .iter() .map(|x| (x - mean).powi(2)) .sum::<f64>() / n; let stddev = variance.sqrt().max(1.0); // floor to avoid divide-by-zero let elapsed_ms = last.elapsed().as_secs_f64() * 1000.0; // Normal-approximation P(time-since-last > elapsed). Real // implementations use a more accurate model and handle the tail // probability with proper numerical techniques. let z = (elapsed_ms - mean) / stddev; let p = 0.5 * libm::erfc(z / std::f64::consts::SQRT_2); if p <= 0.0 { f64::INFINITY } else { -p.log10() } } pub fn is_alive(&self, phi_threshold: f64) -> bool { self.phi() < phi_threshold } } mod libm { pub fn erfc(_x: f64) -> f64 { 0.5 } } fn main() { let mut det = PhiAccrualDetector::new(100); det.note_heartbeat(); // After only one heartbeat, no interval data - phi is 0 println!("phi = {}", det.phi()); }
The production-grade version is significantly more careful: it uses a proper tail probability (not the normal approximation), handles the warm-up period with conservative defaults, and exposes a separate "interval mean" and "interval stddev" for the operations dashboard. The akka-failure-detector crate is a good reference implementation.
Composing Suspicion with Declaration
The pattern for separating suspicion from declaration:
#![allow(unused)] fn main() { struct PhiAccrualDetector; impl PhiAccrualDetector { fn new(_n: usize) -> Self { Self } fn phi(&self) -> f64 { 0.0 } } use std::sync::Arc; use std::time::{Duration, Instant}; pub struct LivenessSupervisor { detector: Arc<PhiAccrualDetector>, // Declaration is sticky: once a node is declared dead, it stays declared // until explicitly cleared. This prevents rapid flapping. declared_dead: bool, declared_at: Option<Instant>, } impl LivenessSupervisor { pub fn evaluate(&mut self, declare_threshold: f64, clear_threshold: f64) { let phi = self.detector.phi(); if !self.declared_dead && phi > declare_threshold { self.declared_dead = true; self.declared_at = Some(Instant::now()); // trigger failover, alert, etc. } else if self.declared_dead && phi < clear_threshold { // Hysteresis: clear only when phi drops well below the threshold. // This prevents flapping if phi hovers around the declare value. self.declared_dead = false; self.declared_at = None; // trigger recovery, re-add to load balancer, etc. } } } }
The hysteresis between declare_threshold and clear_threshold is operational defense against flapping. Without it, a node whose phi oscillates around the declare value will be declared and undeclared repeatedly, producing a stream of expensive recovery actions.
Key Takeaways
- A failure detector is the primitive every higher-level recovery mechanism (elections, failovers, circuit breakers) depends on. Its accuracy determines whether the cluster's reactions are appropriate or pathological.
- Fixed-timeout heartbeats are the practical baseline. The 3× rule (
T = 3 * H) is a sensible default; calibratingHandTto the actual network's latency distribution is the work that turns "good enough" into "operationally robust." - Phi accrual is the adaptive alternative used by Cassandra, Akka, and similar systems. It outputs a continuous suspicion level rather than a binary declaration, which lets multiple consumers react with different thresholds.
- Suspicion (continuous, cheap, frequently updated) and declaration (discrete, expensive, sticky with hysteresis) should be separated. Conflating them produces flapping and unnecessary recovery actions.
- Calibration is ongoing work. Network conditions drift; detector thresholds tuned six months ago may be poorly tuned now. Treat false-positive rate and detection latency as monitored metrics, not set-and-forget config.
Source note: This lesson synthesizes from multiple sources. The Chandra & Toueg failure detector taxonomy is from "Unreliable Failure Detectors for Reliable Distributed Systems" (Journal of the ACM, 1996). Phi accrual is from Hayashibara, Defago, Yared, Katayama, "The φ Accrual Failure Detector" (SRDS 2004). SWIM is from Das, Gupta, Motivala (2002). DDIA Chapter 9 covers fixed-timeout heartbeats but does not deeply treat adaptive detectors; that content is synthesized from training knowledge and the cited papers. Specific operational parameters (50ms heartbeats, phi thresholds) are illustrative and should be calibrated for any production deployment. Foundations of Scalable Systems was unavailable as a source; the synthesis here should be cross-checked against that text or another systems-scaling reference before publication.
Lesson 2: Circuit Breakers and Bulkheads
Context
During the September incident, a single misbehaving downstream — the conjunction-prediction service began returning with a P99 latency of 8 seconds instead of the usual 80ms — propagated through the entire Constellation Network in 47 seconds. The chain was straightforward in retrospect: callers held connections open waiting for replies, each new request consumed another connection, the connection pool exhausted, and every other dependency of the calling services started timing out as their thread pools blocked on the exhausted callers. By the time the on-call engineer logged in, the dashboard showed errors propagating through services that had nothing to do with the original failure. A single slow downstream had taken out half the constellation.
The fix is a category of patterns that bound failure rather than transmit it: circuit breakers, which cut off calls to a misbehaving downstream before they consume local resources; bulkheads, which isolate resource pools so that exhaustion of one pool does not starve others; and timeouts and retries with budget, which prevent any individual call from holding resources indefinitely. These patterns are the operational vocabulary of resilient distributed systems. Hystrix popularized them at Netflix in the early 2010s; their lineage traces back to Michael Nygard's Release It! (2007), which is still the canonical reference.
This lesson covers the three patterns and the failure modes they address. By the end, you should be able to identify the resource-coupling that allows a single slow dependency to take out a service, choose the right pattern to break that coupling, and recognize the configurations (overly aggressive timeouts, undersized bulkheads, retries that amplify load) that make these patterns produce more incidents than they prevent.
Core Concepts
The Resource Coupling Problem
The root cause of the September incident is a class of failure that has nothing to do with the network being unreliable, the clock being unreliable, or replication lag. It is a property of shared resources: connections, thread pool slots, file descriptors, memory. When a service makes calls to a downstream that is slow but not failed, each in-flight call holds a connection. Eventually, the local connection pool is exhausted. New requests that needed those connections now also wait, even if they were destined for other downstreams that are healthy.
This is resource coupling: the failure of one dependency consumes the resources required for other dependencies' health. The dependency graph in the operations diagram says service A depends on B; the resource coupling says service A's calls to C also fail when B is slow, because A's local thread pool is full of stuck calls to B.
The defenses against resource coupling are structural. You cannot fix it by tuning individual call timeouts (the cost is unbounded retries); you fix it by partitioning resources so failures are isolated.
The Circuit Breaker Pattern
A circuit breaker is a wrapper around a downstream call that maintains state across calls and stops issuing them when the downstream is unhealthy. Modeled on the electrical-circuit breaker: when current exceeds the safe threshold, the breaker trips and stops conducting until manually reset.
The state machine has three states:
Closed. Calls pass through to the downstream. The breaker tracks success/failure rates over a rolling window. If the failure rate exceeds a threshold (typical: 50% over 10 seconds with a minimum of 20 calls), the breaker transitions to Open.
Open. Calls return immediately with an error — no downstream call is made. After a configurable timeout (typical: 30 seconds), the breaker transitions to Half-Open to test recovery.
Half-Open. A single call (or a small probe set) is allowed through. If it succeeds, the breaker transitions back to Closed. If it fails, the breaker returns to Open with the timeout restarted.
The pattern's value is twofold. First, failing fast: an Open breaker returns errors in microseconds instead of waiting for the downstream's timeout (often seconds). The caller can fall back to a degraded behavior, cached data, or a useful error message far faster than waiting for the slow downstream to time out. Second, giving the downstream room to recover: a service overloaded by retry storms cannot recover under load. Cutting off the retries by tripping the breaker reduces load on the downstream, often enough for it to recover on its own.
The catalog wraps every cross-service call in a circuit breaker. The conjunction-prediction service has its own breaker; the orbital-element-registry has another; the telemetry-ingest pipeline has a third. When one trips, only the calls to that service fail fast; the others continue to operate normally.
Bulkheads: Isolating Resource Pools
The bulkhead pattern takes its name from naval architecture: a ship is divided into compartments, so flooding in one compartment does not sink the whole vessel. Applied to software: separate resource pools per dependency, so that resource exhaustion in one pool does not affect others.
The standard implementation:
- Per-dependency thread pools. Each downstream service gets its own thread pool. Calls to service A use pool A; calls to service B use pool B. A slow service A fills pool A but does not consume the threads available for B.
- Per-dependency connection pools. Same shape, applied to HTTP connections or database connections. Stuck calls to the slow downstream do not exhaust connections needed for healthy downstreams.
- Per-tenant queue isolation. In multi-tenant systems, separate queues per tenant prevent one heavy tenant from starving others.
The cost is resource overhead: N thread pools require N pool memory footprints, and the worst-case total resource budget is N × pool_size rather than a single shared pool. This is acceptable: the alternative is shared pools that can be exhausted by any single misbehaving consumer.
A subtler variant is semaphore isolation: instead of dedicated threads, each dependency gets a permit count. A call to dependency A acquires a permit from A's semaphore; if all permits are taken, the call fails fast. Semaphores are cheaper than thread pools (no thread context switching) but provide weaker isolation — a CPU-heavy operation on shared threads still affects other dependencies. Hystrix uses thread pools by default and semaphores for low-latency operations; the choice is workload-specific.
Timeouts and the Retry Budget
The third leg of the resilience stool is timeouts with explicit deadline propagation and bounded retries.
Every call should have a timeout. The timeout should be short enough to bound resource consumption (long timeouts hold resources during slow failures) but long enough to accommodate normal latency tail. The standard heuristic is timeout = p99.9_latency + safety_margin. Anything tighter generates spurious failures; anything looser fails to bound resource use.
Deadline propagation is the related discipline: when service A receives a request with a 5-second deadline and calls service B, the call to B should carry a deadline that is 5 seconds minus elapsed-on-A. If A has already spent 3 seconds, the call to B should time out in 2 seconds, not 5. Without deadline propagation, deep call chains can do work that has already missed its real deadline — wasting downstream resources on requests the caller has given up on. gRPC supports this natively (Deadline header); HTTP requires application-level convention.
Retries are dangerous and necessary. Necessary because transient failures (network blips, brief overloads) are real and retrying often succeeds. Dangerous because retries multiply load: a service handling 1000 RPS with 50% failure and 3 retries can produce 4× the request volume on the downstream. During a failure, this is exactly the load profile that prevents the downstream from recovering.
The mitigations:
- Exponential backoff with jitter. Don't retry immediately; wait
base * 2^attempt * random(). Without jitter, simultaneous retries from many callers form synchronized waves that re-hit the downstream simultaneously. - Retry budget. Limit retries to a percentage of base traffic (e.g., 10%). When the failure rate is high enough that retries exceed the budget, retries are dropped. This prevents the retry-amplification cascade.
- Idempotency. Every operation that is retried must be idempotent. This is the same point from Module 1 Lesson 1: at-least-once delivery is only safe under idempotency.
The catalog's RPC framework provides timeouts (mandatory), deadline propagation (gRPC-native), and a retry budget (10% of base RPS). Engineers adding new endpoints inherit all three by default; opting out is a code-review discussion.
Composing Circuit Breakers, Bulkheads, and Timeouts
These patterns layer cleanly:
- Timeout bounds the time any individual call holds a resource.
- Bulkhead bounds the total resource budget per dependency.
- Circuit breaker stops issuing calls to a dependency that has failed enough to trip.
- Retry with backoff and budget handles transient failures without amplifying load.
A well-resilient call to a downstream looks like: enter the bulkhead (semaphore acquire); check the circuit breaker (if open, fail fast); make the call with timeout; if it fails, record the failure for the breaker's state and either retry (within budget) or fail. Each layer addresses a different failure mode.
The cost is implementation complexity: every cross-service call needs all four layers, and getting them right requires non-trivial code. The mitigation is to centralize the resilience layer in a single library or framework: every call goes through the same wrapper, the wrapper provides all four behaviors, and individual endpoints just configure thresholds. The catalog's ResilientClient type is this wrapper; it's a few hundred lines of code that protects every cross-service call.
Anti-Patterns
The patterns are powerful but easy to misconfigure. The anti-patterns:
Timeouts shorter than dependency p99 latency. Produces spurious failures during normal operation; the system spends time in retry rather than productive work.
Timeouts longer than upstream deadlines. Wastes resources on requests the upstream has already given up on. Deadline propagation prevents this when used correctly.
Bulkheads sized for steady-state, not burst. When a dependency spikes briefly, calls back up and exceed the bulkhead. The fix is either larger bulkheads (more resources held idle in steady-state) or a queue with bounded depth (calls fail fast when the queue is full).
Circuit breakers with thresholds tuned to traffic volume. A breaker tuned for "trip after 100 failures" will rarely trip in low-traffic dependencies (because total volume is low) and trip too easily in high-traffic ones (because absolute failure count crosses the threshold even when failure rate is normal). The fix is rate-based thresholds with a minimum-volume floor.
Retries without budget or backoff. The retry storm pattern: a service degrades, retries multiply load, the degradation worsens, retries multiply more, the service fails completely. The classic example is the AWS S3 outage of February 2017, where retry amplification turned a brief disruption into hours of cascading failure.
The pattern is correct only when configured to the workload. Most operational pain from circuit breakers and bulkheads comes from misconfiguration, not from the patterns themselves.
Code Examples
A Simple Circuit Breaker
use std::sync::Mutex; use std::time::{Duration, Instant}; #[derive(Debug, Clone, Copy, PartialEq, Eq)] pub enum BreakerState { Closed, Open, HalfOpen } pub struct CircuitBreaker { state: Mutex<State>, failure_threshold: u32, success_threshold: u32, // probes needed in half-open to close open_duration: Duration, } struct State { state: BreakerState, consecutive_failures: u32, consecutive_successes: u32, opened_at: Option<Instant>, } impl CircuitBreaker { pub fn new(failure_threshold: u32, success_threshold: u32, open_duration: Duration) -> Self { Self { state: Mutex::new(State { state: BreakerState::Closed, consecutive_failures: 0, consecutive_successes: 0, opened_at: None, }), failure_threshold, success_threshold, open_duration, } } /// Called before a downstream call. Returns Ok(()) if the call may proceed, /// or Err(()) if the breaker is open and the call should fail fast. pub fn allow(&self) -> Result<(), ()> { let mut s = self.state.lock().unwrap(); match s.state { BreakerState::Closed => Ok(()), BreakerState::HalfOpen => Ok(()), // probe call BreakerState::Open => { let elapsed = s.opened_at.map(|t| t.elapsed()).unwrap_or_default(); if elapsed >= self.open_duration { // Transition to half-open: allow a probe. s.state = BreakerState::HalfOpen; s.consecutive_successes = 0; Ok(()) } else { Err(()) } } } } /// Called after a successful downstream call. pub fn note_success(&self) { let mut s = self.state.lock().unwrap(); s.consecutive_failures = 0; match s.state { BreakerState::Closed => {} BreakerState::HalfOpen => { s.consecutive_successes += 1; if s.consecutive_successes >= self.success_threshold { s.state = BreakerState::Closed; s.opened_at = None; } } BreakerState::Open => { // Shouldn't happen - a call only completes if allow() returned Ok. } } } /// Called after a failed downstream call. pub fn note_failure(&self) { let mut s = self.state.lock().unwrap(); s.consecutive_successes = 0; s.consecutive_failures += 1; match s.state { BreakerState::Closed => { if s.consecutive_failures >= self.failure_threshold { s.state = BreakerState::Open; s.opened_at = Some(Instant::now()); } } BreakerState::HalfOpen => { // Probe failed - back to open. s.state = BreakerState::Open; s.opened_at = Some(Instant::now()); } BreakerState::Open => {} } } pub fn current_state(&self) -> BreakerState { self.state.lock().unwrap().state } } fn main() { let cb = CircuitBreaker::new(3, 2, Duration::from_secs(30)); println!("initial: {:?}", cb.current_state()); // Three consecutive failures trip the breaker. cb.note_failure(); cb.note_failure(); cb.note_failure(); println!("after 3 failures: {:?}", cb.current_state()); // Calls now fail fast. assert!(cb.allow().is_err()); }
This is the consecutive-failure variant; production breakers typically use a rolling-window failure rate, which is more sophisticated but follows the same state machine.
Semaphore-Based Bulkhead
use std::sync::Arc; use tokio::sync::Semaphore; use std::time::Duration; use anyhow::Result; pub struct Bulkhead { semaphore: Arc<Semaphore>, acquire_timeout: Duration, } impl Bulkhead { pub fn new(max_concurrent: usize, acquire_timeout: Duration) -> Self { Self { semaphore: Arc::new(Semaphore::new(max_concurrent)), acquire_timeout, } } pub async fn execute<F, R>(&self, f: F) -> Result<R> where F: std::future::Future<Output = R>, { // Try to acquire a permit within the timeout. If we can't, the // bulkhead is full - fail fast rather than queueing indefinitely. let permit = tokio::time::timeout( self.acquire_timeout, self.semaphore.acquire(), ) .await .map_err(|_| anyhow::anyhow!("bulkhead acquire timeout"))? .map_err(|_| anyhow::anyhow!("bulkhead semaphore closed"))?; let result = f.await; drop(permit); Ok(result) } } #[tokio::main] async fn main() -> Result<()> { let bulkhead = Bulkhead::new(2, Duration::from_millis(100)); let result = bulkhead.execute(async { "downstream-call-result" }).await?; println!("got: {}", result); Ok(()) }
The semaphore version is the lightest-weight bulkhead; thread-pool bulkheads (one pool per dependency) provide stronger isolation when CPU contention is the concern.
Composing the Layers
// Resilient call combining bulkhead, circuit breaker, timeout, and retry.
use std::time::Duration;
use anyhow::Result;
pub struct ResilientClient {
bulkhead: Bulkhead,
breaker: CircuitBreaker,
call_timeout: Duration,
max_retries: u32,
}
impl ResilientClient {
pub async fn call<F, R, Fut>(&self, op: F) -> Result<R>
where
F: Fn() -> Fut,
Fut: std::future::Future<Output = Result<R>>,
{
// Outer retry loop with exponential backoff.
for attempt in 0..=self.max_retries {
// 1. Check the circuit breaker.
if self.breaker.allow().is_err() {
anyhow::bail!("circuit breaker open");
}
// 2. Enter the bulkhead and execute with timeout.
let result = self
.bulkhead
.execute(tokio::time::timeout(self.call_timeout, op()))
.await;
match result {
Ok(Ok(Ok(r))) => {
self.breaker.note_success();
return Ok(r);
}
Ok(Ok(Err(_))) | Ok(Err(_)) | Err(_) => {
self.breaker.note_failure();
if attempt < self.max_retries {
// Exponential backoff with jitter.
let base = Duration::from_millis(100);
let backoff = base * 2u32.pow(attempt);
let jitter = rand::random::<f64>();
tokio::time::sleep(backoff.mul_f64(0.5 + jitter * 0.5)).await;
}
}
}
}
anyhow::bail!("max retries exceeded")
}
}
struct Bulkhead;
impl Bulkhead {
async fn execute<F: std::future::Future>(&self, _f: F) -> Result<F::Output> { unimplemented!() }
}
struct CircuitBreaker;
impl CircuitBreaker {
fn allow(&self) -> Result<()> { Ok(()) }
fn note_success(&self) {}
fn note_failure(&self) {}
}
mod rand { pub fn random<T>() -> f64 { 0.5 } }
The composition is the production shape. Every call to a downstream goes through this layer. The configuration is per-dependency: each downstream has its own ResilientClient instance with thresholds tuned to that dependency's characteristics.
Key Takeaways
- Resource coupling is the failure mode where one slow dependency exhausts shared resources (connections, threads), preventing healthy work on other dependencies. The structural fix is to partition resources, not to tune individual timeouts.
- Circuit breakers fail fast on calls to misbehaving downstreams, freeing local resources for other work and giving the downstream room to recover. The three-state machine (Closed, Open, Half-Open) handles the recovery probe pattern.
- Bulkheads isolate resource pools per dependency so that exhaustion in one pool does not propagate to others. Thread-pool bulkheads provide stronger isolation; semaphore bulkheads are lighter weight.
- Timeouts, retries with backoff and jitter, and retry budgets are the third leg of the resilience stool. Without backoff, simultaneous retries form synchronized waves; without budget, retries amplify load during failures.
- The patterns layer cleanly: bulkhead, circuit breaker, timeout, retry. Compose them in a centralized resilience library so every cross-service call inherits the protection without per-endpoint work.
Source note: This lesson synthesizes from the Hystrix design (Netflix, 2012, archived), Michael Nygard's Release It! (2nd ed, 2018, Pragmatic Bookshelf) — the canonical reference for these patterns — and DDIA Chapter 9's brief treatment of "Degraded performance and partial functionality." Specific operational parameters (10% retry budget, p99.9 + safety margin timeouts) are illustrative and should be calibrated to the workload. The AWS S3 February 2017 outage reference is from the public post-incident report; specific details should be verified before publication. Foundations of Scalable Systems was unavailable; the resilience-pattern material would normally cite that text and should be cross-referenced.
Lesson 3: Chaos Engineering and Fault Injection
Context
The catalog's resilience layer — failure detection, circuit breakers, bulkheads, retries — was deployed two years ago. It has prevented 14 documented incidents, according to the after-action records. It has also failed to prevent three incidents that occurred because of failure modes the resilience layer had not been designed for: a downstream returning correct responses with a 30-second latency that fell within the timeout but above the bulkhead-acquire wait; a coordinator service that returned HTTP 200 with an error body the caller's framework did not parse as an error; and a deadlock in the resilience layer itself that occurred under a specific race between circuit-breaker state transitions and pool acquisition.
The pattern is consistent. Resilience patterns protect against known failure modes. The next outage is, by definition, an unknown failure mode — something the system has not been tested against. The discipline that closes the gap is chaos engineering: the deliberate, controlled injection of failures into production-like environments to discover failure modes before they discover you.
Chaos engineering was popularized at Netflix in the early 2010s with the Chaos Monkey tool, which randomly terminated production instances during business hours to verify that the architecture survived single-node failures. The practice has matured into a broader discipline covering network partitions, latency injection, dependency failures, configuration errors, and the human-factors elements of incident response. By the end of this lesson, you should understand the principles of chaos engineering, the spectrum of fault-injection tools and approaches, and the operational discipline (game days, deterministic simulation testing, runbook validation) that converts chaos engineering from a "nice idea" into reliable knowledge about how the system actually fails.
Core Concepts
The Principles of Chaos Engineering
The Principles of Chaos Engineering, formalized by the Netflix team and codified at principlesofchaos.org, lay out four guiding ideas:
Build a hypothesis around steady-state behavior. Before injecting a failure, define what "normal" looks like. The hypothesis is "in the steady state of N requests per second and P% error rate, when we inject failure F, the system will still serve at least M% of those requests with no observable degradation beyond X." Without a measurable steady-state hypothesis, the experiment cannot have a clear result.
Vary real-world events. The failures injected should be ones the system will plausibly encounter in production: instance termination, network latency, packet loss, dependency unavailability, region outages. Inventing artificial failures that production won't produce is less valuable than reproducing realistic ones.
Run experiments in production. This is the principle that distinguishes chaos engineering from traditional testing. Production has properties — actual traffic, actual data distributions, actual operator behavior — that no test environment can replicate. The discipline is to run controlled chaos experiments in production while bounding the blast radius so that user-visible impact is minimized.
Automate experiments to run continuously. A failure mode discovered once and then never re-checked will silently regress as the system evolves. Automation makes chaos a continuous discipline rather than a quarterly event.
These principles are aspirational; production chaos engineering also requires substantial guardrails. The catalog's chaos program runs experiments only during business hours, only when on-call is staffed, only with explicit dashboards and kill-switches, and only after the experiment has passed in lower environments. The discipline is "controlled chaos," not "chaos."
The Chaos Tool Lineage
Chaos engineering as practice traces through several generations of tooling:
Chaos Monkey (Netflix, 2011) — randomly terminates EC2 instances during business hours. The original; still the standard introduction. Forces the architecture to tolerate single-instance failures, which surfaces dependencies on specific machine instances rather than on roles.
Simian Army (Netflix, 2012) — extends Chaos Monkey with siblings: Chaos Gorilla (region failure), Latency Monkey (delay injection), Janitor Monkey (resource cleanup), Conformity Monkey (configuration drift). The pattern: each monkey injects one specific failure mode, automated to run regularly.
Gremlin, ChaosMesh, LitmusChaos (mid-2010s onward) — commercial and open-source platforms that productize fault injection. Network partitions, CPU/memory stress, disk failure, time skew, and dependency-specific failures (Kafka, Redis, gRPC) are all injectable as targeted experiments.
Deterministic simulation testing (FoundationDB, 2014; TigerBeetle, 2020s) — instead of injecting failures in production, simulate the entire system at the level of individual messages and CPU operations, then run millions of random schedules to find concurrency bugs. FoundationDB's simulation is what allowed them to claim "we found the bugs Jepsen would find before Jepsen would have found them" — and the claim held up.
The catalog uses a layered approach: deterministic simulation for the consensus and replication layers (where bugs are subtle and the state space is tractable), Gremlin-style production fault injection for service-level failures, and game days (covered below) for the human-factors layer.
What to Inject: A Taxonomy of Failure Modes
A useful checklist of failure modes to consider for any new system:
Infrastructure failures. Instance termination, machine restart, disk full, network partition between zones, DNS unavailability, certificate expiry. The Chaos Monkey lineage covers these well.
Network failures. Packet loss, latency injection, asymmetric routes, MTU changes, intermittent dropouts. Latency injection is particularly valuable because it surfaces timeout misconfiguration without the system actually failing.
Dependency failures. A downstream returning errors, returning correct responses with high latency, returning malformed data, returning correct data but very slowly. Each is a different failure mode with different defenses needed.
Configuration errors. Wrong feature flags, expired credentials, mismatched versions across the cluster, schema migrations that haven't propagated. These are statistically the most common cause of real outages, and they are testable: deliberately inject a wrong config and see what happens.
Time-related failures. Clock skew between nodes, NTP unavailability, leap-second handling, system clock jumping backward. The Module 1 lesson on clock unreliability is exactly the set of failure modes this category covers.
Resource exhaustion. Memory pressure, CPU saturation, file descriptor leaks, thread pool exhaustion. These are the failures that classically take down systems that have only been tested under normal load.
Adversarial input. Malformed requests, very large requests, requests with unusual character encodings, replay attacks. Less about chaos engineering and more about robustness testing, but the line is blurry.
A mature chaos program covers all of these on a rotating cadence. The catalog's chaos engineering team runs one experiment per category per quarter, with the specific scenario varying so the system is not just tested against the same fault repeatedly.
Game Days
The human-factors element of incident response is the part most easily neglected. A system can be technically resilient and still produce protracted outages because the on-call engineer doesn't know which runbook to follow, can't find the relevant dashboard, or doesn't have the access credentials for the failing component.
Game days are scheduled exercises where a controlled failure is injected and the operations team responds as if it were a real incident. The team uses the actual runbook, the actual dashboards, the actual escalation paths. The result is a list of gaps: the runbook step that references a deleted dashboard, the alert that doesn't actually page anyone, the credential that needs a manager's approval to use during off-hours.
The discipline:
- Schedule the game day in advance. Not a surprise; the team knows it will happen, but not the specific scenario. This mimics the real on-call experience without the genuine stress of a 3 AM page.
- Inject the failure with realistic constraints. The injector is a separate team or person; the responding team doesn't know what was injected and must diagnose from first principles using only production tooling.
- Time-box and observe. A facilitator tracks the response, noting decisions and bottlenecks. The blameless retro afterward is where the gaps surface.
- Convert findings to backlog items. Every "we didn't have a runbook for this" becomes a runbook item. Every "we couldn't find the right dashboard" becomes a dashboard improvement.
Game days are slow and expensive — a full one takes a half-day from a dozen people — but they surface the soft-skill and tooling gaps that automated chaos cannot. The catalog runs four game days per year, two scheduled in advance and two with the only-the-leadership-knows-the-date variety.
Deterministic Simulation Testing
For the lower layers of the stack (consensus, storage, replication), production chaos engineering is too coarse. A bug that requires a specific interleaving of three message receives between Raft followers may not manifest in years of production traffic, but a deterministic simulation that tries thousands of message orderings per minute will find it in an afternoon.
The pattern, pioneered at FoundationDB:
- Build the system as a deterministic state machine. Every component — networking, storage, threading — is implemented behind an interface that allows a simulator to control its behavior. In production the implementation is real network/disk/threads; in tests, the simulator substitutes mock implementations with deterministic random scheduling.
- Inject faults at the simulation level. The simulator can drop, reorder, and delay messages arbitrarily; pause and resume "threads" at any point; corrupt disk writes; introduce arbitrary clock skew.
- Run with random seeds. Each test run produces a deterministic trace from a seed. A failing run can be replayed exactly by re-running the same seed.
- Run millions of seeds. The state space of a non-trivial system is too large to exhaustively explore, but random sampling finds bugs with high probability. FoundationDB ran billions of simulation-seconds before they considered the system production-ready.
The cost is that the system must be architected from the start to support deterministic simulation. Retrofitting an existing codebase is hard. TigerBeetle and FoundationDB are designed for it; most other systems get partial benefit by simulating subset of components (the Raft layer, for instance) while leaving others untested.
The catalog's Raft implementation uses deterministic simulation (the Module 3 project includes a simplified version). Other components rely on more conventional testing. The result is that the Raft layer has very high confidence and the others have moderate confidence — which matches the cost asymmetry of bugs in each layer.
Measuring the Value of Chaos Engineering
A common pushback against chaos engineering: "we're spending engineering time creating problems instead of solving them." The response requires measurement.
Useful metrics:
- Findings per experiment. How many real bugs or operational gaps does each chaos experiment surface? If the answer is consistently zero, either the system is genuinely resilient (great) or the experiments are not exercising the system's weaknesses (more likely).
- Mean time to detection (MTTD). When a failure is injected, how long until monitoring detects it? Chaos engineering exposes detection gaps directly.
- Mean time to recovery (MTTR). After the failure is detected, how long until the system returns to baseline? Recovery time is a function of the runbook, the tooling, and the automation — each of which chaos surfaces.
- Production incident rate. The lagging indicator. A successful chaos program correlates with a decrease in production incident rate, holding architecture stable.
The catalog tracks all of these. After two years of chaos engineering, MTTD has dropped from 12 minutes to 3 minutes, MTTR has dropped from 47 minutes to 18 minutes, and production incident rate (per million requests) has dropped 60%. The chaos engineering investment was sustained, then refined, on the basis of these numbers.
Code Examples
A Simple Latency Injector
For service-level fault injection, the cheapest mechanism is a wrapper that introduces controlled delays:
use std::sync::atomic::{AtomicU64, Ordering}; use std::time::Duration; pub struct LatencyInjector { delay_ms: AtomicU64, // 0 means disabled probability_pct: AtomicU64, // 0-100 } impl LatencyInjector { pub fn new() -> Self { Self { delay_ms: AtomicU64::new(0), probability_pct: AtomicU64::new(0), } } pub fn configure(&self, delay_ms: u64, probability_pct: u64) { self.delay_ms.store(delay_ms, Ordering::SeqCst); self.probability_pct.store(probability_pct.min(100), Ordering::SeqCst); } pub fn disable(&self) { self.delay_ms.store(0, Ordering::SeqCst); self.probability_pct.store(0, Ordering::SeqCst); } /// Called at the start of each request. Injects delay with the configured /// probability. The delay is uniform; production injectors use distributions /// that match observed real-world latency tails. pub async fn maybe_inject(&self) { let prob = self.probability_pct.load(Ordering::SeqCst); let delay = self.delay_ms.load(Ordering::SeqCst); if prob == 0 || delay == 0 { return; } // Roll the dice. Probability is per-request. let roll = fastrand::u64(0..100); if roll < prob { tokio::time::sleep(Duration::from_millis(delay)).await; } } } mod fastrand { pub fn u64(_r: std::ops::Range<u64>) -> u64 { 50 } } fn main() { let inj = LatencyInjector::new(); inj.configure(200, 5); // 5% of requests get 200ms delay println!("latency injector configured"); // Production: the injector is wired into every cross-service call. }
The configuration is dynamically adjustable, so the chaos team can ramp up the injection rate gradually and back off if the system shows distress beyond the experimental budget. Production injectors integrate with feature flag systems so the kill switch is one toggle away.
A Deterministic Simulation Harness (Sketch)
// SKETCH: deterministic simulation framework for a Raft-style protocol.
use std::collections::VecDeque;
pub struct SimEnvironment {
nodes: Vec<SimNode>,
network: SimNetwork,
rng: SeededRng,
clock: SimulatedClock,
}
pub struct SimNetwork {
// Pending messages, controllable by the simulator.
pending: VecDeque<(String, String, Vec<u8>)>,
drop_probability: f64,
reorder_probability: f64,
}
impl SimEnvironment {
pub fn new(seed: u64, node_count: usize) -> Self { /* ... */ unimplemented!() }
/// Single simulation step: deliver one message, advance one node's clock,
/// or inject one fault. The choice is made by the seeded RNG, so the
/// entire run is determined by the seed.
pub fn step(&mut self) -> StepResult {
unimplemented!()
}
pub fn run_until(&mut self, predicate: impl Fn(&SimEnvironment) -> bool) -> usize {
let mut steps = 0;
while !predicate(self) {
self.step();
steps += 1;
if steps > 1_000_000 { panic!("simulation diverged"); }
}
steps
}
}
// In a test:
// for seed in 0..1_000_000 {
// let mut env = SimEnvironment::new(seed, 5);
// env.run_until(|e| e.has_committed_index(10));
// assert!(env.no_safety_violations());
// }
struct SimNode;
struct SeededRng;
struct SimulatedClock;
struct StepResult;
The structure is the point. A bug found by seed 4837291 can be reproduced exactly by re-running with that seed — no "it's flaky" excuses, no "we couldn't reproduce." Bugs are deterministically reproducible, which makes them deterministically fixable.
Game Day Tracker (Sketch of Structure)
use std::time::Instant; pub struct GameDayEvent { pub timestamp: Instant, pub actor: String, pub action: String, pub notes: String, } pub struct GameDay { pub scenario_name: String, pub scheduled_at: Instant, pub injection_at: Option<Instant>, pub detected_at: Option<Instant>, pub mitigated_at: Option<Instant>, pub resolved_at: Option<Instant>, pub timeline: Vec<GameDayEvent>, pub findings: Vec<String>, } impl GameDay { pub fn mttd(&self) -> Option<std::time::Duration> { match (self.injection_at, self.detected_at) { (Some(i), Some(d)) => Some(d.duration_since(i)), _ => None, } } pub fn mttr(&self) -> Option<std::time::Duration> { match (self.detected_at, self.resolved_at) { (Some(d), Some(r)) => Some(r.duration_since(d)), _ => None, } } pub fn record_event(&mut self, actor: &str, action: &str, notes: &str) { self.timeline.push(GameDayEvent { timestamp: Instant::now(), actor: actor.to_string(), action: action.to_string(), notes: notes.to_string(), }); } } fn main() { let gd = GameDay { scenario_name: "primary region failure".to_string(), scheduled_at: Instant::now(), injection_at: None, detected_at: None, mitigated_at: None, resolved_at: None, timeline: Vec::new(), findings: Vec::new(), }; println!("game day: {}", gd.scenario_name); }
The point is not the code — game day tracking is process work, not engineering. The point is that the data produced by a game day is structured enough to feed back into MTTD and MTTR metrics, and findings should be queryable and trackable like any other engineering work.
Key Takeaways
- Resilience patterns protect against known failure modes; chaos engineering discovers the unknown ones. Both are necessary. Defenses without testing are theoretical; testing without defenses produces incidents.
- The principles of chaos engineering — steady-state hypothesis, real-world events, production experiments, continuous automation — are aspirational. Production chaos requires guardrails: business hours, kill switches, bounded blast radius, on-call coverage.
- Chaos targets span infrastructure, network, dependencies, configuration, time, resources, and adversarial input. A mature program rotates through these categories rather than testing the same fault repeatedly.
- Game days surface human-factors gaps that automated chaos cannot: missing runbooks, broken dashboards, escalation paths that don't work at 3 AM. The slow, expensive ones produce findings that other testing misses.
- Deterministic simulation testing is the right tool for low-level concurrent systems (consensus, storage). It requires architectural support and substantial up-front cost, and it produces dramatically higher confidence than conventional testing for the systems where it applies.
Source note: This lesson synthesizes from Netflix's published material on Chaos Monkey and the Simian Army (techblog.netflix.com), the Principles of Chaos Engineering document (principlesofchaos.org), Casey Rosenthal & Nora Jones, Chaos Engineering: System Resiliency in Practice (O'Reilly, 2020), and the FoundationDB simulation testing approach as described in Will Wilson's "Testing Distributed Systems w/ Deterministic Simulation" talk (Strange Loop 2014). DDIA Chapter 9 has a brief discussion under "Fault injection" and "Formal Methods and Randomized Testing." Specific operational numbers (MTTD reduction from 12 to 3 minutes, etc.) are illustrative and not real Meridian metrics. Foundations of Scalable Systems would have been the standard reference here and was unavailable; the content should be cross-checked against that text before publication.
Module 04 Project — Ground Station Failover
Mission Brief
Incident ticket CN-2702-003 Severity: P2 Reporter: Constellation Operations, Atlantic Watch Status: Open
The Atlantic ground station primary uplink failed at 11:23Z. The backup uplink should have taken over within 30 seconds. Instead, the failover procedure stalled for 8 minutes because: (1) the failure detector took 4 minutes to declare the primary dead (the detector was tuned for cross-region paths, not the Atlantic intra-region path); (2) the circuit breaker on the satellite-command service was configured with a 50-failure threshold, but at the Atlantic traffic rate that's a full day of failures; (3) when the failover finally completed, a retry storm from the queued commands overloaded the backup uplink for another 5 minutes.
You are building Ground Station Failover, a Rust crate that integrates the patterns from this module into an operational failover system: phi accrual failure detection, circuit breakers and bulkheads on outgoing commands, retry budgets with backoff, and a chaos-injection harness for verifying the system works under adversarial conditions.
Repository Layout
ground-station-failover/
├── Cargo.toml
├── src/
│ ├── lib.rs
│ ├── detector.rs # FixedHeartbeatDetector + PhiAccrualDetector
│ ├── breaker.rs # CircuitBreaker with Closed/Open/HalfOpen state
│ ├── bulkhead.rs # Semaphore-based bulkhead
│ ├── client.rs # ResilientClient composing breaker + bulkhead + timeout + retry
│ ├── failover.rs # PrimaryBackupFailover: monitors primary, promotes backup
│ └── chaos.rs # Latency / drop / failure injection harness
├── tests/
│ ├── detection_calibration.rs
│ ├── breaker_state_machine.rs
│ ├── bulkhead_isolation.rs
│ ├── failover_end_to_end.rs
│ └── retry_budget.rs
└── README.md
Required API
// detector.rs
pub trait FailureDetector: Send + Sync {
fn note_heartbeat(&self, node: &str);
fn suspicion_level(&self, node: &str) -> f64;
fn is_alive(&self, node: &str, threshold: f64) -> bool;
}
pub struct PhiAccrualDetector { /* ... */ }
pub struct FixedHeartbeatDetector { /* ... */ }
// breaker.rs
pub enum BreakerState { Closed, Open, HalfOpen }
pub struct CircuitBreaker { /* ... */ }
impl CircuitBreaker {
pub fn new(failure_rate_threshold: f64, min_volume: u32, open_duration: Duration) -> Self;
pub fn allow(&self) -> Result<(), ()>;
pub fn note_success(&self);
pub fn note_failure(&self);
pub fn state(&self) -> BreakerState;
}
// bulkhead.rs
pub struct Bulkhead { /* ... */ }
impl Bulkhead {
pub fn new(max_concurrent: usize, acquire_timeout: Duration) -> Self;
pub async fn execute<F: Future<Output = R>, R>(&self, f: F) -> Result<R>;
}
// client.rs
pub struct ResilientClient { /* ... */ }
impl ResilientClient {
pub async fn call<F, R, Fut>(&self, op: F) -> Result<R>
where F: Fn() -> Fut, Fut: Future<Output = Result<R>>;
}
// failover.rs
pub struct PrimaryBackupFailover { /* ... */ }
impl PrimaryBackupFailover {
pub fn new(primary: ResilientClient, backup: ResilientClient, detector: Arc<dyn FailureDetector>) -> Self;
pub async fn run(&self); // monitor + promote on detected failure
pub fn current_active(&self) -> ActiveTier;
}
pub enum ActiveTier { Primary, Backup }
Acceptance Criteria
-
cargo build --releasecompletes without warnings under#![deny(warnings)]. -
cargo test --releasepasses all integration tests with zero flakes across 50 consecutive runs. -
cargo clippy -- -D warningsproduces no lints. - Detector calibration test: with synthetic inter-heartbeat times drawn from a normal distribution (mean 50ms, stddev 10ms), the phi accrual detector reports phi < 2 in steady state. After the next heartbeat is 300ms late, phi rises above the configured declare threshold.
- Detection latency test: when the primary stops sending heartbeats, the detector declares it dead within 3× the configured base timeout. (For phi accrual: within phi reaching the declare threshold from the calibrated baseline.)
- Breaker state machine test: under a sequence of 100 failures, the breaker transitions Closed → Open. After the open duration elapses, a probe transitions Open → HalfOpen. A successful probe transitions HalfOpen → Closed; a failed probe transitions HalfOpen → Open.
- Rate-based threshold test: the breaker uses a failure-rate threshold with a minimum-volume floor. With 5 failures out of 5 calls, the breaker does NOT trip (volume below minimum). With 50 failures out of 100 calls, the breaker DOES trip (rate exceeds threshold and volume above minimum).
- Bulkhead isolation test: two bulkheads in the same process; one is saturated with slow operations. The other bulkhead's operations continue at normal latency. Test asserts the second bulkhead's p99 latency is within 10% of its baseline.
- Failover end-to-end test: with primary configured to fail (all calls error), backup configured to succeed, and detector configured with phi threshold 8: the failover system detects the failure, promotes the backup, and the next 100 commands succeed against the backup with no errors propagated to the caller.
- Retry budget test: under 50% downstream failure rate, the total retry-attempt count is bounded to 10% of the base call rate. The retry-budget mechanism rejects retries that would exceed the budget rather than amplifying load.
-
Chaos injection harness: the
chaosmodule provides programmatic injection of latency, error responses, and dropped requests, with dynamic configuration (the chaos can be turned on, ramped, and turned off during a test). - (self-assessed) The README explains the failure modes the system is designed to handle and explicitly names the failure modes it is NOT designed to handle (e.g., Byzantine failures, data corruption, deliberate adversarial behavior).
- (self-assessed) The phi accrual implementation is verified against a published reference; if it deviates, the deviations are documented and justified.
- (self-assessed) The retry-budget mechanism uses a sliding window (not a fixed bucket) and the window size is configurable. The default is documented and justified.
Expected Output
cargo test --release failover_end_to_end -- --nocapture:
[t=0.000s] Failover system initialized; primary=Atlantic-A, backup=Atlantic-B
[t=0.000s] Detector: phi-accrual, declare threshold=8.0
[t=0.000s] Initial state: PRIMARY active
[t=0.050s] Primary heartbeats: arriving at 50ms intervals, phi=0.4
[t=2.000s] Chaos: Atlantic-A enters failure mode (all calls error)
[t=2.050s] Primary heartbeats: STOPPED
[t=2.450s] Detector: phi=3.1 (suspected)
[t=2.700s] Detector: phi=6.4 (suspected, approaching declare)
[t=2.850s] Detector: phi=8.3 (DECLARED dead)
[t=2.851s] Failover: promoting backup Atlantic-B
[t=2.852s] Failover: state=BACKUP active
[t=2.852s] Subsequent 100 commands directed to backup
[t=4.852s] All 100 commands succeeded against Atlantic-B
PASS: failover completed in 852ms; no caller-visible errors after promotion
Hints
1. Detector tuning is workload-specific
The phi accrual paper gives example thresholds (typically 8.0), but the right value for the constellation is empirical. Run the detector against synthetic inter-heartbeat distributions matching your target environment, observe the phi distribution under normal and degraded conditions, and pick a threshold that separates them with a comfortable margin. Treat the chosen threshold as configuration, not a hardcoded constant.
2. Rate-based vs count-based breaker thresholds
Count-based thresholds ("trip after 50 failures") don't generalize across traffic levels. A high-traffic dependency hits 50 failures quickly during normal operation; a low-traffic one might never accumulate 50 failures even during sustained outages. Use rate-based thresholds ("50% failure rate over a 10-second window with minimum 20 calls") and the minimum-volume floor prevents flapping at low traffic.
3. Bulkhead acquire-timeout vs queue-depth
A bulkhead can either queue requests (calls wait for a permit) or reject (calls fail when no permit is available). Queueing introduces unpredictable latency; rejection produces clearer error signals. Most production bulkheads queue with an acquire timeout: calls wait up to T for a permit, then fail. This is the shape the project's Bulkhead::execute takes.
4. Retry budget mechanics
The retry budget is typically implemented as a sliding-window counter. Track (base_calls, retry_calls) over a fixed window (e.g., 10 seconds). When considering a retry: if retry_calls / base_calls would exceed the budget (e.g., 0.1), reject the retry. This caps retry volume regardless of the failure rate. The counters can be approximate (atomic counters) for performance; exact accounting under contention is more expensive than the value it provides.
5. Wiring the chaos module to tests
The chaos module's API should be programmatic: chaos.set_latency(Duration::from_millis(200), 0.05) to inject 200ms latency on 5% of calls. The chaos handle is held by the test and dropped at end, disabling the injection. This makes tests self-contained — no global state, no order-dependent interactions between tests. Each test configures its own chaos and tears it down.
6. Integration vs unit tests
Each module should have unit tests for its individual behavior. The integration tests verify composition: detector + breaker + bulkhead + retry + failover working together. The composition tests are where the subtle bugs hide — for example, a breaker that trips on bulkhead rejection (incorrect: bulkhead rejection isn't a downstream failure, it's a local resource decision) or a retry that doesn't decrement the budget on success (incorrect: the budget should count attempts, not just failures).
Source Anchors
- Michael Nygard, Release It! (2nd ed, Pragmatic Bookshelf, 2018) — the canonical reference for circuit breaker, bulkhead, timeout, and retry patterns
- Hayashibara et al., "The φ Accrual Failure Detector" (SRDS 2004) — the phi accrual paper
- Casey Rosenthal & Nora Jones, Chaos Engineering: System Resiliency in Practice (O'Reilly, 2020) — for the chaos injection module's design
- DDIA 2nd Edition, Chapter 9 — failure detection and "Degraded performance and partial functionality"
Module 05 — Distributed Coordination
"Two registry instances each believed they held the daily-merge lock. The bug was not in the lock service. The bug was in treating a distributed lock like a single-machine mutex."
Mission Context
Modules 1 through 4 covered the foundations: the failure model, replication, consensus, fault tolerance. This module covers the coordination primitives built on those foundations: distributed locks and leases (Lesson 1), service discovery and load balancing (Lesson 2), and gossip protocols (Lesson 3). These primitives are what real services use to call each other, share state, and avoid stepping on each other's resources. They are also where the previous modules' theoretical hazards (clock unreliability, partial failure, replication lag) become operational realities.
The three lessons connect: locks use consensus from Module 3 plus fencing tokens (themselves a Module 4 idea); discovery uses failure detection from Module 4 combined with the load-balancing algorithms introduced here; gossip is the scale-up alternative to centralized registries, with its own consistency model that requires the vector-clock and CRDT thinking from Modules 1 and 2.
The opening incidents — the April lock-without-fencing corruption (Lesson 1), the static-config-file decay (Lesson 2), the central-registry CPU saturation (Lesson 3) — are the standard operational pathologies of growing distributed systems. Knowing the patterns is what lets you choose the right primitive at design time rather than discover the failure at incident time.
Lessons
| # | Title | Source |
|---|---|---|
| 1 | Distributed Locks and Leases | DDIA Ch. 10 + Kleppmann's fencing tokens post |
| 2 | Service Discovery and Load Balancing | Synthesis + Mitzenmacher 2001 + Karger et al. 1997 |
| 3 | Gossip Protocols | Demers 1987 + Das/Gupta/Motivala 2002 (SWIM) |
Project
Telemetry Gossip — implement a SWIM-based gossip layer that replaces a central health registry for a 250-node edge fleet. The project covers the membership-state structures, the SWIM indirect-probe failure detection, the push-pull gossip exchange, and a deterministic test harness that verifies convergence properties and partition behavior at scale.
Position
Module 5 of 6 in the Distributed Systems track.
What You Should Be Able to Do After This Module
- Use a distributed lock service correctly: leases with explicit renewal, fencing tokens validated at the protected resource, monotonic-time renewal loops, halt-on-lease-loss semantics.
- Recognize the holder-pauses failure mode and other ways that distributed locks differ from single-machine mutexes.
- Choose between client-side and server-side service discovery for a given workload and articulate the operational tradeoffs.
- Implement and tune load balancing algorithms: round-robin, P2C, consistent hashing with virtual nodes, weighted variants.
- Choose between centralized registry and gossip-based membership based on cluster size and consistency requirements.
- Reason about gossip propagation time, fanout, and bandwidth as a function of cluster size, and tune the protocol's parameters against the operational regime.
Source Materials
- DDIA 2nd Edition (Kleppmann & Riccomini, 2026), Chapter 10 — "Membership and Coordination Services." The primary direct source for Lesson 1.
- Martin Kleppmann, "How to do distributed locking" (kleppmann.com, 2016) — the canonical public reference for fencing tokens.
- Hunt et al., "ZooKeeper: Wait-free coordination for Internet-scale systems" (USENIX ATC 2010) — ZooKeeper's foundational paper.
- Karger et al., "Consistent Hashing and Random Trees" (STOC 1997) — the consistent-hashing original.
- Mitzenmacher, "The Power of Two Choices in Randomized Load Balancing" (IEEE TPDS 2001) — the P2C analysis.
- Demers et al., "Epidemic Algorithms for Replicated Database Maintenance" (PODC 1987) — the gossip-protocol foundational paper.
- Das, Gupta, Motivala, "SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol" (DSN 2002) — the SWIM paper.
- The Envoy, NGINX, and Hashicorp Consul/Serf documentation for production-grade reference implementations of the patterns covered.
Track-level synthesis note: Foundations of Scalable Systems — the source book originally planned for parts of Lessons 2 and 3 — was not available during authoring. Synthesized content is flagged in each lesson's source note. Specific operational parameters (lease durations, gossip periods, P2C fanout) are illustrative; production deployments should calibrate against the actual workload.
Lesson 1: Distributed Locks and Leases
Context
The Constellation Network has a recurring class of operations that require mutual exclusion across the cluster: exactly one ground station at a time should be commanding a given satellite during a pass; exactly one orbital-element-registry node at a time should be processing the daily TLE merge; exactly one scheduler instance at a time should be assigning pass windows. These are textbook mutual-exclusion problems. On a single machine, you would use a mutex. Across a cluster, you cannot — and the reasons why have already been the source of one expensive Meridian incident.
In April 2024, two registry instances each believed they held the daily-merge lock simultaneously. The first instance had requested the lock and proceeded with the merge. Partway through, its host paused for 11 seconds (a stop-the-world garbage collection in a Java sidecar; the merge process was Rust but shared the host). During the pause, the lock service decided the first instance had failed and granted the lock to the second instance. The second instance started merging. When the first instance's host resumed, it continued its merge — completely unaware that its lock had been revoked and reassigned. Two simultaneous merges produced a corrupt TLE database that took six hours to reconstruct.
The bug is not in the lock service. The bug is in treating the lock service like a single-machine mutex. A distributed lock is fundamentally probabilistic in a way that a single-machine mutex is not: holding the lock at time T does not mean the holder is the only entity that will believe it holds the lock at T+1. This lesson covers the mechanisms — leases, fencing tokens, and consensus-backed coordination services — that make distributed mutual exclusion safe. By the end, you should be able to use a distributed lock service correctly, identify the failure modes of misuse, and understand why the right answer to "we just need a distributed mutex" is rarely "deploy ZooKeeper."
Core Concepts
Why a Distributed Mutex Cannot Look Like a Single-Machine Mutex
The single-machine mutex contract is simple: at any given instant, at most one thread holds the lock. The kernel enforces this by atomically updating the lock's owner field. The thread that holds the lock can rely on its uniqueness without any additional checks.
A distributed lock cannot offer this guarantee. The lock service is a process on a different machine; the holder is connected to it by an unreliable network; the holder's view of "do I still hold the lock" is at best the moment it last heard from the service. Two specific failure modes break the single-machine intuition:
The holder pauses. The holder is alive but unresponsive (GC pause, host CPU stall, swap thrash). The lock service times out the holder and reassigns the lock. The first holder resumes and continues executing as if it still holds the lock. Two holders, no exclusion.
The network partitions. The holder is on one side of a partition; the lock service is on the other. The holder still believes it holds the lock; the service may have reassigned it. Again, two holders.
The single-machine kernel could not produce either failure mode. The distributed lock service inherently can. The defenses are structural — fencing tokens — not procedural — making the timeout longer.
Leases: Time-Bounded Locks
A lease is a lock that expires automatically after a duration unless renewed. The holder acquires a lease with some TTL; before the TTL elapses, the holder renews the lease (extending its expiry); if the holder fails to renew (because it has crashed or is partitioned), the lease expires and the lock service can grant it to someone else.
The lease pattern is the standard distributed lock primitive. ZooKeeper's ephemeral znodes implement leases (the znode expires when the session times out); etcd's leases are explicit; Consul's session-tied keys are leases. The mechanism shifts the cost of detection from "the lock service detecting a holder is dead" to "the holder proving it is still alive."
The lease duration is the operational knob. Too short: holders spend most of their time renewing rather than doing work, and a brief network blip can spuriously expire the lease. Too long: a failed holder's lock remains unavailable for the full duration after the failure. The standard tradeoff is to choose a lease duration considerably longer than the renewal interval (e.g., 30-second lease, 10-second renewal), so two renewals can be lost before the lease expires.
What leases do not fix: the holder-pauses-during-its-work problem. A holder that pauses for longer than the lease TTL has implicitly lost the lease but has no immediate way to know. When the holder resumes, the lease has expired, possibly been granted to another holder, and the original holder may continue its work in violation of the mutual-exclusion contract. Leases alone are insufficient. The fix is fencing.
Fencing Tokens: The Safety-Critical Pattern
The fencing token pattern, popularized by Martin Kleppmann's "How to do distributed locking" (2016), is the mechanism that makes distributed locks safe in the presence of holder pauses and partitions.
The mechanism: when the lock service grants a lease, it issues a monotonically increasing fencing token — a number larger than any previously-issued token. The holder includes this token in every request to the protected resource. The resource (e.g., the registry's storage layer) tracks the highest token it has seen and rejects any request with a lower token.
The April incident, with fencing tokens, would have played out differently. The first registry instance acquired the lock with token 42 and started merging. After its pause, the lock service had reassigned the lock to the second instance with token 43. When the first instance resumed and submitted a merge write with token 42, the storage layer would have observed 42 < 43 and rejected the write. The second instance's writes with token 43 would have been accepted. The corruption would not have occurred.
Fencing tokens require cooperation from the protected resource. The lock service alone cannot enforce fencing — it can issue tokens, but the resource must check them. This is why distributed locks alone are not enough; the entire write path from holder to resource must propagate and validate the token. Systems that retrofit distributed locks onto storage that doesn't understand fencing are vulnerable to exactly the April incident.
Consensus-Backed Coordination Services
The lock service itself is a distributed system, and getting it wrong is how you produce more outages than you prevent. Production systems use consensus-backed coordination services: ZooKeeper, etcd, and Consul.
ZooKeeper (Apache, 2008) uses the ZAB protocol (similar in spirit to Raft) to maintain a replicated hierarchical filesystem. Locks are implemented as ephemeral nodes; the first node to create an ephemeral child of a lock path wins the lock; the node automatically disappears when the session ends. Watches let other waiters be notified when the lock-holding node disappears. ZooKeeper's API is one of the most influential designs in the field; many other systems were built on top of it.
etcd (CoreOS, 2013) uses Raft and exposes a key-value store with linearizable operations, leases, and watches. Its API is simpler than ZooKeeper's — flat keys instead of hierarchical paths, explicit leases instead of ephemeral nodes — and the deployment is single-binary. etcd is the coordination service for Kubernetes and many other cloud-native systems.
Consul (HashiCorp, 2014) provides similar primitives plus service discovery, health checking, and a gossip-based membership layer. It is more opinionated about what coordination is for; its API is shaped around the specific patterns (service registration, health checks, KV configuration) rather than primitives that you compose.
The three services have different operational characteristics — ZooKeeper is heavier and battle-tested; etcd is simpler and fast-iterating; Consul is integrated and opinionated — but the core capability is the same: a consensus-backed store of small data items with strong consistency and watch semantics. The catalog uses etcd for membership and coordination, on the basis that the team values simplicity and the operational story is well-understood.
What Coordination Services Are For (and What They Are Not For)
Coordination services are designed for small, slowly-changing, high-value state: cluster membership, leader election, configuration, lock state, schema versions. The data is tiny (kilobytes, not gigabytes), the write rate is low (hundreds per second across the cluster, not millions), and the consistency guarantees are strong.
They are not designed for:
- High-throughput data storage. etcd will not serve as your TLE database. The consensus overhead per write is too high for application data.
- Large value storage. A coordination service is not a blob store. Per-value size limits are typically in the kilobyte range.
- Pub/sub at scale. While watches exist, they are not designed to drive millions of subscribers. Use Kafka or NATS for that workload.
Treating a coordination service as a general-purpose database is a recurring failure mode. The service collapses under load; the team blames the service rather than recognizing the workload mismatch. The right framing is: coordination services are the consensus layer's user-facing API. Use them for the consensus-shaped problems and use other tools for everything else.
Lease Renewal and Clock Discipline
Lease renewal has a subtle dependency on time. The lock service measures the lease's age from when it was granted; the holder measures from when it should renew. The two clocks must agree closely enough that the holder renews before the service times out.
This is the same clock-unreliability concern from Module 1. If the holder's clock runs slow, it may delay renewals past the service's timeout. If the service's clock runs fast, it may time out renewals that arrived in real time but appeared late to the service. The defenses:
- Use monotonic time for renewal intervals.
Instant::elapsedon the holder;time.MonotonicNow()on the service. Don't use wall-clock time for interval calculations. - Renew well before the lease expires. A 10-second renewal interval with a 30-second lease tolerates one or two missed renewals before the lease expires. If the renewals were "just before expiry," any clock skew would expire the lease.
- Treat lease expiry as a holder failure mode. A holder whose lease has expired must stop using its lock — issuing more writes will be rejected by fencing (in a system that uses fencing) and will produce divergence (in a system that doesn't).
The discipline is: leases are not just locks with timeouts; they are leases with explicit renewal and explicit handling of lease loss. Holders should be designed to detect "my lease may have expired" and respond by stopping work and re-acquiring (or surfacing the failure to the operator), rather than continuing as if nothing has changed.
Code Examples
A Lease-Based Lock with Fencing Token
use std::sync::Mutex; use std::time::{Duration, Instant}; #[derive(Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Debug)] pub struct FencingToken(u64); pub struct Lease { pub holder: String, pub token: FencingToken, pub expires_at: Instant, } pub struct LockService { state: Mutex<LockServiceState>, } struct LockServiceState { current_lease: Option<Lease>, next_token: u64, } impl LockService { pub fn new() -> Self { Self { state: Mutex::new(LockServiceState { current_lease: None, next_token: 1, }), } } pub fn acquire(&self, holder: &str, ttl: Duration) -> Option<Lease> { let mut s = self.state.lock().unwrap(); // If a lease is active and not expired, deny. if let Some(ref existing) = s.current_lease { if existing.expires_at > Instant::now() { return None; } } // Grant a new lease with the next token. let token = FencingToken(s.next_token); s.next_token += 1; let lease = Lease { holder: holder.to_string(), token, expires_at: Instant::now() + ttl, }; s.current_lease = Some(lease.clone()); Some(lease) } pub fn renew(&self, token: FencingToken, ttl: Duration) -> bool { let mut s = self.state.lock().unwrap(); match s.current_lease.as_mut() { Some(lease) if lease.token == token && lease.expires_at > Instant::now() => { lease.expires_at = Instant::now() + ttl; true } _ => false, } } } impl Clone for Lease { fn clone(&self) -> Self { Self { holder: self.holder.clone(), token: self.token, expires_at: self.expires_at, } } } // The protected resource validates the fencing token on every write. pub struct ProtectedStore { highest_token_seen: Mutex<FencingToken>, } impl ProtectedStore { pub fn new() -> Self { Self { highest_token_seen: Mutex::new(FencingToken(0)), } } pub fn write(&self, token: FencingToken, _data: &[u8]) -> Result<(), &'static str> { let mut highest = self.highest_token_seen.lock().unwrap(); if token < *highest { // A delayed write from a deposed holder. Reject it. return Err("fencing: stale token rejected"); } *highest = token; // ... apply the write ... Ok(()) } } fn main() { let svc = LockService::new(); let store = ProtectedStore::new(); let lease_a = svc.acquire("merger-a", Duration::from_secs(30)).unwrap(); // merger-a does some work... store.write(lease_a.token, b"merge-step-1").unwrap(); // Simulate merger-a pausing past lease expiry, then merger-b acquiring. let lease_b = svc.acquire("merger-b", Duration::from_secs(30)); // (In reality, merger-b couldn't acquire because lease_a hasn't expired yet // in real time. For a test, we'd advance simulated time or set a short TTL.) // After merger-a resumes, its write is rejected by fencing. // store.write(lease_a.token, b"merge-step-2") would return Err here. println!("lease_a token: {:?}, lease_b possibility: {:?}", lease_a.token, lease_b.is_some()); }
The structural point: the lock service grants the token, but the store enforces it. Both pieces are required. Skipping either — using a lock service without fencing, or using fencing without a real consensus-backed token issuer — leaves the holes that produce the April incident.
Lease Renewal Loop
#![allow(unused)] fn main() { use std::time::Duration; use std::sync::atomic::{AtomicBool, Ordering}; use std::sync::Arc; struct LockService; struct Lease { token: FencingToken } #[derive(Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Debug)] struct FencingToken(u64); impl LockService { fn renew(&self, _t: FencingToken, _d: Duration) -> bool { true } } pub async fn renewal_loop( svc: Arc<LockService>, initial_lease: Lease, renewal_interval: Duration, ttl: Duration, keep_running: Arc<AtomicBool>, ) -> Result<(), &'static str> { let mut current_token = initial_lease.token; while keep_running.load(Ordering::SeqCst) { // Sleep, using monotonic time. tokio::time::sleep is monotonic. tokio::time::sleep(renewal_interval).await; // Attempt renewal. If the lock service rejects (lease already expired), // we cannot continue safely: someone else may now hold the lease. if !svc.renew(current_token, ttl) { // Lease lost. Stop work immediately - subsequent writes would be // fenced anyway, but better to halt than to perform doomed work. return Err("lease lost; cannot continue holding the lock"); } } Ok(()) } }
The renewal loop is small and important. It uses monotonic time (via tokio::time::sleep), it stops cleanly on shutdown, and it surfaces lease loss to the caller as an error — not as a silent state change. The caller (the merger, the scheduler, whoever) is responsible for halting work on lease loss.
A Common Anti-Pattern
For contrast, the anti-pattern:
// ANTI-PATTERN: distributed lock without fencing.
async fn do_critical_section(svc: &LockService) -> Result<()> {
let lease = svc.acquire("worker", Duration::from_secs(30)).await?;
// ... do work ...
// PROBLEM: if this code path takes longer than 30 seconds (due to GC, swap,
// cosmic ray, whatever), the lease expires. The lock service grants the
// lock to someone else. We continue executing here, oblivious. The
// critical section is no longer exclusive.
perform_writes_without_token(); // <-- bug
svc.release(lease).await?;
Ok(())
}
struct LockService;
struct Lease;
impl LockService {
async fn acquire(&self, _w: &str, _d: std::time::Duration) -> Result<Lease> { Ok(Lease) }
async fn release(&self, _l: Lease) -> Result<()> { Ok(()) }
}
fn perform_writes_without_token() {}
use anyhow::Result;
The fix is fencing: pass lease.token to perform_writes, and have the storage layer validate it on every write. The distributed lock alone is not sufficient.
Key Takeaways
- A distributed lock is not a single-machine mutex. The holder can pause, the network can partition, and the lock service can revoke the lease — all without the holder's immediate knowledge. Code that treats a distributed lock as if it were a single-machine mutex will be wrong under those conditions.
- Leases (time-bounded locks with explicit renewal) are the standard distributed lock primitive. They shift detection from the service to the holder: the holder must prove liveness via renewal. The lease duration is tuned against the renewal interval to tolerate transient renewal failures.
- Fencing tokens make distributed locks safe in the presence of holder pauses. The lock service issues monotonically increasing tokens; the protected resource validates each write's token against the highest seen. Without fencing, leases are insufficient.
- Coordination services (ZooKeeper, etcd, Consul) are the consensus-backed implementations of locks, leases, and related primitives. They are designed for small, slowly-changing, high-value state — not for general-purpose data storage.
- Lease renewal depends on time. Use monotonic clocks for renewal intervals; renew well before lease expiry to tolerate clock skew; treat lease loss as a halt-and-recover event, not a silent state change.
Source note: This lesson is grounded in DDIA 2nd Edition (Kleppmann & Riccomini), Chapter 10, "Membership and Coordination Services" — and in Martin Kleppmann's blog post "How to do distributed locking" (kleppmann.com, 2016), which is the canonical public reference for fencing tokens. ZooKeeper details are from the ZooKeeper documentation and Hunt et al., "ZooKeeper: Wait-free coordination for Internet-scale systems" (USENIX ATC 2010). The April incident is illustrative and not based on a real Meridian event. Specific operational parameters (30-second lease, 10-second renewal) are common defaults but should be calibrated per workload.
Lesson 2: Service Discovery and Load Balancing
Context
The Constellation Network has grown from twelve services to forty-three in the last eighteen months. Each ground station runs an instance of the telemetry-ingest service; each region runs replicas of the orbital-element-registry; each pass-window scheduler talks to a dozen downstream services to validate decisions. The original deployment configuration — a static config file listing every service's IP and port — was last accurate in March of last year. Today, it is updated by hand on Tuesdays during the maintenance window, and roughly one in five incidents traces back to a service being addressable from a config file that no longer matches reality.
Static configuration does not work at this scale. The system needs service discovery — a mechanism by which a service that wants to call orbital-registry gets a current list of healthy orbital-registry instances without anyone editing a config file. The system also needs load balancing — once it has a list of instances, the caller needs to pick one that is alive, not overloaded, and ideally close to the caller in network terms.
The two problems are entangled. A service discovery system that returns dead instances forces clients to do their own health checking. A load balancer that can't see the current instance list can't distribute load. The mature production answer integrates the two: the discovery layer exposes only live instances, and the load balancer uses real-time health and load signals to route. This lesson covers the discovery patterns (client-side vs server-side, registries vs DNS), the load-balancing algorithms (round-robin, least-loaded, consistent hashing, power-of-two-choices), and the operational discipline that connects them to the failure detector from Module 4.
Core Concepts
Service Discovery: Client-Side vs Server-Side
Two architectural patterns dominate service discovery, and each has different operational implications.
Client-side discovery. The caller queries a registry directly for instances of the target service, picks one, and connects. Examples: Netflix Eureka, HashiCorp Consul (in client-side mode), the etcd-based discovery used by many Kubernetes operators. The caller is responsible for caching the registry's response, refreshing it periodically, handling registry failures, and implementing the load balancing.
The advantages: no extra network hop; the caller has full visibility into available instances and can make sophisticated routing decisions; the load-balancing logic runs on every caller, distributing CPU cost. The disadvantages: every caller must implement the same discovery/balancing logic, leading to drift between languages and frameworks; clients can hold stale lists; the registry becomes a critical dependency for every caller.
Server-side discovery. The caller connects to a load balancer (HAProxy, NGINX, Envoy, AWS ELB, Kubernetes Service), which itself queries the registry and forwards to a backend. The caller doesn't need to know anything beyond "the load balancer's address." Examples: every Kubernetes Service, AWS Application Load Balancer, GCP Cloud Load Balancer.
The advantages: callers are simple; one well-tested load-balancing implementation serves all clients; routing decisions can be sophisticated without duplicating logic. The disadvantages: extra network hop (the load balancer is in the request path); the load balancer is a critical dependency that must be highly available; some routing decisions (e.g., affinity based on caller identity) are harder to implement.
The constellation uses both. The high-volume internal RPC paths (telemetry ingest → registry → catalog) use client-side discovery with a shared library that handles caching, health, and balancing. The external API gateway and the inter-region links use server-side discovery via an Envoy mesh, on the basis that the operational simplicity is worth the extra hop.
Registries: Push vs Pull, Static vs Dynamic
The registry is the source of truth for "which instances exist." How it learns about instances and how clients learn from it shapes the system's behavior under failure.
Pull-based registries. Services register themselves on startup (HTTP POST to the registry) and refresh their registration periodically. The registry expires registrations that haven't refreshed within a TTL. Clients pull the current list on demand. Eureka and Consul work this way.
Push-based registries. The registry watches the underlying infrastructure (Kubernetes API server, Nomad job state, cloud instance metadata) and is automatically updated by the orchestrator. Clients receive push notifications via watches or streams when the list changes. The Kubernetes Endpoints model and most service mesh control planes work this way.
DNS-based discovery. A degenerate registry where service names map to A/AAAA records (or SRV records for ports). Universal compatibility — every language has a DNS client — at the cost of weak semantics: DNS caching means clients can have stale records for the TTL duration; DNS has no health awareness on its own. Many systems use DNS for discovery with very short TTLs (5–30 seconds), accepting the staleness window as the tradeoff for universal compatibility.
The push-based pattern is operationally cleaner when an orchestrator is the source of truth: the orchestrator decides which instances exist, and the registry reflects that. Pull-based is the older pattern but still widely used; it tolerates an absent orchestrator at the cost of more registration logic in each service.
Health Awareness in Discovery
A service that has registered itself isn't necessarily healthy. A service that has registered itself and not yet been removed from the registry isn't necessarily still running. The registry needs to actively distinguish healthy instances from registered ones.
Three mechanisms appear:
TTL expiry. Registrations expire if not refreshed. The crashed service stops refreshing; after the TTL, the registry removes it. This catches hard crashes but is slow (the TTL is typically tens of seconds, during which clients still see the dead instance).
Active health checks from the registry. The registry probes each registered instance periodically — HTTP /health, TCP connect, gRPC health check. Failed probes remove the instance from the active list immediately. Faster than TTL expiry; costs more registry CPU.
Passive health checks at the client. The load balancer or client library tracks per-instance failure rates and excludes unhealthy instances from rotation. This catches failures the active probes miss (a service that returns 200 to /health but 500 to actual requests). Envoy's "outlier detection" is the canonical implementation.
The constellation uses TTL expiry as a baseline (every registry entry has a 30-second TTL with 10-second refresh) plus passive health detection at the client (10% failure rate over 30 seconds excludes the instance for 60 seconds). The combination provides both correctness (TTL eventually removes truly dead instances) and responsiveness (passive checks react in seconds to actual failures).
Load Balancing Algorithms
Once you have a list of healthy instances, you need to pick one. The algorithms span a wide range of complexity:
Round-robin. Cycle through instances in order. Simple, stateless, fair if instances are equivalent in capacity. Fails when instances are not equivalent — a slow instance still gets its fair share of requests and accumulates a backlog.
Random. Pick uniformly at random. Same statistical properties as round-robin in the limit; doesn't require any coordination between callers, which matters when callers are independent processes that can't share state.
Least-connections / least-loaded. Pick the instance with the fewest in-flight requests. Routes around slow or overloaded instances naturally. Requires tracking per-instance state, which is fine for a single load balancer but harder for distributed callers (each caller sees only its own connections, not the global load).
Power-of-two-choices. Pick two instances at random, then pick the less-loaded of the two. Surprisingly effective: produces near-optimal load distribution with vastly less state than least-loaded. The standard reference is Mitzenmacher's 2001 paper. Used by NGINX (least_conn with random two), Envoy, and Akka.
Consistent hashing. Map each request's key to an instance via a hash ring. The same key always routes to the same instance (until the instance set changes). Essential for caching tiers and any system where instance affinity matters. The Karger et al. 1997 paper is the canonical reference; modern implementations use jump consistent hashing (Lamping & Veach 2014) or rendezvous hashing for simpler implementation.
Weighted variants. Each algorithm can be weighted by instance capacity, geographic distance, or recent latency. Production load balancers typically support weighting; the operational value is matching real instance characteristics.
The constellation uses power-of-two-choices for general RPC, consistent hashing for the catalog's read tier (so the same TLE is always served by the same replica, maximizing cache hit rate), and weighted round-robin for the cross-region paths (weighted by inter-region link bandwidth).
Latency-Aware Routing
When instances span different network regions, a load balancer that picks uniformly produces a mixture of fast and slow routes. Latency-aware routing measures per-instance response time and prefers faster instances.
The implementations:
EWMA-based. Each caller maintains an exponentially weighted moving average of per-instance latency. Routing prefers low-EWMA instances. Simple, low-overhead, adapts quickly to changes. The cost is per-caller state and the cold-start problem (a new instance starts with no EWMA).
P2C with latency. The power-of-two-choices variant where the "less loaded" comparison uses recent latency instead of connection count. Inherits P2C's good statistical properties.
Zone-aware routing. Hard-code or auto-detect that instances are in the same region as the caller and prefer them. Used heavily in cloud deployments to keep cross-AZ traffic minimal.
The constellation uses zone-aware routing for the standard case (calls stay within the same region whenever possible) and EWMA-based selection as the fallback when the local region is unavailable.
The Failure Detector Connection
The failure detector from Module 4 Lesson 1 is the substrate that makes service discovery and load balancing work. The discovery layer's "healthy instance list" is built from failure-detector signals — actively probed or passively observed. The load balancer's "this instance is slow" decision is a suspicion-level reading from the same detector.
This is why the constellation team standardized on phi accrual for service-to-service liveness: the same continuous suspicion signal drives discovery (high suspicion → temporarily remove from the rotation), load balancing (medium suspicion → reduce weight), and circuit breaking (sustained high suspicion → open the breaker). The three layers consume different thresholds on the same signal.
The alternative — three separate detection mechanisms with different timeouts — is what produces flapping. A discovery layer that removes instances at 5 seconds, a load balancer that re-includes at 10 seconds, and a circuit breaker that trips at 30 seconds will interact in ways that nobody designed. Centralize the liveness signal.
Code Examples
A Client-Side Discovery Cache
#![allow(unused)] fn main() { use std::collections::HashMap; use std::sync::{Arc, Mutex}; use std::time::{Duration, Instant}; #[derive(Clone, Debug)] pub struct ServiceInstance { pub id: String, pub endpoint: String, pub last_health_check: Instant, } pub struct DiscoveryCache { inner: Arc<Mutex<HashMap<String, Vec<ServiceInstance>>>>, ttl: Duration, } impl DiscoveryCache { pub fn new(ttl: Duration) -> Self { Self { inner: Arc::new(Mutex::new(HashMap::new())), ttl, } } /// Called periodically (e.g., every 10s) to refresh the cache from the /// registry. Stale entries (not seen in this refresh) are evicted - this /// is what removes deregistered or expired instances from the cache. pub fn refresh(&self, service: &str, instances: Vec<ServiceInstance>) { let mut cache = self.inner.lock().unwrap(); cache.insert(service.to_string(), instances); } /// Returns the current cached instance list. Callers may further filter /// by health (using the passive health detector) before picking one. pub fn instances(&self, service: &str) -> Vec<ServiceInstance> { let cache = self.inner.lock().unwrap(); cache.get(service).cloned().unwrap_or_default() } /// Returns the instances that are considered fresh (last_health_check /// within the TTL). Filters out any instance whose health-check timestamp /// is older than the TTL, even if it's still in the cache. pub fn fresh_instances(&self, service: &str) -> Vec<ServiceInstance> { let cutoff = Instant::now() - self.ttl; self.instances(service) .into_iter() .filter(|i| i.last_health_check >= cutoff) .collect() } } }
The cache is a simple read-mostly structure with periodic refresh. The two-level filtering (refresh writes the cache; fresh_instances filters by freshness on read) handles the case where the refresh is delayed or the registry itself has stale entries.
Power-of-Two-Choices Load Balancer
#![allow(unused)] fn main() { use std::sync::atomic::{AtomicU64, Ordering}; use std::sync::Arc; #[derive(Clone)] struct ServiceInstance { id: String, endpoint: String } pub struct InstanceState { pub instance: ServiceInstance, pub in_flight: AtomicU64, pub recent_failures: AtomicU64, } pub struct P2CBalancer { instances: Vec<Arc<InstanceState>>, } impl P2CBalancer { pub fn new(instances: Vec<ServiceInstance>) -> Self { Self { instances: instances .into_iter() .map(|i| { Arc::new(InstanceState { instance: i, in_flight: AtomicU64::new(0), recent_failures: AtomicU64::new(0), }) }) .collect(), } } /// Power-of-two-choices: pick two instances at random, return the one /// with fewer in-flight requests. Surprisingly effective: produces /// near-optimal load distribution with O(1) work per call and no /// cross-caller coordination. pub fn pick(&self) -> Option<Arc<InstanceState>> { let n = self.instances.len(); if n == 0 { return None; } if n == 1 { return Some(self.instances[0].clone()); } let a = fastrand::usize(0..n); let mut b = fastrand::usize(0..n); // Ensure b != a; loop is bounded since n >= 2. while b == a { b = fastrand::usize(0..n); } let inst_a = &self.instances[a]; let inst_b = &self.instances[b]; // Compare by in_flight; tiebreak by recent_failures (prefer fewer). let score_a = inst_a.in_flight.load(Ordering::Relaxed); let score_b = inst_b.in_flight.load(Ordering::Relaxed); if score_a <= score_b { Some(inst_a.clone()) } else { Some(inst_b.clone()) } } } mod fastrand { pub fn usize(_r: std::ops::Range<usize>) -> usize { 0 } } }
The key property: each caller picks independently, with no coordination, and the statistical distribution of choices converges to within a constant factor of optimal. Mitzenmacher's paper proves this; the practical observation is that P2C beats round-robin by a wide margin under heterogeneous load and is essentially free to implement.
Consistent Hashing for Cache-Affinity Routing
#![allow(unused)] fn main() { use std::collections::BTreeMap; use std::hash::{Hash, Hasher}; #[derive(Clone)] struct ServiceInstance { id: String } pub struct ConsistentHashRing { // BTreeMap gives O(log n) lookup for the next-larger key, which is the // canonical consistent-hash lookup. Each instance occupies multiple // 'virtual nodes' on the ring to smooth distribution. ring: BTreeMap<u64, ServiceInstance>, virtual_nodes_per_instance: usize, } impl ConsistentHashRing { pub fn new(virtual_nodes_per_instance: usize) -> Self { Self { ring: BTreeMap::new(), virtual_nodes_per_instance, } } pub fn add_instance(&mut self, instance: ServiceInstance) { for vn in 0..self.virtual_nodes_per_instance { let key = hash_key(&format!("{}#{}", instance.id, vn)); self.ring.insert(key, instance.clone()); } } pub fn remove_instance(&mut self, instance_id: &str) { for vn in 0..self.virtual_nodes_per_instance { let key = hash_key(&format!("{}#{}", instance_id, vn)); self.ring.remove(&key); } } /// Return the instance responsible for the given key. The instance set /// changes only when nodes are added or removed; otherwise the same key /// always maps to the same instance. pub fn lookup(&self, key: &str) -> Option<&ServiceInstance> { if self.ring.is_empty() { return None; } let h = hash_key(key); // Find the smallest ring position >= h; if none, wrap around to the // smallest position overall. self.ring .range(h..) .next() .or_else(|| self.ring.iter().next()) .map(|(_, instance)| instance) } } fn hash_key(s: &str) -> u64 { let mut hasher = std::collections::hash_map::DefaultHasher::new(); s.hash(&mut hasher); hasher.finish() } }
The virtual_nodes_per_instance parameter is the smoothing knob. With one virtual node per instance, the distribution is uneven (an instance can own a much larger arc of the ring than its share). With 100–200 virtual nodes per instance, the distribution becomes nearly uniform with negligible CPU cost. Production implementations (Cassandra, DynamoDB) typically use a few hundred virtual nodes per physical instance.
Key Takeaways
- Service discovery is the dynamic-configuration layer that replaces static IP lists. Client-side discovery puts logic in each caller (simpler infrastructure, more language duplication); server-side puts a load balancer in the path (one well-tested implementation, extra network hop).
- The registry needs active or passive health awareness to be useful. TTL-based expiry catches hard crashes; passive failure detection at the client catches the subtle "200 to /health, 500 to real requests" failure mode. Use both.
- Power-of-two-choices is the practical default for load balancing across equivalent instances: O(1) per call, near-optimal distribution, no coordination required. Round-robin is acceptable when instances are truly equivalent; least-loaded is appropriate when one balancer has full visibility.
- Consistent hashing is the right tool when key affinity matters (caching, sharded state). Use virtual nodes per physical instance to smooth distribution; production deployments use hundreds of virtual nodes each.
- The failure detector underlies discovery and load balancing both. Use one detector with multiple thresholds for the different consumers (discovery removal, load balancer reweighting, circuit breaker tripping) rather than separate detection mechanisms that interact unpredictably.
Source note: This lesson is synthesized from training knowledge plus the canonical sources for each load balancing algorithm. Karger et al., "Consistent Hashing and Random Trees" (STOC 1997) is the consistent hashing original. Mitzenmacher, "The Power of Two Choices in Randomized Load Balancing" (IEEE TPDS 2001) is the P2C analysis. The Envoy and NGINX documentation are the practical references for production configurations. DDIA Chapter 10 briefly discusses service discovery under "Membership and Coordination Services" but does not go deep on load balancing algorithms. Foundations of Scalable Systems would have been the standard reference here and was unavailable; cross-check before publication.
Lesson 3: Gossip Protocols
Context
The Constellation Network's membership list — which satellites are currently in the active grid, which ground stations are operational, which compute nodes can accept pass-window jobs — needs to be visible from every node. The naive approach is a central registry, and for a 60-node deployment this works. But the next phase of the network adds 200 edge-compute nodes (one per ground station, each with a small fleet of GPUs for on-site image processing) and the central registry becomes a bottleneck. Every node querying every five seconds produces 60 query/second of registry traffic; the registry CPU spikes; the operations team starts adding read replicas and load balancers in front of the registry. The architecture has begun to fight the workload.
For membership and other "eventually-consistent broadcast" workloads, a different shape is right: gossip protocols. Each node tracks a small subset of peers, periodically exchanges state with a random peer, and the cluster-wide view emerges via epidemic propagation. The mechanism is decentralized, scales as O(log N) propagation time for cluster size N, and is naturally resilient to individual node failures. Gossip is what Cassandra uses for cluster membership, what SWIM is built on, what HashiCorp Serf implements, what Riak uses for ring state propagation, and what most service meshes use for control-plane state distribution.
This lesson covers the family of gossip protocols, the SWIM failure-detection variant in detail, and the operational properties — propagation time, message complexity, convergence guarantees — that distinguish gossip from the centralized alternatives. By the end, you should be able to choose between centralized and gossip-based mechanisms for a given workload, and tune a gossip protocol's parameters against the network's actual characteristics.
Core Concepts
Why Gossip Works
The core idea, formalized by Demers et al. at Xerox PARC in 1987: when each of N nodes randomly chooses K peers to exchange state with per round, information about any new fact spreads through the cluster in O(log N / log K) rounds with high probability. The "epidemic" framing is exact — gossip protocols are mathematically the same as biological epidemic models, with the same convergence properties.
The three flavors of gossip, distinguished by the direction of information flow:
Push gossip. Each node periodically picks a random peer and sends it state. The recipient merges; that's the entire interaction. Simple but slow to converge — late in the epidemic, most pushes go to peers that already have the information.
Pull gossip. Each node periodically picks a random peer and requests state from it. Faster than push late in the epidemic (a node that doesn't have the information is exactly the node that benefits from pulling).
Push-pull gossip. Each node periodically picks a peer and exchanges state bidirectionally. Each round, both peers end with the union of their state. Push-pull dominates push and pull in convergence speed and is the standard production form.
All three converge with high probability under the same conditions: messages are delivered with reasonable probability, the random peer selection is fair, and no information is permanently lost. None require strong network synchrony or majority quorums. The cost is convergence delay: a fact takes multiple gossip rounds to spread, so gossip is appropriate for state that can tolerate seconds of staleness (membership, configuration) and inappropriate for state that requires immediate consistency (consensus decisions, financial transactions).
SWIM: Scalable Failure Detection via Gossip
The SWIM protocol — Scalable Weakly-consistent Infection-style Membership — was introduced by Das, Gupta, and Motivala at UIUC in 2002. It is the canonical reference for gossip-based failure detection, used by Hashicorp Serf and many other production systems.
SWIM has two parts: a failure detector that uses indirect probes and gossip dissemination of membership state, and a membership protocol that propagates joins, leaves, and failures via the same gossip channel.
The failure detector. Periodically, each node picks a random target and sends a PING. If the target replies with ACK within a timeout, the target is alive. If not, the prober asks K other random nodes to PING-REQ the target on its behalf. If any of the indirect probes succeeds, the target is alive. If all fail, the prober declares the target suspected. This indirect-probe mechanism is what gives SWIM its key property: a single network blip doesn't cause a false positive, because the indirect probes verify the path independently.
The membership dissemination. Membership updates (joins, suspicions, failures) are piggybacked on existing PING/ACK messages. There's no separate broadcast — the same packets that carry liveness probes also carry membership state. The effect is that membership changes propagate at the same speed as failure detection, with no extra network cost.
Suspicion mechanism. SWIM introduces a "suspected" state between "alive" and "dead." When a node is suspected, the cluster has a window (typically a few seconds) before declaring it definitely dead. During that window, the suspected node can disseminate its own "I'm alive" message via gossip and clear the suspicion. This dramatically reduces false-positive rates compared to simple timeout-based detection.
The catalog uses SWIM for the edge-compute layer (200+ nodes) and a more direct heartbeat mechanism for the consensus tier (5 nodes in each Raft cluster). The reasoning: SWIM's O(log N) scaling matters when N is large; for small clusters the direct mechanism is simpler and just as effective.
Anti-Entropy Across Replicas
Gossip is also the mechanism behind anti-entropy: the background process that reconciles diverged replicas in eventually-consistent storage systems. Cassandra, Riak, and DynamoDB all use anti-entropy to ensure replicas converge even for keys that are not being read.
The standard implementation uses Merkle trees: each replica computes a tree of hashes over its key ranges. Two replicas exchange the root hashes; if they match, the replicas are consistent and no further work is needed. If they differ, the replicas recurse down the tree, exchanging child hashes until the diverging ranges are localized to specific keys, which are then reconciled.
The cost is per-replica: each node builds and maintains the Merkle tree for its data. The savings are dramatic: with N keys, a single Merkle exchange identifies the diverging ranges in O(log N) hash comparisons rather than O(N) per-key comparisons. Cassandra's anti-entropy uses this exactly.
The gossip layer is what makes anti-entropy work cluster-wide: replicas don't need to coordinate; each periodically picks a peer (via the same gossip random selection) and runs anti-entropy with it. Over time, every pair gets exchanged, and the cluster converges.
Versioned State and Vector Clocks in Gossip
When two nodes exchange membership state via gossip, they must reconcile entries that disagree. The disagreement might be a genuine update ("node X has joined") or a stale view ("node X was alive when I last checked, but maybe it has failed since"). The reconciliation needs to identify which view is newer.
Two approaches:
Lamport-style logical clocks. Each entry carries a generation number incremented on each update. The higher number wins on reconciliation. Simple but loses information about concurrent updates.
Vector clocks per entry. Each entry carries a vector clock (Module 1) showing which other nodes' updates it has incorporated. Conflicting entries are detected as siblings and resolved via application logic. More precise but heavier.
Cassandra uses Lamport-style timestamps; Riak uses vector clocks. Both work; the choice depends on whether the system needs to detect concurrent updates as conflicts (Riak) or accepts last-write-wins (Cassandra).
The constellation's gossip layer uses Lamport timestamps for membership state (failure detection is naturally last-write-wins — a node is either alive or dead, with the most recent observation winning) and vector clocks for replica metadata in the data layer (where concurrent updates from different regions are operationally meaningful).
Convergence Properties and Tuning
Gossip's mathematical analysis gives concrete numbers. With cluster size N, gossip rate f (gossip rounds per second), and fanout k (peers contacted per round), the expected time for an update to reach all nodes is roughly log(N) / (k * f) seconds.
The operational parameters:
- Gossip period. How often each node initiates a gossip exchange. Standard values are 1–5 seconds. Faster gossip means faster convergence at higher CPU/network cost.
- Fanout. Number of peers contacted per round. Higher fanout = faster convergence but more bandwidth. SWIM typically uses fanout 3.
- Sub-cluster (peer set) size. Each node tracks a subset of cluster members for gossip targeting. For small clusters (under ~50), every node tracks every other; for larger clusters, partial views with periodic refresh.
The constellation's edge layer uses 1-second gossip periods with fanout 3. The math: for 250 nodes, expected propagation time is log(250) / (3 * 1) ≈ 1.8 seconds. The metric (membership change propagation time) is monitored; the parameters are tuned when the metric drifts above the design budget.
When Gossip Is the Wrong Tool
Despite its scaling properties, gossip is wrong for several workloads:
Strongly consistent state. Gossip provides eventual convergence; it does not provide linearizability or consensus. If two nodes need to agree atomically (as in Raft's leader election), gossip is the wrong shape — use consensus.
Latency-critical decisions. Gossip takes seconds to converge. If a decision needs to be based on globally-consistent state right now, gossip will be too slow. The local failure detector (Module 4) is faster for individual node liveness; gossip is for cluster-wide membership view.
Very small clusters. For 5 nodes, every-to-every gossip is fine but a centralized registry is simpler. The crossover where gossip becomes operationally advantageous is somewhere in the dozens.
Adversarial environments. Gossip's assumption is that all participants are honest. Byzantine-tolerant gossip exists (peer review of state, signed updates) but it's substantially more complex. Blockchain consensus protocols are essentially Byzantine-tolerant gossip plus economic incentives; that's a different problem.
The catalog's split — consensus for the 5-node tier, gossip for the 250-node edge — is the standard cloud-native answer. The two layers communicate at a defined boundary (the consensus tier exposes a small consensus-backed registry; the edge layer reads from it but does its own gossip for the high-volume membership info).
Code Examples
A Push-Pull Gossip Step
use std::collections::HashMap; use std::sync::Mutex; #[derive(Clone, Debug, PartialEq, Eq)] pub struct MemberInfo { pub node_id: String, pub generation: u64, pub status: MemberStatus, } #[derive(Clone, Debug, PartialEq, Eq)] pub enum MemberStatus { Alive, Suspected, Dead } pub struct MembershipState { members: Mutex<HashMap<String, MemberInfo>>, } impl MembershipState { pub fn new() -> Self { Self { members: Mutex::new(HashMap::new()) } } /// Merge our state with state received from a peer. For each member, the /// higher generation wins. This is Lamport-style last-write-wins on the /// per-member version counter. pub fn merge(&self, peer_state: &HashMap<String, MemberInfo>) { let mut ours = self.members.lock().unwrap(); for (id, peer_info) in peer_state { match ours.get(id) { Some(our_info) if our_info.generation >= peer_info.generation => { // We have the same or newer info; keep ours. } _ => { // Peer has newer info; adopt it. ours.insert(id.clone(), peer_info.clone()); } } } } pub fn snapshot(&self) -> HashMap<String, MemberInfo> { self.members.lock().unwrap().clone() } pub fn mark_alive(&self, node_id: &str) { let mut m = self.members.lock().unwrap(); let entry = m.entry(node_id.to_string()).or_insert(MemberInfo { node_id: node_id.to_string(), generation: 0, status: MemberStatus::Alive, }); entry.generation += 1; entry.status = MemberStatus::Alive; } } /// Push-pull gossip: send our state to the peer; receive theirs in return. /// Both ends merge. After the exchange, both have the union of their state. pub async fn gossip_step( local: &MembershipState, peer_send: impl FnOnce(HashMap<String, MemberInfo>) -> HashMap<String, MemberInfo>, ) { let snapshot = local.snapshot(); let peer_response = peer_send(snapshot); local.merge(&peer_response); } #[tokio::main] async fn main() { let local = MembershipState::new(); local.mark_alive("ground-pacific"); local.mark_alive("ground-atlantic"); let peer_state = { let m = MembershipState::new(); m.mark_alive("ground-pacific"); m.mark_alive("ground-indian"); // peer knows about a node we don't m }; gossip_step(&local, |_our_snapshot| peer_state.snapshot()).await; let after = local.snapshot(); println!("after gossip, we know about {} nodes", after.len()); // Should print 3: pacific (we knew), atlantic (we knew), indian (learned from peer) }
The merge is the core operation. The exchange is symmetric: each side ends with the union of state. After O(log N) rounds, every node has every other node's state, with high probability.
SWIM Indirect Probe
#![allow(unused)] fn main() { use std::time::Duration; use tokio::time::timeout; use anyhow::Result; struct PeerEndpoint; impl PeerEndpoint { async fn ping(&self) -> Result<()> { Ok(()) } async fn ping_req(&self, _target: &PeerEndpoint) -> Result<bool> { Ok(true) } } pub async fn swim_probe( target: &PeerEndpoint, indirect_peers: &[PeerEndpoint], direct_timeout: Duration, indirect_timeout: Duration, ) -> bool { // Step 1: direct probe. Most checks succeed here. if timeout(direct_timeout, target.ping()).await.is_ok() { return true; } // Step 2: indirect probes. Pick a few random peers to probe the target on // our behalf. If any path succeeds, the target is alive (and the failure // is in our direct path to it, not in the target itself). let mut indirect_tasks = Vec::new(); for peer in indirect_peers { let fut = timeout(indirect_timeout, peer.ping_req(target)); indirect_tasks.push(fut); } for task in indirect_tasks { if let Ok(Ok(true)) = task.await { return true; } } // All paths failed; declare suspected (not dead - the dissemination layer // will run the suspicion-timeout protocol to confirm). false } }
The indirect-probe layer is what gives SWIM its low false-positive rate. A direct timeout could mean "target is dead" or "my path to target is broken"; the indirect probes distinguish these. If indirect probes succeed, the target is alive and the issue is on the local node's network — exactly the case where you don't want to falsely declare the target dead.
Merkle Tree for Anti-Entropy (Sketch)
#![allow(unused)] fn main() { use std::collections::BTreeMap; #[derive(Clone, Debug)] pub struct MerkleNode { pub hash: [u8; 32], pub range: (u64, u64), pub children: Option<Box<(MerkleNode, MerkleNode)>>, } impl MerkleNode { /// Build a Merkle tree over key-value pairs in a range. Leaves are /// hashes of individual key-value pairs; internal nodes are hashes of /// the concatenation of their children's hashes. pub fn build(data: &BTreeMap<u64, Vec<u8>>, range: (u64, u64), depth: usize) -> Self { if depth == 0 || range.1 - range.0 <= 1 { // Leaf: hash all the data in this range. let mut hasher = Sha256Stub::new(); for (k, v) in data.range(range.0..range.1) { hasher.update(&k.to_be_bytes()); hasher.update(v); } return MerkleNode { hash: hasher.finalize(), range, children: None }; } let mid = (range.0 + range.1) / 2; let left = MerkleNode::build(data, (range.0, mid), depth - 1); let right = MerkleNode::build(data, (mid, range.1), depth - 1); let mut hasher = Sha256Stub::new(); hasher.update(&left.hash); hasher.update(&right.hash); MerkleNode { hash: hasher.finalize(), range, children: Some(Box::new((left, right))), } } /// Compare two trees: return the leaf ranges where they differ. The /// O(log N) speedup comes from skipping subtrees where root hashes match. pub fn diverged_ranges(a: &MerkleNode, b: &MerkleNode, out: &mut Vec<(u64, u64)>) { if a.hash == b.hash { return; } match (&a.children, &b.children) { (None, _) | (_, None) => out.push(a.range), (Some(ac), Some(bc)) => { Self::diverged_ranges(&ac.0, &bc.0, out); Self::diverged_ranges(&ac.1, &bc.1, out); } } } } // Sha256Stub stands in for sha2::Sha256 (which would require adding the sha2 crate). struct Sha256Stub; impl Sha256Stub { fn new() -> Self { Self } fn update(&mut self, _b: &[u8]) {} fn finalize(self) -> [u8; 32] { [0; 32] } } }
In a real anti-entropy exchange, peers exchange root hashes first; if they match, the keys are consistent and no further work is needed. If not, peers exchange child hashes recursively, narrowing in on the diverged ranges. Only those keys are re-exchanged at the per-key level. For a 1-million-key replica with one diverged key, this is O(log N) hash comparisons rather than O(N) key comparisons — a dramatic speedup.
Key Takeaways
- Gossip protocols spread information epidemically: each node periodically exchanges state with random peers, and updates propagate in O(log N) rounds for cluster size N. The mechanism is decentralized, resilient, and scales to thousands of nodes without a central bottleneck.
- SWIM is the canonical gossip-based failure detector. The two-step probing (direct, then indirect via K random peers) dramatically reduces false positives, and the suspicion-state mechanism gives a node a chance to refute a false positive before it's declared dead.
- Anti-entropy uses gossip plus Merkle trees to reconcile diverged replicas efficiently. Cassandra, Riak, and DynamoDB use this pattern for background consistency repair.
- Gossip is appropriate for cluster-wide state that can tolerate seconds of staleness (membership, configuration, routing tables). It is inappropriate for state that requires immediate consistency (consensus decisions). Use gossip for the broadcast workloads; use consensus for the agreement workloads.
- The standard architectural split is consensus for the small high-value tier (Raft cluster, 3–7 nodes) and gossip for the large edge tier (membership across hundreds of nodes). The two layers communicate at a defined boundary; the consensus tier acts as a small, strongly-consistent registry that the gossip tier reads.
Source note: This lesson is synthesized from training knowledge plus the canonical sources. Demers et al., "Epidemic Algorithms for Replicated Database Maintenance" (PODC 1987) is the original gossip-protocol paper. Das, Gupta, Motivala, "SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol" (DSN 2002) is the SWIM paper. Merkle trees for anti-entropy are documented in the Cassandra and Riak operational guides. DDIA's Chapter 6 briefly mentions anti-entropy in the leaderless replication section but does not deeply treat gossip protocols. Foundations of Scalable Systems would have been the natural reference and was unavailable; cross-check before publication.
Module 05 Project — Telemetry Gossip
Mission Brief
Incident ticket CN-2703-018 Severity: P2 Reporter: Constellation Operations, Edge Fleet Status: Open
The edge-compute fleet has reached 240 nodes and the central health registry is now the dominant source of intra-cluster traffic. The registry's CPU is at 80% steady-state; query latency for the membership API has crept from 5ms to 110ms over the last quarter; an outage of the registry on March 14 left the entire fleet uncoordinated for 47 minutes because every node's view of "who else is alive" froze.
You are building Telemetry Gossip, a Rust crate that replaces the central health registry with a SWIM-based gossip layer. Each node tracks the cluster membership independently via gossip; the central registry remains only as a small consensus-backed source of authoritative cluster configuration (which nodes belong to the cluster); per-node health and load state propagates via gossip with no central bottleneck.
Repository Layout
telemetry-gossip/
├── Cargo.toml
├── src/
│ ├── lib.rs
│ ├── membership.rs # MemberInfo, MemberStatus, MembershipState
│ ├── swim.rs # SwimDetector: direct + indirect probe
│ ├── gossip.rs # PushPullGossip: peer selection + state exchange
│ ├── transport.rs # Network abstraction (real or simulated)
│ └── node.rs # GossipNode: integrated SWIM + gossip + state
├── tests/
│ ├── convergence.rs
│ ├── swim_indirect_probe.rs
│ ├── partition_recovery.rs
│ └── scale_simulation.rs
└── README.md
Required API
// membership.rs
#[derive(Clone, Debug, PartialEq, Eq)]
pub enum MemberStatus { Alive, Suspected, Dead }
#[derive(Clone, Debug, PartialEq, Eq)]
pub struct MemberInfo {
pub node_id: String,
pub status: MemberStatus,
pub generation: u64, // Lamport-style timestamp for LWW reconciliation
pub last_observed: Instant,
}
pub struct MembershipState {
members: Mutex<HashMap<String, MemberInfo>>,
}
impl MembershipState {
pub fn new() -> Self;
pub fn upsert(&self, info: MemberInfo);
pub fn merge(&self, peer_state: &[MemberInfo]);
pub fn snapshot(&self) -> Vec<MemberInfo>;
pub fn alive_members(&self) -> Vec<MemberInfo>;
pub fn member(&self, id: &str) -> Option<MemberInfo>;
}
// swim.rs
pub struct SwimDetector { /* ... */ }
impl SwimDetector {
pub fn new(transport: Arc<dyn Transport>, indirect_k: usize) -> Self;
pub async fn probe(&self, target: &str, members: &[MemberInfo]) -> bool;
}
// gossip.rs
pub struct PushPullGossip { /* ... */ }
impl PushPullGossip {
pub fn new(state: Arc<MembershipState>, transport: Arc<dyn Transport>) -> Self;
pub async fn gossip_round(&self) -> Result<()>;
}
// node.rs
pub struct GossipNode { /* ... */ }
impl GossipNode {
pub fn new(
node_id: String,
seeds: Vec<String>,
transport: Arc<dyn Transport>,
) -> Self;
pub async fn start(&self);
pub fn members(&self) -> Vec<MemberInfo>;
}
Acceptance Criteria
-
cargo build --releasecompletes without warnings under#![deny(warnings)]. -
cargo test --releasepasses all integration tests with zero flakes across 50 consecutive runs. -
cargo clippy -- -D warningsproduces no lints. - Convergence test: start a 20-node cluster; have node 0 update its own state; verify that all 20 nodes reflect the update within 5 gossip rounds (allowing for the O(log N) propagation bound).
- State reconciliation test: two nodes have divergent views of a third node's status (one says Alive, one says Dead). After a single push-pull exchange, both end with the higher-generation value.
- SWIM indirect probe test: node A's direct path to node B is broken (network injection drops A→B but not other paths). A probes B directly, fails, falls back to indirect probes via two other nodes, succeeds. A correctly classifies B as alive.
- Suspicion-state test: when a target genuinely fails (no node can reach it), the cluster transitions it to Suspected, then after a configurable timeout to Dead. A target that recovers during the Suspected window broadcasts an Alive update that clears the suspicion.
- Partition recovery test: partition a 10-node cluster into 7+3. While partitioned, the 3-node side's view freezes (they cannot learn about each other's state); the 7-node side correctly identifies the 3-node side as Dead via SWIM. After the partition heals, the 3-node side learns the 7-node side's state and the 7-node side updates to reflect the 3-node side's recovery.
- Scale simulation: simulate a 250-node cluster (in-memory transport, simulated message delivery). Measure: (a) per-node memory used for membership state, (b) network bytes per gossip round, (c) average convergence time for a state change. The test reports these as observable metrics; the README documents them.
- (self-assessed) The README explains the relationship between the consensus-backed "cluster configuration" registry and the gossip-based membership state. A reader should understand which guarantees come from which layer.
-
(self-assessed) The SWIM indirect probe respects the configured
Kparameter and does not blow up if fewer thanKindirect peers are available. Edge cases (1-node cluster, 2-node cluster) are handled gracefully. - (self-assessed) The state-reconciliation logic handles concurrent updates from the same node (same generation, different status) deterministically. Document the tiebreaker policy.
Expected Output
cargo test --release convergence -- --nocapture:
[t=0.000s] Cluster initialized: 20 nodes (n0..n19), gossip period 100ms
[t=0.100s] All nodes initial state: all members Alive
[t=0.500s] n0: update self status -> generation incremented
[t=0.600s] gossip round 1 complete. Nodes aware of update: 3
[t=0.700s] gossip round 2 complete. Nodes aware of update: 9
[t=0.800s] gossip round 3 complete. Nodes aware of update: 18
[t=0.900s] gossip round 4 complete. Nodes aware of update: 20
PASS: Convergence in 4 rounds (O(log 20) = ~4.3, within bound)
Hints
1. Generation numbers and reconciliation
Each node owns its own generation counter for its own status. When a node updates its own state, it increments its generation and broadcasts. Other nodes adopt the higher generation. For "this node is dead" updates from other nodes about us, the protocol gets more interesting: the suspected node can refute by broadcasting an even-higher generation Alive update. This is the SWIM suspicion-refutation mechanism. Implement it as: the prober proposes generation G+1 with status Suspected; the suspected node refutes by issuing G+2 with status Alive.
2. Peer selection in gossip
For small clusters (under ~50), every node knows every other and picks uniformly. For larger clusters, partial views — each node tracks a subset, refreshing periodically — are more bandwidth-efficient. The project's required API supports the small-cluster case; extending to partial views is a worthwhile self-assessed improvement once the basics work.
3. Suspicion timeout vs gossip period
The suspicion timeout should be at least 2–3× the gossip period, so a node has multiple chances to refute. Too short and false-positive suspicions race the refutation; too long and real failures take too long to declare. A 100ms gossip period with a 5s suspicion timeout is a reasonable default for the test environment.
4. Testing convergence deterministically
Real-time-based convergence tests can be flaky. Instead, drive the simulation with explicit "round" ticks: the test calls gossip_round() on each node in sequence, then asserts on the resulting state. This makes the test reproducible and the bound assertion (O(log N) rounds) exact.
5. The 250-node scale simulation
Spawning 250 real tokio tasks is fine but excessive for a unit test. An alternative: keep all 250 nodes in a single test function, drive them sequentially via the simulated transport, and time the rounds. The simulation runs deterministically; the metrics it reports are exact rather than empirical. The "real" performance of the production deployment can be benchmarked separately.
6. Memory budget per node
Each MemberInfo is roughly 100 bytes (node id, status, generation, timestamp). For 250 members, that's 25KB per node — trivial. The interesting limit is in the gossip payload: 250 × 100 bytes = 25KB per gossip exchange, which at 1Hz per node is 25KB/s per peer pair. Sub-cluster peering or delta-state gossip reduces this; the project's required minimum is the 25KB/s baseline, with optimization noted as a self-assessed improvement.
Source Anchors
- Demers et al., "Epidemic Algorithms for Replicated Database Maintenance" (PODC 1987) — original gossip protocol analysis
- Das, Gupta, Motivala, "SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol" (DSN 2002) — SWIM paper
- Hashicorp Serf documentation — production reference for SWIM in Rust-similar ecosystems
- DDIA 2nd Edition, Chapter 6 (anti-entropy section) and Chapter 10 (coordination services)
Module 06 — Scheduling and Resource Management
"Today the system handles roughly 800 satellite passes per day. The next-quarter plan adds 30 more satellites and three more ground stations, projecting 1,400 passes per day. The existing scheduler — a Python script that runs first-fit placement on a single thread with no priority awareness — already drops roughly 4% of pass-window jobs during peak hours."
Mission Context
The previous five modules built the foundations: failure model, replication, consensus, fault tolerance, coordination. This final module covers the discipline that puts work onto resources: scheduling (which job runs next) and resource allocation (which machine runs it). Where Modules 1-5 made the system correct under failure, Module 6 makes the system efficient under load.
The three lessons cover three layers of the problem. Scheduling algorithms (Lesson 1) — FIFO, priority, EDF, fair-share, MLFQ — are the policy primitives that decide ordering when jobs compete for the same resource. Resource allocation (Lesson 2) — bin-packing, multi-dimensional matching, headroom, overcommitment, eviction — is the placement layer that maps jobs to machines. Work stealing and task migration (Lesson 3) — Chase-Lev deques, in-process redistribution, cluster-level migration — handle the imbalance that emerges after placement.
The track converges in the capstone project. The Pass Window Scheduler is the integration of every pattern in this module and several from earlier modules: priority+EDF ordering with fair-share across regions; best-fit multi-dimensional placement with headroom and constraints; work-stealing across scheduler workers; cluster-level task migration when load becomes imbalanced. The scheduler is what would replace the Python prototype that currently drops 4% of pass-window jobs.
The lessons are synthesis-heavy because scheduling is treated only briefly in DDIA and the canonical references (Borg paper, Kubernetes scheduler docs, operating systems texts) are scattered. Source notes call out where claims are synthesized versus cited.
Lessons
| # | Title | Source |
|---|---|---|
| 1 | Scheduling Algorithms | Synthesis + Liu & Layland 1973 + OS texts |
| 2 | Resource Allocation and Bin Packing | Synthesis + Borg paper + Kubernetes docs |
| 3 | Work Stealing and Task Migration | Blumofe & Leiserson 1994 + Chase & Lev 2005 + tokio docs |
Project
Pass Window Scheduler — implement a production-quality scheduler that integrates priority+EDF discipline, best-fit multi-dimensional placement, work-stealing across scheduler workers, and cluster-level migration. The required acceptance criteria include a simulated 1,400-jobs-per-day workload meeting 99.9% deadline compliance, region-fair throughput, and no scheduler thread bottleneck. The capstone for the entire Distributed Systems track.
Position
Module 6 of 6 in the Distributed Systems track — the final module.
What You Should Be Able to Do After This Module
- Choose a scheduling discipline (FIFO, priority+aging, EDF, fair-share, MLFQ, or a combination) appropriate for a given workload and explain the failure modes each is and is not vulnerable to.
- Identify scheduling pathologies in production systems by name: starvation, priority inversion, convoy effects, thundering herd, deadline-miss cascades.
- Apply bin-packing algorithms (first-fit, best-fit, worst-fit, decreasing variants) and choose the right one based on workload characteristics.
- Reason about the utilization vs latency tradeoff using queueing theory's
1/(1-ρ)formula, and pick target utilizations matched to each workload's latency budget. - Compose headroom, overcommitment, and eviction policies as complementary mechanisms rather than as alternatives.
- Distinguish in-process work stealing (fine-grained, continuous) from cluster-level task migration (coarse-grained, infrequent), and recognize the workloads where neither is appropriate.
Source Materials
- Liu & Layland, "Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment" (Journal of the ACM, 1973) — the EDF and rate-monotonic original.
- Blumofe & Leiserson, "Scheduling Multithreaded Computations by Work Stealing" (FOCS 1994; JACM 1999) — the canonical work-stealing paper.
- Chase & Lev, "Dynamic Circular Work-Stealing Deque" (SPAA 2005) — the standard lock-free deque used by Tokio, Go, and Rayon.
- Verma et al., "Large-scale cluster management at Google with Borg" (EuroSys 2015) — the canonical reference for integrated cluster scheduling.
- Coffman, Garey, Johnson, "Approximation Algorithms for Bin Packing: A Survey" (in Hochbaum, Approximation Algorithms for NP-Hard Problems, 1996) — bin-packing theory reference.
- Kubernetes' scheduler documentation (kubernetes.io/docs/concepts/scheduling-eviction/) — the practical production reference for filtering, scoring, taints/tolerations, affinity.
- Tokio runtime documentation (tokio.rs/blog) — the production Rust work-stealing implementation.
- A general operating-systems text (Tanenbaum, Modern Operating Systems; Silberschatz, Operating System Concepts) for the MLFQ and OS-scheduler-level material.
Track-level synthesis note: Foundations of Scalable Systems was unavailable as a source during authoring. The scheduling-and-resource-management chapters of that book are the natural companion text to this module; the synthesized material here should be cross-checked against it before publication. Specific operational parameters (15% headroom, 30% migration threshold, work-stealing fanout) are illustrative defaults — production deployments should calibrate against actual workload characteristics.
Lesson 1: Scheduling Algorithms
Context
The Constellation Network's compute grid runs in fragments: each ground station has a fleet of GPUs for on-site image processing; each satellite has a small embedded compute budget for in-orbit triage; the central catalog tier has a large CPU pool for batch reconciliation. Jobs arriving at the system — process this raster, predict that conjunction, reconcile this catalog — must be assigned to specific machines. The decision of which job runs where, and when, is the scheduling problem, and the answer determines whether the system meets its latency budgets, whether it utilizes hardware efficiently, and whether high-priority work (conjunction alerts) gets through during periods of high load.
The naïve scheduler is a single global queue with a single worker per machine: pull the next job off the queue, hand it to the next idle machine. This is FIFO, and it is sufficient for systems with one workload, one machine class, and no priority differentiation. The constellation has none of these properties. Conjunction alerts must preempt routine telemetry processing. Pass-window-scoped jobs have deadlines (they must complete before the next pass) that FIFO cannot honor. Some workloads (large image processing) belong on GPUs; others (text-shaped log analysis) belong on CPUs. The scheduler must reason about all of this.
This lesson covers the family of scheduling algorithms — FIFO, priority, fair-share, deadline-aware, EDF, multi-level feedback — and the tradeoffs each makes. By the end, you should be able to choose a scheduling discipline for a given workload, articulate the failure modes (starvation, priority inversion, convoy effects) each is and is not vulnerable to, and recognize when the right answer is to partition the workload across multiple schedulers rather than to find a single perfect algorithm.
Core Concepts
FIFO and Its Limits
First-in-first-out scheduling pulls jobs from a queue in arrival order. Each machine takes the next job when it becomes idle. The implementation is one shared queue per scheduling domain.
The properties:
- Fair in the sense of arrival order. Every job waits its turn; no job is favored over another by anything other than arrival time.
- Simple. The implementation is a single concurrent queue. No priority comparison, no preemption, no deadline tracking.
- Convoy effect prone. A long-running job at the head of the queue blocks all subsequent shorter jobs. If a 10-second job is followed by ten 1-second jobs, the 1-second jobs wait 10 seconds in queue and the system spends 10 seconds with one machine busy and others idle (or, if the machines pull independently, the queue empties faster but the convoy effect manifests when machines re-poll a long-busy queue).
- No priority awareness. Critical jobs (conjunction alerts) wait behind routine jobs; the scheduler cannot distinguish.
FIFO is the right choice for workloads that are genuinely homogeneous and unconcerned with priority. For the catalog's background reconciliation pipeline — every job is the same shape and the system has hours to complete them — FIFO is the right tool. For anything that needs differential treatment, FIFO is the baseline that more sophisticated schedulers improve on.
Priority Scheduling
Priority scheduling assigns each job a priority class; the scheduler always picks the highest-priority pending job. Within a priority class, jobs are typically FIFO.
The structure: instead of one queue, the scheduler has N queues, one per priority class. To pick the next job, scan from highest priority down and take the first non-empty queue's head.
The properties:
- High-priority work is unblocked by low-priority work. Conjunction alerts (P0) run before routine telemetry (P3) regardless of arrival order.
- Starvation risk for low-priority work. If high-priority jobs arrive faster than the system can process them, low-priority jobs may never run. The classic fix is priority aging: a job's effective priority increases the longer it waits. After waiting for some bound, even a P3 job is effectively P0 and runs.
- Priority inversion. A low-priority job holds a lock that a high-priority job needs; the high-priority job waits behind the low-priority job, effectively inverted. The fix is priority inheritance: the lock-holder's priority is temporarily raised to match the highest-priority waiter. Linux's real-time scheduler supports this; most application-level schedulers do not.
The constellation uses three priority levels: P0 for safety-critical (conjunction alerts, anomaly responses), P1 for time-bounded operational (pass-window jobs), and P2 for background (reconciliation, archival). Priority aging triggers after 5 minutes in queue; below P2 there is no priority floor — jobs that don't fit the three levels don't get scheduled.
Fair-Share / Weighted Fair Scheduling
In multi-tenant systems, FIFO and priority schedulers can be hijacked by a single high-volume tenant. Fair-share scheduling allocates resources by tenant, with each tenant guaranteed a share regardless of how many jobs it submits.
The structure: each tenant has a quota (a share of system capacity). The scheduler tracks per-tenant consumption and prefers the tenant whose recent usage is furthest below quota. Hadoop's Fair Scheduler, Kubernetes' fair-share queue, and AWS's Auto Scaling all use variants of this.
The key insight: fairness is a property of allocation over time, not of any single scheduling decision. A tenant that has used 80% of capacity in the last minute is "ahead of fair"; a tenant that has used 5% is "behind fair." The scheduler advances the behind-fair tenants until balance is restored.
Weighted fair scheduling generalizes: each tenant has a weight, and the fair share is proportional to the weight. Tenant A with weight 3 and tenant B with weight 1 get a 3:1 split of capacity under contention; if A submits fewer jobs than its share, the unused capacity goes to B.
The constellation's Pass Window Scheduler treats each region as a tenant: each region has a weight proportional to the number of satellites it tracks, and the scheduler ensures every region's pass-window jobs run on time even if one region (the densely-tracked North Atlantic) submits more jobs than another.
Deadline-Aware Scheduling: Earliest Deadline First (EDF)
When jobs have hard deadlines — "this conjunction prediction must complete before T+30 seconds, after which the result is useless" — the scheduler should prefer jobs with the earliest deadlines.
Earliest Deadline First (EDF) picks the pending job with the earliest deadline. Provably optimal in the sense that if any schedule meets all deadlines, EDF finds one.
The properties:
- Optimal for hard real-time. EDF is the theoretical baseline for real-time scheduling on a single processor (Liu & Layland 1973).
- Degrades badly under overload. When too many jobs are submitted to meet all deadlines, EDF doesn't degrade gracefully — it tries to meet every deadline and misses them in a cascading pattern. Real-time systems often add admission control: reject new jobs at submission time if the system cannot meet their deadlines.
- Mixed-criticality complications. EDF doesn't know that a P0 job's deadline is more important to meet than a P2 job's. In practice, deadline-aware schedulers combine deadlines with priority classes: within a priority class, EDF; across classes, priority dominates.
The constellation's pass-window scheduler is essentially EDF with priority classes. Conjunction alerts have priority over routine pass-window jobs; within either class, the job with the earliest deadline goes first.
Multi-Level Feedback Queue (MLFQ)
For workloads where job durations are unknown in advance — "is this job going to take 1 second or 1 hour?" — fixed-priority and deadline-aware schedulers cannot make optimal decisions. The multi-level feedback queue schedules without prior knowledge by adjusting a job's priority based on its observed behavior.
The structure:
- New jobs enter the highest-priority queue.
- The scheduler always runs from the highest non-empty queue.
- If a job uses its full time slice (i.e., is CPU-bound), it's demoted to the next lower queue with a longer time slice.
- Short jobs that complete within their time slice stay at high priority; long jobs gradually migrate down.
The effect: short jobs finish quickly (they don't get demoted), long jobs eventually get fair share at lower priority. This is the algorithm Linux's CFS and Windows' priority scheduler are descended from.
The constellation doesn't use MLFQ at the application level — the workloads have known characteristics, so the scheduler can use explicit priority and deadline information. But the operating system underneath every machine uses MLFQ-like scheduling, and the application scheduler interacts with the OS scheduler in subtle ways (a job that the application scheduler considers low-priority may still be high-priority at the OS level if it's interactive).
When to Partition Workloads Rather Than Build One Scheduler
A common scheduler antipattern: trying to build one global scheduler that handles every workload, every priority class, every deadline, every machine type. The result is a scheduler that is complex, slow to tune, and rarely optimal for any individual workload.
The alternative is workload partitioning: separate schedulers (with separate queues, separate machines) for genuinely different workloads. The constellation runs three schedulers:
- Pass Window Scheduler — deadline-driven, priority-aware, region-fair. Handles jobs scoped to satellite passes (typically 5-15 minute windows).
- Conjunction Alert Scheduler — strict priority, no resource pre-allocation. Handles safety-critical jobs that must preempt everything else.
- Reconciliation Scheduler — FIFO across a dedicated pool of low-priority machines. Handles background work that runs whenever capacity is available.
Each scheduler is simpler than a single unified scheduler would be. The cost is some inefficiency at the boundaries — a machine assigned to reconciliation cannot be used for pass-window jobs even if the pass-window scheduler is overloaded. The benefit is operational comprehensibility: when something goes wrong, the on-call knows which scheduler is involved and what its rules are.
Scheduler Failure Modes
A few specific operational failure modes worth recognizing:
Starvation. Some class of jobs never runs because the scheduler always prefers another. Fix: priority aging or fair-share guarantees.
Priority inversion. A high-priority job is blocked by a low-priority job holding a shared resource. Fix: priority inheritance or lock-free designs.
Convoy effects. A small number of long jobs cause many short jobs to queue behind them, producing poor throughput. Fix: separate queues for short vs long jobs, or preemption.
Thundering herd. Many jobs become eligible to run at the same instant (e.g., a scheduled batch fires). Fix: jitter in the scheduling tick or rate-limited release.
Stale-deadline accumulation. Jobs whose deadlines have already passed continue to occupy queue slots, displacing fresh work. Fix: admission control plus aggressive aging-out of expired jobs.
Resource fragmentation. Available capacity is spread across machines such that no single machine has enough for the next job. Fix: bin-packing-aware scheduling or workload consolidation, covered in Lesson 2.
The catalog has hit four of these six in production over the past two years. The operational playbook for each is a specific scheduling discipline; the discipline is configured at design time, not discovered during the incident.
Code Examples
A Priority-Aging Scheduler
use std::cmp::Reverse; use std::collections::BinaryHeap; use std::time::{Duration, Instant}; #[derive(Clone, Debug, Eq, PartialEq)] pub struct Job { pub id: String, pub priority: u8, // higher = more important pub submitted_at: Instant, } #[derive(Eq, PartialEq)] struct ScheduleKey { effective_priority: u8, submitted_at: Instant, } impl Ord for ScheduleKey { fn cmp(&self, other: &Self) -> std::cmp::Ordering { // Higher priority first; within priority, earlier submission first. self.effective_priority .cmp(&other.effective_priority) .then_with(|| other.submitted_at.cmp(&self.submitted_at)) } } impl PartialOrd for ScheduleKey { fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> { Some(self.cmp(other)) } } pub struct AgingScheduler { queue: BinaryHeap<(ScheduleKey, Job)>, aging_step: Duration, // priority increases by 1 per step waited max_priority: u8, } impl AgingScheduler { pub fn new(aging_step: Duration, max_priority: u8) -> Self { Self { queue: BinaryHeap::new(), aging_step, max_priority } } pub fn submit(&mut self, job: Job) { let key = ScheduleKey { effective_priority: job.priority, submitted_at: job.submitted_at, }; self.queue.push((key, job)); } /// Pick the next job, applying aging to all queued jobs first. This is /// O(n) per pick; production implementations use a more efficient /// data structure (a tree indexed by 'next aging tick'). pub fn pick(&mut self) -> Option<Job> { // Rebuild the heap with aged priorities. let now = Instant::now(); let mut updated: Vec<(ScheduleKey, Job)> = self.queue.drain().collect(); for (key, job) in updated.iter_mut() { let waited = now.duration_since(job.submitted_at); let aging_steps = (waited.as_secs_f64() / self.aging_step.as_secs_f64()) as u8; key.effective_priority = job.priority .saturating_add(aging_steps) .min(self.max_priority); } for entry in updated { self.queue.push(entry); } self.queue.pop().map(|(_, job)| job) } } fn main() { let mut sched = AgingScheduler::new(Duration::from_secs(60), 10); sched.submit(Job { id: "P3-old".into(), priority: 3, submitted_at: Instant::now() - Duration::from_secs(600), }); sched.submit(Job { id: "P5-fresh".into(), priority: 5, submitted_at: Instant::now(), }); let next = sched.pick(); println!("scheduler picked: {:?}", next.map(|j| j.id)); // P3-old has aged 10 priority steps (10 minutes / 1 minute each), reaching // priority 13 (capped at 10); P5-fresh is at 5. P3-old wins. }
The aging mechanism prevents starvation: even very-low-priority jobs eventually rise to compete with high-priority ones. The cost is implementation complexity and the need to re-prioritize the queue periodically.
Earliest-Deadline-First With Priority Classes
use std::collections::BTreeMap; use std::time::Instant; #[derive(Clone, Debug)] pub struct DeadlineJob { pub id: String, pub priority_class: u8, // higher = more important pub deadline: Instant, } pub struct PriorityEDFScheduler { // Map of priority class -> BTreeMap of deadline -> job. // Within a class, BTreeMap ordering gives EDF: smallest deadline first. queues: BTreeMap<u8, BTreeMap<Instant, DeadlineJob>>, } impl PriorityEDFScheduler { pub fn new() -> Self { Self { queues: BTreeMap::new() } } pub fn submit(&mut self, job: DeadlineJob) { let queue = self.queues.entry(job.priority_class).or_default(); queue.insert(job.deadline, job); } /// Pick the highest-priority class's earliest-deadline job. pub fn pick(&mut self) -> Option<DeadlineJob> { // Scan classes from highest priority to lowest. let highest_class = self.queues.iter().rev().find(|(_, q)| !q.is_empty()).map(|(c, _)| *c)?; let queue = self.queues.get_mut(&highest_class)?; let earliest_deadline = *queue.keys().next()?; queue.remove(&earliest_deadline) } } fn main() { let mut sched = PriorityEDFScheduler::new(); let now = Instant::now(); sched.submit(DeadlineJob { id: "routine-1".into(), priority_class: 1, deadline: now + std::time::Duration::from_secs(60), }); sched.submit(DeadlineJob { id: "alert-1".into(), priority_class: 5, deadline: now + std::time::Duration::from_secs(120), }); sched.submit(DeadlineJob { id: "routine-2".into(), priority_class: 1, deadline: now + std::time::Duration::from_secs(30), }); // alert-1 (class 5) outranks both routine jobs (class 1) regardless of deadline. let next = sched.pick(); println!("first pick: {:?}", next.map(|j| j.id)); // alert-1 let next = sched.pick(); println!("second pick: {:?}", next.map(|j| j.id)); // routine-2 (earlier deadline) }
The two-level structure encodes the policy: priority dominates across classes; within a class, EDF orders by deadline. This is the standard pattern for mixed-criticality real-time scheduling.
Weighted Fair-Share Scheduling
use std::collections::HashMap; #[derive(Clone, Debug)] pub struct TenantJob { pub tenant: String, pub job_id: String, } pub struct FairShareScheduler { weights: HashMap<String, f64>, consumed: HashMap<String, f64>, // virtual time consumed per tenant queues: HashMap<String, Vec<TenantJob>>, } impl FairShareScheduler { pub fn new() -> Self { Self { weights: HashMap::new(), consumed: HashMap::new(), queues: HashMap::new(), } } pub fn add_tenant(&mut self, tenant: &str, weight: f64) { self.weights.insert(tenant.to_string(), weight); self.consumed.entry(tenant.to_string()).or_insert(0.0); self.queues.entry(tenant.to_string()).or_default(); } pub fn submit(&mut self, job: TenantJob) { let queue = self.queues.entry(job.tenant.clone()).or_default(); queue.push(job); } /// Pick the tenant whose virtual-time consumption is lowest relative to /// their weight. This is the deficit-round-robin-style fair-share algorithm. pub fn pick(&mut self) -> Option<TenantJob> { // Find the tenant with the smallest consumed/weight ratio that has a // pending job. Lower ratio = further behind fair = picked next. let next_tenant = self .consumed .iter() .filter(|(t, _)| self.queues.get(*t).map(|q| !q.is_empty()).unwrap_or(false)) .min_by(|(t1, c1), (t2, c2)| { let w1 = self.weights.get(*t1).copied().unwrap_or(1.0); let w2 = self.weights.get(*t2).copied().unwrap_or(1.0); let r1 = **c1 / w1; let r2 = **c2 / w2; r1.partial_cmp(&r2).unwrap_or(std::cmp::Ordering::Equal) }) .map(|(t, _)| t.clone())?; let queue = self.queues.get_mut(&next_tenant)?; let job = queue.remove(0); // Advance the picked tenant's virtual time. The increment is the // "size" of the work; for simplicity here we treat each job as 1 unit. *self.consumed.entry(next_tenant).or_insert(0.0) += 1.0; Some(job) } } fn main() { let mut sched = FairShareScheduler::new(); sched.add_tenant("pacific", 3.0); // weight 3 sched.add_tenant("indian", 1.0); // weight 1 for i in 0..6 { sched.submit(TenantJob { tenant: "pacific".into(), job_id: format!("p-{}", i), }); } for i in 0..6 { sched.submit(TenantJob { tenant: "indian".into(), job_id: format!("i-{}", i), }); } // Pacific has 3x the weight; over the next 12 picks, expect roughly 9 pacific, 3 indian. let mut counts = (0, 0); for _ in 0..12 { let job = sched.pick().unwrap(); if job.tenant == "pacific" { counts.0 += 1; } else { counts.1 += 1; } } println!("pacific: {}, indian: {}", counts.0, counts.1); // ~9,3 }
The weighted fair-share gives each tenant a guaranteed share under contention. When the queue is unbalanced (tenant A has many jobs, tenant B has none), tenant A gets all the capacity it can use; when both compete, they get their proportional share.
Key Takeaways
- FIFO is fair-by-arrival but vulnerable to convoy effects and priority blindness. It's the right tool only for genuinely homogeneous workloads with no priority differentiation.
- Priority scheduling lets high-priority work preempt routine work, at the cost of starvation risk for low-priority work. Priority aging is the standard defense against starvation; priority inheritance is the defense against priority inversion.
- Fair-share scheduling allocates resources by tenant proportion rather than per-job FIFO. Weighted variants give each tenant a guaranteed share that scales with the weight; unused capacity flows to other tenants on demand.
- Earliest Deadline First is optimal for hard real-time on a single processor but degrades badly under overload. Combine with priority classes for mixed-criticality workloads; add admission control to prevent the cascade-miss failure mode.
- When workloads are genuinely different, multiple schedulers are simpler and more operable than one universal scheduler. Partition by workload class and accept some inefficiency at the boundaries.
Source note: This lesson synthesizes from training knowledge plus the canonical scheduling references. Liu & Layland, "Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment" (Journal of the ACM, 1973) is the EDF original. Multi-level feedback queue traces to Corbató et al.'s CTSS work in the 1960s. Fair-share scheduling at scale traces through the Hadoop Fair Scheduler and the Linux Completely Fair Scheduler. DDIA does not treat scheduling directly; this content is synthesized and should be cross-checked against an operating systems text (Tanenbaum, Modern Operating Systems) or a scheduler-specific reference. Foundations of Scalable Systems was unavailable; the operational guidance here would normally cite that text.
Lesson 2: Resource Allocation and Bin Packing
Context
The Constellation Network's compute fleet runs at 38% average CPU utilization. Half the machines are idle most of the time; the other half are full enough that new pass-window jobs are routinely rejected for lack of capacity. The configuration is correct in the sense that no single machine is overloaded; the configuration is wrong in the sense that the fleet has more aggregate capacity than the workload requires, but the workload is poorly distributed across it. The catalog spends roughly $480k per quarter on machines that are mostly idle, while the pass-window scheduler regularly drops jobs for lack of room.
The problem is not the scheduling discipline from Lesson 1 (priorities and deadlines are working fine). The problem is placement: when a new job arrives, which machine should run it? The naive answer — "any machine with enough free capacity" — produces the observed distribution: the first-fit machine accumulates jobs until full; subsequent machines get the residual; one or two machines stay mostly empty because no individual job needs the room they offer. This is a bin-packing problem, and the answers differ in important ways from the scheduling problems of Lesson 1.
This lesson covers resource allocation algorithms — first-fit, best-fit, worst-fit, bin-packing approximations — and the operational practices (overcommitment, headroom reservation, eviction) that make them work in real systems. By the end, you should be able to choose a placement algorithm for a given workload, recognize the fragmentation patterns that cause utilization gaps, and understand the tradeoff between high utilization (good for cost) and low queuing delay (good for latency).
Core Concepts
The Bin-Packing Problem
Classical bin-packing: given items of various sizes, pack them into the minimum number of bins of fixed capacity. The problem is NP-hard in general. Real schedulers don't solve it optimally; they use approximation algorithms.
First-Fit. Place each item in the first bin that has room. Simple, fast, online (handles items as they arrive without seeing the full sequence). Known to use at most 1.7× the optimal number of bins for any input.
Best-Fit. Place each item in the bin with the smallest residual that still fits. Tends to fill bins tightly, leaving small remainders. Similar approximation ratio to first-fit, sometimes slightly better.
Worst-Fit. Place each item in the bin with the largest residual. Distributes items across bins, leaving large remainders everywhere. Better for handling subsequent large items but produces poorer utilization overall.
Decreasing variants (First-Fit-Decreasing, Best-Fit-Decreasing). Sort items by size before placing. Improves the approximation ratio to about 11/9 of optimal — substantially better than the online variants — but requires knowing all items in advance.
For online scheduling (items arriving over time), best-fit-decreasing-by-priority is a useful approximation: when placing a new job, prefer the machine with the smallest residual that still fits, breaking ties by the priority of jobs already running there. This concentrates load on a smaller number of machines, leaving others free for large incoming jobs.
Multi-Dimensional Bin Packing
Real machines have multiple resource dimensions: CPU, memory, disk, network, GPU. A job requires some quantity of each. Bin packing along one dimension is the classical problem; along multiple dimensions, it gets considerably harder.
The standard heuristic: dominant resource fairness or dot-product packing. Each job is characterized by a vector of resource requirements; each machine by a vector of available capacity. The job fits if every dimension is satisfied. The placement algorithm picks the machine whose remaining capacity most closely matches the job's profile — minimizing the "waste" along dimensions where the job doesn't fully consume the residual.
Kubernetes' default scheduler uses something close to this. It scores each candidate machine on multiple criteria (resource fit, affinity, anti-affinity, node selectors, taints/tolerations) and picks the highest-scoring machine. The resource-fit scoring is a variant of dot-product matching.
The constellation's pass-window scheduler treats CPU, memory, and GPU as separate dimensions. A telemetry-processing job that wants 4 CPU cores and 8GB RAM is placed on the machine whose residual most closely matches (4 CPU, 8GB, 0 GPU). This concentrates GPU-using jobs on GPU machines (because non-GPU jobs find their best match on non-GPU machines) and avoids the failure mode where a CPU-heavy job lands on a GPU machine, displacing future GPU-hungry workloads.
Overcommitment and Headroom
Statistical multiplexing — the same idea that powers networking — applies to compute too. If you allocate every machine to its full nominal capacity, then any spike pushes the system into degraded performance. If you reserve headroom — typically 10-25% of capacity — the spikes are absorbed without spilling.
The flip side: overcommitment intentionally allocates more capacity than the machine has, on the theory that not all allocated jobs will use their full reservation simultaneously. Cloud providers do this aggressively; a "4 CPU" instance often shares hardware with other "4 CPU" instances on the assumption that most are not fully utilizing their allocation at any given moment.
The two practices are complementary. Headroom on individual machines prevents per-machine failure; overcommitment at the fleet level extracts more aggregate utilization. The numbers must be operationally calibrated: too much overcommitment and the spikes correlate (everyone needs CPU at the same time, and someone gets throttled); too little and the fleet is underutilized.
The catalog uses 15% per-machine headroom and 130% fleet-level overcommitment. The combination matches the observed workload profile: spikes are uncorrelated enough that 30% overcommitment is safe, and 15% per-machine headroom absorbs the burstiness of individual jobs.
Eviction and Preemption
When the fleet is genuinely overloaded — the headroom is consumed and there's nowhere to put the next high-priority job — the scheduler has two choices: reject the job or evict an existing one to make room.
Hard eviction terminates a running job. The work in progress is lost; the job may be retried later when capacity returns. This is what Kubernetes does to lower-priority pods under node pressure.
Soft eviction sends a signal asking the job to pause or terminate gracefully. The job has a chance to checkpoint state, return partial results, or release resources cleanly. Useful when the work is non-trivial to redo.
Suspension pauses the job in place (writing its state to disk if necessary), runs the higher-priority job, then resumes. Conceptually clean; operationally complex.
The constellation uses hard eviction for low-priority batch work and soft eviction with a 30-second grace period for medium-priority work. P0 jobs (conjunction alerts) can preempt any lower-priority job; the lesson here is that the eviction policy is part of the priority model and should be explicit, not implicit.
Affinity, Anti-Affinity, and Constraints
Real placement decisions are not just about resource sizes. Some constraints:
Affinity. This job should run on the same machine as that one (data locality), or in the same rack (network locality), or in the same region. Implementations: Kubernetes' nodeAffinity and podAffinity; HashiCorp Nomad's constraints.
Anti-affinity. This job should NOT run on the same machine as that one (redundancy, blast-radius limitation). Two replicas of the same service should be on different machines so a single failure doesn't take both down.
Tagging / node selectors. This job needs a GPU, this job needs SSD, this job needs to be in the EU for data-residency reasons. The scheduler filters candidate machines by tag match.
Taints and tolerations. Some machines are reserved for specific workloads (e.g., dedicated to streaming pipelines); only jobs that explicitly tolerate the taint can be placed there. Kubernetes uses this; the pattern is a defense against accidental placement on specialized machines.
The constellation's pass-window scheduler treats ground station as a hard constraint (a pass-window job must run on the station that owns the pass) and GPU availability as a tagged requirement. The scheduler's placement algorithm filters candidates by constraint first, then applies bin-packing on the remaining set.
Utilization vs Latency Tradeoff
There is a fundamental tradeoff in placement that does not have a free-lunch resolution. Pushing utilization toward 100% — packing jobs as tightly as possible — produces high utilization but increases queuing delay (the next job is more likely to wait). Keeping utilization lower (say, 60-70%) reduces queuing delay but leaves more capacity idle.
The math, from queueing theory: a queue's expected waiting time grows as 1 / (1 - ρ) where ρ is the utilization. At ρ=0.5, waiting time is reasonable; at ρ=0.9, waiting time is 10× higher; at ρ=0.95, 20× higher. This is the queueing latency cliff that separates "healthy under load" from "in distress under load."
The operational implication: target utilization should match the latency budget. A workload with strict latency requirements (sub-100ms response time) should target 60-70% utilization at most. A batch workload tolerant of queueing can target 85-90%. Trying to push the latency-sensitive workload to 90% utilization is the canonical failure that produces "the system worked fine yesterday and is timing out today."
The catalog runs the conjunction-alert pipeline at 50% utilization (because alerts can't wait) and the reconciliation pipeline at 85% (because reconciliation can wait). The two share underlying capacity but the scheduler's target utilization differs by workload class.
Code Examples
A Best-Fit Bin Packing Placement
use std::collections::HashMap; #[derive(Clone, Debug)] pub struct ResourceRequest { pub cpu_millicores: u32, pub memory_mb: u32, } #[derive(Clone, Debug)] pub struct Machine { pub id: String, pub capacity: ResourceRequest, pub used: ResourceRequest, } impl Machine { pub fn residual(&self) -> ResourceRequest { ResourceRequest { cpu_millicores: self.capacity.cpu_millicores.saturating_sub(self.used.cpu_millicores), memory_mb: self.capacity.memory_mb.saturating_sub(self.used.memory_mb), } } pub fn fits(&self, req: &ResourceRequest) -> bool { let r = self.residual(); r.cpu_millicores >= req.cpu_millicores && r.memory_mb >= req.memory_mb } pub fn slack_score(&self, req: &ResourceRequest) -> u64 { // Sum of leftover across both dimensions. Lower = tighter fit = better. let r = self.residual(); (r.cpu_millicores - req.cpu_millicores) as u64 + (r.memory_mb - req.memory_mb) as u64 } } pub struct BestFitPlacer { machines: HashMap<String, Machine>, } impl BestFitPlacer { pub fn new(machines: Vec<Machine>) -> Self { Self { machines: machines.into_iter().map(|m| (m.id.clone(), m)).collect(), } } /// Pick the machine with the smallest residual that still fits the request. /// This concentrates load on a smaller set of machines, leaving others free /// for jobs that need more room. pub fn place(&mut self, req: ResourceRequest) -> Option<String> { let best = self .machines .values() .filter(|m| m.fits(&req)) .min_by_key(|m| m.slack_score(&req)) .map(|m| m.id.clone())?; let machine = self.machines.get_mut(&best)?; machine.used.cpu_millicores += req.cpu_millicores; machine.used.memory_mb += req.memory_mb; Some(best) } pub fn release(&mut self, machine_id: &str, req: ResourceRequest) { if let Some(m) = self.machines.get_mut(machine_id) { m.used.cpu_millicores = m.used.cpu_millicores.saturating_sub(req.cpu_millicores); m.used.memory_mb = m.used.memory_mb.saturating_sub(req.memory_mb); } } } fn main() { let machines = vec![ Machine { id: "edge-a".into(), capacity: ResourceRequest { cpu_millicores: 8000, memory_mb: 16000 }, used: ResourceRequest { cpu_millicores: 6000, memory_mb: 12000 }, // 2 CPU, 4GB residual }, Machine { id: "edge-b".into(), capacity: ResourceRequest { cpu_millicores: 8000, memory_mb: 16000 }, used: ResourceRequest { cpu_millicores: 1000, memory_mb: 2000 }, // 7 CPU, 14GB residual }, ]; let mut placer = BestFitPlacer::new(machines); // A small job (1 CPU, 2GB) should land on edge-a (tighter fit), leaving // edge-b free for a future larger job. let placed = placer.place(ResourceRequest { cpu_millicores: 1000, memory_mb: 2000 }); println!("placed on: {:?}", placed); // edge-a }
The best-fit heuristic concentrates load. Worst-fit would have placed the job on edge-b, distributing the load more evenly but leaving smaller residuals on both machines. Either is defensible; the choice depends on whether future jobs are expected to be larger (favor worst-fit to keep big residuals) or smaller (favor best-fit to consolidate).
Headroom-Aware Placement
#![allow(unused)] fn main() { #[derive(Clone, Debug)] struct ResourceRequest { cpu_millicores: u32, memory_mb: u32 } #[derive(Clone, Debug)] struct Machine { id: String, capacity: ResourceRequest, used: ResourceRequest } impl Machine { fn fits(&self, _r: &ResourceRequest) -> bool { true } } pub struct HeadroomPlacer { machines: Vec<Machine>, headroom_pct: u32, // e.g., 15 for 15% } impl HeadroomPlacer { pub fn effective_capacity(&self, m: &Machine) -> ResourceRequest { ResourceRequest { cpu_millicores: m.capacity.cpu_millicores * (100 - self.headroom_pct) / 100, memory_mb: m.capacity.memory_mb * (100 - self.headroom_pct) / 100, } } pub fn fits_with_headroom(&self, m: &Machine, req: &ResourceRequest) -> bool { let cap = self.effective_capacity(m); m.used.cpu_millicores + req.cpu_millicores <= cap.cpu_millicores && m.used.memory_mb + req.memory_mb <= cap.memory_mb } /// Place a job; if no machine has room within the headroom budget, try /// again ignoring headroom (operational override for high-priority work). pub fn place(&self, req: &ResourceRequest, allow_headroom_override: bool) -> Option<String> { // First pass: respect headroom. let with_headroom = self .machines .iter() .find(|m| self.fits_with_headroom(m, req)) .map(|m| m.id.clone()); if with_headroom.is_some() { return with_headroom; } // Second pass: only for high-priority jobs allowed to override. if allow_headroom_override { self.machines .iter() .find(|m| m.fits(req)) .map(|m| m.id.clone()) } else { None } } } }
The two-pass placement encodes the policy: routine jobs honor the headroom; P0 jobs can override it. The override is a deliberate choice — using up the headroom means losing the buffer against spikes — and is restricted to the workload class that justifies it.
A Tagged-Constraint Filter
#![allow(unused)] fn main() { use std::collections::HashSet; #[derive(Clone, Debug)] struct ResourceRequest { cpu_millicores: u32, memory_mb: u32 } #[derive(Clone, Debug)] pub struct TaggedMachine { pub id: String, pub tags: HashSet<String>, // e.g., {"gpu", "ssd", "region=us-west"} pub capacity: ResourceRequest, pub used: ResourceRequest, } #[derive(Clone, Debug)] pub struct ConstrainedJob { pub req: ResourceRequest, pub required_tags: HashSet<String>, pub anti_affinity_machines: HashSet<String>, // don't place here } pub fn place_with_constraints( machines: &[TaggedMachine], job: &ConstrainedJob, ) -> Option<String> { machines .iter() .filter(|m| { // All required tags must be present. job.required_tags.iter().all(|t| m.tags.contains(t)) }) .filter(|m| !job.anti_affinity_machines.contains(&m.id)) .filter(|m| { // Capacity check. m.capacity.cpu_millicores - m.used.cpu_millicores >= job.req.cpu_millicores && m.capacity.memory_mb - m.used.memory_mb >= job.req.memory_mb }) .min_by_key(|m| { // Best-fit among the candidates. (m.capacity.cpu_millicores - m.used.cpu_millicores) as u64 + (m.capacity.memory_mb - m.used.memory_mb) as u64 }) .map(|m| m.id.clone()) } }
The filter-then-pack pattern is the standard structure for constrained placement: filter the candidate set by all hard constraints first, then apply the bin-packing heuristic to the remaining candidates. The filter is fast (set operations); the packing is the more interesting algorithm.
Key Takeaways
- Bin-packing is NP-hard in general; production schedulers use approximation algorithms. First-fit and best-fit are the standard online algorithms; decreasing variants improve the approximation ratio at the cost of requiring full knowledge of items in advance.
- Multi-dimensional resource allocation (CPU + memory + GPU + ...) uses dot-product matching: pick the machine whose residual capacity vector most closely matches the job's requirement vector. Kubernetes and similar schedulers use this pattern.
- Headroom (reserved capacity per machine, typically 10-25%) and overcommitment (total fleet allocation exceeds physical capacity) are complementary practices. Headroom absorbs per-machine spikes; overcommitment extracts aggregate utilization from uncorrelated demand.
- Eviction policy is part of the priority model. Hard, soft, and suspension are the three mechanisms; the choice depends on the cost of redoing the evicted work versus the urgency of the preempting job.
- Utilization trades against latency. Target utilization should match the latency budget: latency-critical workloads cap below 70%; batch workloads can target 85-90%. Pushing latency-critical work toward 95% utilization is the canonical failure mode.
Source note: This lesson is synthesized from training knowledge plus the canonical references. Bin-packing theory traces to Johnson et al. (1974) for the first approximation analyses; "Bin Packing" by Coffman, Garey, Johnson (1996) is the survey reference. Multi-dimensional packing for clusters is treated in the Borg paper (Verma et al., EuroSys 2015) and Kubernetes' scheduler documentation. The queueing-latency formula
1/(1-ρ)is standard M/M/1 queue theory. DDIA does not treat scheduling or placement directly. Foundations of Scalable Systems and an operating-systems text are the natural references and were not available; cross-check before publication.
Lesson 3: Work Stealing and Task Migration
Context
The Pass Window Scheduler from Lessons 1 and 2 makes placement decisions when jobs arrive. Once placed, jobs run on their assigned machine until completion. This works when placement decisions are good and workloads are predictable. It does not work when one machine ends up with several long-running jobs while neighboring machines sit idle, or when the cost of a job turns out to be much larger than estimated at placement time, or when a machine's effective capacity changes mid-flight (a GPU thermal throttle, a noisy-neighbor VM, a brief hardware degradation).
The mechanism for handling these cases is work redistribution after placement — either pulling work toward idle resources (work stealing) or pushing work away from saturated ones (task migration). Tokio's runtime, the Go scheduler, Cilk, and most modern parallel-task frameworks use work stealing internally. Cluster schedulers like Borg and Kubernetes use migration for less granular cases. The two mechanisms are different in detail but solve the same problem: load imbalance that emerges after initial placement.
This lesson covers the two patterns, the data structures that make them efficient (Chase-Lev deques, ABA-resistant queues), and the operational tradeoffs (migration cost, locality loss, cache effects). By the end, you should be able to recognize when work stealing is the right pattern, understand why tokio's multi-threaded scheduler is built the way it is, and choose between fine-grained intra-process stealing and coarse-grained cross-machine migration for a given workload.
Core Concepts
Why Initial Placement Is Not Enough
Initial placement decisions are necessarily approximate. The scheduler estimates a job's resource requirements before placing it, but estimates can be wrong: a "5-minute" reconciliation that turns out to take 50 minutes, a "1 GB" data load that grows to 10 GB after expansion, a CPU-bound job that turns out to be memory-bound. Misestimation produces load imbalance that placement cannot anticipate.
Environmental changes also matter. A machine's effective capacity changes over time: thermal throttling, sibling VMs becoming noisy, network congestion to a downstream service that turns CPU-bound jobs into I/O-waiting ones. The placement decision was correct at the time; it's wrong now.
Finally, work arrival patterns vary. A region experiences a burst of jobs while another sits idle. Placement decisions made during the burst land on different machines than they would have during the quiet period. The cumulative effect is imbalance even when each individual decision was reasonable.
The fix is post-placement redistribution. Two patterns dominate.
Work Stealing: Pull-Based Redistribution
In work stealing, idle workers proactively pull work from busy workers. The pattern was popularized by Cilk in 1995 (Blumofe & Leiserson) and is now standard in modern thread pools.
The structure:
- Each worker has a local task queue (typically a double-ended queue / deque).
- The worker pushes and pops tasks from one end (the "back") in LIFO order. LIFO is cache-friendly: recently-created tasks reference recently-touched data.
- When a worker has no local tasks, it picks a random other worker and steals from the opposite end of that worker's deque (the "front"). FIFO from the front means stealing oldest tasks, which are typically the coarsest-grained.
- The owner pushes/pops one end without coordination; thieves pop from the other end. This is what makes work stealing efficient — the steal operation only needs synchronization when the owner is near the same end as the thief.
The Chase-Lev deque (Chase & Lev 2005) is the standard lock-free implementation. The owner's push/pop are nearly synchronization-free; the thief's steal uses CAS to detect conflicts. Tokio's multi-threaded runtime, Rayon's parallel iterator, and the Go scheduler all use Chase-Lev-style work stealing.
The properties:
- Excellent cache locality. Each worker reuses tasks it created, keeping working set local. Only stealing crosses cache boundaries.
- No central coordinator. Each worker steals from random others; the load balances naturally across the pool.
- Asymptotic optimality. Theory shows work-stealing schedulers achieve near-optimal speedup on parallel-recursive workloads.
- Cost is in steal attempts. When everyone is busy, idle workers may attempt many steals before succeeding (or before they get new local work). Production runtimes add backoff and exponential reattempt to bound this overhead.
The constellation's per-machine scheduler — the userspace runtime that distributes work across CPU cores within a single machine — is built on tokio's work-stealing pool. The cluster-level scheduler operates at a coarser granularity.
Task Migration: Push-Based Cross-Machine Redistribution
At the cluster level, work stealing's fine granularity isn't right. Moving a job from one machine to another is expensive: the job's working memory must be copied, file descriptors must be migrated or recreated, the job typically must be restarted (or at least paused and resumed). For per-task overhead in the hundreds of milliseconds, you don't want to migrate every time you observe imbalance.
The cluster pattern is task migration: a monitoring layer periodically observes load distribution and migrates specific jobs to rebalance. The migration is intentional and infrequent, triggered by significant imbalance rather than continuous balancing.
The decision logic:
- Identify a hotspot machine (utilization significantly above the fleet average).
- Identify a target machine (utilization significantly below the fleet average, fits the job's requirements).
- Pick a migrateable job on the hotspot. Some jobs are not migrateable (long-running stateful tasks with significant in-memory state); others migrate easily (stateless batch jobs).
- Pause the job on the hotspot, transfer its state to the target, resume on the target.
The migration cost (time during which the job is paused or running degraded) is the main tradeoff. Kubernetes' descheduler does this; Mesos handles it explicitly via its framework API; Borg's reschedule rate is one of the operational metrics that bounds the system's churn.
The constellation's cluster scheduler migrates low-priority batch jobs (which can tolerate the pause-and-resume) when the fleet variance in utilization exceeds 30%. P0 and P1 jobs are not migrated except in extreme cases — the migration cost is higher than the imbalance cost.
Hybrid Patterns: Local Stealing + Cluster Migration
The two patterns are complementary, not alternatives. Modern systems use both at different scales:
- Within a process, work stealing across CPU cores. Tokio, Go, Rayon. Fine-grained; high frequency; low cost.
- Within a cluster, occasional task migration. Kubernetes, Borg, the constellation's pass-window scheduler. Coarse-grained; low frequency; significant cost.
- Between availability zones or regions, almost never. The cost of moving large jobs across regions is enormous; place correctly the first time.
The pattern: as granularity coarsens, redistribution frequency drops. Cache-line-granularity stealing happens millions of times per second; job-migration happens a few times per minute; cross-region migration happens during planned maintenance, not autonomously.
The Cost of Migration
Migration is not free. The specific costs:
Pause and resume. The job stops on the source, doesn't do work for the migration duration, and resumes on the target. For latency-critical workloads, this pause is the entire SLO budget.
State transfer. The job's in-memory state must be moved. For a 4 GB job, this is bandwidth and time — and CPU on both ends to serialize and deserialize.
Connection re-establishment. Long-lived TCP connections, file handles, database sessions — all must be recreated. Some can be migrated transparently (Linux's criu can checkpoint and restore many connection states); most require the job to re-establish.
Cache cold-start. The target machine's local caches (CPU L1/L2/L3, page cache, application-level caches) are cold for the migrated job. The first few seconds run slower than steady-state.
Locality loss. If the job was placed near its data (same machine, same rack, same region), the migration may break that locality, increasing cross-network latency for the job's data accesses.
For most workloads, these costs are tolerable as long as migration is rare. For latency-critical workloads, they may exceed the latency budget, in which case migration is the wrong tool.
Stealing Granularity and Locality Tradeoffs
Work stealing's frequency depends on granularity. Stealing every task — typical in Tokio's task queue — produces excellent load balance and acceptable overhead for in-process work. Stealing larger chunks (a thousand tasks at a time, used in some parallel frameworks) reduces overhead but produces less even balance.
The locality tradeoff: tasks that share data benefit from running on the same worker. A naive work-stealing scheduler might steal a task that needs data the original worker just loaded into cache. The data has to move to the new worker, which is slower than if the task had stayed.
Modern work-stealing schedulers add locality hints: tasks can declare data they need; the scheduler prefers to keep them on the worker that produced the data. Tokio supports this via the local_set API (single-threaded tasks that never migrate); some HPC frameworks expose more granular locality affinities.
The constellation's image-processing pipeline runs on tokio's standard work-stealing pool for most tasks but uses local_set for stages that have heavy CPU-cache-resident state. The tradeoff is deliberate: lose some load-balancing benefit for the heavy stages in exchange for keeping the working set local.
When Not to Steal or Migrate
A few cases where the redistribution patterns are wrong:
Workloads with strict ordering. If task B must execute after task A and on the same machine, stealing B to another worker breaks the ordering or the locality. Document and enforce the ordering at the scheduler level.
Workloads with significant per-machine warm state. A long-running database, a model server that has loaded a multi-GB model — migrating these is much more expensive than tolerating some imbalance.
Latency-critical real-time workloads. Migration introduces a pause; stealing introduces queueing variance. If the latency budget is sub-millisecond, neither is appropriate.
Workloads with affinity constraints not visible to the scheduler. A job that has cached significant data from a specific upstream service; the scheduler doesn't know about the cache and migrates the job, breaking the implicit affinity.
In each case, the right answer is to not redistribute. The discipline is to know when redistribution is helpful and when it produces more harm than good.
Code Examples
A Simplified Work-Stealing Deque
use std::collections::VecDeque; use std::sync::Mutex; pub struct WorkStealingDeque<T> { // Mutex-protected for simplicity; production implementations (Chase-Lev) // are lock-free with atomic operations on indices. inner: Mutex<VecDeque<T>>, } impl<T> WorkStealingDeque<T> { pub fn new() -> Self { Self { inner: Mutex::new(VecDeque::new()) } } /// Owner: push to the back. LIFO with pop_back means recent tasks first. pub fn push(&self, task: T) { self.inner.lock().unwrap().push_back(task); } /// Owner: pop from the back. Cache-friendly. pub fn pop(&self) -> Option<T> { self.inner.lock().unwrap().pop_back() } /// Thief: steal from the front. FIFO from the front means stealing oldest /// tasks, which are typically coarsest-grained (better steal value). pub fn steal(&self) -> Option<T> { self.inner.lock().unwrap().pop_front() } pub fn len(&self) -> usize { self.inner.lock().unwrap().len() } } fn main() { let deque: WorkStealingDeque<u32> = WorkStealingDeque::new(); // Owner pushes tasks for i in 0..10 { deque.push(i); } // Owner pops (LIFO) println!("owner pops: {:?}", deque.pop()); // 9 (most recent) println!("owner pops: {:?}", deque.pop()); // 8 // Thief steals (FIFO from other end) println!("thief steals: {:?}", deque.steal()); // 0 (oldest) println!("thief steals: {:?}", deque.steal()); // 1 println!("remaining: {}", deque.len()); // 6 }
The asymmetry is the point: the owner operates at one end (LIFO for cache benefits), the thief at the other (FIFO for steal-value). In the lock-free Chase-Lev variant, these operations require synchronization only when the deque is nearly empty and owner and thief converge on the same task.
A Work-Stealing Pool of Workers
#![allow(unused)] fn main() { use std::sync::Arc; use std::sync::atomic::{AtomicBool, Ordering}; use std::thread; use std::time::Duration; struct WorkStealingDeque<T>(std::sync::Mutex<std::collections::VecDeque<T>>); impl<T> WorkStealingDeque<T> { fn new() -> Self { Self(std::sync::Mutex::new(std::collections::VecDeque::new())) } fn push(&self, t: T) { self.0.lock().unwrap().push_back(t); } fn pop(&self) -> Option<T> { self.0.lock().unwrap().pop_back() } fn steal(&self) -> Option<T> { self.0.lock().unwrap().pop_front() } } pub struct Pool { workers: Vec<Arc<WorkStealingDeque<Box<dyn FnOnce() + Send>>>>, shutdown: Arc<AtomicBool>, } impl Pool { pub fn new(num_workers: usize) -> Self { let workers: Vec<_> = (0..num_workers) .map(|_| Arc::new(WorkStealingDeque::new())) .collect(); let shutdown = Arc::new(AtomicBool::new(false)); for (i, worker) in workers.iter().enumerate() { let worker = worker.clone(); let peers = workers.clone(); let shutdown = shutdown.clone(); thread::spawn(move || { run_worker(i, worker, peers, shutdown); }); } Self { workers, shutdown } } pub fn submit(&self, idx: usize, task: Box<dyn FnOnce() + Send>) { self.workers[idx % self.workers.len()].push(task); } } fn run_worker( own_idx: usize, own_queue: Arc<WorkStealingDeque<Box<dyn FnOnce() + Send>>>, peers: Vec<Arc<WorkStealingDeque<Box<dyn FnOnce() + Send>>>>, shutdown: Arc<AtomicBool>, ) { while !shutdown.load(Ordering::Relaxed) { // Try local queue first - cache-friendly. if let Some(task) = own_queue.pop() { task(); continue; } // Local queue is empty; try stealing. let mut stole = false; for (i, peer) in peers.iter().enumerate() { if i == own_idx { continue; } if let Some(task) = peer.steal() { task(); stole = true; break; } } if !stole { // No work anywhere; brief sleep before retrying. // Production runtimes use parking with wakeup on submit; // sleep is a simplification. thread::sleep(Duration::from_millis(1)); } } } }
This is the structural shape. Production implementations (tokio, rayon) are much more sophisticated: lock-free deques, work-stealing with backoff, park/unpark to avoid spinning when there's no work, and explicit task locality hints. The pattern, however, is the same as what's shown here.
A Task Migration Decision
#[derive(Clone, Debug)] pub struct MachineLoad { pub id: String, pub utilization: f64, } #[derive(Clone, Debug)] pub struct MigrateableJob { pub id: String, pub estimated_size: f64, // expected utilization contribution pub migration_cost: std::time::Duration, } pub fn should_migrate( machines: &[MachineLoad], jobs_by_machine: &std::collections::HashMap<String, Vec<MigrateableJob>>, threshold: f64, // e.g., 0.30 = migrate when variance exceeds 30 percentage points ) -> Option<(String, String, String)> { if machines.len() < 2 { return None; } let max = machines.iter().max_by(|a, b| a.utilization.partial_cmp(&b.utilization).unwrap())?; let min = machines.iter().min_by(|a, b| a.utilization.partial_cmp(&b.utilization).unwrap())?; // Only migrate if the imbalance is significant. if max.utilization - min.utilization < threshold { return None; } // Pick a migrateable job from the hotspot. Prefer jobs whose migration // cost is low and whose size won't overload the target. let jobs = jobs_by_machine.get(&max.id)?; let job = jobs.iter().min_by(|a, b| a.migration_cost.cmp(&b.migration_cost))?; // Verify the move doesn't itself produce imbalance the other way. if min.utilization + job.estimated_size > max.utilization - job.estimated_size { return None; } Some((job.id.clone(), max.id.clone(), min.id.clone())) } fn main() { let machines = vec![ MachineLoad { id: "edge-a".into(), utilization: 0.92 }, MachineLoad { id: "edge-b".into(), utilization: 0.45 }, MachineLoad { id: "edge-c".into(), utilization: 0.50 }, ]; let mut jobs = std::collections::HashMap::new(); jobs.insert("edge-a".to_string(), vec![ MigrateableJob { id: "reconcile-batch-17".into(), estimated_size: 0.10, migration_cost: std::time::Duration::from_secs(15), }, ]); let decision = should_migrate(&machines, &jobs, 0.30); println!("migration decision: {:?}", decision); // Expect Some((job, from, to)) because edge-a (0.92) - edge-b (0.45) > 0.30. }
The decision pattern is deliberately conservative: only migrate when imbalance is large; only migrate jobs with low cost; verify the move improves balance rather than just shifting the problem. Production migration loops add more checks (job age, eviction history, anti-thrash protections) on top of these.
Key Takeaways
- Initial placement is approximate; redistribution after placement is necessary when workloads misestimate, environments change, or arrival patterns vary. Work stealing and task migration are the two redistribution mechanisms.
- Work stealing is the right pattern for fine-grained in-process work distribution: each worker has a local deque, owners use one end (LIFO, cache-friendly), thieves the other (FIFO, coarse-grained steal-value). Tokio, Rayon, and Go's scheduler all use this pattern.
- Task migration is the right pattern for coarse-grained cluster-level redistribution: less frequent, more expensive per operation, triggered by significant imbalance rather than continuous rebalancing.
- The two patterns layer cleanly: work stealing within a process for nanosecond-granularity, task migration within a cluster for second-granularity. Cross-region migration is rare and intentional.
- Both patterns have costs (steal overhead, migration pause). The discipline is to know when redistribution helps and when its costs exceed the imbalance it would correct. Latency-critical real-time workloads typically can't tolerate either.
Source note: This lesson is synthesized from training knowledge plus the canonical references. Blumofe & Leiserson, "Scheduling Multithreaded Computations by Work Stealing" (FOCS 1994; JACM 1999) is the canonical work-stealing paper. Chase & Lev, "Dynamic Circular Work-Stealing Deque" (SPAA 2005) is the standard lock-free deque. Tokio's runtime documentation (tokio.rs/blog) covers the production Rust implementation. Borg (Verma et al., EuroSys 2015) is the cluster-scale reference for migration. DDIA does not treat work stealing or task migration. Foundations of Scalable Systems and a parallel-programming text are the natural references and were not available; cross-check before publication.
Module 06 Project — Pass Window Scheduler
Mission Brief
Incident ticket CN-2708-002 Severity: P1 (track-defining) Reporter: Constellation Operations, Pass Coordination Status: Open
The constellation's pass windows are getting denser. Today the system handles roughly 800 satellite passes per day across 48 satellites and 12 ground stations; the next-quarter plan adds 30 more satellites and three more ground stations, projecting 1,400 passes per day. The existing scheduler — a Python script that runs first-fit placement on a single thread with no priority awareness — already drops roughly 4% of pass-window jobs during peak hours. Scaling 75% beyond current load will break the scheduler outright.
You are building the Pass Window Scheduler, a Rust crate that integrates the patterns from this entire module: priority+EDF scheduling discipline (Lesson 1), best-fit multi-dimensional bin-packing with headroom and constraints (Lesson 2), and work stealing across worker threads within the scheduler itself plus task migration when the cluster load becomes imbalanced (Lesson 3).
The deliverable is a scheduler that demonstrably handles 1,400 passes per simulated day with a 99.9% deadline-meeting rate, fair-share allocation across regions, and no scheduler thread becoming a bottleneck.
Repository Layout
pass-window-scheduler/
├── Cargo.toml
├── src/
│ ├── lib.rs
│ ├── job.rs # PassWindowJob, deadline, priority_class, resource req
│ ├── machine.rs # MachineFleet, capacity, used, residual
│ ├── placer.rs # Best-fit multi-dim with headroom and constraints
│ ├── queue.rs # Per-region priority+EDF queue with aging
│ ├── scheduler.rs # Scheduler: submit, tick, dispatch
│ ├── stealing.rs # Work-stealing across scheduler workers
│ └── migration.rs # Cluster-level migration decisions
├── tests/
│ ├── deadline_compliance.rs
│ ├── fair_share.rs
│ ├── bin_packing.rs
│ ├── work_stealing.rs
│ ├── migration_triggers.rs
│ └── simulation_scale.rs
└── README.md
Required API
// job.rs
#[derive(Clone, Debug)]
pub struct PassWindowJob {
pub id: String,
pub region: String, // "pacific", "atlantic", "indian", etc.
pub priority_class: u8, // 0 = P0 (alerts), 1 = P1 (pass-window), 2 = P2 (batch)
pub deadline: Instant, // absolute deadline by which the job must complete
pub estimated_runtime: Duration,
pub req: ResourceRequest,
pub required_tags: HashSet<String>,
pub submitted_at: Instant,
}
#[derive(Clone, Debug)]
pub struct ResourceRequest {
pub cpu_millicores: u32,
pub memory_mb: u32,
pub gpu_count: u32,
}
// machine.rs
#[derive(Clone, Debug)]
pub struct Machine {
pub id: String,
pub region: String,
pub tags: HashSet<String>,
pub capacity: ResourceRequest,
pub used: ResourceRequest,
pub running: Vec<String>, // job IDs currently running here
}
pub struct MachineFleet { /* ... */ }
impl MachineFleet {
pub fn add_machine(&mut self, m: Machine);
pub fn machines(&self) -> &[Machine];
pub fn machine(&self, id: &str) -> Option<&Machine>;
pub fn utilization(&self) -> HashMap<String, f64>;
}
// scheduler.rs
pub struct Scheduler { /* ... */ }
impl Scheduler {
pub fn new(fleet: Arc<Mutex<MachineFleet>>, num_workers: usize, headroom_pct: u32) -> Self;
pub fn submit(&self, job: PassWindowJob) -> Result<()>;
pub fn tick(&self); // run one scheduling cycle
pub fn metrics(&self) -> SchedulerMetrics;
}
pub struct SchedulerMetrics {
pub total_submitted: u64,
pub total_completed: u64,
pub deadline_misses: u64,
pub queue_depth_by_priority: HashMap<u8, usize>,
pub migrations_executed: u64,
pub steal_attempts: u64,
pub steal_successes: u64,
}
Acceptance Criteria
-
cargo build --releasecompletes without warnings under#![deny(warnings)]. -
cargo test --releasepasses all integration tests with zero flakes across 20 consecutive runs. -
cargo clippy -- -D warningsproduces no lints. - Deadline compliance test: simulate 1,400 jobs per day distributed across the 12-region fleet with realistic resource shapes. Measure deadline-meeting rate; assert ≥99.9%.
- Priority test: with the queue saturated by P2 jobs, a newly-submitted P0 job is dispatched within one scheduler tick. Assert P0 latency p99 is bounded regardless of P2 backlog.
- Fair-share test: simulate two regions submitting jobs at different rates — pacific submits 2× indian's volume — with equal region weights. Assert that pacific jobs do not consume more than ~60% of capacity (some imbalance is tolerable, but pacific shouldn't fully starve indian).
- Weighted fair-share test: same setup but with pacific weight=3 and indian weight=1. Assert pacific gets approximately 75% of capacity.
- Best-fit placement test: submit jobs of varying sizes to a heterogeneous fleet. Assert that jobs are placed on machines with the tightest residual fit (verified by reading per-machine utilization after placement).
- Headroom respect test: with 15% headroom configured, normal-priority jobs do not push any machine past 85% utilization. P0 jobs CAN override headroom; assert this works correctly.
- Constraint filter test: GPU-tagged jobs are only placed on GPU machines; ground-station-tagged jobs (for a specific pass) are only placed at the requested ground station. Assert that constraint violations never occur.
- Work stealing test: with N scheduler workers, simulate skewed job submission (all jobs initially go to worker 0). Assert that worker 0's queue depth is eventually balanced with the others via stealing.
- Migration trigger test: create a fleet imbalance by manipulating per-machine utilization directly. Assert that the migration loop detects the imbalance (variance > 30 percentage points) and initiates a migration. Assert that the migration is not initiated for variance within bounds.
- Scale simulation: run the full simulated day with 1,400 jobs in under 60 seconds of wall-clock time (simulating an order of magnitude faster than real time). Assert all metrics are within their respective bounds.
- (self-assessed) The README explains how the four mechanisms (priority+EDF, best-fit placement, work stealing, task migration) interact, and what failure modes the integrated system is and is not designed to handle.
- (self-assessed) The scheduler exposes per-queue and per-machine metrics so operators can diagnose bottlenecks. The README documents the metric set and recommended alerting thresholds.
- (self-assessed) The work-stealing implementation uses a Chase-Lev-style asymmetric deque (or equivalent lock-free structure) so the steal path doesn't contend with the owner's push/pop in the common case.
Expected Output
cargo test --release simulation_scale -- --nocapture:
[setup] fleet: 90 machines across 12 regions
[setup] scheduler: 8 workers, 15% headroom, region-weighted fair-share
[setup] simulating 1,400 jobs over a day, accelerated 1440x (1 minute wall-time per simulated day)
[t=0.000s] simulation start
[t=15.000s] simulated 6 hours; 350 jobs completed; 0 deadline misses; avg utilization 62%
[t=30.000s] simulated 12 hours; 720 jobs completed; 0 deadline misses; avg utilization 68%
[t=45.000s] simulated 18 hours; 1,070 jobs completed; 1 deadline miss; avg utilization 71%
[t=60.000s] simulation complete
total submitted: 1,400
total completed: 1,400
deadline misses: 1 (0.07%)
migrations executed: 4
steal successes / steal attempts: 12,847 / 18,103 (70.9%)
per-region throughput (jobs/day): pacific=312, atlantic=298, indian=187, ...
PASS: 99.9% deadline compliance achieved
Hints
1. Order the four mechanisms in the request path
The order matters. On submit: (a) place the job into the per-region priority+EDF queue. The scheduling tick: (b) for each worker, pull the next job from the local queue (with aging applied); (c) attempt placement via best-fit on the fleet, filtered by constraints; (d) on placement failure, the job stays queued (with potentially an aging boost); (e) on placement success, dispatch to the machine. Separately: (f) periodically (every N ticks), check for fleet imbalance and trigger migration. The work stealing happens between the per-worker queues for the scheduling work itself, not for the jobs being scheduled.
2. Separate scheduling work from scheduled work
There are two distinct workloads in this project: the scheduler's own work (deciding where to place jobs) and the jobs themselves (running on the fleet). The work-stealing pool is for the scheduler's work — multiple worker threads each making placement decisions. The placement decisions themselves use the bin-packing algorithm against the fleet. Don't confuse them: tokio's work-stealing is for the scheduler workers; the fleet machines are the bin-packing targets, not work-stealing peers.
3. Deadlines and EDF in a priority+EDF queue
The EDF property within a priority class means the queue must be ordered by deadline (smallest first). A BTreeMap keyed by deadline gives O(log n) insertion and O(1) front-peek. The priority dimension means you have N such BTreeMaps, one per class. Dispatching is: find the highest non-empty class, peek the smallest deadline, that's the next job. Aging interacts: when computing the effective priority class, factor in time waited. A job that should be P2 but has waited an hour effectively becomes P1.
4. Per-machine utilization tracking
The fleet's utilization is updated on every placement (add the job's resource cost to the machine's used) and on every completion (subtract). Use atomic counters or fine-grained locks; the placement path is hot. A coarse-grained Mutex<HashMap<MachineId, Machine>> will become a bottleneck at scale; in the project's required scope it's acceptable, but the scaling assumption is that the read path uses per-machine RwLock.
5. Migration cost estimation
The migration trigger needs to estimate: (a) is the imbalance significant enough to migrate? (b) is the migration cost lower than the benefit? Use a simple heuristic: imbalance > 30 percentage points triggers consideration; pick a job whose migration_cost is < some threshold (say, 30 seconds); verify the destination has room with headroom; verify the move doesn't reverse the imbalance. The required tests are deterministic — the metrics expose whether migrations happen and roughly how often.
6. Simulating accelerated time
For the 1440x simulation, don't use real tokio::time::sleep — it would take a full day. Use a virtual clock (a wrapper around an AtomicU64 that the simulation advances) and have the scheduler's deadline checks consult the virtual clock. This makes the simulation deterministic, fast, and reproducible. tokio::time::pause + advance is one way; a custom Clock trait that the scheduler uses is another. Either is fine.
Source Anchors
- All sources cited in the three lessons of this module
- Verma et al., "Large-scale cluster management at Google with Borg" (EuroSys 2015) — the canonical paper on integrated cluster scheduling
- Kubernetes' scheduler documentation (kubernetes.io/docs/concepts/scheduling-eviction/) — the practical production reference
- The tokio runtime's work-stealing implementation (tokio.rs/blog) — for the scheduler-worker layer