Meridian Space Academy

Meridian Space Academy is the internal engineering training program for Meridian Space Systems. It exists to bring senior engineers — most arriving from web, fintech, or systems backgrounds where Rust's role is established but the operational domain is not — up to working competence on the production data systems that power Meridian's flight operations.

The curriculum is organized as five tracks of six modules each. Every module is grounded in a real piece of Meridian's stack: a service we operate, a pipeline we maintain, or a system we are actively rebuilding. The lessons read like internal engineering documentation because that is what they are.


The mission

Meridian operates a 48-satellite Earth observation constellation, twelve ground station sites distributed across four continents, and a forward operations node at the Artemis lunar base. The original control plane was a Python codebase written for a six-satellite constellation in 2018. It does not scale. Single-threaded I/O loops cannot keep up with the uplink and downlink schedule of forty-eight active vehicles, the GIL prevents the parallelism we need on a fundamentally I/O-bound workload, and the type system has let an unmanageable amount of operational fragility accumulate over six years.

The replacement is being written in Rust. This curriculum is how new engineers — and existing engineers moving into the platform team — develop the mental model required to contribute to that replacement without making the system worse.


Prerequisites

Three or more years of Rust experience. The curriculum assumes fluency with ownership, borrowing, lifetimes, trait bounds, generics, error handling with Result, and Cargo workspaces. There are no introductory lessons on these topics; references are made as needed but the basics are not re-taught.

Engineers who have not yet shipped Rust to production should complete the Rustlings exercises and read Programming Rust (Blandy, Orendorff, Tindall) before starting. This is not a beginner curriculum, and starting it without the prerequisites in place produces frustration without learning.


The five tracks

Foundation — Mission Control Platform (shipped)

Async Rust, concurrency, message passing, networking, data layout, and performance profiling, anchored in the rebuild of the legacy Python control plane. Modules cover the tokio runtime and task lifecycle, shared-state vs. message-passing concurrency, channel patterns for telemetry fan-in and fan-out, TCP and UDP servers for ground station traffic, cache-friendly data layouts for hot paths, and CPU and allocation profiling. Foundation is the prerequisite for every other track.

Database Internals — Orbital Object Registry (shipped)

A storage engine for the Orbital Object Registry — the system of record for every tracked orbital object, its TLE history, and the conjunction analyses computed against it. Modules cover page-level storage and the buffer pool, B+ tree indexes, LSM trees and compaction, write-ahead logging and crash recovery, transactions and isolation levels, and query processing with the Volcano model. The capstone projects compose into a working single-node OLTP storage engine over the course of the track.

Data Pipelines — Space Domain Awareness Fusion (shipped)

Real-time sensor fusion across radar, optical, and inter-satellite link feeds, producing the conjunction alerts that flight ops acts on. Modules cover stream processing semantics, pipeline orchestration internals, watermarks and event-time, exactly-once delivery, and backpressure. The mission framing is the SDA Fusion service that ingests tens of thousands of observations per second from heterogeneous sources and emits a unified, deduplicated catalog.

Data Lakes — Artemis Base Cold Archive (planned)

A versioned mission-data lakehouse for the Artemis base archive. Modules cover columnar formats, table format internals (Parquet and Iceberg), partition layout and clustering, compaction and table maintenance, and time-travel and lineage. The mission framing is the cold-archive storage tier for high-resolution sensor data that flows down from the lunar base on a multi-day cadence and must remain queryable for the operational lifetime of the program.

Distributed Systems — Constellation Network (planned)

Consensus, replication, and partitioning across the 48-satellite grid and its ground footprint. Modules cover failure detection, leader election and Raft, replication and quorum, sharding and rebalancing, and split-brain recovery. The mission framing is the inter-satellite coordination layer that maintains constellation-wide state — orbital schedules, contact windows, downlink priorities — without a stable connection to any single ground site.


Foundation first. Every other track depends on async Rust, channels, and the data-oriented performance habits introduced there.

After Foundation, the Database Internals, Data Pipelines, Data Lakes, and Distributed Systems tracks may be taken in any order. They reference one another where useful — the Database track's WAL chapter is referenced from the Distributed Systems track's replication chapter, for example — but each is self-contained and does not require the others as a prerequisite.


How a module works

Every module follows the same structure:

  • Three lessons. Written readings with code examples and source citations where applicable. Lessons that synthesize from training knowledge rather than a primary source are flagged with a Source note callout at the top.
  • Three quizzes. One quiz at the end of each lesson, mixing multiple choice, free response, and Rust tracing problems. A score of seventy percent or higher is required to mark a lesson complete.
  • One capstone project brief. A production-realistic engineering task that exercises the module's material. Projects are scoped to one to two weeks of focused work and include explicit acceptance criteria, suggested architecture, and the relevant Meridian operational context.

A module is complete when all three lessons are passed and the capstone project brief has been worked through end to end. The capstone is not optional reading; it is where the module's material becomes a real system.


Source material

Each lesson cites the primary text it draws from. The core references across the curriculum are Async Rust (Flitton, Morton), Database Internals (Petrov), Designing Data-Intensive Applications (Kleppmann), The Linux Programming Interface (Kerrisk), and the Raft and Spanner papers. Lessons that synthesize beyond a primary source are explicitly marked.

Module 01 — Async Rust Fundamentals

Track: Foundation — Mission Control Platform
Position: Module 1 of 6
Source material: Async Rust — Maxwell Flitton & Caroline Morton, Chapters 1, 2, 7
Quiz pass threshold: 70% on all three lessons to unlock the project



Mission Context

Meridian's legacy Python control plane was built for a 6-satellite constellation. It handles ground station connections sequentially: one connection at a time, blocking on each telemetry frame before moving to the next. At 48 satellites across 12 ground station sites, this architecture is the primary bottleneck in the control plane. During peak pass windows, the broker accumulates up to 40 seconds of delivery lag — unacceptable for conjunction avoidance workflows that require sub-10-second frame delivery.

This module establishes the async Rust foundation for the replacement system. Before writing any production control plane code, you need an accurate model of how async Rust executes — not at the surface level of #[tokio::main], but at the level of futures, the polling contract, executor scheduling, and task lifecycle. Every architectural decision in the modules that follow depends on this model.


What You Will Learn

By the end of this module you will be able to:

  • Implement the Future trait directly and trace the polling lifecycle from first call to completion
  • Explain the waker contract and identify futures that will silently stall due to missing waker registration
  • Distinguish between tokio::spawn, tokio::join!, and tokio::select! and apply each to the correct concurrency pattern
  • Configure a Tokio runtime explicitly via Builder, size worker and blocking thread pools for a given workload profile, and isolate high-frequency I/O workloads from blocking work
  • Cancel tasks safely using .abort() and tokio::time::timeout, understanding exactly where and when the future is dropped
  • Implement a graceful shutdown sequence with a bounded drain deadline, using RAII and CancellationToken for async cleanup

Lessons

Lesson 1 — The async/await Model: Futures, Polling, and the Executor Loop

Covers the Future trait, the poll function, Poll::Ready vs Poll::Pending, the waker contract, Pin, and the executor's task queue. Includes a manually implemented future to make the state machine mechanics explicit.

Key question this lesson answers: What actually happens at every await point, and what causes a task to silently stall?

lesson-01-async-await-model.md / lesson-01-quiz.toml


Lesson 2 — The Tokio Runtime: Spawning Tasks, the Scheduler, and Thread Pools

Covers Tokio's multi-thread work-stealing scheduler, the distinction between worker threads and blocking threads, tokio::task::spawn_blocking, and explicit runtime configuration via Builder. Includes a dual-runtime pattern for isolating ingress and housekeeping workloads.

Key question this lesson answers: How do you configure the runtime for your actual workload rather than the defaults, and when does blocking work need to leave the async executor?

lesson-02-tokio-runtime.md / lesson-02-quiz.toml


Lesson 3 — Task Lifecycle: Cancellation, Timeouts, and JoinHandle Management

Covers JoinHandle<T> semantics, cooperative cancellation with .abort(), RAII cleanup on cancellation, tokio::time::timeout, tokio::select! for racing futures, and a complete graceful shutdown pattern using broadcast and a bounded drain deadline.

Key question this lesson answers: How do you cleanly terminate a task — whether it completes normally, times out, or receives a shutdown signal — without leaking resources or corrupting state?

lesson-03-task-lifecycle.md / lesson-03-quiz.toml


Capstone Project — Async Telemetry Ingestion Broker

Build the TCP ingress layer for Meridian's replacement control plane. The broker accepts concurrent connections from ground stations, reads length-prefixed telemetry frames, fans each frame out to multiple downstream handlers via a broadcast channel, and shuts down gracefully on Ctrl-C with a 10-second drain deadline.

Acceptance is against 7 verifiable criteria including concurrent connection handling, broadcast fan-out correctness, slow-handler isolation, and bounded graceful shutdown.

project-async-telemetry-broker.md


Prerequisites

This module assumes you are comfortable with Rust ownership, borrowing, traits, and closures. It does not re-explain language fundamentals. If you are new to async Rust generally, the module starts from first principles at the trait level — but it expects you to read Rust error messages without assistance.

What Comes Next

Module 2 — Concurrency Primitives builds directly on this foundation: now that you understand how the executor runs tasks, Module 2 covers how those tasks share state safely — Mutex, RwLock, atomics, and memory ordering. The ground station command queue project in Module 2 connects directly to the telemetry broker you build here.

Lesson 1 — The async/await Model: Futures, Polling, and the Executor Loop

Module: Foundation — M01: Async Rust Fundamentals
Position: Lesson 1 of 3
Source: Async Rust — Maxwell Flitton & Caroline Morton, Chapters 1–2



Context

Meridian's legacy Python control plane was designed for a 6-satellite constellation. It handles ground station connections sequentially: accept a connection, process its telemetry frame, move to the next connection. At 6 satellites, this was acceptable. At 48 satellites across 12 ground station sites, it is a bottleneck. A single slow uplink from a station in the Atacama Desert holds up frames from every other active connection. The Python GIL makes true parallelism on this I/O-bound workload impossible without forking processes, which multiplies memory overhead and complicates shared state.

The replacement control plane is being written in Rust with tokio. Before writing any of that system, you need an accurate mental model of how async Rust actually executes code — not at the level of the tokio macro, but at the level of futures, the polling protocol, and the executor's task queue. Misunderstanding this model is the root cause of most async Rust bugs in production: dropped wakers, blocking the executor thread, and state machine explosions that are impossible to reason about.

This lesson covers the mechanics that every await desugars into. By the end, you should be able to read a future's poll implementation and trace exactly when it will make progress, when it will yield, and what will wake it back up. That skill is indispensable when debugging a hung ground station connection at 0300.

Source: Async Rust, Chapters 1–2 (Flitton & Morton)


Core Concepts

What Async Actually Is

Async programming does not add CPU cores. It reorganizes work so that dead time — waiting for a network response, waiting for a disk write — is used to make progress on other tasks. The classic analogy: you do not stand still while the kettle boils. You put the bread in the toaster. The key insight is that both tasks share one pair of hands but interleave their execution during wait periods.

In Rust, this interleaving is explicit and zero-cost. There is no runtime scheduler running on a background OS thread intercepting your code. Instead, you write state machines, and the Rust compiler compiles async fn into those state machines for you. await is a yield point — a place where the current task volunteers to give up the thread so another task can run.

This is the critical difference from threads: with threads, preemption is involuntary. With async tasks, yield is voluntary, at every await. A task that never hits an await — one that runs a tight CPU loop — will starve every other task on that executor thread. This is not hypothetical. In Meridian's uplink pipeline, a single malformed frame that triggers O(n²) validation holds the entire thread if there's no await in the hot path.

The Future Trait

Every value produced by an async fn or an async block implements Future. The trait is:

#![allow(unused)]
fn main() {
pub trait Future {
    type Output;
    fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output>;
}
}

Poll has two variants: Poll::Ready(value) when the computation is complete, and Poll::Pending when the future cannot yet make progress and should be woken up later.

The poll function is not async. This matters: futures are polled synchronously. The executor calls poll, it runs synchronously until it either completes or hits a point where it cannot proceed, and then it returns. If it returns Pending, it is the future's responsibility to arrange for a wake-up. If it returns Pending without registering a waker, the task will never run again — a silent deadlock.

The Waker Contract

Context carries a Waker. The Waker is a handle that, when called, schedules the associated task back onto the executor's run queue. The contract is: if poll returns Pending, it must have called cx.waker().wake_by_ref() or stored the waker to be called later when the awaited resource becomes available.

Violating this contract — returning Pending without registering the waker — produces a future that stalls forever with no error. The executor sees a pending task, never reschedules it, and the task silently vanishes from the run queue. At the Meridian scale, this manifests as a ground station connection that goes quiet mid-session: no error, no disconnect, just silence until the session timeout fires.

The executor side of this contract: when a waker is called, the executor re-queues the task and eventually calls poll again. The future may be polled many times before it completes. The state it needs to resume must be owned by the future struct itself — this is why async Rust desugars async fn into a struct that holds all local variables as fields.

Pinning

The poll signature takes Pin<&mut Self> rather than &mut Self. Pin prevents the future from being moved in memory after it has been pinned. This matters because async state machines frequently contain self-referential structures: a future that awaits another future may hold a reference into its own fields. If the outer future were moved, that reference would dangle.

Pin enforces at compile time that once you call poll, the future cannot be moved. For futures composed entirely of Unpin types (most standard types), this is a no-op. For futures holding references into themselves — which the compiler generates automatically from async fn — it is essential.

Practical implication: you cannot call poll directly on a future obtained from an async fn without first pinning it via Box::pin(future) or tokio::pin!(future). tokio::spawn handles this for you; you only encounter it directly when building custom executors or when polling a future by reference inside select!.

tokio::pin! — Polling a Future by Reference

tokio::pin! pins a value to the current stack frame in place, making it safe to poll by mutable reference. The common situation where this matters: you need to start an async operation once and track its progress across multiple iterations of a select! loop, rather than restarting it fresh on every iteration.

Consider fetching a TLE catalog update while simultaneously processing incoming session commands. The fetch should run to completion in the background; the command loop should not restart it on each iteration:

use tokio::sync::mpsc;
use tokio::time::{sleep, Duration};

async fn fetch_tle_update() -> Vec<u8> {
    // Simulate a slow catalog fetch — ~200ms in production.
    sleep(Duration::from_millis(200)).await;
    vec![0u8; 64] // placeholder TLE payload
}

#[tokio::main]
async fn main() {
    let (tx, mut rx) = mpsc::channel::<String>(8);

    // Spawn a sender to simulate incoming commands.
    tokio::spawn(async move {
        for cmd in ["REPOINT", "STATUS", "RESET"] {
            sleep(Duration::from_millis(60)).await;
            let _ = tx.send(cmd.to_string()).await;
        }
    });

    // Create the future ONCE, outside the loop.
    let tle_fetch = fetch_tle_update();
    // Pin it to the stack so we can poll it by reference (&mut tle_fetch).
    tokio::pin!(tle_fetch);

    let mut tle_done = false;

    loop {
        tokio::select! {
            // Poll the same future instance each iteration.
            // Without tokio::pin!, each iteration would call fetch_tle_update()
            // again, creating a brand-new future and discarding all progress.
            tle = &mut tle_fetch, if !tle_done => {
                println!("TLE update received: {} bytes", tle.len());
                tle_done = true;
            }
            Some(cmd) = rx.recv() => {
                println!("command received: {cmd}");
                if cmd == "RESET" { break; }
            }
            else => break,
        }
    }
}

Two things to notice. First, tle_fetch is created before the loop and pinned with tokio::pin!. Inside select!, &mut tle_fetch polls the same future on every iteration — it accumulates progress across polls. If you wrote fetch_tle_update() directly inside select!, you would get a new future each time and the fetch would restart from zero on every loop iteration.

Second, the , if !tle_done precondition disables the branch once the fetch has completed. This is essential: if the branch stays enabled after the future resolves, select! would attempt to poll an already-completed future on the next iteration, causing a "async fn resumed after completion" panic. The precondition guards against this. The next section covers preconditions in full.

The Executor Loop

The executor maintains a run queue of tasks ready to be polled. Its loop is approximately:

  1. Pop a task from the ready queue.
  2. Call poll on it.
  3. If Poll::Ready, the task is done — drop it.
  4. If Poll::Pending, the task is parked. It will be re-queued only when its waker is called.

Tasks are not re-polled speculatively. They are polled exactly when woken. This means a task can sit in Pending state indefinitely if nothing triggers its waker — which is the correct behavior for a task waiting on a network connection that has gone silent.

tokio::spawn places a task on the executor's ready queue. tokio::join! polls multiple futures concurrently on the same task — no new OS threads, no new tasks, just interleaved polling within the same scheduler slot. tokio::spawn creates a new independent task that can be scheduled on any worker thread.


Code Examples

Implementing Future Directly — A Telemetry Frame Validator

This example implements Future manually to illustrate what async/await desugars into. In production this would be an async fn, but seeing the state machine explicitly clarifies exactly when control yields and what triggers resumption.

The scenario: validating an incoming telemetry frame header requires checking a CRC that is computed in a background thread pool. The future polls a oneshot channel for the result.

use std::future::Future;
use std::pin::Pin;
use std::task::{Context, Poll};
use tokio::sync::oneshot;

/// Represents a frame whose header CRC is being validated asynchronously.
/// The validation runs on a blocking thread; this future waits for its result.
pub struct FrameValidationFuture {
    // oneshot::Receiver implements Future directly, but we wrap it here
    // to show the polling mechanics explicitly.
    receiver: oneshot::Receiver<bool>,
}

impl FrameValidationFuture {
    pub fn new(receiver: oneshot::Receiver<bool>) -> Self {
        Self { receiver }
    }
}

impl Future for FrameValidationFuture {
    type Output = Result<(), String>;

    fn poll(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> {
        // Pin::new is safe here because oneshot::Receiver is Unpin.
        // For a self-referential type we'd need unsafe or box-pinning.
        match Pin::new(&mut self.receiver).poll(cx) {
            Poll::Ready(Ok(true)) => Poll::Ready(Ok(())),
            Poll::Ready(Ok(false)) => {
                Poll::Ready(Err("CRC validation failed".to_string()))
            }
            Poll::Ready(Err(_)) => {
                // Sender dropped without sending — the validator thread panicked
                // or was cancelled. Treat as a validation failure, not a panic.
                Poll::Ready(Err("Validator thread terminated unexpectedly".to_string()))
            }
            // The result is not ready yet. The oneshot::Receiver has already
            // registered cx's waker — it will call it when a value is sent.
            // We return Pending; the executor parks this task.
            Poll::Pending => Poll::Pending,
        }
    }
}

#[tokio::main]
async fn main() {
    let (tx, rx) = oneshot::channel::<bool>();

    // Simulate the CRC validator running on a blocking thread pool.
    tokio::spawn(async move {
        // In production: tokio::task::spawn_blocking(|| compute_crc(...)).await
        // Here we just send a valid result immediately.
        let _ = tx.send(true);
    });

    let validation = FrameValidationFuture::new(rx);
    match validation.await {
        Ok(()) => println!("Frame header valid — forwarding to telemetry pipeline"),
        Err(e) => eprintln!("Frame rejected: {e}"),
    }
}

The poll implementation delegates to the inner oneshot::Receiver's own poll. When Receiver::poll returns Pending, it has already stored the waker from cx internally. When tx.send(true) fires, Receiver calls that waker, which re-queues this task. No manual waker management is needed here because we compose with a type that already handles it correctly.

This is the pattern to follow when building custom futures: compose with existing futures and channel primitives wherever possible. Write unsafe waker code only when you are bridging to a non-async notification source (an epoll fd, a hardware interrupt, a C library callback).

Concurrent Polling with tokio::join! vs. Sequential await

Sequential await for two telemetry frame fetches from different ground stations means the second fetch does not start until the first completes:

#![allow(unused)]
fn main() {
// SEQUENTIAL — total latency = latency(station_a) + latency(station_b)
let frame_a = fetch_frame("gs-atacama").await?;
let frame_b = fetch_frame("gs-svalbard").await?;
}

tokio::join! polls both concurrently on the same task. While one is pending, the executor can drive the other forward:

use anyhow::Result;
use tokio::net::TcpStream;

async fn fetch_frame(station_id: &str) -> Result<Vec<u8>> {
    // Simplified: in production this reads from a persistent connection pool.
    let mut _stream = TcpStream::connect(format!("{station_id}:7777")).await?;
    // ... read frame bytes ...
    Ok(vec![]) // placeholder
}

#[tokio::main]
async fn main() -> Result<()> {
    // CONCURRENT — total latency ≈ max(latency_a, latency_b)
    // Both futures are polled in the same task; no new OS threads are created.
    let (frame_a, frame_b) = tokio::join!(
        fetch_frame("gs-atacama"),
        fetch_frame("gs-svalbard")
    );

    // Both results are available here; handle errors independently.
    match (frame_a, frame_b) {
        (Ok(a), Ok(b)) => {
            println!("Received {} + {} bytes from ground stations", a.len(), b.len());
        }
        (Err(e), _) | (_, Err(e)) => {
            eprintln!("Ground station fetch failed: {e}");
        }
    }
    Ok(())
}

tokio::join! is appropriate when the futures are independent and you need both results. If you only need the first result and want to cancel the loser, use tokio::select!. If the futures have no data dependency and need to run across multiple threads simultaneously, tokio::spawn each and join the handles.


Key Takeaways

  • The Future trait's poll method is synchronous. An async runtime is a loop that calls poll on ready tasks; it does not preempt running tasks. A future that does significant CPU work without an await will monopolize its executor thread.

  • If poll returns Poll::Pending without registering the context's waker, the task is silently parked forever. Always verify that the resource you're awaiting will call the waker when it becomes available.

  • Pin<&mut Self> exists to prevent futures from being moved after polling begins. For futures containing self-referential state (which the compiler generates automatically), this is load-bearing. Most composed futures are Unpin; the constraint only bites when bridging to raw async primitives.

  • tokio::join! achieves concurrency within a single task by interleaved polling. It does not create threads or new tasks. Use it for independent I/O operations that should proceed simultaneously but whose results you need together.

  • tokio::pin! pins a future to the current stack frame so it can be polled by mutable reference across multiple select! iterations. Use it when you need to start an operation once and track its progress, not restart it on each loop. Always pair it with a precondition (, if !done) to prevent polling the future after it has already resolved.

  • Every async fn is compiled into a state machine struct. Variables live across await points become fields of that struct. Understanding this explains why async Rust futures can be large, why they must be pinned, and why capturing large values across await points inflates memory use.


Lesson 2 — The Tokio Runtime: Spawning Tasks, the Scheduler, and Thread Pools

Module: Foundation — M01: Async Rust Fundamentals
Position: Lesson 2 of 3
Source: Async Rust — Maxwell Flitton & Caroline Morton, Chapter 7



Context

The Meridian control plane receives telemetry from 48 satellite uplinks simultaneously. Each uplink connection is long-lived: a ground station holds a persistent TCP session with the control plane and streams frames at irregular intervals driven by orbital geometry and antenna alignment. Alongside these connections, the control plane runs housekeeping tasks — session health checks, TLE refresh from the catalog, and periodic flush of buffered frames to the downstream aggregator.

The #[tokio::main] macro stands up a default multi-threaded runtime and runs your async main function in it. For prototyping and simple services, this is sufficient. For a system with the throughput profile and operational requirements of Meridian's control plane, you need to understand what that runtime is actually doing — how many threads it allocates, how it distributes work across them, what happens when a blocking operation enters the mix, and how to configure it for your actual workload rather than the defaults.

This lesson covers Tokio's scheduler architecture, the distinction between async tasks and blocking tasks, how to size thread pools for I/O-bound vs. compute-bound work, and how to configure the runtime explicitly via Builder. The goal is not to tune prematurely — it is to understand the model well enough to make deliberate choices rather than accepting defaults that may be wrong for your system.

Source: Async Rust, Chapter 7 (Flitton & Morton)


Core Concepts

The Multi-Thread Scheduler

Tokio's default multi_thread scheduler maintains a pool of worker threads — by default, one per logical CPU core. Each worker thread has a local run queue. Tasks spawned with tokio::spawn go onto a global run queue and are stolen by whichever worker thread is idle. This work-stealing design keeps all workers busy when there is backlog, at the cost of some cross-thread synchronization.

Each worker runs the same loop from Lesson 1: pop a ready task, call poll, re-queue it if it returns Pending, drop it if Ready. When a worker's local queue is empty, it attempts to steal tasks from other workers' queues before checking the global queue. The global_queue_interval configuration controls how many local-queue tasks a worker processes before checking the global queue — the default is 61. Lowering this value gives newly spawned tasks lower latency at the cost of more global-queue contention.

The current_thread runtime (used by #[tokio::main] in tests and the single-threaded mode) runs all tasks on the calling thread. It is appropriate for services that are purely I/O-bound with no CPU-intensive tasks and where single-threaded throughput is sufficient. The Meridian control plane uses the multi-thread runtime.

Worker Threads and Blocking Threads

Tokio distinguishes between two kinds of threads:

Worker threads run the async executor loop. They poll futures and run async task code. There should be enough of them to saturate your I/O capacity without exceeding your core count. A typical production setting is num_cpus::get(), which Builder::new_multi_thread() uses by default.

Blocking threads are spawned on demand by tokio::task::spawn_blocking. They run synchronous, blocking code — file I/O, CPU-intensive computation, synchronous library calls — in a separate thread pool that does not interfere with the async worker threads. The key rule: never perform blocking I/O or long CPU work directly on an async worker thread. Doing so parks that thread for the duration, reducing effective parallelism and starving other tasks.

max_blocking_threads caps the number of blocking threads that can exist simultaneously. The default is 512. For the Meridian control plane, which may process TLE bulk imports concurrently with live uplinks, sizing this correctly prevents runaway thread creation under load spikes.

tokio::spawn and Task Identity

tokio::spawn places a new task onto the runtime's global queue. It returns a JoinHandle<T> — a handle to the spawned task's eventual output. The task is independent of the spawner: if the spawner drops the handle, the task continues running (though its output is lost). If you need the task's output, keep the handle and .await it. If you need to cancel the task, call .abort() on the handle.

Spawned tasks must be 'static — they cannot borrow from the spawning scope. If the task needs data from the spawner, move it in with async move { ... }, clone cheaply clonable data (like Arc-wrapped state), or use channels to communicate.

A common mistake is spawning a task per connection without any admission control. At 48 uplinks with 100 frames per second each, that is 4,800 task-spawns per second for frame processing alone. Task creation has overhead. For Meridian's frame processing workload, a bounded task pool or a pipeline of fixed workers is more appropriate than an unbounded spawn-per-frame pattern.

Configuring the Runtime Explicitly

The #[tokio::main] macro is shorthand for building a default runtime and blocking on the async main function. Replacing it with an explicit Builder gives fine-grained control:

use tokio::runtime::Builder;

fn main() {
    let runtime = Builder::new_multi_thread()
        .worker_threads(8)
        .max_blocking_threads(16)
        .thread_name("meridian-worker")
        .thread_stack_size(2 * 1024 * 1024)
        .enable_all()
        .build()
        .expect("failed to build Tokio runtime");

    runtime.block_on(async_main());
}

The runtime is a value. You can have multiple runtimes in the same process — useful when you need strict resource isolation between subsystems (e.g., keeping the telemetry ingress runtime separate from the housekeeping runtime so a housekeeping spike does not starve active uplinks).


Code Examples

Explicit Runtime Configuration for the Meridian Control Plane

Meridian's control plane has two distinct workload profiles that benefit from isolated runtimes: the high-frequency telemetry ingress path (many short-lived I/O tasks) and the housekeeping path (fewer, slower tasks including blocking TLE catalog refreshes). Sharing a single runtime risks head-of-line blocking when a TLE import saturates the blocking thread pool.

use std::sync::LazyLock;
use tokio::runtime::{Builder, Runtime};

// Ingress runtime: tuned for concurrent I/O — one worker per core,
// minimal blocking threads since real blocking work routes to the
// housekeeping runtime.
static INGRESS_RUNTIME: LazyLock<Runtime> = LazyLock::new(|| {
    Builder::new_multi_thread()
        .worker_threads(num_cpus::get())
        .max_blocking_threads(4)
        .thread_name("meridian-ingress")
        .thread_stack_size(2 * 1024 * 1024)
        .on_thread_start(|| tracing::debug!("ingress worker started"))
        .enable_all()
        .build()
        .expect("failed to build ingress runtime")
});

// Housekeeping runtime: fewer workers, more blocking threads for
// catalog refreshes and frame archival.
static HOUSEKEEPING_RUNTIME: LazyLock<Runtime> = LazyLock::new(|| {
    Builder::new_multi_thread()
        .worker_threads(2)
        .max_blocking_threads(32)
        .thread_name("meridian-housekeeping")
        .enable_all()
        .build()
        .expect("failed to build housekeeping runtime")
});

async fn handle_uplink_session() {
    // This runs on an ingress worker thread.
    // Long-running I/O awaits are fine here.
    tokio::time::sleep(tokio::time::Duration::from_millis(10)).await;
    tracing::info!("uplink session processed");
}

async fn refresh_tle_catalog() {
    // CPU + blocking I/O — route to spawn_blocking so we do not
    // park an ingress worker for the duration of the refresh.
    tokio::task::spawn_blocking(|| {
        // Synchronous HTTP fetch + file write; blocks for ~200ms.
        tracing::info!("TLE catalog refreshed");
    })
    .await
    .expect("TLE refresh panicked");
}

fn main() {
    // Ingress and housekeeping run in separate thread pools.
    // A TLE refresh spike cannot starve active uplink sessions.
    std::thread::spawn(|| {
        HOUSEKEEPING_RUNTIME.block_on(async {
            loop {
                refresh_tle_catalog().await;
                tokio::time::sleep(tokio::time::Duration::from_secs(300)).await;
            }
        });
    });

    INGRESS_RUNTIME.block_on(async {
        // In production: bind TCP listener, accept connections,
        // spawn handle_uplink_session per connection.
        for _ in 0..48 {
            INGRESS_RUNTIME.spawn(handle_uplink_session());
        }
        tokio::time::sleep(tokio::time::Duration::from_secs(1)).await;
    });
}

The on_thread_start hook enables per-thread tracing setup. In a production deployment, this is where you would initialize thread-local metrics recorders. The thread_name setting surfaces in top, htop, and perf output — essential when profiling which runtime is responsible for CPU usage.

Dispatching Blocking Work Correctly

The most common async-correctness mistake in production Rust services is calling blocking code on an async worker thread. The rule is simple but frequently violated: if a function does not have async in its signature and it does any I/O or takes more than a few hundred microseconds, it belongs in spawn_blocking.

use anyhow::Result;
use tokio::task;

/// Parse and validate a TLE record from a raw string.
/// TLE parsing is synchronous and O(n) with input length.
/// On a 100KB batch, this can take several milliseconds.
fn parse_tle_batch_blocking(raw: String) -> Result<Vec<String>> {
    // Synchronous parsing — no I/O, but CPU-bound for large inputs.
    raw.lines()
        .filter(|l| l.starts_with("1 ") || l.starts_with("2 "))
        .map(|l| Ok(l.to_string()))
        .collect()
}

async fn ingest_tle_update(raw_batch: String) -> Result<Vec<String>> {
    // Moving raw_batch into spawn_blocking satisfies the 'static bound.
    // The closure executes on a blocking thread; we await the JoinHandle.
    let records = task::spawn_blocking(move || parse_tle_batch_blocking(raw_batch))
        .await
        // The outer error is a JoinError (task panicked or was aborted).
        // Propagate it as an application error.
        .map_err(|e| anyhow::anyhow!("TLE parser panicked: {e}"))??;

    Ok(records)
}

#[tokio::main]
async fn main() -> Result<()> {
    let raw = "1 25544U 98067A   21275.52500000  .00001234  00000-0  12345-4 0  9999\n\
               2 25544  51.6400 337.6640 0007417  62.6000 297.5200 15.48889583300000\n"
        .to_string();

    let records = ingest_tle_update(raw).await?;
    println!("Parsed {} TLE records", records.len());
    Ok(())
}

The double ? on .await.map_err(...)?? deserves explanation: spawn_blocking returns Result<T, JoinError>, and parse_tle_batch_blocking itself returns Result<Vec<String>, anyhow::Error>. The first ? propagates JoinError (after mapping it), the second propagates the inner application error. Collapsing these correctly is a common stumbling point — do not use .unwrap() on JoinHandle in production code; a parser panic should not take down the ingress runtime.


Key Takeaways

  • Tokio's multi-thread scheduler uses work-stealing across a pool of worker threads (defaulting to one per logical CPU). Tasks spawned via tokio::spawn enter the global queue and are picked up by idle workers.

  • Worker threads and blocking threads serve different purposes. Never run synchronous blocking I/O or long CPU computation on a worker thread. Use tokio::task::spawn_blocking to route blocking work to the dedicated blocking thread pool.

  • Explicit Builder configuration lets you control thread counts, stack sizes, thread names, and lifecycle hooks. Use it in production to separate high-frequency I/O workloads from lower-frequency blocking workloads, preventing one from starving the other.

  • tokio::spawn creates a task with 'static lifetime. If you need to share data from the spawning scope, move it into the closure with async move, wrap it in Arc, or communicate via channels.

  • Multiple runtimes in the same process are a valid pattern for resource isolation. Ingress and housekeeping workloads with fundamentally different resource profiles benefit from separate thread pools rather than competing on a shared executor.


Lesson 3 — Task Lifecycle: Cancellation, Timeouts, and JoinHandle Management

Module: Foundation — M01: Async Rust Fundamentals
Position: Lesson 3 of 3
Source: Async Rust — Maxwell Flitton & Caroline Morton, Chapters 2, 7



Context

The Meridian control plane manages connections that span orbital passes. A ground station connection is live while the target satellite is above the horizon — typically 8 to 12 minutes. When the pass ends, the connection should be torn down cleanly: in-flight frames flushed, session state persisted, downstream consumers notified. If the control plane is restarted mid-pass — rolling deploy, crash recovery, OOM kill — active tasks must be cancelled in a way that does not corrupt shared state or leave downstream systems with partial data.

Understanding task lifecycle is not optional for this system. Tasks that outlive their useful scope waste resources. Tasks cancelled without cleanup leave corrupted state. Tasks that silently swallow their errors make incident response a guessing game. The Tokio JoinHandle, the .abort() call, and tokio::time::timeout are the instruments for managing these concerns; this lesson covers each one in depth.

Source: Async Rust, Chapters 2 & 7 (Flitton & Morton)


Core Concepts

The JoinHandle and Task Output

tokio::spawn returns JoinHandle<T>. The handle has two primary uses: waiting for the task's output with .await, and cancelling the task with .abort().

.await on a JoinHandle<T> produces Result<T, JoinError>. JoinError indicates one of two things: the task panicked, or it was aborted. Distinguishing them matters:

#![allow(unused)]
fn main() {
match handle.await {
    Ok(value) => { /* normal completion */ }
    Err(e) if e.is_panic() => { /* task panicked — log and recover */ }
    Err(e) if e.is_cancelled() => { /* task was aborted */ }
    Err(_) => unreachable!(),
}
}

If you drop a JoinHandle without awaiting it, the task continues running — it is not cancelled. This is the correct behavior for fire-and-forget tasks. If you need the task to stop when the handle is dropped, use tokio_util::task::AbortOnDropHandle (a wrapper that calls .abort() on drop) or implement the same pattern manually.

Task Cancellation with .abort()

.abort() sends a cancellation signal to the task. The task does not stop immediately — it is cancelled at the next .await point. This is cooperative cancellation: the task's state machine is dropped when it next yields to the executor, which runs the Drop implementation of any held values.

The implication: resources guarded by RAII are dropped correctly on cancellation. A tokio::net::TcpStream held by the task will be closed. A MutexGuard will be released. A tokio::fs::File will be flushed and closed. What is not guaranteed: code after the .await where cancellation occurred will not run. If you have cleanup logic that must run regardless of cancellation, it must be in a Drop impl, not in code that follows an .await.

#![allow(unused)]
fn main() {
// This cleanup logic may NOT run if the task is cancelled at the await:
async fn session_handler(id: u64) {
    process_frames().await; // <-- task may be cancelled here
    // The following line may never execute if aborted above.
    persist_session_state(id).await; // NOT guaranteed on cancellation
}

// This cleanup logic WILL run on cancellation because it is in Drop:
struct Session {
    id: u64,
    state: SessionState,
}

impl Drop for Session {
    fn drop(&mut self) {
        // Synchronous cleanup only — no async here.
        // Flush to a synchronous in-memory buffer; a separate housekeeping
        // task drains the buffer to persistent storage.
        tracing::info!(session_id = self.id, "session dropped, state buffered");
    }
}
}

CancellationToken and TaskTracker

broadcast and watch channels work for shutdown signalling, but tokio-util provides two purpose-built primitives that are cleaner for the specific problem of cooperative shutdown.

CancellationToken is a cloneable, shareable cancellation handle. Any clone of a token represents the same cancellation event: when .cancel() is called on any one of them, all clones see it. Tasks wait on .cancelled(), which returns a future that resolves when the token is cancelled:

use tokio::time::{sleep, Duration};
use tokio_util::sync::CancellationToken;

async fn uplink_session(station_id: u32, token: CancellationToken) {
    loop {
        tokio::select! {
            // cancelled() is just a future — it composes naturally with select!.
            _ = token.cancelled() => {
                // Run async cleanup here before returning.
                // This is the key advantage over .abort(): we choose when to stop
                // and can flush state, send final messages, close connections.
                tracing::info!(station_id, "session received cancellation — draining");
                flush_pending_frames().await;
                break;
            }
            _ = process_next_frame(station_id) => {
                // Normal frame processing continues until token is cancelled.
            }
        }
    }
}

async fn flush_pending_frames() {
    sleep(Duration::from_millis(10)).await; // placeholder
}

async fn process_next_frame(_id: u32) {
    sleep(Duration::from_millis(50)).await; // placeholder
}

#[tokio::main]
async fn main() {
    let token = CancellationToken::new();

    // Clone the token for each task — all clones share the same cancellation.
    let handles: Vec<_> = (0..4)
        .map(|id| {
            let t = token.clone();
            tokio::spawn(uplink_session(id, t))
        })
        .collect();

    // Simulate running for a short time then shutting down.
    sleep(Duration::from_millis(120)).await;

    // Cancel all sessions simultaneously with one call.
    token.cancel();

    for handle in handles {
        let _ = handle.await;
    }
    tracing::info!("all sessions shut down");
}

The critical difference from .abort(): when the token fires, the task's select! arm runs, giving the task the opportunity to execute async cleanup before it exits. .abort() drops the future at the next .await with no opportunity for the task to run any further code.

CancellationToken::child_token() creates a child that is cancelled when the parent is cancelled, but can also be cancelled independently. Use this for hierarchical shutdown: cancel the top-level token to shut down everything, or cancel a child token to shut down one subsystem while leaving others running.

TaskTracker solves the drain-waiting problem more cleanly than collecting JoinHandles into a Vec. Spawn tasks through the tracker; call .close() when no more tasks will be added; then .wait() to block until all tracked tasks finish:

use tokio::time::{sleep, Duration};
use tokio_util::task::TaskTracker;

#[tokio::main]
async fn main() {
    let tracker = TaskTracker::new();
    let token = tokio_util::sync::CancellationToken::new();

    for station_id in 0..12u32 {
        let t = token.clone();
        tracker.spawn(async move {
            tokio::select! {
                _ = t.cancelled() => {
                    tracing::info!(station_id, "session shutting down");
                }
                _ = sleep(Duration::from_secs(300)) => {
                    tracing::info!(station_id, "session pass complete");
                }
            }
        });
    }

    // Signal that no more tasks will be spawned.
    // wait() will not resolve until close() has been called.
    tracker.close();

    // Trigger shutdown.
    sleep(Duration::from_millis(50)).await;
    token.cancel();

    // Block until all 12 sessions finish their cleanup.
    tracker.wait().await;
    tracing::info!("all sessions drained");
}

tracker.wait() only resolves after both conditions are true: all spawned tasks have completed, and tracker.close() has been called. The close() requirement prevents a race where wait() resolves between the last task finishing and the next one being spawned. Always call close() before wait().

tokio::time::timeout

tokio::time::timeout(duration, future) wraps any future and adds a deadline. If the future does not complete within the duration, it is cancelled and the wrapper returns Err(tokio::time::error::Elapsed).

#![allow(unused)]
fn main() {
use tokio::time::{timeout, Duration};

async fn fetch_frame_with_deadline(station: &str) -> anyhow::Result<Vec<u8>> {
    timeout(Duration::from_secs(5), fetch_frame(station))
        .await
        // Elapsed is returned as Err — map it to an application error.
        .map_err(|_| anyhow::anyhow!("ground station {station} timed out after 5s"))?
}
}

A critical detail: timeout cancels the inner future when the deadline fires — with the same cooperative semantics as .abort(). The future is dropped at its next .await point after the deadline. If the future holds a database transaction or has submitted writes that should be rolled back on timeout, the transaction handle's Drop must handle the rollback.

For scenarios where you want to retry on timeout, wrap the timeout in a loop. For scenarios where you want to give a task one deadline with no retry, timeout is the right primitive. For scenarios where you want to cancel based on an external signal (graceful shutdown, satellite pass end), use CancellationToken or tokio::select! with a shutdown receiver.

tokio::select! for Racing Futures

tokio::select! polls multiple futures concurrently and completes with the first one that becomes ready, cancelling the others. It is the right tool for:

  • Racing a task against a timeout
  • Racing a task against a shutdown signal
  • Implementing priority receive patterns on multiple channels
#![allow(unused)]
fn main() {
use tokio::sync::oneshot;

async fn session_with_shutdown(
    session: impl std::future::Future<Output = ()>,
    mut shutdown: oneshot::Receiver<()>,
) {
    tokio::select! {
        _ = session => {
            tracing::info!("session completed normally");
        }
        _ = &mut shutdown => {
            // Shutdown signal received — session future is cancelled here.
            // RAII cleanup in the session's Drop runs.
            tracing::info!("session cancelled: shutdown signal received");
        }
    }
}
}

The branch that wins is executed; the branches that lose are cancelled (futures dropped at their next await point). If you need to do async cleanup when the losing branch is cancelled, you cannot do it inside select! — you need CancellationToken combined with a cleanup task.

Important: all branches of a select! run concurrently on the same task. They are never truly simultaneous — only one executes at a time — but they are polled in interleaved fashion within a single scheduler slot. This is distinct from tokio::spawn, which creates a new task that can run on a different worker thread. select! is lightweight concurrent multiplexing; spawn is independent parallel scheduling.

select! Loop Patterns and Branch Preconditions

select! is most often used inside a loop. Two patterns come up constantly in production systems.

Multi-channel drain with else: when a session task needs to drain from multiple upstream channels until all are closed:

#![allow(unused)]
fn main() {
use tokio::sync::mpsc;

async fn drain_uplinks(
    mut primary: mpsc::Receiver<Vec<u8>>,
    mut redundant: mpsc::Receiver<Vec<u8>>,
) {
    loop {
        tokio::select! {
            // select! randomly picks which ready branch to check first —
            // this prevents the redundant channel from always being starved
            // if the primary is consistently busy.
            Some(frame) = primary.recv() => {
                process_frame(frame, "primary");
            }
            Some(frame) = redundant.recv() => {
                process_frame(frame, "redundant");
            }
            // else fires when ALL patterns fail — both channels returned None,
            // meaning both are closed. This is the clean exit condition.
            else => {
                tracing::info!("all uplink channels closed — drain complete");
                break;
            }
        }
    }
}

fn process_frame(frame: Vec<u8>, source: &str) {
    tracing::debug!(bytes = frame.len(), source, "frame processed");
}
}

The else branch is not optional when you pattern-match on Some(...). If both channels close and there is no else, select! will panic because no branch can make progress. Always include else when all branches use fallible patterns.

Branch preconditions: the , if condition syntax disables a branch before select! evaluates it. This is essential when polling a pinned future by reference inside a loop — once the future completes, the branch must be disabled or the next iteration will attempt to poll an already-resolved future, causing a panic:

use tokio::sync::mpsc;
use tokio::time::{sleep, Duration};

async fn catalog_refresh() -> Vec<u8> {
    sleep(Duration::from_millis(100)).await;
    vec![0u8; 128]
}

#[tokio::main]
async fn main() {
    let (_tx, mut cmd_rx) = mpsc::channel::<String>(8);

    let refresh = catalog_refresh();
    tokio::pin!(refresh);
    let mut refresh_done = false;

    for _ in 0..5 {
        tokio::select! {
            // Branch is disabled once refresh_done = true.
            // Without this precondition: panic on second iteration.
            result = &mut refresh, if !refresh_done => {
                println!("catalog refreshed: {} bytes", result.len());
                refresh_done = true;
            }
            Some(cmd) = cmd_rx.recv() => {
                println!("command: {cmd}");
            }
            else => break,
        }
    }
}

When the precondition is false, select! simply skips that branch. If all branches are disabled by preconditions, select! panics — so structure your logic to ensure at least one branch is always eligible or an else handles the case.

Graceful Shutdown Pattern

A production service needs a defined shutdown sequence. For the Meridian control plane:

  1. Stop accepting new connections.
  2. Signal active session tasks to finish or cancel.
  3. Wait for tasks to drain (with a deadline — do not wait forever).
  4. Flush pending telemetry to downstream consumers.
  5. Exit cleanly.
#![allow(unused)]
fn main() {
use std::time::Duration;
use tokio::sync::broadcast;

struct ShutdownCoordinator {
    sender: broadcast::Sender<()>,
}

impl ShutdownCoordinator {
    fn new() -> Self {
        let (sender, _) = broadcast::channel(1);
        Self { sender }
    }

    fn subscribe(&self) -> broadcast::Receiver<()> {
        self.sender.subscribe()
    }

    async fn shutdown(&self, tasks: Vec<tokio::task::JoinHandle<()>>) {
        // Signal all subscribers.
        let _ = self.sender.send(());

        // Give tasks 10 seconds to drain. After that, abort stragglers.
        let deadline = Duration::from_secs(10);
        let _ = tokio::time::timeout(deadline, async {
            for handle in tasks {
                // Ignore individual task errors during shutdown.
                let _ = handle.await;
            }
        })
        .await;
    }
}
}

The coordinator sends a shutdown signal over a broadcast channel. Each session task holds a Receiver and uses tokio::select! to race its work against the shutdown signal. After broadcasting, shutdown awaits all handles with a 10-second deadline. Any task that has not completed by then is left to the OS — in a containerized environment, the container will be killed by the orchestrator anyway.


Code Examples

Managing a Satellite Pass Session with Full Lifecycle Control

A pass session has a well-defined lifetime: it starts when the satellite rises above the ground station horizon and ends when it sets. The session task must complete cleanly if the pass ends normally, abort gracefully on shutdown, and timeout if the satellite goes silent mid-pass (antenna tracking failure, power anomaly).

use anyhow::Result;
use tokio::{
    sync::oneshot,
    time::{timeout, Duration},
};
use tracing::{info, warn};

#[derive(Debug)]
struct PassSession {
    satellite_id: u32,
    ground_station: String,
}

impl Drop for PassSession {
    fn drop(&mut self) {
        // Synchronous state flush — no async.
        // In production, push final state to a lock-free ring buffer
        // that a background writer drains to persistent storage.
        info!(
            satellite_id = self.satellite_id,
            ground_station = %self.ground_station,
            "PassSession dropped — flushing state synchronously"
        );
    }
}

impl PassSession {
    async fn run(&mut self) -> Result<()> {
        info!(
            satellite_id = self.satellite_id,
            "pass session started"
        );
        // Simulate frame processing loop.
        // In production: read frames from TcpStream, validate, forward.
        for frame_num in 0u32..100 {
            tokio::time::sleep(Duration::from_millis(50)).await;
            info!(frame = frame_num, "frame processed");
        }
        Ok(())
    }
}

async fn manage_pass(
    satellite_id: u32,
    ground_station: String,
    pass_duration: Duration,
    mut shutdown_rx: oneshot::Receiver<()>,
) -> Result<()> {
    let mut session = PassSession {
        satellite_id,
        ground_station,
    };

    // Race: session completion, pass duration timeout, or shutdown signal.
    tokio::select! {
        result = timeout(pass_duration, session.run()) => {
            match result {
                Ok(Ok(())) => info!(satellite_id, "pass completed normally"),
                Ok(Err(e)) => warn!(satellite_id, "session error: {e}"),
                Err(_) => warn!(satellite_id, "pass duration exceeded — session timed out"),
            }
        }
        _ = &mut shutdown_rx => {
            // PassSession::drop runs here, flushing state before the task exits.
            warn!(satellite_id, "pass cancelled: shutdown received");
        }
    }

    Ok(())
}

#[tokio::main]
async fn main() -> Result<()> {
    tracing_subscriber::fmt::init();

    let (shutdown_tx, shutdown_rx) = oneshot::channel::<()>();

    let handle = tokio::spawn(manage_pass(
        25544,
        "gs-svalbard".to_string(),
        Duration::from_secs(30),
        shutdown_rx,
    ));

    // Simulate shutdown signal after 1 second.
    tokio::time::sleep(Duration::from_secs(1)).await;
    let _ = shutdown_tx.send(());

    match handle.await {
        Ok(Ok(())) => info!("task completed"),
        Ok(Err(e)) => warn!("task error: {e}"),
        Err(e) if e.is_cancelled() => warn!("task was aborted externally"),
        Err(e) => warn!("task panicked: {e}"),
    }

    Ok(())
}

Key decisions in this code: the Drop impl handles synchronous cleanup, which is guaranteed to run whether the session completes normally, times out, or is cancelled. The select! gives the session three possible exit paths with distinct log entries — observable, diagnosable behavior rather than silent state corruption. The outer .await on the handle distinguishes between clean task exit, application errors, external abort, and panics.


Key Takeaways

  • JoinHandle<T> awaits as Result<T, JoinError>. Distinguish between panics and cancellation using e.is_panic() / e.is_cancelled(). Never .unwrap() a JoinHandle in production code without a comment explaining the invariant.

  • Dropping a JoinHandle does not cancel the task. Call .abort() explicitly if you need cancellation on drop. .abort() is cooperative — the task stops at its next .await point, not immediately.

  • Async cleanup after an .await is not guaranteed on cancellation. Put mandatory cleanup in Drop (synchronous) or use CancellationToken to intercept the shutdown signal and run async teardown before the task exits.

  • tokio::time::timeout wraps any future with a deadline. On expiry, it cancels the inner future at its next .await. Resources held by the cancelled future are dropped via RAII — no manual cleanup needed if your types implement Drop correctly.

  • tokio::select! runs all branches on the same task — they multiplex, they do not parallelize. Branches randomly compete for selection when multiple are ready, which prevents starvation. Use tokio::spawn when you need true independent scheduling; use select! when you need lightweight concurrency within a single task.

  • select! branch preconditions (, if condition) disable a branch before evaluation. Always use them with pinned futures in loops to prevent the "async fn resumed after completion" panic.

  • In select! loops, always include an else branch when all active branches use fallible patterns like Some(...). The else branch fires when all patterns fail to match — typically when all channels are closed — and provides the clean exit condition.

  • CancellationToken (from tokio-util) is the preferred cancellation primitive for cooperative shutdown. Cloning shares the same cancellation event. .cancelled().await composes naturally with select! and, unlike .abort(), allows the task to run async cleanup before exiting.

  • TaskTracker (from tokio-util) is the preferred drain primitive for shutdown. Spawn tasks through the tracker, call .close() when done spawning, then .wait().await to block until all tasks finish. This avoids the JoinHandle Vec pattern and correctly handles the close/wait ordering requirement.


Project — Async Telemetry Ingestion Broker

Module: Foundation — M01: Async Rust Fundamentals
Prerequisite: All three module quizzes passed (≥70%)



Mission Brief

TO: Platform Engineering
FROM: Mission Control Systems Lead
CLASSIFICATION: UNCLASSIFIED // INTERNAL
SUBJECT: RFC-0041 — Telemetry Ingestion Broker Replacement


The legacy Python telemetry broker is being decommissioned. It accepted connections sequentially on a single thread and could not keep up beyond 12 concurrent ground station feeds. With constellation expansion to 48 LEO satellites and 12 active ground station sites, the broker routinely falls behind during peak pass windows, buffering up to 40 seconds of lag before flushing — unacceptable for conjunction avoidance workflows that require sub-10-second delivery.

Your task is to implement the replacement broker in Rust using Tokio. The broker must accept concurrent TCP connections from ground stations, parse incoming telemetry frames, and fan each frame out to multiple registered downstream handlers — without blocking on any single slow handler.

The broker does not perform conjunction computation. It is a pure ingress and distribution layer. Correctness, throughput, and clean lifecycle management are the acceptance criteria.


System Specification

Connection Model

  • Ground stations connect over TCP to a configurable bind address.
  • Each connection streams telemetry frames encoded as length-prefixed byte sequences: a 4-byte big-endian u32 length header followed by length bytes of payload.
  • Connections are persistent for the duration of a satellite pass (8–12 minutes). They may drop and reconnect within a pass without notice.
  • The broker must handle up to 48 concurrent connections without degradation.

Frame Routing

  • Registered downstream handlers receive every frame via a bounded tokio::sync::broadcast channel.
  • If a slow handler's receiver falls behind and the broadcast channel fills, it is the handler's problem — the broker must not block or slow its ingress path to accommodate a slow consumer.
  • The broker logs a warning when a receiver falls behind (broadcast returns RecvError::Lagged).

Lifecycle

  • The broker accepts a shutdown signal (a tokio::sync::watch or oneshot channel) and performs graceful shutdown:
    1. Stop accepting new connections.
    2. Signal all active session tasks to drain and exit.
    3. Wait up to 10 seconds for tasks to finish.
    4. Force-abort any remaining tasks and exit.
  • Session tasks must flush their in-progress frame before shutting down (complete the current frame read, then exit — do not abort mid-frame).

Expected Output

A binary crate (meridian-broker) that:

  1. Binds a TCP listener on a configurable address (default 0.0.0.0:7777).
  2. Spawns a new async task per incoming connection.
  3. Each task reads frames using the length-prefix protocol.
  4. Each parsed frame is sent over a broadcast::Sender<Frame>.
  5. A configurable number of simulated downstream handler tasks subscribe to the broadcast channel and print/log received frames.
  6. Ctrl-C triggers graceful shutdown with the sequence described above.

The binary should run, accept at least one connection from telnet or netcat with hand-crafted bytes, and log frame receipt and shutdown cleanly.


Acceptance Criteria

#CriterionVerifiable
1Broker accepts ≥2 simultaneous TCP connections without either blocking the otherYes — connect two nc sessions concurrently
2Frames are delivered to all registered downstream handlersYes — log output shows frame receipt on each handler
3A slow downstream handler does not stall frame ingestion on other connectionsYes — add a tokio::time::sleep in one handler; other connections continue at full rate
4Ctrl-C triggers graceful shutdown; in-progress frame reads complete before the task exitsYes — observable in log output
5If shutdown drain exceeds 10 seconds, remaining tasks are abortedYes — simulate a stuck task and verify the process exits within 11 seconds
6No .unwrap() on JoinHandle::await or channel send/receive in production pathsYes — code review
7spawn_blocking is used for any synchronous I/O or CPU-intensive frame processingYes — code review

Frame Format Reference

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                      Payload Length (u32 BE)                  |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                   Payload (variable length)                   |
|                          ...                                  |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

A Frame struct for the purpose of this project:

#![allow(unused)]
fn main() {
#[derive(Clone, Debug)]
pub struct Frame {
    pub station_id: String,
    pub payload: Vec<u8>,
}
}

Hints

Hint 1 — Reading length-prefixed frames

tokio::io::AsyncReadExt provides .read_exact(&mut buf) which reads exactly buf.len() bytes or returns an error. Use it to read the 4-byte header, parse the length, allocate the payload buffer, and read the payload:

#![allow(unused)]
fn main() {
use tokio::io::AsyncReadExt;
use tokio::net::TcpStream;

async fn read_frame(stream: &mut TcpStream) -> anyhow::Result<Vec<u8>> {
    let mut len_buf = [0u8; 4];
    stream.read_exact(&mut len_buf).await?;
    let len = u32::from_be_bytes(len_buf) as usize;
    let mut payload = vec![0u8; len];
    stream.read_exact(&mut payload).await?;
    Ok(payload)
}
}
Hint 2 — Broadcast channel for fan-out

tokio::sync::broadcast::channel(capacity) returns (Sender<T>, Receiver<T>). Additional receivers are created with sender.subscribe(). Receivers that fall behind by more than capacity messages receive Err(RecvError::Lagged(n)) — not an error in the fatal sense, just a signal that they missed n messages. Log it and continue receiving.

#![allow(unused)]
fn main() {
use tokio::sync::broadcast;

let (tx, _rx) = broadcast::channel::<Frame>(256);

// Downstream handler
let mut rx = tx.subscribe();
tokio::spawn(async move {
    loop {
        match rx.recv().await {
            Ok(frame) => { /* process */ }
            Err(broadcast::error::RecvError::Lagged(n)) => {
                tracing::warn!(missed = n, "handler fell behind");
            }
            Err(broadcast::error::RecvError::Closed) => break,
        }
    }
});
}
Hint 3 — Graceful shutdown with watch channel

tokio::sync::watch is well-suited for broadcasting a shutdown signal to an arbitrary number of tasks: one sender, many receivers, each receiver can check the current value or wait for a change.

#![allow(unused)]
fn main() {
use tokio::sync::watch;

let (shutdown_tx, shutdown_rx) = watch::channel(false);

// In each session task:
let mut shutdown = shutdown_rx.clone();
tokio::select! {
    result = read_frames(&mut stream) => { /* ... */ }
    _ = shutdown.changed() => {
        tracing::info!("shutdown received, finishing current frame");
        // complete current frame read if mid-frame, then return
    }
}

// In shutdown handler:
let _ = shutdown_tx.send(true);
}
Hint 4 — Collecting JoinHandles for drain

Keep a Vec<JoinHandle<()>> of spawned session tasks. During shutdown, wrap the drain loop in tokio::time::timeout:

#![allow(unused)]
fn main() {
let drain_deadline = Duration::from_secs(10);
let drain_result = tokio::time::timeout(drain_deadline, async {
    for handle in session_handles {
        let _ = handle.await; // ignore individual task errors
    }
}).await;

if drain_result.is_err() {
    tracing::warn!("drain deadline exceeded — some tasks may not have flushed");
}
}

After the timeout, the remaining JoinHandles are dropped (tasks continue) or you can collect and abort them explicitly.


Reference Implementation

Reveal reference implementation (attempt the project first)
// src/main.rs
use anyhow::Result;
use std::sync::Arc;
use tokio::{
    net::{TcpListener, TcpStream},
    sync::{broadcast, watch},
    time::{timeout, Duration},
};
use tokio::io::AsyncReadExt;
use tracing::{info, warn};

#[derive(Clone, Debug)]
pub struct Frame {
    pub station_id: String,
    pub payload: Vec<u8>,
}

async fn read_frame(stream: &mut TcpStream) -> Result<Vec<u8>> {
    let mut len_buf = [0u8; 4];
    stream.read_exact(&mut len_buf).await?;
    let len = u32::from_be_bytes(len_buf) as usize;
    if len > 65_536 {
        // Reject oversized frames — likely a protocol error or malicious client.
        anyhow::bail!("frame length {len} exceeds maximum 65536 bytes");
    }
    let mut payload = vec![0u8; len];
    stream.read_exact(&mut payload).await?;
    Ok(payload)
}

async fn handle_connection(
    mut stream: TcpStream,
    station_id: String,
    frame_tx: broadcast::Sender<Frame>,
    mut shutdown_rx: watch::Receiver<bool>,
) {
    info!(station = %station_id, "connection established");
    loop {
        tokio::select! {
            // Bias toward reading frames to minimize partial-frame cancellation.
            biased;

            result = read_frame(&mut stream) => {
                match result {
                    Ok(payload) => {
                        let frame = Frame {
                            station_id: station_id.clone(),
                            payload,
                        };
                        // Broadcast errors mean all receivers dropped — broker is shutting down.
                        if frame_tx.send(frame).is_err() {
                            break;
                        }
                    }
                    Err(e) => {
                        // EOF or read error — connection dropped.
                        info!(station = %station_id, "connection closed: {e}");
                        break;
                    }
                }
            }

            _ = shutdown_rx.changed() => {
                if *shutdown_rx.borrow() {
                    info!(station = %station_id, "shutdown — completing current frame then exiting");
                    // The biased select ensures we finish the in-progress read if one was started.
                    // On the next iteration, the shutdown branch will win again and we break.
                    break;
                }
            }
        }
    }

    info!(station = %station_id, "connection handler exiting");
}

fn spawn_handler(
    id: usize,
    mut rx: broadcast::Receiver<Frame>,
) -> tokio::task::JoinHandle<()> {
    tokio::spawn(async move {
        loop {
            match rx.recv().await {
                Ok(frame) => {
                    info!(
                        handler = id,
                        station = %frame.station_id,
                        bytes = frame.payload.len(),
                        "frame received"
                    );
                }
                Err(broadcast::error::RecvError::Lagged(n)) => {
                    warn!(handler = id, missed = n, "handler fell behind — lagged");
                }
                Err(broadcast::error::RecvError::Closed) => {
                    info!(handler = id, "broadcast channel closed, handler exiting");
                    break;
                }
            }
        }
    })
}

#[tokio::main]
async fn main() -> Result<()> {
    tracing_subscriber::fmt::init();

    let bind_addr = "0.0.0.0:7777";
    let listener = TcpListener::bind(bind_addr).await?;
    info!("meridian broker listening on {bind_addr}");

    let (frame_tx, _) = broadcast::channel::<Frame>(256);
    let (shutdown_tx, shutdown_rx) = watch::channel(false);

    // Spawn 3 downstream handlers.
    let mut handler_handles: Vec<tokio::task::JoinHandle<()>> = (0..3)
        .map(|i| spawn_handler(i, frame_tx.subscribe()))
        .collect();

    // Ctrl-C handler.
    let shutdown_tx = Arc::new(shutdown_tx);
    let shutdown_tx_ctrlc = shutdown_tx.clone();
    tokio::spawn(async move {
        tokio::signal::ctrl_c().await.expect("failed to listen for ctrl-c");
        info!("ctrl-c received — initiating graceful shutdown");
        let _ = shutdown_tx_ctrlc.send(true);
    });

    let mut session_handles: Vec<tokio::task::JoinHandle<()>> = Vec::new();
    let mut conn_id = 0usize;

    loop {
        // Stop accepting new connections once shutdown is signalled.
        if *shutdown_rx.borrow() {
            break;
        }

        tokio::select! {
            accept = listener.accept() => {
                match accept {
                    Ok((stream, addr)) => {
                        conn_id += 1;
                        let station_id = format!("gs-{conn_id}@{addr}");
                        let handle = tokio::spawn(handle_connection(
                            stream,
                            station_id,
                            frame_tx.clone(),
                            shutdown_rx.clone(),
                        ));
                        session_handles.push(handle);
                    }
                    Err(e) => warn!("accept error: {e}"),
                }
            }
            _ = shutdown_rx.changed() => {
                if *shutdown_rx.borrow() {
                    break;
                }
            }
        }
    }

    info!("draining {} active sessions (10s deadline)", session_handles.len());

    // Drop the broadcast sender so downstream handlers see Closed after drain.
    drop(frame_tx);

    let drain_result = timeout(Duration::from_secs(10), async {
        for handle in session_handles {
            let _ = handle.await;
        }
        for handle in handler_handles.drain(..) {
            let _ = handle.await;
        }
    })
    .await;

    if drain_result.is_err() {
        warn!("drain deadline exceeded — forcing exit");
    } else {
        info!("all tasks drained cleanly");
    }

    Ok(())
}

Cargo.toml dependencies:

[dependencies]
tokio = { version = "1", features = ["full"] }
anyhow = "1"
tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["env-filter"] }

Testing the broker manually:

# Terminal 1: run the broker
RUST_LOG=info cargo run

# Terminal 2: send a frame (4-byte length prefix = 5, then "hello")
printf '\x00\x00\x00\x05hello' | nc localhost 7777

# Terminal 3: send concurrently
printf '\x00\x00\x00\x07meridian' | nc localhost 7777

# Ctrl-C in Terminal 1 to trigger graceful shutdown

Reflection

After completing this project, you have built the entry point for Meridian's control plane ingress. The patterns used here — broadcast fan-out, select!-driven shutdown, bounded drain with timeout, JoinHandle collection — recur throughout the rest of the Foundation modules and into the Data Pipelines track.

Consider for further exploration: what happens if the broker receives 10,000 connections? At what point does the spawn-per-connection model become a problem, and what is the alternative? How would you add backpressure from downstream handlers back to the ingress path without stalling the broker? These questions are the starting point for Module 3 (Message Passing Patterns).

Module 02 — Concurrency Primitives

Track: Foundation — Mission Control Platform
Position: Module 2 of 6
Source material: Rust Atomics and Locks — Mara Bos, Chapters 1–3
Quiz pass threshold: 70% on all three lessons to unlock the project



Mission Context

The Meridian control plane is not a purely async system. The async runtime handles the high-frequency I/O path. But the control plane also runs CPU-bound conjunction checks, synchronous vendor libraries with C FFI, a shared priority command table written by multiple connections and read by the session dispatcher, and lock-free statistics counters sampled by the monitoring dashboard.

The Python system handled shared state with a global threading lock. In six months of operation, that lock has caused four production incidents. This module establishes the Rust concurrency model that replaces it — not by eliminating shared state, but by giving you the type-level guarantees and primitive toolkit to reason about it precisely.


What You Will Learn

By the end of this module you will be able to:

  • Distinguish OS threads from async tasks at the scheduling level, and route work to the correct model for its characteristics — blocking work to spawn_blocking or scoped threads, I/O-bound work to async tasks
  • Use Send and Sync to reason about which types can cross thread boundaries, and understand why Rc, Cell, and raw pointers opt out
  • Implement shared mutable state with Mutex<T> and RwLock<T>, manage MutexGuard lifetimes correctly, and handle lock poisoning appropriately
  • Identify the three deadlock patterns that cause most production incidents and apply the structural patterns that prevent them
  • Use tokio::sync::Mutex when locks must be held across .await points in async code
  • Apply atomic operations (fetch_add, compare_exchange, load/store) and select the correct memory ordering (Relaxed, Acquire/Release, SeqCst) for the intended guarantee

Lessons

Lesson 1 — Threads vs Async Tasks: When to Use Each and Why

Covers std::thread::spawn vs tokio::spawn, the preemptive vs cooperative scheduling distinction, thread::scope for scoped threads with borrowed data, Send and Sync as the compiler's enforcement mechanism, and spawn_blocking as the bridge between the two models.

Key question this lesson answers: When a piece of work needs to happen concurrently, how do you decide between an OS thread and an async task — and what goes wrong if you choose incorrectly?

lesson-01-threads-vs-async.md / lesson-01-quiz.toml


Lesson 2 — Shared State: Mutex, RwLock, and Avoiding Deadlocks

Covers Mutex<T> mechanics (locking, MutexGuard, RAII unlock, lock poisoning), RwLock<T> and the read-heavy access pattern, MutexGuard lifetime pitfalls in async code, tokio::sync::Mutex, Condvar for blocking on data conditions, and the three deadlock patterns with structural prevention strategies.

Key question this lesson answers: How do you share mutable data between threads correctly, and what are the failure modes that Rust's type system does not prevent?

lesson-02-shared-state.md / lesson-02-quiz.toml


Lesson 3 — Atomics and Memory Ordering: Acquire/Release/SeqCst in Practice

Covers atomic types and the operations they support (load/store, fetch_add/fetch_sub, compare_exchange), memory ordering (Relaxed, Acquire/Release, AcqRel, SeqCst), the happens-before relationship established by the Acquire/Release pair, and the practical decision of when to use atomics vs a mutex.

Key question this lesson answers: When is a mutex overkill, and how do you safely share single values between threads without locking — including ensuring the processor and compiler do not reorder the operations that matter?

lesson-03-atomics.md / lesson-03-quiz.toml


Capstone Project — Ground Station Command Queue

Build a typed, concurrent priority command queue for Meridian's mission operations system. The queue accepts commands from multiple concurrent ground station producer threads, dispatches them to a consumer in priority order with FIFO tie-breaking, blocks producers when at capacity (without dropping commands), exposes lock-free metrics readable without acquiring the queue lock, and shuts down gracefully by draining remaining commands.

Acceptance is against 7 verifiable criteria including correct priority dispatch, non-busy-waiting, lock-free metrics access, and clean shutdown drain.

project-command-queue.md


Prerequisites

Module 1 (Async Rust Fundamentals) must be complete. This module assumes you understand how async tasks are scheduled and why blocking an async worker thread is harmful — that understanding is the foundation for the threads-vs-async distinction in Lesson 1 and the async Mutex guidance in Lesson 2.

What Comes Next

Module 3 — Message Passing Patterns builds the next layer: rather than sharing state between concurrent actors, you pass ownership of data through channels. The command queue from this module's project is extended in Module 3 with a tokio::sync::mpsc front-end, moving backpressure into async channel semantics.

Lesson 1 — Threads vs Async Tasks: When to Use Each and Why

Module: Foundation — M02: Concurrency Primitives
Position: Lesson 1 of 3
Source: Rust Atomics and Locks — Mara Bos, Chapter 1



Context

The Meridian control plane is not a purely async system. The async runtime handles the high-frequency path — accepting ground station connections, reading telemetry frames, routing them to downstream consumers. But the control plane also runs work that has no business on an async worker thread: a vendor-supplied TLE validation library with a synchronous C FFI, a CPU-intensive conjunction check that processes several hundred orbital elements per pass, and a legacy configuration parser that performs synchronous file I/O.

The Python system handled this by running everything on threads, leaning on the GIL to serialize concurrent access. The Rust replacement needs a deliberate model. The first decision you make when writing any new piece of the control plane is: does this go on an async task or an OS thread? Getting this wrong produces either a system that starves its async executor with blocking work, or one that spawns OS threads unnecessarily, paying per-thread stack overhead at scale.

This lesson establishes the model. Every rule here has a corresponding failure mode that has been observed in Meridian's staging environment.

Source: Rust Atomics and Locks, Chapter 1 (Bos)


Core Concepts

The Fundamental Difference

An OS thread is scheduled by the kernel. The kernel decides when it runs, when it is preempted, and which CPU core it runs on. The thread has its own stack (typically 2–8 MB by default), and blocking — whether on I/O, a mutex, or std::thread::sleep — is perfectly safe: the kernel parks the thread and runs something else.

An async task is scheduled by the executor. It runs until it voluntarily yields at an await point. It shares executor worker threads with other tasks. Blocking on the worker thread — calling a synchronous library, running a long computation, sleeping with std::thread::sleep — starves every other task scheduled on that thread. There is no kernel to preempt you and run something else.

This is the core rule: any call that can block a thread for non-trivial time belongs on an OS thread, not on an async worker thread. In Tokio, the mechanism is spawn_blocking, which routes the closure to a dedicated blocking thread pool. From the async side, it looks like an awaitable future. On the execution side, it gets a real OS thread.

std::thread::spawn — Ownership and Lifetimes

std::thread::spawn takes a closure that is Send + 'static. The 'static requirement means the thread cannot borrow from the spawning scope — it must own everything it uses, or access data through shared references that are themselves 'static (like Arc).

use std::thread;
use std::sync::Arc;

fn main() {
    let catalog = Arc::new(vec!["ISS", "CSS", "STARLINK-1"]);

    let handle = thread::spawn({
        let catalog = Arc::clone(&catalog);
        move || {
            // catalog is owned by this thread — no borrow, no lifetime issue.
            println!("Thread sees {} objects", catalog.len());
        }
    });

    handle.join().unwrap();
    println!("Main sees {} objects", catalog.len());
}

The Arc::clone before the move is idiomatic: clone the handle, not the data. The thread gets its own Arc pointer (cheap — one atomic increment), and both threads share the underlying Vec. When both Arcs drop, the Vec deallocates.

thread::scope — Scoped Threads with Borrowed Data

The 'static requirement on spawn prevents borrowing stack data. thread::scope lifts this restriction: threads spawned within a scope are guaranteed to finish before the scope exits, which allows them to borrow data from the enclosing frame.

use std::thread;

fn validate_tle_batch(records: &[String]) -> usize {
    let mid = records.len() / 2;
    let (left, right) = records.split_at(mid);

    // Scoped threads can borrow `left` and `right` — no Arc, no clone.
    thread::scope(|s| {
        let left_handle = s.spawn(|| left.iter().filter(|r| r.starts_with("1 ")).count());
        let right_handle = s.spawn(|| right.iter().filter(|r| r.starts_with("1 ")).count());

        // scope blocks here until both threads finish.
        left_handle.join().unwrap() + right_handle.join().unwrap()
    })
}

fn main() {
    let records: Vec<String> = (0..100)
        .map(|i| format!("{} {:05}U record", if i % 2 == 0 { "1" } else { "2" }, i))
        .collect();
    println!("{} valid TLE lines", validate_tle_batch(&records));
}

thread::scope is the right tool for data-parallel CPU work over a borrowed slice — exactly the conjunction check pattern in the Meridian pipeline. No heap allocation, no Arc, no 'static constraint. The compiler enforces that the borrowed data outlives the scope.

Send and Sync — The Type System's Enforcement

Rust enforces thread safety through two marker traits (Rust Atomics and Locks, Ch. 1):

Send: a type is Send if ownership of a value of that type can be transferred to another thread. Arc<T> is Send (if T: Send + Sync), but Rc<T> is not — Rc's reference count is non-atomic and would race if shared across threads.

Sync: a type is Sync if it can be shared between threads by shared reference. i32 is Sync. Cell<i32> is not — mutating through a shared reference is not safe across threads.

The compiler enforces these automatically. You cannot accidentally send a Rc<T> to another thread — thread::spawn requires Send, and Rc does not implement it. You cannot share a RefCell<T> across threads — Mutex<T> requires T: Send, and RefCell does not implement Sync.

Both traits are auto-derived: a struct whose fields are all Send is itself Send. The common exceptions are raw pointers (*const T, *mut T), Rc, Cell, RefCell, and types that wrap OS handles that are not thread-safe. When you implement a type that wraps these, you must opt in to Send/Sync manually with unsafe impl, accepting responsibility for the invariant.

Choosing the Right Model

The decision tree for any piece of work in the control plane:

Work typeRight modelMechanism
Concurrent TCP connections, channel receive/sendAsync tasktokio::spawn
CPU-bound computation (conjunction check, CRC)Blocking threadspawn_blocking
Synchronous vendor library (C FFI)Blocking threadspawn_blocking
Synchronous file I/O (std::fs)Blocking threadspawn_blocking
Data-parallel work over borrowed dataScoped threadsthread::scope
Independent long-running background serviceOS threadthread::spawn

The cost difference matters at scale. An OS thread on Linux has a default 8 MB stack reservation (even if physical pages are not committed until used), a kernel thread structure, and scheduling overhead. Tokio tasks use a few hundred bytes of heap. The control plane at 48 uplinks can sustain thousands of concurrent tasks trivially; it cannot sustain thousands of OS threads without careful stack-size tuning.


Code Examples

Mixing Async and Blocking: The Vendor TLE Validator

The TLE validation library provided by Meridian's orbit data vendor is a synchronous C library wrapped in a Rust FFI crate. It performs checksum validation and orbital element range checking — purely CPU work, no I/O, but it takes 2–15ms per record depending on complexity. Calling it from an async task would stall the executor for the duration.

use std::time::Duration;
use tokio::task;

// Simulates a synchronous vendor library call.
// In production: calls into the C FFI wrapper.
fn validate_tle_sync(line1: &str, line2: &str) -> Result<(), String> {
    // Vendor library does checksum + orbital element bounds checking.
    // Blocks for 2–15ms depending on record complexity.
    std::thread::sleep(Duration::from_millis(5)); // placeholder
    if line1.starts_with("1 ") && line2.starts_with("2 ") {
        Ok(())
    } else {
        Err(format!("malformed TLE: {line1}"))
    }
}

async fn validate_tle_async(line1: String, line2: String) -> Result<(), String> {
    // Move strings into the blocking closure.
    // spawn_blocking runs on the dedicated blocking thread pool —
    // async worker threads are not touched.
    task::spawn_blocking(move || validate_tle_sync(&line1, &line2))
        .await
        // JoinError means the blocking thread panicked.
        .map_err(|e| format!("validator panicked: {e}"))?
}

#[tokio::main]
async fn main() {
    // All 48 sessions can submit validation concurrently.
    // Each runs on the blocking pool; none stall the async workers.
    let tasks: Vec<_> = (0..6).map(|i| {
        tokio::spawn(validate_tle_async(
            format!("1 {:05}U 98067A   21275.52  .00001234  00000-0  12345-4 0  999{i}", i),
            format!("2 {:05}  51.6400 337.6640 0007417  62.6000 297.5200 15.4888958300000{i}", i),
        ))
    }).collect();

    for (i, t) in tasks.into_iter().enumerate() {
        match t.await.unwrap() {
            Ok(()) => println!("record {i}: valid"),
            Err(e) => println!("record {i}: {e}"),
        }
    }
}

Scoped Threads for Parallel Conjunction Screening

The conjunction screening pass runs every 10 minutes against the full 50k-object catalog. It splits the catalog across CPU cores using scoped threads. The catalog is a large Vec<OrbitalRecord> — no clone, no Arc, just borrowed slices distributed across workers.

use std::thread;

#[derive(Clone)]
struct OrbitalRecord {
    norad_id: u32,
    altitude_km: f64,
}

struct ConjunctionAlert {
    object_a: u32,
    object_b: u32,
    closest_approach_km: f64,
}

fn screen_shard(shard: &[OrbitalRecord], threshold_km: f64) -> Vec<ConjunctionAlert> {
    // Simplified: real implementation computes relative positions via SGP4.
    shard.windows(2)
        .filter(|pair| (pair[0].altitude_km - pair[1].altitude_km).abs() < threshold_km)
        .map(|pair| ConjunctionAlert {
            object_a: pair[0].norad_id,
            object_b: pair[1].norad_id,
            closest_approach_km: (pair[0].altitude_km - pair[1].altitude_km).abs(),
        })
        .collect()
}

fn run_conjunction_screen(catalog: &[OrbitalRecord], threshold_km: f64) -> Vec<ConjunctionAlert> {
    let num_cores = thread::available_parallelism()
        .map(|n| n.get())
        .unwrap_or(4);
    let shard_size = (catalog.len() + num_cores - 1) / num_cores;

    thread::scope(|s| {
        let handles: Vec<_> = catalog
            .chunks(shard_size)
            .map(|shard| s.spawn(move || screen_shard(shard, threshold_km)))
            .collect();

        handles.into_iter()
            .flat_map(|h| h.join().unwrap())
            .collect()
    })
}

fn main() {
    let catalog: Vec<OrbitalRecord> = (0..1000)
        .map(|i| OrbitalRecord { norad_id: i, altitude_km: 400.0 + (i as f64 * 0.3) })
        .collect();

    let alerts = run_conjunction_screen(&catalog, 5.0);
    println!("{} conjunction alerts generated", alerts.len());
}

Each shard runs on its own OS thread via thread::scope, borrowing its slice without any heap allocation for sharing. The scope blocks until all workers finish, then results are collected. This is the correct pattern for data-parallel CPU work where all input data is available upfront and results need to be aggregated.


Key Takeaways

  • OS threads are preemptively scheduled by the kernel. Async tasks are cooperatively scheduled by the executor. Blocking on an async worker thread — any call that does not yield at await — starves other tasks on that thread.

  • Use spawn_blocking for any synchronous, blocking, or CPU-intensive work that originates in an async context. It routes work to a dedicated thread pool separate from the async workers.

  • thread::scope allows scoped threads to borrow data from the enclosing frame without Arc or 'static constraints. It is the right tool for data-parallel work over borrowed slices. The scope blocks until all spawned threads finish.

  • Send and Sync are marker traits enforced at compile time. Send permits transferring ownership across threads; Sync permits sharing by reference. Violating these constraints — sending Rc, sharing Cell — is a compile error, not a runtime race.

  • The thread vs async decision is about scheduling model, not concurrency. Both models run work concurrently. The difference is what happens when work blocks: OS threads can block safely; async tasks cannot.


Lesson 2 — Shared State: Mutex, RwLock, and Avoiding Deadlocks

Module: Foundation — M02: Concurrency Primitives
Position: Lesson 2 of 3
Source: Rust Atomics and Locks — Mara Bos, Chapter 1



Context

The Meridian command queue maintains a shared priority table: incoming operator commands are written by the command ingress task, read by the session dispatch task, and occasionally queried by the monitoring dashboard. The Python system used a global dictionary with a threading lock. In production, that lock has been involved in three separate deadlock incidents — two in the same deployment week — all caused by the same root pattern: lock acquired, function called, that function also acquires the same lock.

Rust does not prevent deadlocks at compile time. But it gives you the tools to reason about them precisely: Mutex<T> and RwLock<T> make the protected data visible in the type signature, MutexGuard makes it impossible to access data without holding the lock, and RAII makes it impossible to forget to release it. This lesson covers how these primitives work, the failure modes that remain after Rust's type system has done its job, and the patterns that prevent them.

Source: Rust Atomics and Locks, Chapter 1 (Bos)


Core Concepts

Mutex<T> — Exclusive Access with RAII

std::sync::Mutex<T> wraps a value of type T and enforces that only one thread can access it at a time. The data is inaccessible without locking. There is no way to accidentally read T without going through .lock().

.lock() returns LockResult<MutexGuard<'_, T>>. The MutexGuard dereferences to T and automatically releases the lock when it drops. There is no .unlock() method. The lock is released when the guard goes out of scope — or, critically, when it is explicitly dropped.

use std::sync::{Arc, Mutex};
use std::thread;

fn main() {
    let command_count = Arc::new(Mutex::new(0u64));

    let handles: Vec<_> = (0..4).map(|_| {
        let counter = Arc::clone(&command_count);
        thread::spawn(move || {
            for _ in 0..1000 {
                // Lock is acquired here. Guard is dropped at end of block.
                let mut count = counter.lock().unwrap();
                *count += 1;
                // Guard dropped here — lock released before next iteration.
            }
        })
    }).collect();

    for h in handles { h.join().unwrap(); }
    println!("commands processed: {}", command_count.lock().unwrap());
}

The Arc provides shared ownership across threads (Rc is not Send and will not compile here). The Mutex provides exclusive access. This is the standard pattern for shared mutable state between threads.

Lock Poisoning

When a thread panics while holding a Mutex lock, the mutex is marked poisoned. Subsequent calls to .lock() return Err(PoisonError). The data is still accessible through the error — err.into_inner() returns the MutexGuard — but the poison signals that the data may be in an inconsistent state.

In practice, most Meridian code uses .unwrap() on mutex locks. This is deliberate: if a thread panics while holding the command queue lock, it is not safe to continue operating on potentially corrupted queue state. Propagating the panic is the correct response. The cases where you would recover from a poisoned mutex are rare and require domain-specific knowledge about what "inconsistent" means for that data.

One place where .unwrap() is wrong: in a test or in a thread that genuinely needs to clean up a partially-written state. In those cases, match on the LockResult explicitly.

MutexGuard Lifetime — A Common Bug

The most common Mutex bug in Rust code is holding a guard longer than intended, or — worse — holding it across an .await point in async code. A guard held across an await parks the lock for the duration of the async operation. If another task tries to acquire the same lock, it will block the async worker thread (since std::sync::Mutex::lock blocks, not yields).

use std::sync::Mutex;

fn main() {
    let data = Mutex::new(vec![1u32, 2, 3]);

    // BUG: guard lives to end of the if block, holding lock during the push
    {
        let guard = data.lock().unwrap();
        if guard.contains(&2) {
            drop(guard); // Must explicitly drop before re-locking.
            data.lock().unwrap().push(4);
        }
        // Without the explicit drop, this deadlocks: the guard is still
        // alive when we try to lock again at data.lock().unwrap().push(4)
    }

    println!("{:?}", data.lock().unwrap());
}

In async code, use tokio::sync::Mutex instead of std::sync::Mutex. It yields to the executor while waiting for the lock rather than blocking the thread. Conversely, never hold a tokio::sync::MutexGuard across a .await that might block for a long time — you are holding the lock for the duration of that await, which blocks all other lock waiters.

RwLock<T> — Read Concurrency, Write Exclusivity

RwLock<T> distinguishes between reads and writes. Multiple readers can hold the lock simultaneously; a writer requires exclusive access. This is the concurrent version of RefCell.

It is appropriate when reads are frequent and writes are rare. For the Meridian session state table: many tasks read current session state, but writes only happen when sessions start or end. An RwLock allows those many concurrent reads without serializing them.

use std::collections::HashMap;
use std::sync::{Arc, RwLock};
use std::thread;

type SessionTable = Arc<RwLock<HashMap<u32, String>>>;

fn register_session(table: &SessionTable, id: u32, station: String) {
    // Write lock — exclusive.
    table.write().unwrap().insert(id, station);
}

fn query_session(table: &SessionTable, id: u32) -> Option<String> {
    // Read lock — concurrent with other readers.
    table.read().unwrap().get(&id).cloned()
}

fn main() {
    let table: SessionTable = Arc::new(RwLock::new(HashMap::new()));

    register_session(&table, 25544, "gs-svalbard".into());

    let readers: Vec<_> = (0..4).map(|_| {
        let t = Arc::clone(&table);
        thread::spawn(move || {
            // All four reader threads can hold the read lock simultaneously.
            println!("{:?}", query_session(&t, 25544));
        })
    }).collect();

    for r in readers { r.join().unwrap(); }
}

RwLock is not always faster than Mutex. If writes are frequent, readers pay the overhead of checking for pending writers. On some platforms, RwLock can starve writers if readers continuously hold the lock. Profile before committing to RwLock as an optimisation. For the common case of a hot write path with rare reads, Mutex is simpler and often faster.

Deadlock Patterns and How to Prevent Them

A deadlock requires at least two resources and two threads acquiring them in opposite order. Rust's type system does not prevent this. Three patterns cause the vast majority of deadlocks in production:

Lock ordering violation: Thread A acquires lock 1 then lock 2. Thread B acquires lock 2 then lock 1. Each holds what the other needs. Prevention: establish a global lock acquisition order and document it. If the command queue lock must always be acquired before the session table lock, enforce that convention in code review.

Re-entrant locking: std::sync::Mutex is not reentrant. A thread that calls .lock() on a mutex it already holds will deadlock immediately — there is no second locking that succeeds. This is the source of Meridian's production incidents: a function that acquires the lock, calls a helper, and the helper also acquires the same lock.

Prevention: keep lock-holding code flat. Do not call functions while holding a lock unless you can verify they do not acquire the same lock. If a function is callable both with and without a lock held, split it into two versions or restructure the locking scope.

Holding guards across blocking calls: In synchronous code: holding a MutexGuard while calling a function that blocks on I/O. In async code: holding a std::sync::MutexGuard across an .await.

Prevention: minimize the scope of guards. Acquire, mutate, release. Do not hold a lock while doing I/O. In async code, use tokio::sync::Mutex or restructure to release the lock before awaiting.


Code Examples

The Meridian Priority Command Queue

The command queue receives operator commands from the ground network interface. Commands have integer priorities. The session dispatcher reads the highest-priority pending command. Multiple ground network connections can write concurrently.

use std::collections::BinaryHeap;
use std::cmp::Reverse;
use std::sync::{Arc, Mutex, Condvar};
use std::thread;
use std::time::Duration;

#[derive(Eq, PartialEq)]
struct Command {
    priority: u8,
    payload: String,
}

impl Ord for Command {
    fn cmp(&self, other: &Self) -> std::cmp::Ordering {
        // Higher priority = higher value in max-heap.
        self.priority.cmp(&other.priority)
    }
}
impl PartialOrd for Command {
    fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> {
        Some(self.cmp(other))
    }
}

struct CommandQueue {
    // Mutex + Condvar is the standard pattern for blocking producers/consumers.
    inner: Mutex<BinaryHeap<Command>>,
    available: Condvar,
}

impl CommandQueue {
    fn new() -> Arc<Self> {
        Arc::new(Self {
            inner: Mutex::new(BinaryHeap::new()),
            available: Condvar::new(),
        })
    }

    fn push(&self, cmd: Command) {
        self.inner.lock().unwrap().push(cmd);
        // Notify one waiting consumer that data is available.
        self.available.notify_one();
    }

    fn pop_blocking(&self) -> Command {
        let mut queue = self.inner.lock().unwrap();
        // Condvar::wait releases the mutex and blocks until notified,
        // then reacquires the mutex before returning.
        loop {
            if let Some(cmd) = queue.pop() {
                return cmd;
            }
            queue = self.available.wait(queue).unwrap();
        }
    }
}

fn main() {
    let queue = CommandQueue::new();

    // Producer threads simulate ground network connections.
    let producers: Vec<_> = (0..3).map(|i| {
        let q = Arc::clone(&queue);
        thread::spawn(move || {
            thread::sleep(Duration::from_millis(i * 10));
            q.push(Command {
                priority: (i as u8 % 3) + 1,
                payload: format!("CMD-{i:04}"),
            });
            println!("producer {i}: pushed priority {}", (i as u8 % 3) + 1);
        })
    }).collect();

    // Consumer runs on a separate thread — simulates session dispatcher.
    let q = Arc::clone(&queue);
    let consumer = thread::spawn(move || {
        for _ in 0..3 {
            let cmd = q.pop_blocking();
            println!("dispatcher: executing '{}' (priority {})", cmd.payload, cmd.priority);
        }
    });

    for p in producers { p.join().unwrap(); }
    consumer.join().unwrap();
}

The Condvar solves the busy-wait problem: without it, the consumer would spin-lock on queue.is_empty(), wasting CPU. Condvar::wait atomically releases the mutex and parks the thread, then reacquires the mutex before returning. The .unwrap() on lock() is intentional: if a producer panics while holding the lock, corrupting the queue, the consumer should not continue silently.


Key Takeaways

  • Mutex<T> makes protected data inaccessible without locking. MutexGuard is the only way to reach the data, and it releases the lock on drop. There is no way to forget to unlock — but there are ways to hold the lock longer than intended.

  • Lock poisoning marks a mutex as potentially inconsistent when a thread panics while holding it. Most production code uses .unwrap() on locks, propagating the panic. Recover from a poisoned mutex only when you can correct the inconsistent state.

  • RwLock<T> allows concurrent reads and exclusive writes. It is appropriate when reads are dominant. It is not always faster than Mutex on write-heavy paths — profile before optimizing.

  • Three deadlock patterns cover most production incidents: lock ordering violations (acquire in inconsistent order across threads), re-entrant locking (acquiring a lock you already hold), and holding guards across blocking calls. Document lock acquisition order and minimize guard scope.

  • In async code, std::sync::Mutex::lock blocks the OS thread, which parks the async worker. Use tokio::sync::Mutex when the lock may be contended and the wait must yield to the executor. Never hold any MutexGuard across a slow .await.

  • Condvar is the correct primitive for blocking on a data condition (waiting for a non-empty queue, waiting for a flag). It atomically releases the mutex and parks the thread, avoiding busy-waiting.


Lesson 3 — Atomics and Memory Ordering: Acquire/Release/SeqCst in Practice

Module: Foundation — M02: Concurrency Primitives
Position: Lesson 3 of 3
Source: Rust Atomics and Locks — Mara Bos, Chapters 2–3



Context

The Meridian control plane increments a frame counter every time a telemetry frame is received — 4,800 times per second at full uplink load across 48 satellites. The per-session heartbeat timer fires every 100ms. The frame drop rate is sampled by the monitoring dashboard every second. None of these operations need the overhead of a mutex lock. They need a single integer that multiple threads can read and write without data races.

This is the domain of atomics. std::sync::atomic provides integer and boolean types that support safe concurrent mutation without locking. The operations are indivisible — they either complete entirely or have not happened yet — which prevents the torn reads and non-atomic increments that would corrupt counters under concurrent access.

But atomics are not free. The memory ordering argument on every atomic operation — Relaxed, Acquire, Release, AcqRel, SeqCst — controls what guarantees the processor and compiler make about the ordering of operations across threads. Getting this wrong produces bugs that are invisible in development and intermittent in production.

Source: Rust Atomics and Locks, Chapters 2–3 (Bos)


Core Concepts

What Atomic Operations Guarantee

An atomic operation is indivisible: it either completes entirely before any other operation on the same variable, or it has not happened yet (Rust Atomics and Locks, Ch. 2). Two threads simultaneously performing counter += 1 on a plain integer is undefined behavior — the read-modify-write is three separate operations, and the interleaving is unpredictable. Two threads simultaneously calling counter.fetch_add(1, Relaxed) is defined and correct: each fetch_add is a single atomic step.

The available types live in std::sync::atomic: AtomicBool, AtomicI8/U8 through AtomicI64/U64, AtomicIsize/Usize, and AtomicPtr<T>. All support mutation through a shared reference (&AtomicUsize) — they use interior mutability without UnsafeCell runtime checks.

Every atomic operation takes an Ordering argument. The ordering is not about the value — it is about the visibility of other memory operations to other threads.

Load, Store, and Fetch-and-Modify

The three basic operation families:

Load and store — read or write the atomic value:

#![allow(unused)]
fn main() {
use std::sync::atomic::{AtomicU64, Ordering::Relaxed};
static FRAME_COUNT: AtomicU64 = AtomicU64::new(0);

fn record_frame() {
    FRAME_COUNT.fetch_add(1, Relaxed);
}

fn read_frame_count() -> u64 {
    FRAME_COUNT.load(Relaxed)
}
}

Fetch-and-modify — atomically modify the value and return the previous value (Rust Atomics and Locks, Ch. 2):

use std::sync::atomic::{AtomicU64, Ordering::Relaxed};

fn main() {
    let counter = AtomicU64::new(100);
    let old = counter.fetch_add(23, Relaxed);
    assert_eq!(old, 100);          // returned the value before the add
    assert_eq!(counter.load(Relaxed), 123); // value after the add
}

The full set: fetch_add, fetch_sub, fetch_and, fetch_or, fetch_xor, fetch_max, fetch_min, and swap. Use these in preference to compare-and-exchange when the operation fits — they are simpler and the compiler can map them to a single hardware instruction.

compare_exchange — The General Atomic Primitive

compare_exchange atomically checks whether the current value equals an expected value, and if so, replaces it with a new value. It returns the previous value on success, and the actual current value on failure (Rust Atomics and Locks, Ch. 2):

use std::sync::atomic::{AtomicU32, Ordering::Relaxed};

fn increment_if_below(a: &AtomicU32, limit: u32) -> bool {
    let mut current = a.load(Relaxed);
    loop {
        if current >= limit { return false; }
        match a.compare_exchange(current, current + 1, Relaxed, Relaxed) {
            Ok(_) => return true,   // successfully incremented
            Err(v) => current = v,  // another thread changed it; retry
        }
    }
}

fn main() {
    let seq = AtomicU32::new(0);
    println!("{}", increment_if_below(&seq, 5)); // true
}

The loop-and-retry pattern is fundamental: load the current value, compute the desired new value without holding any lock, then swap atomically only if the value has not changed since the load. If it has changed, retry. This is a lock-free algorithm — no thread blocks, and progress is guaranteed as long as at least one thread makes progress.

compare_exchange_weak may spuriously fail (return Err even when the value matches) on some architectures. Use it in loops where spurious failure just triggers another iteration. Use the strong version when you need a guarantee that success or failure is definitive.

The ABA problem: if a value changes from A to B and back to A between the load and the CAS, compare_exchange will succeed even though the value was modified. For simple counters and flags this is harmless; for pointer-based data structures it can be a correctness issue.

Memory Ordering — The Model

Processors and compilers reorder operations when it does not change single-threaded program behavior. In concurrent code, these reorderings can change observed behavior across threads. Memory ordering tells the compiler and processor what reorderings are permissible around a given atomic operation (Rust Atomics and Locks, Ch. 3).

Relaxed — no ordering guarantees beyond consistency on the single atomic variable. All threads see modifications of a given atomic in the same total order, but operations on different variables may be reordered arbitrarily. Use for statistics counters and progress indicators where you only care about the eventual value, not the timing relationship with other operations.

Release (stores) / Acquire (loads) — the most important pair. A release-store establishes a happens-before relationship with any subsequent acquire-load that reads the stored value:

use std::sync::atomic::{AtomicBool, AtomicU64, Ordering::{Acquire, Release, Relaxed}};
use std::thread;

static DATA: AtomicU64 = AtomicU64::new(0);
static READY: AtomicBool = AtomicBool::new(false);

fn main() {
    thread::spawn(|| {
        DATA.store(12345, Relaxed);     // (1) write data
        READY.store(true, Release);     // (2) publish: everything before this is visible...
    });

    while !READY.load(Acquire) {        // (3) ...once this returns true.
        std::hint::spin_loop();
    }
    println!("{}", DATA.load(Relaxed)); // guaranteed to print 12345
}

Once the acquire-load at (3) sees true, the happens-before relationship guarantees that (1) is visible. Without the Acquire/Release pair — using Relaxed on both — the processor could see READY as true while DATA still holds 0.

The names come from the mutex pattern: a mutex unlock is a release-store; a mutex lock-acquire is an acquire-load. Everything the thread did before releasing the mutex is visible to the thread that acquires it next.

AcqRel — both Acquire and Release in a single operation. Used for read-modify-write operations (like fetch_add or compare_exchange) that must both see all prior releases and publish all prior stores.

SeqCst — sequentially consistent: the strongest ordering. All SeqCst operations across all threads form a single total order that every thread agrees on. This is stronger than Acquire/Release and is rarely needed. Use it when you have two threads each setting a flag and then reading the other's flag, and you need to guarantee that at least one thread sees the other's write (Rust Atomics and Locks, Ch. 3). In nearly all other cases, Acquire/Release is sufficient.

When to Reach for Atomics vs Mutex

Atomics are not a general replacement for mutexes. They are appropriate for:

  • Single-value counters and flags (frame counts, connection counts, shutdown flags)
  • Lock-free reference counting (the internal mechanism of Arc)
  • Progress indicators shared between threads
  • Single-producer/single-consumer patterns where acquire/release establishes the necessary ordering

Mutexes are appropriate for:

  • Protecting multi-field structs where all fields must be updated atomically
  • Any operation that requires a multi-step transaction
  • Data structures that cannot be represented as a single atomic value

Reaching for SeqCst everywhere is not safe by default — it has higher cost on some architectures (notably ARM) and the extra strength is rarely needed. Start with Acquire/Release. If your correctness argument requires a global total order across multiple atomics, then SeqCst is warranted.


Code Examples

Multi-Thread Frame Counter with Atomic Statistics

The telemetry pipeline tracks three counters: total frames received, total frames dropped (due to backpressure), and bytes processed. These are written by 48 uplink tasks and read by the monitoring dashboard. A mutex would serialize all 48 writes; atomics let them proceed in parallel.

use std::sync::atomic::{AtomicU64, Ordering::{Relaxed, Release, Acquire}};
use std::sync::Arc;
use std::thread;
use std::time::Duration;

struct PipelineMetrics {
    frames_received: AtomicU64,
    frames_dropped: AtomicU64,
    bytes_processed: AtomicU64,
    // Shutdown flag: Release on write, Acquire on read.
    shutdown: AtomicU64,
}

impl PipelineMetrics {
    fn new() -> Arc<Self> {
        Arc::new(Self {
            frames_received: AtomicU64::new(0),
            frames_dropped: AtomicU64::new(0),
            bytes_processed: AtomicU64::new(0),
            shutdown: AtomicU64::new(0),
        })
    }

    fn record_frame(&self, bytes: u64) {
        // Relaxed: these counters are for monitoring only.
        // The exact ordering relative to other threads' stores doesn't matter;
        // we only care about the eventual totals.
        self.frames_received.fetch_add(1, Relaxed);
        self.bytes_processed.fetch_add(bytes, Relaxed);
    }

    fn record_drop(&self) {
        self.frames_dropped.fetch_add(1, Relaxed);
    }

    fn signal_shutdown(&self) {
        // Release: ensures all frame counts written before this are visible
        // to any thread that reads shutdown with Acquire.
        self.shutdown.store(1, Release);
    }

    fn should_stop(&self) -> bool {
        // Acquire: establishes happens-before with the Release store above.
        // Any Relaxed loads on frames_received etc. after this call
        // will see all stores from before signal_shutdown().
        self.shutdown.load(Acquire) == 1
    }

    fn snapshot(&self) -> (u64, u64, u64) {
        (
            self.frames_received.load(Relaxed),
            self.frames_dropped.load(Relaxed),
            self.bytes_processed.load(Relaxed),
        )
    }
}

fn main() {
    let metrics = PipelineMetrics::new();

    // Simulate 4 uplink tasks.
    let workers: Vec<_> = (0..4).map(|i| {
        let m = Arc::clone(&metrics);
        thread::spawn(move || {
            for _ in 0..100 {
                if m.should_stop() { break; }
                m.record_frame(1024);
                if i == 0 { m.record_drop(); } // simulate occasional drops on uplink 0
            }
        })
    }).collect();

    // Monitoring thread samples every 5ms.
    let m = Arc::clone(&metrics);
    let monitor = thread::spawn(move || {
        for _ in 0..3 {
            thread::sleep(Duration::from_millis(5));
            let (recv, drop, bytes) = m.snapshot();
            println!("recv={recv} drop={drop} bytes={bytes}");
        }
        m.signal_shutdown();
    });

    for w in workers { w.join().unwrap(); }
    monitor.join().unwrap();
    let (recv, drop, bytes) = metrics.snapshot();
    println!("final: recv={recv} drop={drop} bytes={bytes}");
}

The Acquire/Release pair on the shutdown flag ensures that after any thread reads should_stop() as true, all Relaxed frame counts written before signal_shutdown() are visible. Without this pair, the monitoring thread could read shutdown=1 but still see stale frame counts from before the shutdown writes.


Key Takeaways

  • Atomic operations are indivisible: a fetch_add on an AtomicU64 is a single step with no observable intermediate state. Plain integer += is not atomic — concurrent modification is undefined behavior.

  • fetch_add and friends return the value before the operation. This is intentional: it lets you use the old value to implement compare-and-swap patterns or sequence counters.

  • compare_exchange is the general-purpose lock-free primitive. The loop-and-retry pattern — load, compute, CAS, retry on failure — enables lock-free algorithms where no thread ever blocks.

  • Relaxed ordering gives only modification order on a single variable. It is correct for statistics counters and progress indicators where cross-variable ordering does not matter.

  • Acquire/Release establishes happens-before across threads. A release-store publishes all preceding memory operations; an acquire-load that reads that value sees all of them. This is what makes mutex unlock/lock, Arc drop/clone, and cross-thread data handoffs safe.

  • SeqCst provides a global total order across all SeqCst operations on all threads. Use it only when you need to coordinate two or more flags where the relative order matters globally. In practice, Acquire/Release covers the vast majority of use cases.


Project — Ground Station Command Queue

Module: Foundation — M02: Concurrency Primitives
Prerequisite: All three module quizzes passed (≥70%)



Mission Brief

TO: Platform Engineering
FROM: Mission Operations Lead
CLASSIFICATION: UNCLASSIFIED // INTERNAL
SUBJECT: RFC-0044 — Priority Command Queue Implementation


The legacy Python command queue used a global dictionary with a threading lock. Over the past six months it has been involved in four production incidents: two deadlocks from re-entrant locking, one priority inversion where a low-priority housekeeping command blocked an emergency SAFE_MODE injection, and one data race when a monitoring process read the queue mid-write.

The replacement must be a typed, concurrent priority queue in Rust. It accepts mission-critical commands from multiple concurrent ground network connections, dispatches them in priority order to the session controller, and exposes lock-free metrics to the monitoring system — without the failure modes of the Python implementation.


System Specification

Command Model

Commands have a u8 priority (0 = lowest, 255 = highest). Predefined priorities:

Command typePriority
SAFE_MODE255
ABORT_PASS200
REPOINT100
STATUS_REQUEST50
HOUSEKEEPING10

A Command struct:

#![allow(unused)]
fn main() {
#[derive(Debug, Eq, PartialEq)]
pub struct Command {
    pub priority: u8,
    pub kind: CommandKind,
    pub issued_at: std::time::Instant,
}

#[derive(Debug, Eq, PartialEq)]
pub enum CommandKind {
    SafeMode,
    AbortPass,
    Repoint { azimuth: f32, elevation: f32 },
    StatusRequest,
    Housekeeping,
}
}

Queue Behaviour

  • Multiple producer threads (one per ground station connection) push commands concurrently.
  • One consumer thread (the session dispatcher) pops the highest-priority command. If multiple commands share the same priority, the oldest (by issued_at) is dispatched first.
  • When the queue is empty, the consumer blocks without busy-waiting.
  • The queue has a configurable capacity. If full, a push from a producer blocks until space is available. Blocking producers must not block the consumer.

Metrics

The following counters are maintained lock-free and available to the monitoring system without acquiring any lock:

  • commands_pushed — total commands ever pushed (all priorities)
  • commands_dispatched — total commands ever dispatched
  • safe_mode_count — number of SAFE_MODE commands dispatched (priority 255)

Shutdown

The queue accepts a shutdown signal. On shutdown:

  1. No new pushes are accepted — producers get an Err(QueueShutdown).
  2. The consumer drains any remaining commands in priority order.
  3. Once the queue is empty and shutdown is signalled, the consumer returns.

Expected Output

A library crate (meridian-cmdqueue) with:

  • A CommandQueue type with push, pop, and shutdown methods
  • An Arc<Metrics> accessible from the CommandQueue with the three lock-free counters
  • A binary that demonstrates: 3 producer threads pushing 5 commands each, 1 consumer thread dispatching them in priority order, a monitoring thread sampling metrics every 20ms, and shutdown after all producers finish

The output should clearly show commands being dispatched in priority order (not FIFO).


Acceptance Criteria

#CriterionVerifiable
1Commands dispatched in priority order (highest first, then oldest-first within priority)Yes — log output order
2Consumer blocks without busy-waiting when queue is emptyYes — no >5% CPU when idle (measure with top)
3Multiple concurrent producers do not cause data racesYes — runs under cargo test with --test-threads=1 via loom or ThreadSanitizer
4Metrics counters are readable without acquiring any queue lockYes — code review: metrics accessed via atomics only
5Shutdown drains remaining commands before consumer exitsYes — log shows all pushed commands dispatched before exit
6Producer push blocks (does not drop commands) when queue is at capacityYes — test with capacity=2 and 10 concurrent pushes
7No .unwrap() on Mutex::lock() without a comment on the invariantYes — code review

Hints

Hint 1 — Implementing priority + FIFO ordering on BinaryHeap

BinaryHeap is a max-heap. To get "highest priority first, then oldest first within the same priority," implement Ord on Command to compare first by priority (descending), then by issued_at (ascending — older is higher priority):

#![allow(unused)]
fn main() {
use std::cmp::Ordering;
use std::time::Instant;

struct Command { priority: u8, issued_at: Instant }

impl Ord for Command {
    fn cmp(&self, other: &Self) -> Ordering {
        self.priority.cmp(&other.priority)
            .then_with(|| other.issued_at.cmp(&self.issued_at)) // older = higher
    }
}
impl PartialOrd for Command {
    fn partial_cmp(&self, other: &Self) -> Option<Ordering> { Some(self.cmp(other)) }
}
impl PartialEq for Command {
    fn eq(&self, other: &Self) -> bool { self.priority == other.priority && self.issued_at == other.issued_at }
}
impl Eq for Command {}
}
Hint 2 — Blocking push with capacity using Mutex + two Condvars

Two Condvars: one signals "not full" (wake a blocked producer), one signals "not empty" (wake the consumer):

#![allow(unused)]
fn main() {
use std::sync::{Mutex, Condvar};
use std::collections::BinaryHeap;

struct QueueInner<T> {
    heap: BinaryHeap<T>,
    capacity: usize,
    shutdown: bool,
}

struct CommandQueue<T> {
    inner: Mutex<QueueInner<T>>,
    not_empty: Condvar,
    not_full: Condvar,
}
}

Push blocks on not_full when the heap is at capacity; pop blocks on not_empty when the heap is empty. Each operation notifies the other condvar after completing.

Hint 3 — Lock-free metrics with atomics

Counters increment in push and pop, which both hold the mutex. But the monitoring thread must read without the mutex. Use atomics for all three counters — write from inside the locked section (ordering is Relaxed since the mutex provides the actual happens-before relationship), read from the monitoring thread with Relaxed:

#![allow(unused)]
fn main() {
use std::sync::atomic::{AtomicU64, Ordering::Relaxed};
use std::sync::Arc;

pub struct Metrics {
    pub commands_pushed: AtomicU64,
    pub commands_dispatched: AtomicU64,
    pub safe_mode_count: AtomicU64,
}

impl Metrics {
    pub fn new() -> Arc<Self> {
        Arc::new(Self {
            commands_pushed: AtomicU64::new(0),
            commands_dispatched: AtomicU64::new(0),
            safe_mode_count: AtomicU64::new(0),
        })
    }
}
}
Hint 4 — Shutdown drain sequence

Set shutdown = true in the inner state while holding the mutex, then notify_all() on both condvars. In push, check shutdown after acquiring the lock and return Err if set. In pop, check shutdown && heap.is_empty() — if both are true, return None to signal the consumer to exit:

#![allow(unused)]
fn main() {
// In pop:
let mut inner = self.inner.lock().unwrap();
loop {
    if let Some(cmd) = inner.heap.pop() {
        self.not_full.notify_one();
        return Some(cmd);
    }
    if inner.shutdown {
        return None; // Queue is empty and shutdown — consumer exits
    }
    inner = self.not_empty.wait(inner).unwrap();
}
}

Reference Implementation

Reveal reference implementation
#![allow(unused)]
fn main() {
// src/lib.rs
use std::cmp::Ordering;
use std::collections::BinaryHeap;
use std::sync::atomic::{AtomicU64, Ordering::Relaxed};
use std::sync::{Arc, Condvar, Mutex};
use std::time::Instant;

#[derive(Debug)]
pub struct QueueShutdown;

impl std::fmt::Display for QueueShutdown {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        write!(f, "command queue is shut down")
    }
}

#[derive(Debug, Eq, PartialEq)]
pub enum CommandKind {
    SafeMode,
    AbortPass,
    Repoint { azimuth: u32, elevation: u32 },
    StatusRequest,
    Housekeeping,
}

#[derive(Debug, Eq, PartialEq)]
pub struct Command {
    pub priority: u8,
    pub kind: CommandKind,
    pub issued_at: Instant,
}

impl Ord for Command {
    fn cmp(&self, other: &Self) -> Ordering {
        self.priority
            .cmp(&other.priority)
            // Within the same priority, older commands go first.
            .then_with(|| other.issued_at.cmp(&self.issued_at))
    }
}
impl PartialOrd for Command {
    fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
        Some(self.cmp(other))
    }
}

pub struct Metrics {
    pub commands_pushed: AtomicU64,
    pub commands_dispatched: AtomicU64,
    pub safe_mode_count: AtomicU64,
}

impl Metrics {
    fn new() -> Arc<Self> {
        Arc::new(Self {
            commands_pushed: AtomicU64::new(0),
            commands_dispatched: AtomicU64::new(0),
            safe_mode_count: AtomicU64::new(0),
        })
    }
}

struct Inner {
    heap: BinaryHeap<Command>,
    capacity: usize,
    shutdown: bool,
}

pub struct CommandQueue {
    inner: Mutex<Inner>,
    not_empty: Condvar,
    not_full: Condvar,
    pub metrics: Arc<Metrics>,
}

impl CommandQueue {
    pub fn new(capacity: usize) -> Arc<Self> {
        Arc::new(Self {
            inner: Mutex::new(Inner {
                heap: BinaryHeap::with_capacity(capacity),
                capacity,
                shutdown: false,
            }),
            not_empty: Condvar::new(),
            not_full: Condvar::new(),
            metrics: Metrics::new(),
        })
    }

    pub fn push(&self, cmd: Command) -> Result<(), QueueShutdown> {
        let mut inner = self.inner.lock().unwrap();
        loop {
            if inner.shutdown {
                return Err(QueueShutdown);
            }
            if inner.heap.len() < inner.capacity {
                let is_safe_mode = matches!(cmd.kind, CommandKind::SafeMode);
                inner.heap.push(cmd);
                // Relaxed: the mutex provides the happens-before. These are
                // statistics only — no cross-variable ordering needed.
                self.metrics.commands_pushed.fetch_add(1, Relaxed);
                if is_safe_mode {
                    self.metrics.safe_mode_count.fetch_add(1, Relaxed);
                }
                self.not_empty.notify_one();
                return Ok(());
            }
            // Queue full — block until space opens or shutdown.
            inner = self.not_full.wait(inner).unwrap();
        }
    }

    pub fn pop(&self) -> Option<Command> {
        let mut inner = self.inner.lock().unwrap();
        loop {
            if let Some(cmd) = inner.heap.pop() {
                self.metrics.commands_dispatched.fetch_add(1, Relaxed);
                self.not_full.notify_one();
                return Some(cmd);
            }
            if inner.shutdown {
                return None;
            }
            inner = self.not_empty.wait(inner).unwrap();
        }
    }

    pub fn shutdown(&self) {
        let mut inner = self.inner.lock().unwrap();
        inner.shutdown = true;
        // Wake all blocked producers and the consumer.
        self.not_empty.notify_all();
        self.not_full.notify_all();
    }
}
}
// src/main.rs (demonstration binary)
use std::sync::Arc;
use std::thread;
use std::time::{Duration, Instant};

fn main() {
    // Inline the relevant types for the playground demo
    // (in the real crate, use `use meridian_cmdqueue::*`)

    tracing_subscriber::fmt::init();

    let queue = CommandQueue::new(20);
    let metrics = Arc::clone(&queue.metrics);

    // Three producer threads — simulate ground network connections.
    let producers: Vec<_> = (0..3u8).map(|gs| {
        let q = Arc::clone(&queue);
        thread::spawn(move || {
            let priorities = [255u8, 200, 100, 50, 10];
            for &priority in &priorities {
                thread::sleep(Duration::from_millis(gs as u64 * 5));
                let kind = match priority {
                    255 => CommandKind::SafeMode,
                    200 => CommandKind::AbortPass,
                    100 => CommandKind::Repoint { azimuth: 180, elevation: 45 },
                    50  => CommandKind::StatusRequest,
                    _   => CommandKind::Housekeeping,
                };
                match q.push(Command { priority, kind, issued_at: Instant::now() }) {
                    Ok(()) => println!("gs-{gs}: pushed priority {priority}"),
                    Err(e) => println!("gs-{gs}: push rejected — {e}"),
                }
            }
        })
    }).collect();

    // Consumer thread — session dispatcher.
    let q = Arc::clone(&queue);
    let consumer = thread::spawn(move || {
        while let Some(cmd) = q.pop() {
            println!("dispatcher: {:?} (priority {})", cmd.kind, cmd.priority);
            thread::sleep(Duration::from_millis(10));
        }
        println!("dispatcher: queue drained, exiting");
    });

    // Monitoring thread.
    let monitor = thread::spawn(move || {
        for _ in 0..4 {
            thread::sleep(Duration::from_millis(20));
            println!(
                "metrics: pushed={} dispatched={} safe_mode={}",
                metrics.commands_pushed.load(std::sync::atomic::Ordering::Relaxed),
                metrics.commands_dispatched.load(std::sync::atomic::Ordering::Relaxed),
                metrics.safe_mode_count.load(std::sync::atomic::Ordering::Relaxed),
            );
        }
    });

    for p in producers { p.join().unwrap(); }
    queue.shutdown();

    consumer.join().unwrap();
    monitor.join().unwrap();
}

Reflection

The command queue built here uses all three concurrency layers from this module: OS threads for the producer and consumer, Mutex + Condvar for blocking coordination, and atomics for the metrics that must be readable without acquiring any lock. The relationship between these layers — the mutex providing the happens-before for the atomic writes, the condvar providing the non-busy-waiting block, the atomics avoiding any lock on the read path — is the pattern used throughout the Meridian control plane.

The natural next question: the blocking push is correct but puts an upper bound on producer throughput. In Module 3, this queue is extended with a tokio::sync::mpsc front-end that moves the backpressure into async channel semantics rather than blocking OS threads.

Module 03 — Message Passing Patterns

Track: Foundation — Mission Control Platform
Position: Module 3 of 6
Source material: Async Rust — Maxwell Flitton & Caroline Morton, Chapters 3, 6, 7, 8
Quiz pass threshold: 70% on all three lessons to unlock the project



Mission Context

Module 2 built shared-state concurrency: Mutex, RwLock, atomics. Those primitives protect data that multiple actors need to touch. This module takes the complementary approach: instead of sharing data, pass ownership through channels. Producers and consumers are decoupled — each owns its state exclusively, communicating only through typed messages.

For the Meridian control plane, message passing is the primary architecture for the telemetry pipeline. 48 satellite uplinks funnel frames into a priority-ordered aggregator. TLE catalog updates fan out to every active session simultaneously. The shutdown signal propagates to all tasks through a watched value. None of these require shared mutable state — they compose entirely from channel primitives.


What You Will Learn

By the end of this module you will be able to:

  • Create bounded mpsc channels, size them for backpressure, clone senders for multiple concurrent producers, and design consumer loops that terminate cleanly when all senders drop
  • Implement the actor pattern: an async task that owns its state exclusively and exposes all operations as messages, using oneshot channels for request-response within the message protocol
  • Distribute events to all subscribers using broadcast, handle RecvError::Lagged correctly, and size the broadcast capacity for the slowest realistic consumer
  • Distribute current state to many readers using watch, understand the difference between event distribution and state distribution, and apply Arc<T> inside a watch for cheap config reads
  • Merge multiple independent async sources into one stream using shared-sender MPSC (uniform sources), select! { biased; } (priority sources), and a router actor (dynamic sources)
  • Choose between mpsc, broadcast, watch, and oneshot given a fan-in or fan-out requirement

Lessons

Lesson 1 — tokio::mpsc: Bounded Channels, Backpressure, and Sender Cloning

Covers mpsc::channel(capacity), Sender::clone for multiple producers, send().await as the backpressure mechanism, try_send for non-blocking producers, the consumer loop termination on sender drop, oneshot for request-response, and the actor pattern as the structural idiom that emerges from MPSC channels.

Key question this lesson answers: How do you safely move work between concurrent async tasks without shared state, and what ensures slow consumers are not overwhelmed by fast producers?

lesson-01-mpsc.md / lesson-01-quiz.toml


Lesson 2 — Broadcast and Watch Channels: Fan-Out Patterns

Covers broadcast::channel(capacity) for event fan-out (every subscriber gets every message), RecvError::Lagged handling, watch::channel(initial) for state fan-out (latest value, change notification), borrow() for lock-free reads, and the decision matrix for choosing between mpsc, broadcast, and watch.

Key question this lesson answers: How do you distribute one event or one value to many concurrent tasks, and when does missing an intermediate update matter?

lesson-02-broadcast-watch.md / lesson-02-quiz.toml


Lesson 3 — Fan-In Aggregation: Merging Streams from Multiple Satellite Feeds

Covers shared-sender MPSC for uniform fan-in, select! { biased; } for priority fan-in with two priority levels, message tagging with typed source identifiers, and the router actor for dynamic fan-in (sources registered and removed at runtime).

Key question this lesson answers: How do you merge many independent async sources into one stream with control over priority, fairness, and dynamic source registration?

lesson-03-fan-in.md / lesson-03-quiz.toml


Capstone Project — Multi-Source Telemetry Aggregator

Build the full telemetry aggregation pipeline: a router actor with dynamic source registration, a priority fan-in that ensures emergency frames are never delayed behind routine telemetry, a bounded frame processor with backpressure, a broadcast fan-out to downstream consumers, atomic pipeline statistics exposed through a watch channel, and a clean shutdown sequence.

Acceptance is against 7 verifiable criteria including emergency frame priority, dynamic source registration, backpressure enforcement, lossless shutdown drain, and lagged broadcast handling.

project-telemetry-aggregator.md


Prerequisites

Modules 1 and 2 must be complete. Module 1 established how async tasks are scheduled and why they cooperatively yield — essential for understanding why bounded channel backpressure works without blocking threads. Module 2 established the shared-state model that message passing replaces — understanding both models is necessary to choose the right one for a given problem.

What Comes Next

Module 4 — Network Programming connects the message-passing pipeline to the network. The telemetry aggregator from this module gains a TCP listener front-end, turning the router actor into a full ground station connection broker that accepts connections from the 12 Meridian ground station sites.

Lesson 1 — tokio::mpsc: Bounded Channels, Backpressure, and Sender Cloning

Module: Foundation — M03: Message Passing Patterns
Position: Lesson 1 of 3
Source: Async Rust — Maxwell Flitton & Caroline Morton, Chapter 8



Context

Module 2's command queue used a Mutex<BinaryHeap> plus a Condvar to share state between threads. That approach works, but it couples the producers and consumer through a shared data structure — every access requires acquiring the same lock, and the consumer must hold the lock while inspecting queue contents. Under contention at 48-uplink load, that lock becomes a bottleneck.

The alternative model is message passing: producers send values into a channel; the consumer receives from it. There is no shared data structure, no explicit locking, and no Arc to pass around. The channel itself manages all synchronization. The backpressure mechanism is built in: when the channel is full, send yields rather than blocking a thread, and the async runtime can schedule other work while the producer waits.

This lesson covers tokio::sync::mpsc — the multi-producer, single-consumer channel that is the workhorse of most async Rust systems. It also covers oneshot for request-response patterns and introduces the actor model as the structural pattern that emerges naturally from combining channels with task ownership.

Source: Async Rust, Chapter 8 (Flitton & Morton)


Core Concepts

MPSC Channels: The Model

tokio::sync::mpsc::channel(capacity) creates a bounded channel and returns a (Sender<T>, Receiver<T>) pair. The capacity is the maximum number of messages that can sit in the channel before senders must wait:

use tokio::sync::mpsc;

#[tokio::main]
async fn main() {
    // Capacity of 32: up to 32 messages can be buffered.
    // If the receiver falls behind, the 33rd send will yield.
    let (tx, mut rx) = mpsc::channel::<String>(32);

    tokio::spawn(async move {
        tx.send("frame-001".to_string()).await.unwrap();
    });

    while let Some(msg) = rx.recv().await {
        println!("received: {msg}");
    }
}

recv() returns None when all Sender handles have been dropped — this is the clean shutdown signal for a consumer loop. No explicit close call is needed; the channel closes naturally when the last sender drops.

Sender is Clone: Multiple Producers

Sender<T> implements Clone. Each clone is an independent handle to the same channel. This is the "multi-producer" part of MPSC — any number of tasks can hold a Sender and push messages concurrently. The receiver sees messages from all senders interleaved, in the order they are delivered to the channel.

use tokio::sync::mpsc;

#[tokio::main]
async fn main() {
    let (tx, mut rx) = mpsc::channel::<(u32, Vec<u8>)>(64);

    // Each uplink session gets its own cloned sender.
    for satellite_id in 0..4u32 {
        let tx = tx.clone();
        tokio::spawn(async move {
            for seq in 0u8..3 {
                let frame = vec![satellite_id as u8, seq];
                // Yields if channel is full — backpressure in action.
                tx.send((satellite_id, frame)).await
                    .expect("aggregator task dropped");
            }
        });
    }

    // Drop the original sender so the channel closes when all
    // spawned tasks finish. Without this drop, rx.recv() never
    // returns None — the original sender keeps the channel alive.
    drop(tx);

    while let Some((sat, frame)) = rx.recv().await {
        println!("sat {sat}: {:?}", frame);
    }
}

The drop of the original tx after spawning is important and easy to forget. If any Sender clone outlives its usefulness, the channel stays open and the consumer loop blocks forever. The idiomatic pattern is to clone before spawning and drop the original.

Backpressure and Capacity Sizing

A bounded channel applies backpressure: when the channel reaches capacity, send().await yields and does not return until the consumer has drained a slot. This is the async equivalent of a blocking queue — it prevents fast producers from overwhelming a slow consumer.

try_send is the non-blocking variant. It returns Err(TrySendError::Full(_)) immediately if the channel is full rather than yielding. Use it when the producer should take an alternative action (log, drop, route to overflow) rather than applying backpressure:

use tokio::sync::mpsc;

async fn forward_or_drop(tx: &mpsc::Sender<Vec<u8>>, frame: Vec<u8>) {
    match tx.try_send(frame) {
        Ok(()) => {}
        Err(mpsc::error::TrySendError::Full(frame)) => {
            // Aggregator is falling behind — record the drop and continue.
            // In production: increment a metrics counter here.
            tracing::warn!(bytes = frame.len(), "frame dropped: aggregator full");
        }
        Err(mpsc::error::TrySendError::Closed(_)) => {
            tracing::error!("aggregator task has exited");
        }
    }
}

#[tokio::main]
async fn main() {
    let (tx, mut rx) = mpsc::channel::<Vec<u8>>(8);
    // Demonstrate try_send behaviour
    for i in 0u8..12 {
        forward_or_drop(&tx, vec![i]).await;
    }
    drop(tx);
    let mut count = 0;
    while rx.recv().await.is_some() { count += 1; }
    println!("received {count} frames (8 max due to capacity)");
}

Capacity sizing: too small causes unnecessary producer backpressure; too large hides a slow consumer until the buffer is exhausted. For the Meridian aggregator, a capacity of 2–4× the expected burst size is a reasonable starting point. Profile under realistic load.

unbounded_channel() provides no capacity limit — senders never yield. Use it only when backpressure is handled at an outer layer and unbounded buffering is acceptable (e.g., a metrics sink that can absorb any burst). Unbounded channels can cause OOM if the consumer is slower than the producers.

oneshot: Request-Response

tokio::sync::oneshot is a single-message channel: exactly one send, exactly one receive. It is the correct primitive for request-response patterns, where a task sends a request and needs to await the result:

use tokio::sync::{mpsc, oneshot};

enum ControlMsg {
    GetQueueDepth { reply: oneshot::Sender<usize> },
    Flush,
}

async fn aggregator(mut rx: mpsc::Receiver<ControlMsg>) {
    let mut queue: Vec<Vec<u8>> = Vec::new();
    while let Some(msg) = rx.recv().await {
        match msg {
            ControlMsg::GetQueueDepth { reply } => {
                // reply.send consumes the sender — can only respond once.
                let _ = reply.send(queue.len());
            }
            ControlMsg::Flush => {
                println!("flushing {} frames", queue.len());
                queue.clear();
            }
        }
    }
}

#[tokio::main]
async fn main() {
    let (tx, rx) = mpsc::channel::<ControlMsg>(8);
    tokio::spawn(aggregator(rx));

    // Ask the aggregator for its current queue depth.
    let (reply_tx, reply_rx) = oneshot::channel::<usize>();
    tx.send(ControlMsg::GetQueueDepth { reply: reply_tx }).await.unwrap();
    let depth = reply_rx.await.unwrap();
    println!("aggregator queue depth: {depth}");
}

The oneshot::Sender is embedded in the message itself. When the aggregator handles the message, it sends back through the oneshot and the caller's reply_rx.await resolves. This pattern — sometimes called the "mailbox" or "actor" pattern — eliminates the need for any shared state between the caller and the aggregator.

The Actor Pattern

An actor is an async task that owns its state exclusively and exposes its functionality entirely through message passing (Async Rust, Ch. 8). No locks, no shared Arc, no exposed fields. Every operation on the actor's state happens sequentially within the actor's message loop — concurrent safety is structural, not from locking.

The advantages: the actor's state is never accessed concurrently. There are no data races by construction. Testing is straightforward — send messages, check responses. Adding operations means adding enum variants, not adding lock guards.

The tradeoffs: all operations are async (each call involves a channel send and an await). If many callers need responses simultaneously, the actor is a serialization point. If the actor's work is CPU-intensive, it blocks its own message loop. Both are solvable — the first with multiple actors, the second with spawn_blocking inside the loop — but they require deliberate design.


Code Examples

Telemetry Frame Aggregator Actor

The aggregator is an actor: it owns the frame buffer exclusively, receives frames and control messages through a single channel, and responds to queries via embedded oneshot channels. No locks anywhere.

use tokio::sync::{mpsc, oneshot};
use std::collections::VecDeque;

const MAX_BUFFER: usize = 1000;

#[derive(Debug)]
struct TelemetryFrame {
    satellite_id: u32,
    sequence: u64,
    payload: Vec<u8>,
}

enum AggregatorMsg {
    /// A new frame from an uplink session.
    Frame(TelemetryFrame),
    /// Request: how many frames are buffered?
    Depth { reply: oneshot::Sender<usize> },
    /// Drain the buffer and return all frames.
    Drain { reply: oneshot::Sender<Vec<TelemetryFrame>> },
}

async fn run_aggregator(mut rx: mpsc::Receiver<AggregatorMsg>) {
    let mut buffer: VecDeque<TelemetryFrame> = VecDeque::with_capacity(MAX_BUFFER);

    while let Some(msg) = rx.recv().await {
        match msg {
            AggregatorMsg::Frame(frame) => {
                if buffer.len() >= MAX_BUFFER {
                    tracing::warn!(
                        satellite_id = frame.satellite_id,
                        "buffer full — dropping oldest frame"
                    );
                    buffer.pop_front();
                }
                buffer.push_back(frame);
            }
            AggregatorMsg::Depth { reply } => {
                let _ = reply.send(buffer.len());
            }
            AggregatorMsg::Drain { reply } => {
                let frames: Vec<_> = buffer.drain(..).collect();
                let _ = reply.send(frames);
            }
        }
    }
    tracing::info!("aggregator: all senders dropped, shutting down");
}

/// A typed handle to the aggregator actor.
/// Hides the channel internals from callers.
#[derive(Clone)]
struct AggregatorHandle {
    tx: mpsc::Sender<AggregatorMsg>,
}

impl AggregatorHandle {
    fn spawn(capacity: usize) -> Self {
        let (tx, rx) = mpsc::channel(capacity);
        tokio::spawn(run_aggregator(rx));
        Self { tx }
    }

    async fn send_frame(&self, frame: TelemetryFrame) -> anyhow::Result<()> {
        self.tx.send(AggregatorMsg::Frame(frame)).await
            .map_err(|_| anyhow::anyhow!("aggregator has shut down"))
    }

    async fn depth(&self) -> anyhow::Result<usize> {
        let (reply_tx, reply_rx) = oneshot::channel();
        self.tx.send(AggregatorMsg::Depth { reply: reply_tx }).await
            .map_err(|_| anyhow::anyhow!("aggregator has shut down"))?;
        reply_rx.await.map_err(|_| anyhow::anyhow!("aggregator dropped reply"))
    }

    async fn drain(&self) -> anyhow::Result<Vec<TelemetryFrame>> {
        let (reply_tx, reply_rx) = oneshot::channel();
        self.tx.send(AggregatorMsg::Drain { reply: reply_tx }).await
            .map_err(|_| anyhow::anyhow!("aggregator has shut down"))?;
        reply_rx.await.map_err(|_| anyhow::anyhow!("aggregator dropped reply"))
    }
}

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    tracing_subscriber::fmt::init();

    let agg = AggregatorHandle::spawn(128);

    // Simulate 4 concurrent uplink sessions each sending 3 frames.
    let tasks: Vec<_> = (0..4u32).map(|sat_id| {
        let agg = agg.clone();
        tokio::spawn(async move {
            for seq in 0u64..3 {
                agg.send_frame(TelemetryFrame {
                    satellite_id: sat_id,
                    sequence: seq,
                    payload: vec![sat_id as u8; 64],
                }).await.unwrap();
            }
        })
    }).collect();

    for t in tasks { t.await.unwrap(); }

    println!("buffered: {}", agg.depth().await?);
    let frames = agg.drain().await?;
    println!("drained {} frames", frames.len());
    Ok(())
}

The AggregatorHandle is the public API. Callers see send_frame, depth, and drain — they never interact with the channel directly. The handle is Clone, so it can be shared freely across tasks by cloning, with no Arc<Mutex<...>> needed.


Key Takeaways

  • tokio::sync::mpsc::channel(capacity) creates a bounded channel. The capacity is the backpressure valve: send().await yields when the channel is full, preventing fast producers from overwhelming slow consumers.

  • Sender<T> is Clone. Every clone is an independent producer on the same channel. The channel closes when all senders drop. Always drop the original sender after spawning cloned senders, or the consumer loop will block forever.

  • try_send is the non-blocking variant. Use it when the producer should take an alternative action — drop, log, route to overflow — rather than yielding. Prefer send().await when backpressure is the correct response.

  • oneshot is the single-message channel for request-response patterns. Embed the oneshot::Sender in the message to allow the receiver to reply exactly once. The Sender is consumed on send — using it more than once is a compile error.

  • The actor pattern — an async task that owns its state exclusively and receives all operations as messages — eliminates shared state and all associated locking. It is the structural pattern that emerges naturally from MPSC channels in async systems.


Lesson 2 — Broadcast and Watch Channels: Fan-Out Patterns for Telemetry Distribution

Module: Foundation — M03: Message Passing Patterns
Position: Lesson 2 of 3
Source: Async Rust — Maxwell Flitton & Caroline Morton, Chapters 6, 7



Context

MPSC channels move work from many producers to one consumer. The inverse problem is fan-out: distributing one event to many consumers. The Meridian control plane has two distinct fan-out requirements that call for different solutions.

The first: when a TLE catalog update arrives, every active uplink session needs to process it. Each session must see the update — no session should receive it twice, and no session should miss it. This is an event-distribution problem.

The second: the shutdown flag. Every task in the control plane needs to know when the system is shutting down, but they do not need to receive a separate "shutdown event" — they just need to be able to check the current value at any time. This is a state-distribution problem.

Tokio provides a dedicated primitive for each. broadcast solves event distribution: every subscriber receives every message. watch solves state distribution: subscribers observe the latest value and are notified when it changes.

Source: Async Rust, Chapters 6–7 (Flitton & Morton)


Core Concepts

tokio::sync::broadcast — Every Subscriber Gets Every Message

broadcast::channel(capacity) returns a (Sender<T>, Receiver<T>) pair. Additional receivers are created by calling sender.subscribe() — each receiver gets its own position in the channel and receives every message sent after the subscription point.

use tokio::sync::broadcast;

#[tokio::main]
async fn main() {
    let (tx, _rx) = broadcast::channel::<String>(16);

    // Each session gets its own receiver.
    let mut session_a = tx.subscribe();
    let mut session_b = tx.subscribe();

    tx.send("TLE-UPDATE-2024-001".to_string()).unwrap();

    // Both sessions receive the same message independently.
    println!("A: {}", session_a.recv().await.unwrap());
    println!("B: {}", session_b.recv().await.unwrap());
}

Sender::send does not require await — it is synchronous. Messages are placed in a ring buffer; receivers read from their own position in that buffer.

The Lagged Error and What to Do With It

The broadcast channel has a fixed capacity ring buffer. If a slow receiver falls behind by more than capacity messages, it loses its position in the buffer. The next recv() call returns Err(RecvError::Lagged(n)), where n is the number of messages missed.

This is not a fatal error. The receiver continues to work — it simply missed n messages and will receive all subsequent ones. Whether missing messages is acceptable depends on the use case. For TLE catalog updates, a session that missed 3 updates can request a fresh fetch. For an audit log, missing messages is a compliance issue.

#![allow(unused)]
fn main() {
use tokio::sync::broadcast;

async fn session_loop(mut rx: broadcast::Receiver<Vec<u8>>) {
    loop {
        match rx.recv().await {
            Ok(frame) => {
                // Normal path.
                process_update(frame).await;
            }
            Err(broadcast::error::RecvError::Lagged(n)) => {
                // Receiver fell behind — n messages were lost from this receiver's view.
                // Log the gap and continue; the next recv will succeed.
                tracing::warn!(missed = n, "session fell behind broadcast — requesting resync");
                request_catalog_resync().await;
            }
            Err(broadcast::error::RecvError::Closed) => {
                // All senders dropped — broadcast channel is done.
                tracing::info!("broadcast channel closed, session exiting");
                break;
            }
        }
    }
}

async fn process_update(_frame: Vec<u8>) {}
async fn request_catalog_resync() {}
}

Capacity sizing for broadcast is more sensitive than for MPSC. The slowest subscriber determines whether lagging occurs. If subscribers have variable processing speeds, size the capacity to accommodate the slowest realistic consumer under load, plus a safety margin.

tokio::sync::watch — Latest Value, Change Notification

watch::channel(initial_value) creates a single-value channel: the sender can update the value at any time, and receivers are notified when it changes. Receivers always see the latest value; intermediate values may be missed if the sender updates faster than the receiver reads.

use tokio::sync::watch;

#[tokio::main]
async fn main() {
    let (tx, rx) = watch::channel::<bool>(false);

    // Clone the receiver for multiple tasks.
    let mut rx2 = rx.clone();

    tokio::spawn(async move {
        // Wait for the value to change.
        rx2.changed().await.unwrap();
        println!("shutdown signal received");
    });

    tokio::time::sleep(tokio::time::Duration::from_millis(10)).await;
    tx.send(true).unwrap();
    tokio::time::sleep(tokio::time::Duration::from_millis(10)).await;
}

watch::Receiver::borrow() returns the current value without waiting. changed().await waits for the next change and then lets you borrow() the new value. This is the pattern for config reloading: tasks watch for a config change, then read the new config with borrow().

watch is the correct primitive for the Meridian shutdown flag — much better than a broadcast channel. The shutdown event needs to be observed once by each task, and latecomers (tasks that check the flag after shutdown is signalled) need to see true immediately. A broadcast receiver created after the shutdown send would miss the message. A watch receiver always sees the current state.

Choosing Between mpsc, broadcast, and watch

PatternChannelUse when
Work queue: one item consumed oncempsc48 sessions each send frames to one aggregator
Event broadcast: every subscriber gets every eventbroadcastTLE update delivered to all active sessions
State sync: subscribers need the latest valuewatchShutdown flag, config updates, current orbital state
One-shot replyoneshotRequest-response within an actor message

The key question: does each message need to be consumed exactly once (mpsc), by every subscriber (broadcast), or is only the latest value relevant (watch)?

watch for Configuration Distribution

A common pattern in the Meridian control plane: runtime configuration loaded at startup and potentially reloaded via a management API. All tasks need to read the current config, and they need to be notified when it changes:

use tokio::sync::watch;
use std::sync::Arc;

#[derive(Clone, Debug)]
struct ControlPlaneConfig {
    max_frame_size: usize,
    session_timeout_secs: u64,
}

async fn uplink_session(
    satellite_id: u32,
    mut config_rx: watch::Receiver<Arc<ControlPlaneConfig>>,
) {
    loop {
        // Read current config — no lock, no await.
        let config = config_rx.borrow().clone();

        tokio::select! {
            // Process frames using current config.
            _ = tokio::time::sleep(
                tokio::time::Duration::from_secs(config.session_timeout_secs)
            ) => {
                tracing::warn!(satellite_id, "session timeout");
                break;
            }
            // React to config changes mid-session.
            Ok(()) = config_rx.changed() => {
                let new_config = config_rx.borrow().clone();
                tracing::info!(
                    satellite_id,
                    max_frame = new_config.max_frame_size,
                    "config reloaded"
                );
                // Loop continues with new config.
            }
        }
    }
}

#[tokio::main]
async fn main() {
    let initial = Arc::new(ControlPlaneConfig {
        max_frame_size: 65536,
        session_timeout_secs: 600,
    });

    let (config_tx, config_rx) = watch::channel(Arc::clone(&initial));

    // Spawn a few sessions.
    for sat_id in 0..3u32 {
        let rx = config_rx.clone();
        tokio::spawn(uplink_session(sat_id, rx));
    }

    // Simulate a config reload.
    tokio::time::sleep(tokio::time::Duration::from_millis(50)).await;
    config_tx.send(Arc::new(ControlPlaneConfig {
        max_frame_size: 32768,
        session_timeout_secs: 300,
    })).unwrap();

    tokio::time::sleep(tokio::time::Duration::from_millis(50)).await;
}

Arc<Config> avoids cloning the full config struct on every borrow(). The Arc::clone is cheap (one atomic increment); the config data is shared read-only across tasks.


Code Examples

TLE Catalog Update Broadcaster

When the orbit data pipeline ingests a new TLE batch, it publishes the update over a broadcast channel. Every active session task receives the update and can refresh its orbital prediction model.

use tokio::sync::broadcast;
use std::sync::Arc;

#[derive(Clone, Debug)]
struct TleUpdate {
    batch_id: u32,
    records: Arc<Vec<String>>,
}

async fn session_task(
    satellite_id: u32,
    mut tle_rx: broadcast::Receiver<TleUpdate>,
    shutdown_rx: tokio::sync::watch::Receiver<bool>,
) {
    let mut shutdown = shutdown_rx.clone();
    loop {
        tokio::select! {
            result = tle_rx.recv() => {
                match result {
                    Ok(update) => {
                        tracing::info!(
                            satellite_id,
                            batch = update.batch_id,
                            records = update.records.len(),
                            "TLE update applied"
                        );
                    }
                    Err(broadcast::error::RecvError::Lagged(n)) => {
                        tracing::warn!(satellite_id, missed = n, "TLE lag — resyncing");
                    }
                    Err(broadcast::error::RecvError::Closed) => break,
                }
            }
            Ok(()) = shutdown.changed() => {
                if *shutdown.borrow() { break; }
            }
        }
    }
    tracing::info!(satellite_id, "session task exiting");
}

#[tokio::main]
async fn main() {
    tracing_subscriber::fmt::init();

    let (tle_tx, _) = broadcast::channel::<TleUpdate>(32);
    let (shutdown_tx, shutdown_rx) = tokio::sync::watch::channel(false);

    // Spawn 4 sessions, each with its own broadcast receiver.
    for sat_id in 0..4u32 {
        let tle_rx = tle_tx.subscribe();
        let sd = shutdown_rx.clone();
        tokio::spawn(session_task(sat_id, tle_rx, sd));
    }

    // Publish a TLE update to all sessions.
    tokio::time::sleep(tokio::time::Duration::from_millis(10)).await;
    tle_tx.send(TleUpdate {
        batch_id: 42,
        records: Arc::new(vec!["1 25544U...".to_string(); 100]),
    }).unwrap();

    // Trigger shutdown.
    tokio::time::sleep(tokio::time::Duration::from_millis(20)).await;
    shutdown_tx.send(true).unwrap();
    tokio::time::sleep(tokio::time::Duration::from_millis(20)).await;
}

The combination of broadcast for events and watch for state is idiomatic Tokio. The broadcast channel delivers the catalog update to every session independently; the watch channel distributes the shutdown signal to all tasks simultaneously. The select! in the session loop races the two — whichever fires first wins.


Key Takeaways

  • broadcast::channel(capacity) distributes every message to every subscriber. Subscribers receive from their own position in a ring buffer. Creating a receiver via sender.subscribe() is the only way to subscribe — receivers created after a message is sent do not receive that message retroactively.

  • RecvError::Lagged(n) is recoverable. A lagged receiver missed n messages but can continue receiving future ones. Whether missing messages is acceptable is application-specific; always handle it explicitly rather than treating it as a fatal error.

  • watch::channel(initial) is for state distribution: the latest value, not every intermediate value. borrow() reads without waiting. changed().await waits for the next update. Receivers created after a send see the current value immediately.

  • Use broadcast when every subscriber must receive every event. Use watch when subscribers need the current state and can tolerate missing intermediate updates. Use mpsc when each message should be consumed by exactly one task.

  • Arc<Config> wrapped in a watch channel is the idiomatic pattern for distributing read-heavy configuration to many tasks. The watch notify is cheap; the config read is a lock-free borrow().


Lesson 3 — Fan-In Aggregation: Merging Streams from Multiple Satellite Feeds

Module: Foundation — M03: Message Passing Patterns
Position: Lesson 3 of 3
Source: Async Rust — Maxwell Flitton & Caroline Morton, Chapters 3, 8



Context

Lesson 1 covered moving data from many producers to one consumer via MPSC. That is fan-in at its simplest: all producers push to the same channel. But the Meridian aggregator's real requirements are more demanding. The 48 uplink sessions produce at different rates. Archived replay feeds produce at a different priority level than live feeds. A session that goes silent should not block the aggregator from processing the other 47. A priority command frame from a SAFE_MODE event should not wait behind a queue of housekeeping frames.

These requirements call for structured fan-in: merging multiple independent async sources into one stream, with control over priority, fairness, and behaviour when sources are slow or silent. This lesson covers three fan-in patterns — shared-sender MPSC, select!-based merge with priority, and the router actor pattern — and when to use each.

Source: Async Rust, Chapters 3 & 8 (Flitton & Morton)


Core Concepts

Shared-Sender Fan-In: The Simple Case

The simplest fan-in is the one already established in Lesson 1: clone the Sender, give each producer a clone, and let the single Receiver consume them all. Every message enters the same queue; the consumer sees them in arrival order.

use tokio::sync::mpsc;

async fn uplink_producer(satellite_id: u32, tx: mpsc::Sender<(u32, Vec<u8>)>) {
    for seq in 0u8..5 {
        let frame = vec![satellite_id as u8, seq];
        if tx.send((satellite_id, frame)).await.is_err() {
            break; // Aggregator shut down.
        }
    }
}

#[tokio::main]
async fn main() {
    let (tx, mut rx) = mpsc::channel::<(u32, Vec<u8>)>(256);

    for sat_id in 0..4u32 {
        tokio::spawn(uplink_producer(sat_id, tx.clone()));
    }
    drop(tx);

    while let Some((sat, frame)) = rx.recv().await {
        println!("sat {sat}: {:?}", frame);
    }
}

This is correct and efficient for uniform, same-priority inputs. It has one limitation: arrival order provides no priority control. A SAFE_MODE frame from satellite 7 waits behind whatever housekeeping frames arrived first.

select!-Based Priority Fan-In

When sources have different priorities, select! can implement a priority receive by always checking a high-priority channel before a lower-priority one. Tokio's select! macro randomly selects among ready branches for fairness, but the biased modifier overrides this and evaluates branches in source order:

use tokio::sync::mpsc;

async fn priority_aggregator(
    mut high: mpsc::Receiver<Vec<u8>>,
    mut low: mpsc::Receiver<Vec<u8>>,
) {
    loop {
        // biased: always check high-priority first.
        // Without biased, both channels are polled in random order —
        // low-priority frames could be dispatched before high-priority ones
        // if both are ready simultaneously.
        tokio::select! {
            biased;
            Some(frame) = high.recv() => {
                println!("HIGH: {} bytes", frame.len());
            }
            Some(frame) = low.recv() => {
                println!("LOW: {} bytes", frame.len());
            }
            else => break,
        }
    }
}

#[tokio::main]
async fn main() {
    let (high_tx, high_rx) = mpsc::channel::<Vec<u8>>(64);
    let (low_tx, low_rx) = mpsc::channel::<Vec<u8>>(256);

    // High-priority: SAFE_MODE and emergency commands.
    tokio::spawn(async move {
        high_tx.send(vec![0xFF; 8]).await.unwrap(); // emergency frame
    });

    // Low-priority: housekeeping telemetry.
    tokio::spawn(async move {
        for _ in 0..3 {
            low_tx.send(vec![0x00; 64]).await.unwrap();
        }
    });

    priority_aggregator(high_rx, low_rx).await;
}

biased is important here. Without it, if both channels have messages ready, select! randomly picks which to process — a high-priority frame could wait behind three low-priority frames. With biased, the high-priority channel is always drained first. The tradeoff: if the high-priority channel receives messages faster than they are processed, the low-priority channel is starved. For mission-critical applications like SAFE_MODE injection, this is the intended behaviour.

This pattern directly implements what Async Rust Chapter 3 builds when constructing a priority async queue with HIGH_CHANNEL and LOW_CHANNEL — the concept is the same, applied to async channels rather than thread queues.

Tagging Messages with Source Identity

When fan-in merges undifferentiated Vec<u8> frames from multiple sources, the consumer cannot determine which satellite the frame came from. Tag messages at the producer side with an enum or a source identifier:

use tokio::sync::mpsc;

#[derive(Debug)]
enum FeedKind {
    LiveUplink { satellite_id: u32 },
    ArchivedReplay { mission_id: String },
}

#[derive(Debug)]
struct TaggedFrame {
    source: FeedKind,
    sequence: u64,
    payload: Vec<u8>,
}

async fn live_uplink(sat_id: u32, tx: mpsc::Sender<TaggedFrame>) {
    for seq in 0u64..3 {
        let _ = tx.send(TaggedFrame {
            source: FeedKind::LiveUplink { satellite_id: sat_id },
            sequence: seq,
            payload: vec![sat_id as u8; 32],
        }).await;
    }
}

async fn replay_feed(mission: String, tx: mpsc::Sender<TaggedFrame>) {
    for seq in 0u64..2 {
        let _ = tx.send(TaggedFrame {
            source: FeedKind::ArchivedReplay { mission_id: mission.clone() },
            sequence: seq,
            payload: vec![0xAA; 128],
        }).await;
    }
}

#[tokio::main]
async fn main() {
    let (tx, mut rx) = mpsc::channel::<TaggedFrame>(128);

    for sat_id in 0..3u32 {
        tokio::spawn(live_uplink(sat_id, tx.clone()));
    }
    tokio::spawn(replay_feed("ARTEMIS-IV".to_string(), tx.clone()));
    drop(tx);

    while let Some(frame) = rx.recv().await {
        match &frame.source {
            FeedKind::LiveUplink { satellite_id } => {
                println!("live sat {satellite_id} seq {}: {} bytes",
                    frame.sequence, frame.payload.len());
            }
            FeedKind::ArchivedReplay { mission_id } => {
                println!("replay {mission_id} seq {}: {} bytes",
                    frame.sequence, frame.payload.len());
            }
        }
    }
}

Using an enum for source identity is more robust than a raw integer: the compiler enforces that all source types are handled. When a new source type is added, match exhaustiveness forces updates at all handling sites.

The Router Actor Pattern

For more than two or three sources, or when sources are created dynamically (e.g., a new ground station connection comes online mid-session), a router actor is the correct abstraction. The router owns a set of active input channels, polls them all, and forwards to a single output channel. This is the pattern Async Rust Chapter 8 builds as the foundation of its actor system.

use tokio::sync::mpsc;
use std::collections::HashMap;

#[derive(Debug)]
struct TaggedFrame {
    source_id: u32,
    payload: Vec<u8>,
}

enum RouterMsg {
    /// Register a new uplink feed.
    AddFeed { source_id: u32, feed: mpsc::Receiver<Vec<u8>> },
    /// Remove an uplink feed (session ended).
    RemoveFeed { source_id: u32 },
}

async fn router_actor(
    mut ctrl: mpsc::Receiver<RouterMsg>,
    out: mpsc::Sender<TaggedFrame>,
) {
    // Tokio's mpsc doesn't provide a built-in multi-receiver select,
    // so we use a secondary MPSC where all feeds forward their frames.
    let (internal_tx, mut internal_rx) = mpsc::channel::<TaggedFrame>(512);
    let mut feed_handles: HashMap<u32, tokio::task::JoinHandle<()>> = HashMap::new();

    loop {
        tokio::select! {
            // Control messages: add or remove feeds.
            Some(msg) = ctrl.recv() => {
                match msg {
                    RouterMsg::AddFeed { source_id, mut feed } => {
                        let fwd_tx = internal_tx.clone();
                        let handle = tokio::spawn(async move {
                            while let Some(payload) = feed.recv().await {
                                if fwd_tx.send(TaggedFrame { source_id, payload }).await.is_err() {
                                    break; // Router shut down.
                                }
                            }
                            tracing::debug!(source_id, "feed task exiting");
                        });
                        feed_handles.insert(source_id, handle);
                    }
                    RouterMsg::RemoveFeed { source_id } => {
                        if let Some(handle) = feed_handles.remove(&source_id) {
                            handle.abort(); // Feed task no longer needed.
                        }
                    }
                }
            }
            // Frames from all registered feeds, already fan-in'ed via internal channel.
            Some(frame) = internal_rx.recv() => {
                if out.send(frame).await.is_err() {
                    break; // Downstream consumer has shut down.
                }
            }
            else => break,
        }
    }
}

#[tokio::main]
async fn main() {
    let (ctrl_tx, ctrl_rx) = mpsc::channel::<RouterMsg>(8);
    let (out_tx, mut out_rx) = mpsc::channel::<TaggedFrame>(256);

    tokio::spawn(router_actor(ctrl_rx, out_tx));

    // Register two satellite feeds dynamically.
    for sat_id in [25544u32, 48274] {
        let (feed_tx, feed_rx) = mpsc::channel::<Vec<u8>>(32);
        ctrl_tx.send(RouterMsg::AddFeed {
            source_id: sat_id,
            feed: feed_rx,
        }).await.unwrap();

        tokio::spawn(async move {
            for i in 0u8..3 {
                feed_tx.send(vec![i; 16]).await.unwrap();
            }
        });
    }

    drop(ctrl_tx);

    tokio::time::sleep(tokio::time::Duration::from_millis(50)).await;
    let mut count = 0;
    while let Ok(frame) = tokio::time::timeout(
        tokio::time::Duration::from_millis(20),
        out_rx.recv()
    ).await {
        if let Some(f) = frame {
            println!("sat {}: {} bytes", f.source_id, f.payload.len());
            count += 1;
        } else {
            break;
        }
    }
    println!("total frames: {count}");
}

Each registered feed gets a dedicated forwarding task that moves frames to the router's internal channel. The router selects between control messages (add/remove feeds) and forwarded frames. Adding a new satellite source at runtime is a single ctrl_tx.send(RouterMsg::AddFeed {...}) call — no restructuring of the select loop.


Key Takeaways

  • Shared-sender MPSC is the simplest fan-in: all producers clone the Sender, and the consumer reads from the single Receiver. Use it when sources have equal priority and arrival order is acceptable.

  • select! with biased implements priority fan-in: the first branch is always evaluated before the second. Use it for two or three sources with different priority levels. Without biased, select! randomizes branch selection — a high-priority source is not guaranteed to be drained first when both are ready.

  • Tag messages at the source with a typed identifier (enum or struct field) rather than relying on arrival order to infer provenance. An enum exhaustiveness check at the consumer forces all source types to be handled explicitly.

  • The router actor pattern handles dynamic fan-in: sources can be registered and deregistered at runtime via control messages. Each source gets a dedicated forwarding task that converts its Receiver into tagged frames on the internal channel. The router selects between control and data messages.

  • Fan-in and fan-out compose: an aggregator can receive from a router (fan-in) and forward to a broadcast channel (fan-out), building a full hub-and-spoke telemetry pipeline from these primitives.


Interactive Examples

Message Brokers

Project — Multi-Source Telemetry Aggregator

Module: Foundation — M03: Message Passing Patterns
Prerequisite: All three module quizzes passed (≥70%)



Mission Brief

TO: Platform Engineering
FROM: Mission Control Systems Lead
CLASSIFICATION: UNCLASSIFIED // INTERNAL
SUBJECT: RFC-0047 — Telemetry Aggregation Pipeline


The control plane currently receives telemetry from 48 LEO satellite uplinks and two archived replay feeds simultaneously during mission replay operations. Each source produces frames at independent rates. Emergency command frames from any source must be processed before routine telemetry. Downstream analytics consumers need every frame; a monitoring dashboard needs only the latest pipeline statistics.

Your task is to build the telemetry aggregation pipeline that connects these sources to their consumers. The pipeline must: fan-in all sources into a priority-ordered stream, fan-out to a downstream frame processor and to a monitoring dashboard, apply backpressure so fast sources cannot overwhelm the processor, and shut down cleanly when signalled.


System Specification

Frame Types

#![allow(unused)]
fn main() {
#[derive(Debug, Clone)]
pub enum FramePriority {
    Emergency,  // SAFE_MODE, ABORT commands
    Routine,    // Standard telemetry
}

#[derive(Debug, Clone)]
pub struct Frame {
    pub source_id: u32,
    pub source_kind: SourceKind,
    pub priority: FramePriority,
    pub sequence: u64,
    pub payload: Vec<u8>,
}

#[derive(Debug, Clone)]
pub enum SourceKind {
    LiveUplink,
    ArchivedReplay,
}
}

Pipeline Architecture

[Uplink 0..48]  ──┐
                  ├─► [Router Actor] ─► [Priority Fan-In] ─► [Frame Processor]
[Replay 0..2]   ──┘         │                                        │
                             └──────────────────────────────► [Broadcast: all frames]
                                                                      │
                                                              [Dashboard] [Archive]

[watch: shutdown] ──────────────────────────────────────► All tasks
[watch: stats]    ◄──────────────────── Frame Processor (updates atomically)

Behavioural Requirements

Fan-in: Frames from live uplinks and archived replays are merged via a router actor that supports dynamic source registration. Emergency frames must be prioritised over routine frames when both are available simultaneously.

Backpressure: The frame processor has a bounded input channel (capacity 64). When the processor is saturated, backpressure propagates up to the priority fan-in, which in turn applies pressure to the router's internal channel. Routine sources are slowed; emergency frames still make progress due to priority ordering.

Fan-out: Every processed frame is sent over a broadcast channel to all downstream consumers. The monitoring dashboard subscribes; an archive writer task subscribes. The dashboard is allowed to lag and handles RecvError::Lagged gracefully.

Stats: The pipeline maintains three AtomicU64 counters: frames_routed, frames_processed, emergency_count. These are exposed via a watch channel as a PipelineStats snapshot, updated by the frame processor after each frame.

Shutdown: A watch<bool> shutdown signal is distributed to all tasks. On signal: (1) stop accepting new frames from sources, (2) drain the priority fan-in channel, (3) close the broadcast channel, (4) all tasks exit within 5 seconds.


Expected Output

A binary that:

  1. Starts a router actor accepting dynamic source registration
  2. Registers 4 live uplink sources (each sending 10 frames) and 1 replay source (sending 5 frames)
  3. 2 of the live uplink frames per source are marked Emergency
  4. Runs a frame processor that logs each frame with its priority and source
  5. Runs a monitoring task that reads watch<PipelineStats> every 50ms and prints stats
  6. Runs a downstream archive task subscribed to the broadcast channel
  7. Sends shutdown signal after all sources finish; all tasks exit cleanly

The output should clearly show emergency frames being processed before routine frames from the same batch.


Acceptance Criteria

#CriterionVerifiable
1Emergency frames processed before queued routine frames from the same sourceYes — log order
2New sources can be registered at runtime via the router control channelYes — sources registered mid-run
3Frame processor channel capacity is enforced — producers yield when fullYes — add tokio::time::sleep in processor and verify producers do not drop frames
4All downstream consumers receive every processed frame via broadcastYes — counts match between processor and archive consumer
5Stats watch channel provides latest snapshot without acquiring any lockYes — code review: only atomic loads in stats read path
6Shutdown drains the fan-in channel before exitingYes — no frames lost after shutdown signal
7Lagged broadcast receivers log a warning and continue — they do not crashYes — introduce a slow archive task and verify Lagged is handled

Hints

Hint 1 — Priority fan-in with biased select!

Use two channels from the router: one for emergency frames, one for routine. The priority fan-in selects with biased:

#![allow(unused)]
fn main() {
async fn priority_fan_in(
    mut emergency_rx: tokio::sync::mpsc::Receiver<Frame>,
    mut routine_rx: tokio::sync::mpsc::Receiver<Frame>,
    out_tx: tokio::sync::mpsc::Sender<Frame>,
    shutdown: tokio::sync::watch::Receiver<bool>,
) {
    let mut shutdown = shutdown;
    loop {
        tokio::select! {
            biased;
            Some(f) = emergency_rx.recv() => {
                if out_tx.send(f).await.is_err() { break; }
            }
            Some(f) = routine_rx.recv() => {
                if out_tx.send(f).await.is_err() { break; }
            }
            Ok(()) = shutdown.changed() => {
                if *shutdown.borrow() { break; }
            }
            else => break,
        }
    }
}
}
Hint 2 — Router actor with dynamic registration

The router forwards all sources to two internal channels split by priority. Each registered source gets a forwarding task:

#![allow(unused)]
fn main() {
enum RouterMsg {
    AddSource {
        source_id: u32,
        source_kind: SourceKind,
        feed: tokio::sync::mpsc::Receiver<Frame>,
    },
    RemoveSource { source_id: u32 },
}
}

The forwarding task reads from the feed and sends to the appropriate internal channel based on frame.priority.

Hint 3 — Stats snapshot with watch + atomics

The frame processor updates atomic counters after each frame, then sends a snapshot to the watch channel:

#![allow(unused)]
fn main() {
use std::sync::atomic::{AtomicU64, Ordering::Relaxed};
use std::sync::Arc;

#[derive(Clone, Debug, Default)]
pub struct PipelineStats {
    pub frames_processed: u64,
    pub emergency_count: u64,
}

struct StatsTracker {
    frames_processed: AtomicU64,
    emergency_count: AtomicU64,
    tx: tokio::sync::watch::Sender<PipelineStats>,
}

impl StatsTracker {
    fn record(&self, is_emergency: bool) {
        self.frames_processed.fetch_add(1, Relaxed);
        if is_emergency {
            self.emergency_count.fetch_add(1, Relaxed);
        }
        // Publish a snapshot — receivers always see the latest.
        let _ = self.tx.send(PipelineStats {
            frames_processed: self.frames_processed.load(Relaxed),
            emergency_count: self.emergency_count.load(Relaxed),
        });
    }
}
}
Hint 4 — Broadcast fan-out with lagged handling
#![allow(unused)]
fn main() {
async fn archive_consumer(
    mut rx: tokio::sync::broadcast::Receiver<Frame>,
) {
    let mut archived = 0u64;
    loop {
        match rx.recv().await {
            Ok(frame) => {
                archived += 1;
                tracing::debug!(
                    source = frame.source_id,
                    seq = frame.sequence,
                    "archived"
                );
            }
            Err(tokio::sync::broadcast::error::RecvError::Lagged(n)) => {
                // Archive fell behind — note the gap and continue.
                tracing::warn!(missed = n, "archive lagged");
            }
            Err(tokio::sync::broadcast::error::RecvError::Closed) => {
                tracing::info!(total = archived, "archive consumer done");
                break;
            }
        }
    }
}
}

Reference Implementation

Reveal reference implementation
// This reference implementation is intentionally condensed.
// A production implementation would split into modules.
use tokio::sync::{broadcast, mpsc, watch};
use std::sync::atomic::{AtomicU64, Ordering::Relaxed};
use std::sync::Arc;
use std::collections::HashMap;
use tokio::time::{sleep, Duration};

#[derive(Debug, Clone)]
pub enum FramePriority { Emergency, Routine }

#[derive(Debug, Clone)]
pub enum SourceKind { LiveUplink, ArchivedReplay }

#[derive(Debug, Clone)]
pub struct Frame {
    pub source_id: u32,
    pub source_kind: SourceKind,
    pub priority: FramePriority,
    pub sequence: u64,
    pub payload: Vec<u8>,
}

#[derive(Clone, Debug, Default)]
pub struct PipelineStats {
    pub frames_processed: u64,
    pub emergency_count: u64,
}

enum RouterMsg {
    AddSource {
        source_id: u32,
        feed: mpsc::Receiver<Frame>,
    },
}

async fn router_actor(
    mut ctrl: mpsc::Receiver<RouterMsg>,
    emergency_tx: mpsc::Sender<Frame>,
    routine_tx: mpsc::Sender<Frame>,
) {
    let (internal_tx, mut internal_rx) = mpsc::channel::<Frame>(512);
    let mut handles: HashMap<u32, tokio::task::JoinHandle<()>> = HashMap::new();

    loop {
        tokio::select! {
            Some(msg) = ctrl.recv() => {
                match msg {
                    RouterMsg::AddSource { source_id, mut feed } => {
                        let fwd = internal_tx.clone();
                        let h = tokio::spawn(async move {
                            while let Some(frame) = feed.recv().await {
                                if fwd.send(frame).await.is_err() { break; }
                            }
                        });
                        handles.insert(source_id, h);
                    }
                }
            }
            Some(frame) = internal_rx.recv() => {
                let dest = match frame.priority {
                    FramePriority::Emergency => &emergency_tx,
                    FramePriority::Routine   => &routine_tx,
                };
                if dest.send(frame).await.is_err() { break; }
            }
            else => break,
        }
    }
}

async fn priority_fan_in(
    mut emerg_rx: mpsc::Receiver<Frame>,
    mut routine_rx: mpsc::Receiver<Frame>,
    out_tx: mpsc::Sender<Frame>,
    mut shutdown: watch::Receiver<bool>,
) {
    loop {
        tokio::select! {
            biased;
            Some(f) = emerg_rx.recv()  => { if out_tx.send(f).await.is_err() { break; } }
            Some(f) = routine_rx.recv() => { if out_tx.send(f).await.is_err() { break; } }
            Ok(()) = shutdown.changed() => { if *shutdown.borrow() { break; } }
            else => break,
        }
    }
}

async fn frame_processor(
    mut rx: mpsc::Receiver<Frame>,
    bcast_tx: broadcast::Sender<Frame>,
    stats_tx: watch::Sender<PipelineStats>,
    processed: Arc<AtomicU64>,
    emergency: Arc<AtomicU64>,
) {
    while let Some(frame) = rx.recv().await {
        let is_emerg = matches!(frame.priority, FramePriority::Emergency);
        tracing::info!(
            source = frame.source_id, seq = frame.sequence,
            priority = if is_emerg { "EMERGENCY" } else { "routine" },
            "processed"
        );
        processed.fetch_add(1, Relaxed);
        if is_emerg { emergency.fetch_add(1, Relaxed); }
        let _ = stats_tx.send(PipelineStats {
            frames_processed: processed.load(Relaxed),
            emergency_count: emergency.load(Relaxed),
        });
        let _ = bcast_tx.send(frame);
    }
}

async fn archive_consumer(mut rx: broadcast::Receiver<Frame>) {
    let mut count = 0u64;
    loop {
        match rx.recv().await {
            Ok(_) => count += 1,
            Err(broadcast::error::RecvError::Lagged(n)) => {
                tracing::warn!(missed = n, "archive lagged");
            }
            Err(broadcast::error::RecvError::Closed) => {
                tracing::info!(total = count, "archive done");
                break;
            }
        }
    }
}

#[tokio::main]
async fn main() {
    tracing_subscriber::fmt::init();

    let (shutdown_tx, shutdown_rx) = watch::channel(false);
    let (stats_tx, mut stats_rx) = watch::channel(PipelineStats::default());
    let (bcast_tx, _) = broadcast::channel::<Frame>(128);
    let (ctrl_tx, ctrl_rx) = mpsc::channel::<RouterMsg>(8);
    let (emerg_tx, emerg_rx) = mpsc::channel::<Frame>(64);
    let (routine_tx, routine_rx) = mpsc::channel::<Frame>(256);
    let (proc_tx, proc_rx) = mpsc::channel::<Frame>(64);

    let processed = Arc::new(AtomicU64::new(0));
    let emergency = Arc::new(AtomicU64::new(0));

    // Start pipeline tasks.
    tokio::spawn(router_actor(ctrl_rx, emerg_tx, routine_tx));
    tokio::spawn(priority_fan_in(emerg_rx, routine_rx, proc_tx, shutdown_rx.clone()));
    tokio::spawn(frame_processor(
        proc_rx, bcast_tx.clone(), stats_tx,
        Arc::clone(&processed), Arc::clone(&emergency),
    ));
    tokio::spawn(archive_consumer(bcast_tx.subscribe()));

    // Register 4 live uplink sources.
    for sat_id in 0..4u32 {
        let (feed_tx, feed_rx) = mpsc::channel::<Frame>(32);
        ctrl_tx.send(RouterMsg::AddSource { source_id: sat_id, feed: feed_rx }).await.unwrap();
        tokio::spawn(async move {
            for seq in 0u64..10 {
                let priority = if seq < 2 { FramePriority::Emergency } else { FramePriority::Routine };
                feed_tx.send(Frame {
                    source_id: sat_id, source_kind: SourceKind::LiveUplink,
                    priority, sequence: seq, payload: vec![sat_id as u8; 32],
                }).await.unwrap();
                sleep(Duration::from_millis(5)).await;
            }
        });
    }

    // Stats monitor.
    tokio::spawn(async move {
        for _ in 0..4 {
            sleep(Duration::from_millis(50)).await;
            stats_rx.changed().await.ok();
            let s = stats_rx.borrow().clone();
            println!("stats: processed={} emergency={}", s.frames_processed, s.emergency_count);
        }
    });

    sleep(Duration::from_millis(300)).await;
    println!("sending shutdown");
    shutdown_tx.send(true).unwrap();
    sleep(Duration::from_millis(100)).await;
    println!("final: processed={} emergency={}",
        processed.load(Relaxed), emergency.load(Relaxed));
}

Reflection

This project assembles the full message-passing toolkit from Module 3. The router actor provides dynamic fan-in with independent source lifecycle management. The priority fan-in ensures emergency frames are never delayed by routine traffic. The broadcast channel distributes every processed frame to all downstream consumers. The watch channel distributes state — shutdown signal and pipeline stats — without requiring consumers to hold any lock.

The pattern here — router → priority queue → processor → broadcast — recurs throughout Meridian's data pipeline architecture. In Module 4 (Network Programming), the router actor gains TCP listener integration, turning it into a full ground station connection broker.

Module 04 — Network Programming

Track: Foundation — Mission Control Platform
Position: Module 4 of 6
Source material: Tokio tutorial I/O and Framing chapters; reqwest documentation; tokio::net API docs
Quiz pass threshold: 70% on all three lessons to unlock the project

Note on source book: Network Programming with Rust (Chanda, 2018) uses pre-async/await Tokio 0.1 APIs that are incompatible with current Tokio 1.x. Lesson content is grounded in the current Tokio tutorial and API documentation rather than that book.



Mission Context

The Meridian control plane's telemetry pipeline now has a complete message-passing architecture (Module 3). What it still lacks is the network layer: the actual TCP connections from ground stations that feed the pipeline. This module builds that layer — connecting the abstract pipeline to the physical network.

The control plane operates three distinct network protocols simultaneously: persistent TCP sessions with ground stations (framed, long-lived, must reconnect on failure), UDP datagrams from SDA radar and optical sensors (high-frequency, latency-sensitive, loss-tolerant), and outbound HTTP calls to the TLE catalog API and mission operations endpoints (request-response, with retry logic).


What You Will Learn

By the end of this module you will be able to:

  • Build async TCP servers with tokio::net::TcpListener, spawn per-connection tasks, handle EOF correctly, and shut down the accept loop cleanly via a watch channel shutdown signal
  • Use AsyncReadExt::read_exact for length-prefix framing, split sockets with TcpStream::split() and into_split() for concurrent read/write, and wrap writers in BufWriter to reduce syscall overhead
  • Add per-session timeouts to detect silent connections (antenna tracking failures, network blackouts) without leaving ghost sessions open
  • Bind and use tokio::net::UdpSocket in both connected and unconnected modes, understand why UDP receive buffers must be sized to the maximum datagram, and apply try_send rather than blocking in high-frequency sensor pipelines
  • Build a production reqwest::Client with appropriate timeout configuration, share it via Clone across async tasks, use error_for_status() correctly, and implement exponential backoff retry logic that distinguishes retryable server errors from non-retryable client errors

Lessons

Lesson 1 — TCP Servers with tokio::net: Listeners, Connection Handling, and Graceful Shutdown

Covers TcpListener::bind and the accept loop, AsyncReadExt/AsyncWriteExt extension traits, read_exact for framing, EOF handling, TcpStream::split() vs into_split(), BufWriter for write batching, read timeouts, and graceful shutdown of both the accept loop and individual connections.

Key question this lesson answers: How do you build a TCP server that handles many concurrent connections correctly — reading frames, handling EOF, splitting for bidirectional I/O, and shutting down cleanly?

lesson-01-tcp-servers.md / lesson-01-quiz.toml


Lesson 2 — UDP and Datagram Protocols: Low-Latency Sensor Data

Covers UdpSocket::bind, recv_from/send_to semantics, connected vs unconnected mode, concurrent send/receive via Arc<UdpSocket>, buffer sizing and IP fragmentation, OS socket buffer tuning with socket2, and the decision between UDP and TCP for high-frequency sensor pipelines.

Key question this lesson answers: When does UDP's lack of ordering and reliability become an advantage, and how do you structure a receiver that does not block on a slow downstream consumer?

lesson-02-udp.md / lesson-02-quiz.toml


Lesson 3 — HTTP Clients with reqwest: Async REST Calls

Covers reqwest::Client construction and sharing, ClientBuilder timeout configuration, error_for_status(), .json() for serialization/deserialization, retry logic with exponential backoff and jitter, status-code-based retry decisions, and multiple clients for services with different SLOs.

Key question this lesson answers: How do you build a robust HTTP client that handles transient failures without hammering a rate-limited API, and correctly distinguishes retryable errors from permanent ones?

lesson-03-http-clients.md / lesson-03-quiz.toml


Capstone Project — Ground Station Network Client

Build the full ground station client: connects to a TCP endpoint using the length-prefix framing protocol, automatically reconnects on failure with exponential backoff, runs a background TLE refresh via HTTP with retry logic, forwards received frames to the downstream aggregator pipeline via try_send, publishes session state via a watch channel, and shuts down cleanly including a GOODBYE frame to the peer.

Acceptance is against 7 verifiable criteria including automatic reconnection, bounded backoff, 5-minute failure timeout, TLE retry, non-blocking frame forwarding, mid-frame shutdown safety, and state machine correctness.

project-gs-client.md


Prerequisites

Modules 1–3 must be complete. Module 1 established the async task model and tokio::select! — both used extensively in connection handlers. Module 3 established the message-passing pipeline that network frames feed into. Understanding mpsc::Sender and try_send from Module 3 is prerequisite to the UDP and TCP lessons' discussion of non-blocking frame forwarding.

What Comes Next

Module 5 — Data-Oriented Design in Rust shifts from I/O to computation: how to lay out structs for CPU cache efficiency, when to use struct-of-arrays vs array-of-structs, and arena allocation for high-throughput frame processing. The telemetry frames arriving via the TCP and UDP clients from this module are processed in bulk in Module 5.

Lesson 1 — TCP Servers with tokio::net: Listeners, Connection Handling, and Graceful Shutdown

Module: Foundation — M04: Network Programming
Position: Lesson 1 of 3
Source: Tokio tutorial — I/O and Framing chapters (tokio.rs/tokio/tutorial)

Source note: Network Programming with Rust (Chanda) uses pre-async/await Tokio 0.1 APIs that are incompatible with current Tokio 1.x. This lesson is grounded in the current Tokio tutorial and Tokio 1.x API documentation.



Context

Every uplink session in the Meridian control plane begins with a TCP connection from a ground station. The Module 1 broker project sketched this accept loop in broad strokes. This lesson provides the complete model: how TcpListener binds and accepts connections, how to split a socket for concurrent read and write, how AsyncReadExt and AsyncWriteExt handle framed protocols, how a connection handler exits cleanly on EOF or error, and how the accept loop itself shuts down gracefully without leaking tasks.

The patterns here are not specific to Meridian. Every TCP server in Rust's async ecosystem — from a Redis clone to a satellite control plane — uses the same building blocks. Understanding them at the structural level means you can build, debug, and extend any such system.


Core Concepts

TcpListener — Binding and Accepting

tokio::net::TcpListener::bind(addr) binds the socket and returns a TcpListener. listener.accept().await waits for the next incoming connection and returns a (TcpStream, SocketAddr) pair. The accept call is async — while waiting, the executor can run other tasks.

use tokio::net::TcpListener;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let listener = TcpListener::bind("0.0.0.0:7777").await?;

    loop {
        let (socket, addr) = listener.accept().await?;
        println!("connection from {addr}");
        // Each connection gets its own task.
        tokio::spawn(async move {
            handle_connection(socket).await;
        });
    }
}

async fn handle_connection(_socket: tokio::net::TcpStream) {
    // ... read frames, process, respond
}

The accept loop spawns a new task per connection and immediately loops back to accept the next one. The connection handler runs concurrently with all other handlers and with the accept loop itself. This is the fundamental async TCP server structure.

One critical detail: if listener.accept() returns an error, it does not always mean the listener is broken. EAGAIN, ECONNABORTED, and similar transient errors should be logged and retried. An unrecoverable error (e.g., the listener fd was closed) should terminate the loop. A simple approach: log the error and continue — the OS will sort out transient errors. For a production-grade implementation, add an exponential backoff on repeated errors.

AsyncRead, AsyncWrite, and Their Extension Traits

tokio::net::TcpStream implements both AsyncRead and AsyncWrite, but you almost never call their methods directly. Instead you use the extension traits AsyncReadExt and AsyncWriteExt (from tokio::io), which provide ergonomic higher-level methods:

MethodDescription
read(&mut buf)Read up to buf.len() bytes; returns 0 on EOF
read_exact(&mut buf)Read exactly buf.len() bytes; errors on EOF
read_u32(), read_u64(), etc.Read a big-endian integer
write_all(&buf)Write all bytes in buf
write_u32(n), etc.Write a big-endian integer

read_exact is the right primitive for fixed-size framing (like Meridian's 4-byte length prefix). It guarantees the buffer is fully populated before returning, handling the case where the underlying read returns fewer bytes than requested.

EOF handling: read() returning Ok(0) means the remote has closed the write half of the connection. Any subsequent read() will also return Ok(0). When you see this, exit the read loop — continuing to call read() on a closed stream creates a 100% CPU spin loop.

#![allow(unused)]
fn main() {
use tokio::io::AsyncReadExt;
use tokio::net::TcpStream;

async fn read_frame(stream: &mut TcpStream) -> anyhow::Result<Option<Vec<u8>>> {
    let mut len_buf = [0u8; 4];

    // read_exact returns Err(UnexpectedEof) if the connection closes mid-header.
    match stream.read_exact(&mut len_buf).await {
        Ok(()) => {}
        Err(e) if e.kind() == std::io::ErrorKind::UnexpectedEof => {
            // Clean EOF at frame boundary — connection closed normally.
            return Ok(None);
        }
        Err(e) => return Err(e.into()),
    }

    let len = u32::from_be_bytes(len_buf) as usize;
    if len > 65_536 {
        anyhow::bail!("frame too large: {len} bytes");
    }

    let mut payload = vec![0u8; len];
    stream.read_exact(&mut payload).await?;
    Ok(Some(payload))
}
}

Splitting a Socket: io::split and TcpStream::split

A TcpStream implements both AsyncRead and AsyncWrite, but Rust's borrow rules prevent passing &mut stream to two concurrent operations at the same time. To read and write concurrently — for example, to handle a bidirectional protocol or to send heartbeat responses while reading frames — the socket must be split.

TcpStream::split() splits by reference. Both halves must remain on the same task, but the read and write can be used independently within a single select! or sequential pair. Zero cost — no Arc, no Mutex.

io::split(stream) splits by value. Each half can be sent to a different task. Internally uses an Arc<Mutex> — slightly more overhead than the reference split, but needed when the read and write tasks must be truly independent.

#![allow(unused)]
fn main() {
use tokio::io::{AsyncReadExt, AsyncWriteExt};
use tokio::net::TcpStream;

async fn bidirectional_handler(stream: TcpStream) -> anyhow::Result<()> {
    // into_split: value split — each half can move to separate tasks.
    let (mut reader, mut writer) = stream.into_split();

    // Write task: sends periodic heartbeats.
    let write_task = tokio::spawn(async move {
        loop {
            tokio::time::sleep(tokio::time::Duration::from_secs(30)).await;
            if writer.write_all(b"HEARTBEAT\n").await.is_err() {
                break;
            }
        }
    });

    // Read task: processes incoming frames.
    let mut buf = vec![0u8; 4096];
    loop {
        let n = reader.read(&mut buf).await?;
        if n == 0 { break; } // EOF
        tracing::debug!(bytes = n, "frame received");
    }

    write_task.abort();
    Ok(())
}
}

Use TcpStream::split() (reference) when both read and write stay in one task. Use TcpStream::into_split() (value) when they need to move to separate tasks.

BufWriter — Reducing Syscalls on the Write Path

Each write_all call is a syscall. For a protocol that sends many small writes (header bytes, then payload bytes), the overhead accumulates. Wrapping the write half in tokio::io::BufWriter buffers writes and flushes them in larger batches:

#![allow(unused)]
fn main() {
use tokio::io::{AsyncWriteExt, BufWriter};
use tokio::net::TcpStream;

async fn write_framed(stream: TcpStream, payload: &[u8]) -> anyhow::Result<()> {
    // BufWriter with 8KB internal buffer — flushes when full or on explicit flush().
    let mut writer = BufWriter::new(stream);

    // These two writes go to the internal buffer, not to the socket.
    let len = payload.len() as u32;
    writer.write_all(&len.to_be_bytes()).await?;
    writer.write_all(payload).await?;

    // flush() pushes the buffered bytes to the socket in one syscall.
    writer.flush().await?;
    Ok(())
}
}

Always call flush() after writing a complete logical unit (a frame, a response). If you return from the handler without flushing, buffered data is silently dropped when the BufWriter drops.

Graceful Shutdown of the Accept Loop

A simple loop { listener.accept().await? } has no shutdown path. The pattern from Lesson 3 of Module 1 applies here: race the accept against a shutdown signal with select!:

#![allow(unused)]
fn main() {
use tokio::net::TcpListener;
use tokio::sync::watch;

async fn accept_loop(
    listener: TcpListener,
    mut shutdown: watch::Receiver<bool>,
) {
    loop {
        tokio::select! {
            accept = listener.accept() => {
                match accept {
                    Ok((socket, addr)) => {
                        tracing::info!(%addr, "connection accepted");
                        let sd = shutdown.clone();
                        tokio::spawn(async move {
                            connection_handler(socket, sd).await;
                        });
                    }
                    Err(e) => {
                        tracing::warn!("accept error: {e}");
                        // Continue — transient errors are normal.
                    }
                }
            }
            Ok(()) = shutdown.changed() => {
                if *shutdown.borrow() {
                    tracing::info!("accept loop shutting down");
                    break;
                }
            }
        }
    }
}

async fn connection_handler(
    _socket: tokio::net::TcpStream,
    _shutdown: watch::Receiver<bool>,
) {
    // Read frames; check shutdown between reads.
}
}

Pass the watch::Receiver into each connection handler so that individual connections can also respond to the shutdown signal — stopping mid-read cleanly rather than being forcibly dropped.


Code Examples

Production Ground Station TCP Server

A complete TCP server for a Meridian ground station connection. Reads length-prefixed frames, forwards them to the telemetry aggregator from Module 3, and shuts down cleanly.

use anyhow::Result;
use tokio::{
    io::{AsyncReadExt, AsyncWriteExt},
    net::{TcpListener, TcpStream},
    sync::{mpsc, watch},
    time::{timeout, Duration},
};
use tracing::{info, warn};

#[derive(Debug)]
struct TelemetryFrame {
    station_id: String,
    payload: Vec<u8>,
}

async fn read_frame(stream: &mut TcpStream) -> Result<Option<Vec<u8>>> {
    let mut len_buf = [0u8; 4];
    match stream.read_exact(&mut len_buf).await {
        Ok(()) => {}
        Err(e) if e.kind() == std::io::ErrorKind::UnexpectedEof => return Ok(None),
        Err(e) => return Err(e.into()),
    }
    let len = u32::from_be_bytes(len_buf) as usize;
    if len > 65_536 {
        anyhow::bail!("frame too large: {len}");
    }
    let mut buf = vec![0u8; len];
    stream.read_exact(&mut buf).await?;
    Ok(Some(buf))
}

async fn handle_connection(
    mut stream: TcpStream,
    station_id: String,
    frame_tx: mpsc::Sender<TelemetryFrame>,
    mut shutdown: watch::Receiver<bool>,
) {
    info!(station = %station_id, "session started");
    loop {
        tokio::select! {
            // Bias toward reading to complete in-progress frames.
            biased;
            frame = timeout(Duration::from_secs(60), read_frame(&mut stream)) => {
                match frame {
                    // Session timeout — ground station went silent.
                    Err(_elapsed) => {
                        warn!(station = %station_id, "session timeout");
                        break;
                    }
                    Ok(Ok(Some(payload))) => {
                        if frame_tx.send(TelemetryFrame {
                            station_id: station_id.clone(),
                            payload,
                        }).await.is_err() {
                            break; // Aggregator shut down.
                        }
                    }
                    Ok(Ok(None)) => {
                        info!(station = %station_id, "connection closed by peer");
                        break;
                    }
                    Ok(Err(e)) => {
                        warn!(station = %station_id, "read error: {e}");
                        break;
                    }
                }
            }
            Ok(()) = shutdown.changed() => {
                if *shutdown.borrow() {
                    info!(station = %station_id, "shutdown signal — closing session");
                    break;
                }
            }
        }
    }
    // Send a clean close to the peer.
    let _ = stream.shutdown().await;
    info!(station = %station_id, "session ended");
}

pub async fn run_tcp_server(
    bind_addr: &str,
    frame_tx: mpsc::Sender<TelemetryFrame>,
    shutdown: watch::Receiver<bool>,
) -> Result<()> {
    let listener = TcpListener::bind(bind_addr).await?;
    info!("ground station server listening on {bind_addr}");
    let mut conn_id = 0usize;
    let mut sd = shutdown.clone();

    loop {
        tokio::select! {
            accept = listener.accept() => {
                let (socket, addr) = accept?;
                conn_id += 1;
                let station_id = format!("gs-{conn_id}@{addr}");
                tokio::spawn(handle_connection(
                    socket,
                    station_id,
                    frame_tx.clone(),
                    shutdown.clone(),
                ));
            }
            Ok(()) = sd.changed() => {
                if *sd.borrow() { break; }
            }
        }
    }
    info!("accept loop exited");
    Ok(())
}

#[tokio::main]
async fn main() -> Result<()> {
    tracing_subscriber::fmt::init();
    let (frame_tx, mut frame_rx) = mpsc::channel::<TelemetryFrame>(256);
    let (shutdown_tx, shutdown_rx) = watch::channel(false);

    // Frame consumer.
    tokio::spawn(async move {
        while let Some(frame) = frame_rx.recv().await {
            info!(station = %frame.station_id, bytes = frame.payload.len(), "frame received");
        }
    });

    // Shutdown after 2 seconds for demo purposes.
    let sd = shutdown_tx;
    tokio::spawn(async move {
        tokio::time::sleep(Duration::from_secs(2)).await;
        let _ = sd.send(true);
    });

    run_tcp_server("0.0.0.0:7777", frame_tx, shutdown_rx).await
}

Several production decisions embedded here: the timeout around read_frame handles silent connections (antenna loss, network blackout) without leaving ghost sessions open. stream.shutdown() sends a TCP FIN to the peer on clean exit. The biased select! ensures an in-progress frame read is completed before the shutdown branch wins.


Key Takeaways

  • TcpListener::bind().await binds the socket; listener.accept().await yields a (TcpStream, SocketAddr). Spawn a task per connection and loop back immediately — the accept loop should never be blocked by connection handling.

  • read() returning Ok(0) is EOF — the remote closed its write half. Continuing to call read() after EOF creates a spin loop. Always exit the read loop on Ok(0).

  • read_exact is the correct primitive for fixed-size framing. It handles short reads internally and returns UnexpectedEof if the connection closes before the buffer is filled.

  • Use TcpStream::split() for same-task read/write splitting (zero cost). Use TcpStream::into_split() when the read and write halves must move to separate tasks.

  • BufWriter batches small writes. Always call flush() after writing a complete logical unit — unflushed data is silently dropped when the writer drops.

  • Add a timeout to reads in long-lived connections. Ground stations go silent without warning. A 60-second read timeout detects ghost sessions that would otherwise hold open resources indefinitely.


Lesson 2 — UDP and Datagram Protocols: Low-Latency Sensor Data

Module: Foundation — M04: Network Programming
Position: Lesson 2 of 3
Source: Synthesized from training knowledge and tokio::net::UdpSocket documentation

Source note: This lesson synthesizes from current tokio::net::UdpSocket API documentation and training knowledge. The following concepts would benefit from verification against the source book if the API has changed: split() on UdpSocket, recv_from/send_to semantics, and connect()-vs-unconnected modes.



Context

The Meridian Space Domain Awareness network includes optical sensors and radar installations that report raw detection events at high frequency with strict latency requirements. A radar return needs to reach the conjunction analysis pipeline in under 50ms. At that latency budget, TCP's per-packet acknowledgment and retransmission overhead is a liability, not a feature. When the occasional dropped packet is acceptable — or when the application layer manages its own loss detection — UDP is the right transport.

UDP is a datagram protocol: each send and recv corresponds to exactly one discrete packet. There are no streams, no connection establishment, no ordering guarantees, and no retransmission. What you get is low overhead, minimal kernel buffering, and latency that is bounded only by the network, not by protocol machinery.

This lesson covers tokio::net::UdpSocket: binding, sending, receiving, splitting for concurrent send/receive, and the design decisions around UDP in a high-frequency sensor pipeline.


Core Concepts

UDP Socket Basics

UdpSocket::bind(addr) creates a UDP socket bound to a local address. Unlike TCP, there is no accept loop and no connection concept. A single bound socket can send to any address and receive from any address:

use tokio::net::UdpSocket;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    // Bind to receive on all interfaces, port 9090.
    let socket = UdpSocket::bind("0.0.0.0:9090").await?;

    let mut buf = [0u8; 1024];
    loop {
        // recv_from returns the number of bytes and the sender's address.
        let (n, addr) = socket.recv_from(&mut buf).await?;
        println!("received {n} bytes from {addr}: {:?}", &buf[..n]);

        // Echo back.
        socket.send_to(&buf[..n], addr).await?;
    }
}

recv_from waits for the next datagram. If the incoming datagram is larger than buf, the excess bytes are silently discarded — there is no partial read concept in UDP. Size your buffer to the maximum expected datagram, not the average.

Connected Mode vs. Unconnected Mode

An unconnected UDP socket can communicate with any remote address. A connected UDP socket is associated with one specific remote address via socket.connect(addr) — this is not a TCP handshake, just a filter on the local OS socket:

use tokio::net::UdpSocket;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let socket = UdpSocket::bind("0.0.0.0:0").await?; // OS assigns port

    // "Connect" to the sensor — enables send/recv instead of send_to/recv_from.
    // Datagrams from other addresses are filtered out.
    socket.connect("192.168.1.100:5500").await?;

    socket.send(b"POLL").await?;

    let mut buf = [0u8; 256];
    let n = socket.recv(&mut buf).await?;
    println!("sensor response: {:?}", &buf[..n]);
    Ok(())
}

After connect(), use send/recv instead of send_to/recv_from. The OS filters datagrams to only those from the connected address, which is useful for point-to-point sensor polling. For a server receiving from many sensors, use the unconnected mode with recv_from.

Splitting for Concurrent Send/Receive

A single UdpSocket cannot be both recv_from'd and send_to'd simultaneously from different tasks — you need a split. UdpSocket::into_split() returns (OwnedRecvHalf, OwnedSendHalf), each of which can be moved to a separate task:

use std::sync::Arc;
use tokio::net::UdpSocket;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let socket = Arc::new(UdpSocket::bind("0.0.0.0:9090").await?);

    // For UdpSocket, Arc-sharing is the idiomatic split pattern
    // because both send_to and recv_from take &self.
    let recv_socket = Arc::clone(&socket);
    let send_socket = Arc::clone(&socket);

    let recv_task = tokio::spawn(async move {
        let mut buf = [0u8; 1024];
        loop {
            let (n, addr) = recv_socket.recv_from(&mut buf).await.unwrap();
            println!("recv {n} bytes from {addr}");
        }
    });

    let send_task = tokio::spawn(async move {
        // Periodic heartbeat to a known sensor address.
        loop {
            tokio::time::sleep(tokio::time::Duration::from_secs(5)).await;
            send_socket.send_to(b"HEARTBEAT", "192.168.1.100:5500")
                .await.unwrap();
        }
    });

    let _ = tokio::join!(recv_task, send_task);
    Ok(())
}

UdpSocket's send_to and recv_from take &self (shared reference), so wrapping in Arc lets multiple tasks share the same socket without splitting. This differs from TcpStream where read and write require &mut self.

Buffer Sizing and Packet Loss

UDP datagrams have a maximum size of 65,507 bytes (for IPv4 over Ethernet), but practical limits are lower. A datagram that exceeds the network MTU (typically 1500 bytes on Ethernet) is fragmented at the IP layer. If any fragment is lost, the entire datagram is discarded. For high-frequency sensor data, keep individual datagrams under 1472 bytes (1500 MTU - 20 IP header - 8 UDP header) to avoid fragmentation.

Buffer the receive socket at the OS level with SO_RCVBUF if sensor bursts arrive faster than the application can drain them. This requires socket2 or nix crate access to set socket options before wrapping in tokio::net::UdpSocket:

#![allow(unused)]
fn main() {
use socket2::{Socket, Domain, Type};
use std::net::SocketAddr;
use tokio::net::UdpSocket;

async fn bind_with_large_buffer(addr: &str) -> anyhow::Result<UdpSocket> {
    let addr: SocketAddr = addr.parse()?;
    let socket = Socket::new(Domain::IPV4, Type::DGRAM, None)?;
    socket.set_reuse_address(true)?;
    // 4MB receive buffer to absorb radar bursts.
    socket.set_recv_buffer_size(4 * 1024 * 1024)?;
    socket.bind(&addr.into())?;
    socket.set_nonblocking(true)?;
    Ok(UdpSocket::from_std(socket.into())?)
}
}

When to Choose UDP over TCP

SituationPreferred
Radar/optical detection events, < 50ms latency budgetUDP
Telemetry frames requiring ordered delivery and reliabilityTCP
Configuration commands — must not be lostTCP
Periodic status heartbeats where loss is acceptableUDP
Bulk TLE catalog transferTCP
High-frequency position updates where only latest mattersUDP

The core tradeoff: TCP adds ordering, reliability, and flow control at the cost of latency and per-connection overhead. UDP provides a raw datagram channel — if reliability matters, implement it yourself (sequence numbers, ACKs, retransmission) at the application layer.


Code Examples

SDA Radar Sensor Receiver

The Meridian SDA network has radar stations that broadcast detection events as UDP datagrams. The receiver processes them and forwards to the conjunction analysis pipeline. Packet loss is tolerable — a missed radar return is worse than a delayed one, but the next sweep arrives in 250ms anyway.

use std::net::SocketAddr;
use std::sync::Arc;
use tokio::net::UdpSocket;
use tokio::sync::mpsc;
use tokio::time::{timeout, Duration};

#[derive(Debug)]
struct RadarDetection {
    sensor_id: u32,
    azimuth_deg: f32,
    elevation_deg: f32,
    range_km: f32,
    timestamp_ms: u64,
}

fn parse_detection(buf: &[u8], addr: SocketAddr) -> Option<RadarDetection> {
    // Wire format: 4-byte sensor_id | 4-byte azimuth (f32 BE) |
    //              4-byte elevation (f32 BE) | 4-byte range (f32 BE) |
    //              8-byte timestamp (u64 BE)
    if buf.len() < 24 {
        return None; // Malformed datagram — discard silently.
    }
    let sensor_id = u32::from_be_bytes(buf[0..4].try_into().ok()?);
    let azimuth = f32::from_be_bytes(buf[4..8].try_into().ok()?);
    let elevation = f32::from_be_bytes(buf[8..12].try_into().ok()?);
    let range = f32::from_be_bytes(buf[12..16].try_into().ok()?);
    let timestamp = u64::from_be_bytes(buf[16..24].try_into().ok()?);

    tracing::debug!(%addr, sensor_id, "detection received");
    Some(RadarDetection {
        sensor_id,
        azimuth_deg: azimuth,
        elevation_deg: elevation,
        range_km: range,
        timestamp_ms: timestamp,
    })
}

async fn radar_receiver(
    bind_addr: &str,
    tx: mpsc::Sender<RadarDetection>,
    mut shutdown: tokio::sync::watch::Receiver<bool>,
) -> anyhow::Result<()> {
    let socket = Arc::new(UdpSocket::bind(bind_addr).await?);
    tracing::info!("radar receiver listening on {bind_addr}");

    let mut buf = [0u8; 1472]; // Stay under MTU to avoid fragmentation.

    loop {
        tokio::select! {
            biased;
            recv = socket.recv_from(&mut buf) => {
                match recv {
                    Ok((n, addr)) => {
                        if let Some(detection) = parse_detection(&buf[..n], addr) {
                            // Non-blocking — drop if pipeline is full rather than
                            // blocking the receive loop. A queued radar sweep is
                            // useless by the time it clears the backlog.
                            if tx.try_send(detection).is_err() {
                                tracing::warn!("detection pipeline full — datagram dropped");
                            }
                        }
                    }
                    Err(e) => {
                        tracing::warn!("recv error: {e}");
                        // UDP recv errors are typically transient — continue.
                    }
                }
            }
            Ok(()) = shutdown.changed() => {
                if *shutdown.borrow() {
                    tracing::info!("radar receiver shutting down");
                    break;
                }
            }
        }
    }
    Ok(())
}

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    tracing_subscriber::fmt::init();
    let (tx, mut rx) = mpsc::channel::<RadarDetection>(512);
    let (shutdown_tx, shutdown_rx) = tokio::sync::watch::channel(false);

    tokio::spawn(radar_receiver("0.0.0.0:9090", tx, shutdown_rx));

    // Consumer: conjunction analysis pipeline.
    tokio::spawn(async move {
        while let Some(det) = rx.recv().await {
            tracing::info!(
                sensor = det.sensor_id,
                az = det.azimuth_deg,
                el = det.elevation_deg,
                range = det.range_km,
                "detection processed"
            );
        }
    });

    // Demo: shut down after 5 seconds.
    tokio::time::sleep(Duration::from_secs(5)).await;
    shutdown_tx.send(true)?;
    tokio::time::sleep(Duration::from_millis(100)).await;
    Ok(())
}

try_send instead of send().await is deliberate here. If the conjunction pipeline is saturated, blocking the radar receive loop means subsequent datagrams pile up in the OS socket buffer and eventually overflow it too. Dropping one detection and keeping the receive loop running is the correct behaviour for high-frequency sensor data where recency matters more than completeness.


Key Takeaways

  • UDP is a datagram protocol — each send/recv is one discrete packet with no ordering, reliability, or congestion control. Use it when latency matters more than reliability, or when the application layer manages loss detection.

  • recv_from returns the number of bytes received and the sender's address. If the datagram is larger than the buffer, excess bytes are silently discarded. Size receive buffers to the maximum expected datagram, not the average.

  • connect() on a UDP socket is not a handshake — it sets a default remote address and filters incoming datagrams from other addresses. Use connected mode for point-to-point polling; use unconnected mode for servers receiving from many sources.

  • UdpSocket's send_to and recv_from take &self. Wrapping in Arc lets multiple tasks share one socket without a formal split — unlike TcpStream which requires into_split() or split() for concurrent access.

  • Keep datagrams under 1472 bytes on Ethernet networks to avoid IP fragmentation. A single lost IP fragment drops the entire datagram.

  • In high-frequency sensor pipelines, use try_send rather than send().await when forwarding to a downstream channel. Blocking the receive loop on a full channel is worse than dropping one datagram.


Lesson 3 — HTTP Clients with reqwest: Async REST Calls to Meridian's Mission API

Module: Foundation — M04: Network Programming
Position: Lesson 3 of 3
Source: Synthesized from reqwest documentation and training knowledge

Source note: This lesson synthesizes from reqwest 0.12.x API documentation and training knowledge. Verify connection pool configuration options against the current reqwest::ClientBuilder docs if behaviour differs.



Context

The Meridian control plane is not an island. It fetches TLE updates from the external Space-Track catalog API, posts conjunction alerts to the mission operations REST endpoint, and retrieves ground station configuration from an internal config service. All of these are HTTP calls — outbound, async, with retry logic and timeouts.

reqwest is the standard async HTTP client for Rust. It wraps hyper (the underlying HTTP implementation) with a high-level, ergonomic API, built-in connection pooling, JSON support through serde, and configurable timeout and retry behaviour. Understanding how to use it correctly — particularly how Client is shared, how connection pools work, and how to handle failures robustly — is essential for any Rust service that communicates with external APIs.


Core Concepts

Client — Shared, Pooled, Long-Lived

reqwest::Client manages a connection pool internally. Building a Client is expensive — it allocates the pool, sets up TLS configuration, and resolves DNS configuration. A Client is designed to be created once and cloned cheaply for sharing across tasks.

#![allow(unused)]
fn main() {
use reqwest::Client;
use std::time::Duration;

fn build_client() -> anyhow::Result<Client> {
    Ok(Client::builder()
        // Overall request timeout: connection + headers + body.
        .timeout(Duration::from_secs(30))
        // How long to wait for the TCP connection to establish.
        .connect_timeout(Duration::from_secs(5))
        // Keep connections alive for reuse — avoids TCP handshake per request.
        .pool_idle_timeout(Duration::from_secs(90))
        .pool_max_idle_per_host(10)
        // User-Agent header for all requests.
        .user_agent("meridian-control-plane/1.0")
        .build()?)
}
}

Client is Clone — cloning it is a reference count increment that shares the same underlying connection pool. Pass a Client to tasks by cloning, not by wrapping in Arc<Mutex<Client>>. The Arc is already inside Client.

Never create a new Client per request. Each new Client is a new connection pool — you lose all the benefit of connection reuse and accumulate resource overhead proportional to your request rate.

Making Requests

The basic request pattern: call a method on the Client to get a RequestBuilder, add headers and body, call .send().await, check the status, and deserialize the response:

#![allow(unused)]
fn main() {
use reqwest::Client;
use serde::{Deserialize, Serialize};

#[derive(Debug, Deserialize)]
struct TleRecord {
    norad_id: u32,
    name: String,
    line1: String,
    line2: String,
}

async fn fetch_tle(client: &Client, norad_id: u32) -> anyhow::Result<TleRecord> {
    let url = format!("https://api.meridian.internal/tle/{norad_id}");
    let response = client
        .get(&url)
        .header("X-API-Key", "mission-control-key")
        .send()
        .await?;

    // error_for_status() converts 4xx/5xx responses into Err.
    // Without this, a 404 or 500 is not an error — you receive the body.
    let response = response.error_for_status()?;

    let record: TleRecord = response.json().await?;
    Ok(record)
}
}

error_for_status() is important. A 404 or 503 does not cause .send().await to return Err — only network errors do. If you omit error_for_status(), a 500 response body is deserialized as if it were a valid TleRecord, producing a confusing JSON parse error rather than a clear HTTP error.

Sending JSON Bodies

For POST and PUT requests with JSON bodies, use .json(&value) on the RequestBuilder. It serializes the value with serde, sets the Content-Type: application/json header, and sets the body:

#![allow(unused)]
fn main() {
use reqwest::Client;
use serde::Serialize;

#[derive(Serialize)]
struct ConjunctionAlert {
    object_a: u32,
    object_b: u32,
    tca_seconds: f64,
    miss_distance_km: f64,
}

async fn post_alert(client: &Client, alert: &ConjunctionAlert) -> anyhow::Result<()> {
    client
        .post("https://api.meridian.internal/alerts")
        .json(alert)
        .send()
        .await?
        .error_for_status()?;
    Ok(())
}
}

.json() requires the json feature on reqwest (enabled by default). For large payloads that should be streamed rather than buffered in memory, use .body(reqwest::Body::wrap_stream(stream)) instead.

Retry Logic with Exponential Backoff

External APIs fail transiently — rate limits, brief outages, transient DNS failures. A single retry with a fixed delay is rarely sufficient. Exponential backoff with jitter is the standard approach: wait 1s, then 2s, then 4s, with random jitter to avoid thundering herds:

#![allow(unused)]
fn main() {
use reqwest::{Client, StatusCode};
use tokio::time::{sleep, Duration};

async fn fetch_with_retry(
    client: &Client,
    url: &str,
    max_attempts: u32,
) -> anyhow::Result<String> {
    let mut attempt = 0;
    loop {
        attempt += 1;
        let result = client.get(url).send().await;

        match result {
            Ok(resp) if resp.status().is_success() => {
                return Ok(resp.text().await?);
            }
            Ok(resp) if resp.status() == StatusCode::TOO_MANY_REQUESTS => {
                // Respect Retry-After header if present, otherwise backoff.
                let retry_after = resp
                    .headers()
                    .get("Retry-After")
                    .and_then(|v| v.to_str().ok())
                    .and_then(|s| s.parse::<u64>().ok())
                    .unwrap_or(0);
                let delay = if retry_after > 0 {
                    Duration::from_secs(retry_after)
                } else {
                    backoff_delay(attempt)
                };
                tracing::warn!(attempt, url, ?delay, "rate limited — backing off");
                if attempt >= max_attempts { anyhow::bail!("rate limit exhausted"); }
                sleep(delay).await;
            }
            Ok(resp) if resp.status().is_server_error() => {
                tracing::warn!(attempt, url, status = %resp.status(), "server error");
                if attempt >= max_attempts {
                    anyhow::bail!("server error after {max_attempts} attempts");
                }
                sleep(backoff_delay(attempt)).await;
            }
            Ok(resp) => {
                // 4xx client errors (except 429) are not retryable.
                anyhow::bail!("request failed: HTTP {}", resp.status());
            }
            Err(e) if e.is_connect() || e.is_timeout() => {
                tracing::warn!(attempt, url, "network error: {e}");
                if attempt >= max_attempts { return Err(e.into()); }
                sleep(backoff_delay(attempt)).await;
            }
            Err(e) => return Err(e.into()),
        }
    }
}

fn backoff_delay(attempt: u32) -> Duration {
    // Exponential backoff: 1s, 2s, 4s, 8s, capped at 30s.
    // Add jitter to avoid thundering herd.
    use std::time::SystemTime;
    let base = Duration::from_secs(1u64 << attempt.min(5));
    let jitter_ms = (SystemTime::now()
        .duration_since(SystemTime::UNIX_EPOCH)
        .unwrap_or_default()
        .subsec_millis()) % 1000;
    base + Duration::from_millis(jitter_ms as u64)
}
}

Retry strategy by status code:

  • 5xx (server error): Retry with backoff — transient server issues.
  • 429 (too many requests): Retry with backoff, respect Retry-After header.
  • 408 (request timeout) or connection/timeout errors: Retry with backoff.
  • 4xx (client errors) except 429: Do not retry — the request itself is malformed.
  • Success: Return immediately.

Configuring Timeouts Correctly

A single .timeout(Duration) sets the overall request timeout (connection + sending + receiving). For fine-grained control:

#![allow(unused)]
fn main() {
use reqwest::Client;
use std::time::Duration;

fn build_production_client() -> anyhow::Result<Client> {
    Ok(Client::builder()
        // TCP connection timeout — fail fast if service is unreachable.
        .connect_timeout(Duration::from_secs(3))
        // Total time budget for the entire request (all phases).
        .timeout(Duration::from_secs(15))
        // How long an idle connection can sit in the pool before being closed.
        .pool_idle_timeout(Duration::from_secs(60))
        .build()?)
}
}

For the Meridian TLE catalog API — a slow external service that can take up to 10 seconds to respond during load — set the timeout to 12–15 seconds. For the internal mission ops REST endpoint on the same datacenter network, 3–5 seconds is appropriate. Do not use the same Client configuration for both if the timeout requirements differ significantly — build two clients.


Code Examples

TLE Catalog HTTP Client for the Control Plane

The control plane fetches TLE updates from Space-Track on a 10-minute schedule. It also exposes a REST endpoint for on-demand TLE queries. This example shows both directions: fetching and posting, with retry logic and a shared client.

use anyhow::{Context, Result};
use reqwest::{Client, StatusCode};
use serde::{Deserialize, Serialize};
use std::time::Duration;
use tokio::time::sleep;

#[derive(Debug, Deserialize, Clone)]
pub struct TleRecord {
    pub norad_id: u32,
    pub name: String,
    pub line1: String,
    pub line2: String,
    pub epoch: String,
}

#[derive(Debug, Serialize)]
pub struct ConjunctionReport {
    pub object_a_id: u32,
    pub object_b_id: u32,
    pub tca_unix: f64,
    pub miss_distance_km: f64,
    pub probability: f64,
}

pub struct MissionApiClient {
    client: Client,
    base_url: String,
    api_key: String,
}

impl MissionApiClient {
    pub fn new(base_url: String, api_key: String) -> Result<Self> {
        let client = Client::builder()
            .connect_timeout(Duration::from_secs(5))
            .timeout(Duration::from_secs(20))
            .pool_max_idle_per_host(4)
            .user_agent("meridian-control-plane/1.0")
            .build()
            .context("failed to build HTTP client")?;
        Ok(Self { client, base_url, api_key })
    }

    /// Fetch a single TLE record with up to 3 retry attempts.
    pub async fn get_tle(&self, norad_id: u32) -> Result<TleRecord> {
        let url = format!("{}/tle/{norad_id}", self.base_url);
        let mut attempt = 0u32;
        loop {
            attempt += 1;
            let response = self.client
                .get(&url)
                .header("X-API-Key", &self.api_key)
                .send()
                .await;

            match response {
                Ok(resp) if resp.status().is_success() => {
                    return resp.json::<TleRecord>().await
                        .context("failed to parse TLE response");
                }
                Ok(resp) if resp.status().is_server_error() && attempt < 3 => {
                    tracing::warn!(norad_id, attempt, status = %resp.status(), "retrying");
                    sleep(Duration::from_secs(1 << attempt)).await;
                }
                Ok(resp) => {
                    anyhow::bail!("TLE fetch failed: HTTP {}", resp.status());
                }
                Err(e) if (e.is_connect() || e.is_timeout()) && attempt < 3 => {
                    tracing::warn!(norad_id, attempt, "network error: {e}, retrying");
                    sleep(Duration::from_secs(1 << attempt)).await;
                }
                Err(e) => return Err(e).context("TLE fetch network error"),
            }
        }
    }

    /// Post a conjunction report to the mission operations endpoint.
    pub async fn post_conjunction(&self, report: &ConjunctionReport) -> Result<()> {
        self.client
            .post(format!("{}/conjunctions", self.base_url))
            .header("X-API-Key", &self.api_key)
            .json(report)
            .send()
            .await
            .context("failed to send conjunction report")?
            .error_for_status()
            .context("conjunction report rejected")?;
        Ok(())
    }

    /// Fetch all active TLEs in a specified altitude band (batch request).
    pub async fn get_tle_batch(&self, min_km: u32, max_km: u32) -> Result<Vec<TleRecord>> {
        self.client
            .get(format!("{}/tle/batch", self.base_url))
            .query(&[("min_alt_km", min_km), ("max_alt_km", max_km)])
            .header("X-API-Key", &self.api_key)
            .send()
            .await?
            .error_for_status()?
            .json::<Vec<TleRecord>>()
            .await
            .context("failed to parse TLE batch response")
    }
}

#[tokio::main]
async fn main() -> Result<()> {
    tracing_subscriber::fmt::init();

    let api = MissionApiClient::new(
        "https://api.meridian.internal".to_string(),
        "mission-control-key".to_string(),
    )?;

    // Periodic TLE refresh loop.
    let api_ref = std::sync::Arc::new(api);
    let refresh_api = std::sync::Arc::clone(&api_ref);

    tokio::spawn(async move {
        loop {
            match refresh_api.get_tle(25544).await {
                Ok(tle) => tracing::info!(name = %tle.name, "TLE refreshed"),
                Err(e) => tracing::error!("TLE refresh failed: {e}"),
            }
            sleep(Duration::from_secs(600)).await;
        }
    });

    // Post a conjunction report.
    api_ref.post_conjunction(&ConjunctionReport {
        object_a_id: 25544,
        object_b_id: 48274,
        tca_unix: 1_735_000_000.0,
        miss_distance_km: 0.8,
        probability: 0.003,
    }).await?;

    sleep(Duration::from_secs(1)).await;
    Ok(())
}

The MissionApiClient wraps the reqwest::Client and encodes the API contract — base URL, auth header, response types — in one place. Callers interact with typed methods rather than raw HTTP primitives. The Arc::new(api) pattern is appropriate here because Client is already internally reference-counted; wrapping in Arc just lets the MissionApiClient struct itself be shared. A simpler option is to pass &MissionApiClient to async functions directly, since MissionApiClient is Send + Sync.


Key Takeaways

  • Create one Client per configuration profile and share it across tasks via Clone. Each new Client is a new connection pool — creating one per request wastes connection setup overhead and defeats pooling.

  • Always call error_for_status() after .send().await unless you explicitly want to handle 4xx/5xx response bodies. HTTP error responses do not return Err from send().

  • Use .json(&value) for serializing request bodies with serde. Use .json::<T>() on the response for deserialization. Both require the json feature (enabled by default).

  • Distinguish retryable errors (5xx, 429, connection/timeout errors) from non-retryable ones (4xx client errors). Apply exponential backoff with jitter for retryable failures. Respect Retry-After headers on 429 responses.

  • Set connect_timeout separately from the overall .timeout. A short connect timeout (3–5s) fails fast on unreachable services without waiting for the full request timeout budget.

  • For different external services with different latency profiles and rate limits, use separate Client instances with separate configurations rather than sharing one client across everything.


Project — Ground Station Network Client

Module: Foundation — M04: Network Programming
Prerequisite: All three module quizzes passed (≥70%)



Mission Brief

TO: Platform Engineering
FROM: Mission Control Systems Lead
CLASSIFICATION: UNCLASSIFIED // INTERNAL
SUBJECT: RFC-0051 — Ground Station Network Client Implementation


The Meridian control plane currently uses a Python subprocess to manage ground station TCP connections. It provides no reconnection logic, no session health monitoring, and no integration with the TLE catalog API for per-session orbital data refresh. Under antenna tracking interruptions, sessions drop and are never re-established. Under Space-Track API rate limiting, TLE data becomes stale without any backoff or retry.

Your task is to build the ground station network client — the component that owns the full lifecycle of a ground station TCP session: connect, read frames, reconnect on failure, refresh TLE data via HTTP, and shut down cleanly.


System Specification

Connection Management

The client connects to a ground station TCP endpoint (host:port). The length-prefix framing protocol from Lesson 1 applies: 4-byte big-endian u32 length header followed by length bytes of payload.

On connection loss (EOF, read error, timeout), the client reconnects automatically with exponential backoff: 1s, 2s, 4s, 8s, up to 30s maximum. If reconnection fails for more than 5 minutes total, the client marks the station as Failed and stops retrying.

Session Lifecycle

Connecting → Connected → Receiving frames → [disconnect] → Reconnecting → Connected → ...
                                          → [shutdown signal] → Disconnecting → Stopped
                                          → [5 min failure] → Failed

The current session state is tracked as an enum and exposed via a watch channel so monitoring systems can observe it.

TLE Refresh

Each active session periodically fetches the TLE record for the session's assigned satellite from the mission API (GET /tle/{norad_id}). The refresh interval is configurable (default: 10 minutes). The HTTP client uses a connect_timeout of 3s and overall timeout of 15s. On 5xx or network errors, the refresh is retried with exponential backoff (up to 3 attempts). On 429, the backoff respects a Retry-After header if present.

Frame Forwarding

Successfully received frames are forwarded to a tokio::sync::mpsc::Sender<Frame>. The frame includes the station ID, the session's current TLE record (if available), and the raw payload. If the downstream channel is full, the frame is dropped and a warning is logged.

Shutdown

A watch::Receiver<bool> shutdown signal is accepted. On signal: complete the current frame read (do not abort mid-frame), flush any buffered writes (send a final GOODBYE frame to the peer), close the TCP connection cleanly, and exit.


Expected Output

A library crate (meridian-gs-client) with:

  • A GroundStationClient struct with run() method
  • A SessionState enum and watch channel for state observation
  • A Frame struct forwarded to the downstream channel
  • A test binary that: connects to a local echo server (you implement a minimal echo server in the test), receives 5 frames, triggers reconnect by having the echo server drop the connection, verifies reconnection, then triggers shutdown

The test binary output should clearly show: initial connection, frame receipt, connection drop, reconnection, and clean shutdown.


Acceptance Criteria

#CriterionVerifiable
1Client reconnects automatically on connection loss with exponential backoffYes — drop the server connection and verify reconnection in logs
2Reconnection backoff is bounded at 30 secondsYes — check timing between reconnect attempts under sustained failure
3Client marks station as Failed after 5 minutes of failed reconnectionsYes — simulate sustained connection refusal and verify state transition
4TLE refresh runs on the configured interval and retries on 5xx/network errorsYes — mock server returning 503 then 200
5Frame forwarding uses try_send — channel-full does not block the receive loopYes — code review and test with a slow downstream consumer
6Shutdown completes the current frame before exitingYes — send a large frame and trigger shutdown mid-send; frame arrives complete
7Session state transitions are correctly published to the watch channelYes — observer task sees all transitions in order

Hints

Hint 1 — Session state machine
#![allow(unused)]
fn main() {
#[derive(Debug, Clone, PartialEq)]
pub enum SessionState {
    Connecting { attempt: u32 },
    Connected { since: std::time::Instant },
    Reconnecting { attempt: u32, next_retry: std::time::Instant },
    Disconnecting,
    Failed { reason: String },
    Stopped,
}
}

Publish state changes via watch::Sender<SessionState>. Observers call borrow() to read the current state or changed().await to wait for the next transition.

Hint 2 — Reconnect loop structure
#![allow(unused)]
fn main() {
async fn run_with_reconnect(
    config: &ClientConfig,
    tx: mpsc::Sender<Frame>,
    mut shutdown: watch::Receiver<bool>,
    state_tx: watch::Sender<SessionState>,
) {
    let mut attempt = 0u32;
    let start = std::time::Instant::now();

    loop {
        if *shutdown.borrow() { break; }
        if start.elapsed() > std::time::Duration::from_secs(300) {
            let _ = state_tx.send(SessionState::Failed {
                reason: "reconnection window exceeded".into(),
            });
            break;
        }

        let _ = state_tx.send(SessionState::Connecting { attempt });
        match tokio::net::TcpStream::connect(&config.addr).await {
            Ok(stream) => {
                attempt = 0; // Reset on successful connection.
                let _ = state_tx.send(SessionState::Connected {
                    since: std::time::Instant::now(),
                });
                // Run the session until it disconnects or shutdown.
                run_session(stream, config, &tx, &mut shutdown, &state_tx).await;
                if *shutdown.borrow() { break; }
            }
            Err(e) => {
                tracing::warn!("connection failed (attempt {attempt}): {e}");
            }
        }

        attempt += 1;
        let delay = std::time::Duration::from_secs((1u64 << attempt.min(5)).min(30));
        let _ = state_tx.send(SessionState::Reconnecting {
            attempt,
            next_retry: std::time::Instant::now() + delay,
        });
        tokio::time::sleep(delay).await;
    }
}
}
Hint 3 — TLE refresh as a background task per session

Spawn a TLE refresh task when the session connects. Abort it when the session disconnects. Use a watch::Sender<Option<TleRecord>> to share the current TLE with the frame handler:

#![allow(unused)]
fn main() {
async fn run_tle_refresh(
    http: reqwest::Client,
    norad_id: u32,
    interval: std::time::Duration,
    tle_tx: tokio::sync::watch::Sender<Option<TleRecord>>,
    mut shutdown: tokio::sync::watch::Receiver<bool>,
) {
    loop {
        tokio::select! {
            _ = tokio::time::sleep(interval) => {
                match fetch_tle_with_retry(&http, norad_id, 3).await {
                    Ok(tle) => { let _ = tle_tx.send(Some(tle)); }
                    Err(e) => tracing::warn!(norad_id, "TLE refresh failed: {e}"),
                }
            }
            Ok(()) = shutdown.changed() => {
                if *shutdown.borrow() { break; }
            }
        }
    }
}
}
Hint 4 — Sending a GOODBYE frame on clean shutdown
#![allow(unused)]
fn main() {
use tokio::io::AsyncWriteExt;

async fn send_goodbye(stream: &mut tokio::net::TcpStream) {
    const GOODBYE: &[u8] = b"GOODBYE";
    let len = (GOODBYE.len() as u32).to_be_bytes();
    // Best-effort — ignore errors (we're shutting down anyway).
    let _ = stream.write_all(&len).await;
    let _ = stream.write_all(GOODBYE).await;
    let _ = stream.flush().await;
    let _ = stream.shutdown().await;
}
}

Reference Implementation

Reveal reference implementation
#![allow(unused)]
fn main() {
use anyhow::Result;
use reqwest::Client as HttpClient;
use serde::Deserialize;
use std::time::{Duration, Instant};
use tokio::{
    io::{AsyncReadExt, AsyncWriteExt},
    net::TcpStream,
    sync::{mpsc, watch},
    time::sleep,
};
use tracing::{info, warn, error};

#[derive(Debug, Clone, Deserialize)]
pub struct TleRecord {
    pub norad_id: u32,
    pub name: String,
    pub line1: String,
    pub line2: String,
}

#[derive(Debug, Clone, PartialEq)]
pub enum SessionState {
    Connecting { attempt: u32 },
    Connected,
    Reconnecting { attempt: u32 },
    Failed { reason: String },
    Stopped,
}

#[derive(Debug)]
pub struct Frame {
    pub station_id: String,
    pub tle: Option<TleRecord>,
    pub payload: Vec<u8>,
}

pub struct ClientConfig {
    pub station_id: String,
    pub addr: String,
    pub norad_id: u32,
    pub api_base_url: String,
    pub tle_refresh_interval: Duration,
}

async fn read_frame(stream: &mut TcpStream) -> Result<Option<Vec<u8>>> {
    let mut len_buf = [0u8; 4];
    match stream.read_exact(&mut len_buf).await {
        Ok(()) => {}
        Err(e) if e.kind() == std::io::ErrorKind::UnexpectedEof => return Ok(None),
        Err(e) => return Err(e.into()),
    }
    let len = u32::from_be_bytes(len_buf) as usize;
    if len > 65_536 { anyhow::bail!("frame too large: {len}"); }
    let mut buf = vec![0u8; len];
    stream.read_exact(&mut buf).await?;
    Ok(Some(buf))
}

async fn fetch_tle(http: &HttpClient, base_url: &str, norad_id: u32) -> Result<TleRecord> {
    let url = format!("{base_url}/tle/{norad_id}");
    let mut attempt = 0u32;
    loop {
        attempt += 1;
        match http.get(&url).send().await {
            Ok(r) if r.status().is_success() => {
                return Ok(r.json::<TleRecord>().await?);
            }
            Ok(r) if r.status().is_server_error() && attempt < 3 => {
                warn!(norad_id, attempt, "TLE fetch server error, retrying");
                sleep(Duration::from_secs(1 << attempt)).await;
            }
            Ok(r) => anyhow::bail!("TLE fetch: HTTP {}", r.status()),
            Err(e) if (e.is_connect() || e.is_timeout()) && attempt < 3 => {
                warn!(norad_id, attempt, "TLE fetch network error, retrying");
                sleep(Duration::from_secs(1 << attempt)).await;
            }
            Err(e) => return Err(e.into()),
        }
    }
}

async fn run_session(
    mut stream: TcpStream,
    config: &ClientConfig,
    http: &HttpClient,
    frame_tx: &mpsc::Sender<Frame>,
    tle_tx: &watch::Sender<Option<TleRecord>>,
    mut shutdown: watch::Receiver<bool>,
) {
    // Kick off TLE refresh task for this session.
    let (session_shutdown_tx, session_shutdown_rx) = watch::channel(false);
    let tle_refresh = {
        let http = http.clone();
        let base = config.api_base_url.clone();
        let norad_id = config.norad_id;
        let interval = config.tle_refresh_interval;
        let tle_tx = tle_tx.clone();
        tokio::spawn(async move {
            let mut sd = session_shutdown_rx;
            loop {
                tokio::select! {
                    _ = sleep(interval) => {
                        match fetch_tle(&http, &base, norad_id).await {
                            Ok(tle) => { let _ = tle_tx.send(Some(tle)); }
                            Err(e) => warn!("TLE refresh failed: {e}"),
                        }
                    }
                    Ok(()) = sd.changed() => { if *sd.borrow() { break; } }
                }
            }
        })
    };

    loop {
        tokio::select! {
            biased;
            frame = tokio::time::timeout(Duration::from_secs(60), read_frame(&mut stream)) => {
                match frame {
                    Err(_) => { warn!(station = %config.station_id, "session timeout"); break; }
                    Ok(Ok(Some(payload))) => {
                        let tle = tle_tx.subscribe().borrow().clone();
                        let f = Frame {
                            station_id: config.station_id.clone(),
                            tle,
                            payload,
                        };
                        if frame_tx.try_send(f).is_err() {
                            warn!(station = %config.station_id, "frame dropped: pipeline full");
                        }
                    }
                    Ok(Ok(None)) => { info!(station = %config.station_id, "peer closed connection"); break; }
                    Ok(Err(e)) => { warn!(station = %config.station_id, "read error: {e}"); break; }
                }
            }
            Ok(()) = shutdown.changed() => {
                if *shutdown.borrow() {
                    info!(station = %config.station_id, "shutdown — sending GOODBYE");
                    let _ = session_shutdown_tx.send(true);
                    let payload = b"GOODBYE";
                    let len = (payload.len() as u32).to_be_bytes();
                    let _ = stream.write_all(&len).await;
                    let _ = stream.write_all(payload).await;
                    let _ = stream.flush().await;
                    let _ = stream.shutdown().await;
                    break;
                }
            }
        }
    }

    let _ = session_shutdown_tx.send(true);
    let _ = tle_refresh.await;
}

pub async fn run_client(
    config: ClientConfig,
    frame_tx: mpsc::Sender<Frame>,
    mut shutdown: watch::Receiver<bool>,
    state_tx: watch::Sender<SessionState>,
) {
    let http = HttpClient::builder()
        .connect_timeout(Duration::from_secs(3))
        .timeout(Duration::from_secs(15))
        .build()
        .expect("failed to build HTTP client");

    let (tle_tx, _) = watch::channel::<Option<TleRecord>>(None);
    let mut attempt = 0u32;
    let start = Instant::now();

    loop {
        if *shutdown.borrow() { break; }

        if start.elapsed() > Duration::from_secs(300) {
            let _ = state_tx.send(SessionState::Failed {
                reason: "5-minute reconnect window exceeded".into(),
            });
            return;
        }

        let _ = state_tx.send(SessionState::Connecting { attempt });
        match TcpStream::connect(&config.addr).await {
            Ok(stream) => {
                attempt = 0;
                let _ = state_tx.send(SessionState::Connected);
                info!(station = %config.station_id, "connected to {}", config.addr);
                run_session(stream, &config, &http, &frame_tx, &tle_tx, shutdown.clone()).await;
                if *shutdown.borrow() { break; }
                info!(station = %config.station_id, "session ended, will reconnect");
            }
            Err(e) => {
                warn!(station = %config.station_id, attempt, "connection failed: {e}");
            }
        }

        attempt += 1;
        let delay = Duration::from_secs((1u64 << attempt.min(5)).min(30));
        let _ = state_tx.send(SessionState::Reconnecting { attempt });
        info!(station = %config.station_id, "reconnecting in {delay:?}");
        sleep(delay).await;
    }

    let _ = state_tx.send(SessionState::Stopped);
    info!(station = %config.station_id, "client stopped");
}
}

Reflection

The ground station client built here is the connection layer that sits between the raw TCP socket and the telemetry aggregator from Module 3. The three lessons of this module are directly combined: TcpListener/TcpStream from Lesson 1 for the framed session protocol, UDP from Lesson 2 could be added for out-of-band sensor feeds from the same station, and reqwest from Lesson 3 for the TLE refresh background task within the session.

The reconnection loop pattern — state machine published to a watch channel, exponential backoff, failure timeout — is universal. It applies equally to database connections, message broker connections, and any other persistent network resource that needs supervisory recovery behaviour.

Module 05 — Data-Oriented Design in Rust

Track: Foundation — Mission Control Platform
Position: Module 5 of 6
Source material: Rust for Rustaceans — Jon Gjengset, Chapters 2, 9
Quiz pass threshold: 70% on all three lessons to unlock the project



Mission Context

The Meridian telemetry processor runs at 62,000 frames per second. The conjunction avoidance pipeline requires 100,000. The gap is not a missing algorithm or a suboptimal data structure — it is allocator pressure and cache waste, both caused by data layout decisions made when defining types. Each frame allocates a Vec<u8> on the global heap. Each deduplication pass loads 2.4× more data than the deduplication logic uses.

Data-oriented design is a discipline for making data layout decisions that align with CPU hardware realities: cache lines are 64 bytes, cache misses cost 100–300 cycles, and SIMD instructions operate on contiguous uniform-type data. The three techniques in this module — cache-optimal struct layout, SoA field separation, and arena allocation — directly address the two profiling findings above.


What You Will Learn

By the end of this module you will be able to:

  • Explain how field alignment and padding inflate struct sizes, use repr attributes to control layout, and write const assertions to lock in size expectations at compile time
  • Identify false sharing between concurrent tasks, apply repr(align(64)) with padding to isolate per-thread data to separate cache lines, and separate hot fields from cold fields in structs used in high-volume collections
  • Explain when SoA layout outperforms AoS (field-subset sequential operations) and when AoS outperforms SoA (per-entity random access), implement an OrbitalCatalog using field grouping, and transition from AoS to SoA incrementally without a full rewrite
  • Implement a bump/arena allocator for same-lifetime batch allocations, contrast its allocation cost with the global allocator, use thread-local arenas for zero-contention concurrent allocation, and identify when arena allocation is inappropriate (mixed lifetimes, individual deallocation)

Lessons

Lesson 1 — Cache-Friendly Data Layouts: Struct Layout, Padding, and Cache Line Alignment

Covers alignment and padding mechanics, repr(C) vs repr(Rust) vs repr(packed) vs repr(align(n)), false sharing between concurrent tasks, repr(align(64)) for per-thread counter isolation, and hot/cold field separation. Grounded in Rust for Rustaceans, Chapter 2.

Key question this lesson answers: How does field order affect struct size, what causes false sharing between concurrent tasks, and how do you isolate hot data from cold data?

lesson-01-cache-friendly-layouts.md / lesson-01-quiz.toml


Lesson 2 — Struct-of-Arrays vs Array-of-Structs: When Each Wins

Covers the AoS and SoA layout patterns, the cache utilisation argument for each, the conditions that favour SoA (field-subset sequential scans, SIMD, large N), the conditions that favour AoS (per-entity random access, all-field operations), the hybrid field-grouping pattern, and incremental AoS-to-SoA transition via a companion index.

Key question this lesson answers: When does splitting fields into separate vectors improve performance, and when does it hurt?

lesson-02-soa-vs-aos.md / lesson-02-quiz.toml


Lesson 3 — Arena Allocation: Bump Allocators for High-Throughput Telemetry Processing

Covers the global allocator's cost for high-frequency short-lived allocations, the bump allocator pattern (O(1) alloc, O(1) epoch free), the lifetime constraint, thread-local arenas for zero-contention concurrent allocation, the bumpalo crate interface, and the workloads where arena allocation is inappropriate.

Key question this lesson answers: When is the global allocator the bottleneck, and how does a bump allocator eliminate that overhead for same-lifetime batch objects?

lesson-03-arena-allocation.md / lesson-03-quiz.toml


Capstone Project — High-Throughput Telemetry Packet Processor

Rebuild the Meridian telemetry processor core to achieve ≥100,000 frames/sec using all three techniques: a 24-byte FrameHeader with const size assertion, SoA separation of headers from arena-allocated payloads, bump arena for batch payload allocation with O(1) epoch reset, and SoA-based deduplication operating only on the hot header array.

Acceptance is against 7 verifiable criteria including compile-time size assertions, no per-frame heap allocations, correct arena reset, and measured throughput.

project-telemetry-processor.md


Prerequisites

Modules 1–4 must be complete. Module 2 (Concurrency Primitives) introduced atomic operations and the false sharing problem — Lesson 1 of this module extends that with the repr(align(64)) solution. Module 5's content stands alone otherwise; it does not build on the networking or message-passing material from Modules 3–4.

What Comes Next

Module 6 — Performance and Profiling gives you the measurement tools to validate the optimisations introduced here: criterion for reliable microbenchmarks, flamegraph and perf for identifying hot paths, and heap profiling for measuring allocator pressure. You will profile the processor built in this module's project and verify the improvement against the baseline.

Lesson 1 — Cache-Friendly Data Layouts: Struct Layout, Padding, and Cache Line Alignment

Module: Foundation — M05: Data-Oriented Design in Rust
Position: Lesson 1 of 3
Source: Rust for Rustaceans — Jon Gjengset, Chapter 2



Context

The Meridian telemetry processor receives 100,000 frames per second at peak load across 48 uplinks. Each frame passes through validation, deduplication, and routing — operations that read specific fields from a TelemetryFrame struct on every iteration. At that throughput, the cost of a CPU cache miss — roughly 100–300 clock cycles to fetch from RAM, compared to 4 cycles for an L1 cache hit — is the difference between keeping up and falling behind.

CPU cache performance is not something you can bolt on after profiling shows a bottleneck. It is determined by the decisions you make when you define your data types. How fields are ordered. How large structs are. Whether hot fields and cold fields share a cache line. These decisions are locked in by the struct definition, and changing them later requires touching every callsite that constructs or accesses the type.

This lesson covers the mechanics that determine how Rust lays out your types in memory, the repr attributes that control those mechanics, and how to make decisions that keep hot data in cache.

Source: Rust for Rustaceans, Chapter 2 (Gjengset)


Core Concepts

Alignment and Padding

Every type has an alignment requirement — the CPU needs its address to be a multiple of some power of two. A u8 needs 1-byte alignment. A u32 needs 4-byte alignment. A u64 needs 8-byte alignment.

When you put fields of different alignments in a struct, the compiler inserts padding bytes between fields to satisfy alignment requirements (Rust for Rustaceans, Ch. 2). Consider this struct with #[repr(C)] (which preserves field order):

#![allow(unused)]
fn main() {
#[repr(C)]
struct BadLayout {
    tiny: bool,    // 1 byte
    // 3 bytes padding — to align `normal` to 4 bytes
    normal: u32,   // 4 bytes
    small: u8,     // 1 byte
    // 7 bytes padding — to align `long` to 8 bytes
    long: u64,     // 8 bytes
    short: u16,    // 2 bytes
    // 6 bytes padding — to make total size a multiple of alignment (8)
}
// Total: 32 bytes. Actual data: 16 bytes. Wasted: 16 bytes — half the struct is padding.
}

With #[repr(Rust)] (the default), the compiler is free to reorder fields by size, descending — eliminating most padding:

#![allow(unused)]
fn main() {
// Default Rust layout — compiler reorders fields for minimal padding.
// Effective order: long (8), normal (4), short (2), tiny (1), small (1)
struct GoodLayout {
    tiny: bool,
    normal: u32,
    small: u8,
    long: u64,
    short: u16,
}
// Total: 16 bytes. Same fields, no wasted padding.
}

The difference at scale: a Vec<BadLayout> of 1 million elements occupies 32 MB. A Vec<GoodLayout> with the same data occupies 16 MB — fitting twice as many elements in the same cache footprint, doubling cache hit rate for sequential access.

You can verify sizes at compile time with std::mem::size_of:

#[repr(C)]
struct BadLayout { tiny: bool, normal: u32, small: u8, long: u64, short: u16 }
struct GoodLayout { tiny: bool, normal: u32, small: u8, long: u64, short: u16 }

fn main() {
    // Confirm the size difference at compile time.
    const _: () = assert!(std::mem::size_of::<BadLayout>() == 32);
    const _: () = assert!(std::mem::size_of::<GoodLayout>() == 16);
    println!("BadLayout: {} bytes", std::mem::size_of::<BadLayout>());
    println!("GoodLayout: {} bytes", std::mem::size_of::<GoodLayout>());
}

Use const assertions as compile-time guards on struct sizes for types that appear in high-volume collections. When a future change accidentally adds padding, the assertion fails at compile time rather than silently degrading cache performance.

repr Attributes

repr(Rust) — the default. The compiler may reorder fields for minimal padding and does not guarantee a specific layout. This is optimal for Rust-only code but incompatible with C interop.

repr(C) — fields laid out in declaration order, C-compatible. Required when passing structs across FFI boundaries. At the cost of potentially more padding if fields are not ordered by descending alignment.

repr(packed) — removes all padding. Fields may be misaligned, which can be much slower on x86 (unaligned loads trigger microcode assists) and cause bus errors on architectures that require alignment. Use only when minimizing memory footprint is more important than access speed — for example, serialized wire formats, or extremely memory-constrained environments.

repr(align(n)) — forces the struct to have at least n byte alignment. The most common use in systems programming is cache line alignment for concurrent data structures:

#![allow(unused)]
fn main() {
use std::sync::atomic::AtomicU64;

// Each counter occupies a full 64-byte cache line.
// Without this: two counters from different threads share a cache line,
// causing false sharing — each write by one thread invalidates the
// other thread's cache entry even though they touch different data.
#[repr(align(64))]
struct CacheAlignedCounter {
    value: AtomicU64,
    _pad: [u8; 56], // Explicit padding to fill the 64-byte cache line.
}
}

Cache Lines and False Sharing

A CPU cache line is 64 bytes on x86-64. The CPU fetches and evicts cache lines as atomic units — not individual bytes or words. When two logical pieces of data share a cache line, any write to either one invalidates the entire line in every other core's cache.

False sharing occurs when two threads write to different variables that happen to occupy the same cache line (Rust for Rustaceans, Ch. 2). Each write by either thread causes the cache line to bounce between cores — effectively serializing what should be independent writes:

#![allow(unused)]
fn main() {
use std::sync::atomic::{AtomicU64, Ordering};
use std::thread;

// BAD: both counters fit in one 16-byte struct, sharing a cache line.
// Thread A's writes to `a` invalidate thread B's cached copy of the line,
// which also contains `b`. Both threads contend on the same cache line.
struct SharedCounters {
    a: AtomicU64,
    b: AtomicU64,
}

// GOOD: each counter on its own cache line.
#[repr(align(64))]
struct IsolatedCounter {
    value: AtomicU64,
}

fn demonstrate_false_sharing() {
    // With SharedCounters: threads A and B writing independently
    // still cause cache coherence traffic because they share a line.

    // With two IsolatedCounter instances: threads write truly independently.
    let counter_a = IsolatedCounter { value: AtomicU64::new(0) };
    let counter_b = IsolatedCounter { value: AtomicU64::new(0) };

    // counter_a and counter_b now occupy separate cache lines.
    // Writes by one thread do not invalidate the other's cache entry.
    counter_a.value.fetch_add(1, Ordering::Relaxed);
    counter_b.value.fetch_add(1, Ordering::Relaxed);
}
}

Hot Field / Cold Field Separation

Not all fields in a struct are accessed with equal frequency. For a TelemetryFrame, the routing fields (satellite_id, sequence) are read on every frame. The full payload is only read when forwarding downstream. Putting hot and cold data in the same struct means every cache miss for a hot field also loads the cold payload into cache — evicting other useful data.

The pattern: split the struct. Keep a hot "header" struct with frequently accessed fields, and access the cold data via an Arc<Vec<u8>> or a separate index:

#![allow(unused)]
fn main() {
use std::sync::Arc;

// Hot: accessed on every frame for routing decisions.
// 24 bytes — fits comfortably in cache alongside many sibling headers.
struct FrameHeader {
    satellite_id: u32,   // 4 bytes
    sequence: u64,       // 8 bytes
    timestamp_ms: u64,   // 8 bytes
    flags: u8,           // 1 byte
    _pad: [u8; 3],       // 3 bytes padding (explicit, documented)
}

// Cold: accessed only when forwarding to downstream consumers.
// Heap-allocated; not loaded until needed.
struct FrameBody {
    header: FrameHeader,
    payload: Arc<Vec<u8>>,  // Heap allocation keeps cold data out of hot path.
}
}

A Vec<FrameHeader> for routing decisions keeps 24-byte hot entries packed. 64 bytes (one cache line) holds 2 full headers plus change — much better than loading 24 + payload.len() bytes per frame just to check a sequence number.


Code Examples

Verifying Layout Decisions at Compile Time

Use constant assertions to lock in size expectations for hot types. This catches accidental regressions — adding a field that introduces padding shows up as a compile error immediately.

use std::mem::{size_of, align_of};

/// A telemetry frame header optimized for sequential scanning.
/// Fields ordered by alignment (descending) to minimize padding.
#[derive(Debug, Clone, Copy)]
pub struct TelemetryHeader {
    pub timestamp_ms: u64,      // 8 bytes — largest alignment first
    pub sequence: u64,          // 8 bytes
    pub satellite_id: u32,      // 4 bytes
    pub byte_count: u32,        // 4 bytes
    pub flags: u8,              // 1 byte
    pub station_id: u8,         // 1 byte
    pub _reserved: [u8; 2],     // 2 bytes explicit pad — documented intent
}

// Lock in the expected size at compile time.
// If a future change causes unexpected padding, this fails to compile.
const _SIZE_CHECK: () = assert!(size_of::<TelemetryHeader>() == 32);
const _ALIGN_CHECK: () = assert!(align_of::<TelemetryHeader>() == 8);

/// A per-uplink session counter, cache-line aligned to prevent false sharing.
/// 48 sessions each updating their own counter never contend on a shared line.
#[repr(align(64))]
pub struct SessionCounter {
    pub frames_received: u64,
    pub bytes_received: u64,
    pub frames_dropped: u64,
    _pad: [u8; 40],  // Pad to fill 64-byte cache line: 3×8 + 40 = 64.
}

const _COUNTER_ALIGN: () = assert!(align_of::<SessionCounter>() == 64);
const _COUNTER_SIZE: () = assert!(size_of::<SessionCounter>() == 64);

fn main() {
    println!("TelemetryHeader: {} bytes", size_of::<TelemetryHeader>());
    println!("SessionCounter:  {} bytes (cache-line aligned)", size_of::<SessionCounter>());

    // Verify that an array of counters places each on its own cache line.
    let counters: Vec<SessionCounter> = (0..4)
        .map(|_| SessionCounter {
            frames_received: 0,
            bytes_received: 0,
            frames_dropped: 0,
            _pad: [0; 40],
        })
        .collect();

    // Each counter is 64 bytes and 64-byte aligned — no false sharing.
    for (i, c) in counters.iter().enumerate() {
        let addr = c as *const _ as usize;
        println!("counter[{i}] at 0x{addr:x} (aligned: {})", addr % 64 == 0);
    }
}

Key Takeaways

  • The compiler inserts padding between fields to satisfy alignment requirements. Field order determines how much padding is inserted. Ordering fields by decreasing size (largest alignment first) minimizes padding with default repr(Rust).

  • repr(Rust) (default) allows the compiler to reorder fields — usually optimal. repr(C) preserves field order for FFI compatibility at the potential cost of more padding. repr(packed) removes padding but risks misaligned access penalties.

  • repr(align(n)) forces a minimum alignment. Use it to ensure hot atomic counters occupy separate cache lines when accessed from multiple threads concurrently, preventing false sharing.

  • False sharing occurs when two threads write to different variables that share a 64-byte cache line. The fix is repr(align(64)) with explicit padding to fill the cache line.

  • Separate hot fields (read on every iteration) from cold fields (read rarely). A struct that bundles both forces the CPU to load cold data into cache on every hot access, evicting more useful data. Use a header struct for hot fields and heap-allocated or indexed access for cold data.

  • Use const assertions on size_of and align_of for types in high-volume collections. They turn accidental layout regressions into compile errors rather than silent performance degradation.


Lesson 2 — Struct-of-Arrays vs Array-of-Structs: When Each Wins

Module: Foundation — M05: Data-Oriented Design in Rust Position: Lesson 2 of 3 Source: Synthesized from training knowledge. Concepts would benefit from verification against: Mike Acton's CppCon 2014 "Data-Oriented Design and C++" and Chandler Carruth's related talks.



Context

The Meridian conjunction screening pass processes 50,000 orbital elements every 10 minutes. Each screening step reads the altitude and inclination of every object in the catalog. It does not read the object name, the launch date, the operator contact, or any other administrative metadata. Those fields exist in the catalog, but the screening loop does not touch them.

With a conventional struct design — one OrbitalObject struct with all fields — the screening loop loads each full struct into cache on every iteration. The fields it actually uses are 16 bytes; the fields it ignores are perhaps 120 bytes. The ratio: 13% of every cache line brought in is useful data. The rest is wasted memory bandwidth.

This is the core insight behind struct-of-arrays (SoA) layout: if an operation only accesses a subset of fields, store those fields contiguously rather than interleaved with irrelevant fields. Processing altitudes[0..50000] accesses only altitude data; there is no orbital metadata in the working set, no cache line waste.

This lesson covers when SoA beats AoS, when AoS beats SoA, and how to implement the transition idiomatically in Rust.


Core Concepts

Array-of-Structs (AoS): The Default

The conventional layout: a Vec<T> where T is a struct containing all fields for one entity.

#![allow(unused)]
fn main() {
// Array-of-Structs: all fields for one object are contiguous.
#[derive(Debug, Clone)]
struct OrbitalObject {
    norad_id: u32,         // 4 bytes
    altitude_km: f64,      // 8 bytes — used in conjunction screening
    inclination_deg: f64,  // 8 bytes — used in conjunction screening
    raan_deg: f64,         // 8 bytes — used in conjunction screening
    name: [u8; 24],        // 24 bytes — NOT used in conjunction screening
    launch_year: u16,      // 2 bytes — NOT used in conjunction screening
    _pad: [u8; 6],         // 6 bytes padding
}
// One OrbitalObject: 64 bytes. One cache line.
// The screening loop uses 28 bytes (norad_id + 3 doubles).
// 36 bytes of every cache line are irrelevant to screening.

fn screen_conjunctions_aos(objects: &[OrbitalObject], threshold_km: f64) -> Vec<u32> {
    let mut alerts = Vec::new();
    for (i, a) in objects.iter().enumerate() {
        for b in &objects[i+1..] {
            // Each iteration loads a full 64-byte OrbitalObject into cache.
            // Only altitude_km and inclination_deg are used.
            let delta = (a.altitude_km - b.altitude_km).abs();
            if delta < threshold_km {
                alerts.push(a.norad_id);
            }
        }
    }
    alerts
}
}

For a 50,000-object catalog, AoS loads 50,000 × 64 bytes = 3.2 MB per pass, even though only 28 bytes per object matter. At a 64-byte cache line, 44% of cache bandwidth is wasted on unused fields.

Struct-of-Arrays (SoA): Fields in Separate Vectors

SoA inverts the layout: instead of one Vec<Object>, maintain separate Vec<field_type> for each field. Objects are indexed by position across all vectors.

#![allow(unused)]
fn main() {
// Struct-of-Arrays: each field is a contiguous array.
// Access patterns that touch only a few fields see only those fields in cache.
struct OrbitalCatalog {
    // "Hot" fields — accessed every screening pass.
    norad_ids:       Vec<u32>,
    altitudes_km:    Vec<f64>,
    inclinations_deg: Vec<f64>,
    raans_deg:       Vec<f64>,
    // "Cold" fields — accessed only for display / export.
    names:           Vec<[u8; 24]>,
    launch_years:    Vec<u16>,
}

impl OrbitalCatalog {
    fn len(&self) -> usize { self.norad_ids.len() }

    fn push(&mut self, id: u32, alt: f64, inc: f64, raan: f64,
            name: [u8; 24], launch: u16) {
        self.norad_ids.push(id);
        self.altitudes_km.push(alt);
        self.inclinations_deg.push(inc);
        self.raans_deg.push(raan);
        self.names.push(name);
        self.launch_years.push(launch);
    }
}

fn screen_conjunctions_soa(catalog: &OrbitalCatalog, threshold_km: f64) -> Vec<u32> {
    let alts = &catalog.altitudes_km;
    let ids  = &catalog.norad_ids;
    let mut alerts = Vec::new();

    for i in 0..catalog.len() {
        for j in i+1..catalog.len() {
            // Only altitudes_km is touched here — 8 bytes per element.
            // 8 f64s fit in one cache line.
            // For 50k objects, working set for altitudes_km = 400KB (fits in L2).
            let delta = (alts[i] - alts[j]).abs();
            if delta < threshold_km {
                alerts.push(ids[i]);
            }
        }
    }
    alerts
}
}

The screening loop now accesses only altitudes_km — 50,000 × 8 bytes = 400 KB, which fits in a typical L2 cache (512KB–2MB). The names, launch years, and RAAN values are never loaded. Cache utilisation is near 100%.

When SoA Wins

SoA is most effective when:

  1. Operations access a small subset of fields. The conjunction screening loop uses 3 of 6 fields. SIMD vectorization of altitudes_km - alts[j] operates on 8 doubles per instruction with AVX2.

  2. Processing is sequential over all objects. Iterating altitudes_km[0..50000] is a linear scan — the hardware prefetcher predicts the access pattern and pre-fetches cache lines ahead of the loop.

  3. Field values have uniform types amenable to SIMD. A Vec<f64> can be processed with f64x4 or f64x8 SIMD instructions. An AoS loop cannot be auto-vectorized as efficiently because the fields are interleaved.

  4. Objects are added and removed infrequently. SoA requires synchronized insertion and removal across all vectors. Random insertion in the middle is O(n) for every field vector simultaneously.

When AoS Wins

AoS is more appropriate when:

  1. Operations access all or most fields of one object at a time. Constructing a display record or serializing one object reads all fields — SoA forces jumping across multiple vectors.

  2. Access is random by index. Looking up object ID 25544 requires one index lookup across all vectors. AoS keeps all of 25544's data in one cache line — one miss. SoA scatters it across multiple cache lines.

  3. Objects are frequently inserted, removed, or moved. AoS insertion is a single push. SoA insertion is push across all field vectors — more work and more cache lines touched.

  4. The struct has few fields or all fields are typically accessed together. If the struct is small (≤ 32 bytes) and all fields are used in every operation, SoA provides no benefit and complicates the API.

Hybrid: AoS with Field Grouping

The practical approach is not a binary AoS vs SoA choice — it is grouping fields by access pattern:

#![allow(unused)]
fn main() {
/// Hot group: fields accessed in every pass of the screening loop.
#[derive(Debug, Clone, Copy)]
struct ObjectHot {
    altitude_km:      f64,
    inclination_deg:  f64,
    raan_deg:         f64,
    eccentricity:     f64,
}

/// Cold group: fields accessed for display, export, and audit only.
#[derive(Debug, Clone)]
struct ObjectCold {
    norad_id:    u32,
    launch_year: u16,
    name:        String,
    operator:    String,
}

/// The catalog splits hot and cold data into separate vectors.
/// The index is the common key between them.
struct OrbitalCatalog {
    hot:  Vec<ObjectHot>,   // Dense; accessed every screening pass.
    cold: Vec<ObjectCold>,  // Sparse access; not in screening hot path.
}
}

The screening pass operates only on hot — a Vec<ObjectHot> of 50,000 × 32 bytes = 1.6 MB, fitting in L3 cache. cold is loaded only for display or export, where its access pattern (one object at a time by index) makes AoS natural.


Code Examples

Parallel Altitude Screening with Rayon and SoA

With altitudes stored in a contiguous Vec<f64>, the screening loop is naturally parallelisable — each parallel chunk accesses an independent range of the altitude slice:

// Note: rayon is not available in the Playground; this demonstrates the
// pattern. In production, add rayon = "1" to Cargo.toml.

/// Simplified O(n) altitude band screening (not the full O(n²) conjunction check).
/// Finds objects in a dangerous altitude band. SoA makes this trivially parallel
/// and cache-friendly: the working set is just Vec<f64>.
fn screen_altitude_band(
    altitudes_km: &[f64],
    norad_ids: &[u32],
    min_km: f64,
    max_km: f64,
) -> Vec<u32> {
    assert_eq!(altitudes_km.len(), norad_ids.len());

    // Sequential: all altitudes fit in one contiguous slice.
    // Hardware prefetcher maximises cache utilisation.
    altitudes_km
        .iter()
        .zip(norad_ids.iter())
        .filter_map(|(&alt, &id)| {
            if alt >= min_km && alt <= max_km {
                Some(id)
            } else {
                None
            }
        })
        .collect()
}

fn main() {
    // Simulate a 10,000-object catalog.
    let altitudes_km: Vec<f64> = (0..10_000u32)
        .map(|i| 350.0 + (i as f64) * 0.05)
        .collect();
    let norad_ids: Vec<u32> = (0..10_000u32).collect();

    // Screen for objects in the 400–450 km band (high debris density).
    let alerts = screen_altitude_band(&altitudes_km, &norad_ids, 400.0, 450.0);
    println!("{} objects in 400–450km band", alerts.len());

    // Verify the working set is contiguous and predictable:
    let working_set_bytes = altitudes_km.len() * std::mem::size_of::<f64>();
    println!("working set: {} KB", working_set_bytes / 1024);
}

Transposing AoS to SoA Incrementally

Transitioning an existing AoS codebase to SoA does not require a full rewrite. Extract the hot fields into a companion SoA structure, index both by the same key:

// Existing AoS type — not changed, other code still uses it.
#[derive(Debug, Clone)]
struct TelemetryFrame {
    satellite_id: u32,
    sequence:     u64,
    timestamp_ms: u64,
    station_id:   u8,
    payload:      Vec<u8>,
}

// New SoA hot path for bulk sequence-number deduplication.
// Built from the AoS data; kept in sync on insert.
struct FrameSequenceIndex {
    satellite_ids: Vec<u32>,
    sequences:     Vec<u64>,
}

impl FrameSequenceIndex {
    fn from_frames(frames: &[TelemetryFrame]) -> Self {
        Self {
            satellite_ids: frames.iter().map(|f| f.satellite_id).collect(),
            sequences:     frames.iter().map(|f| f.sequence).collect(),
        }
    }

    /// Find all duplicate (satellite_id, sequence) pairs — O(n) scan,
    /// cache-friendly because both vecs are small and contiguous.
    fn find_duplicates(&self) -> Vec<usize> {
        let mut seen = std::collections::HashSet::new();
        self.satellite_ids
            .iter()
            .zip(self.sequences.iter())
            .enumerate()
            .filter_map(|(i, (&sat, &seq))| {
                if !seen.insert((sat, seq)) { Some(i) } else { None }
            })
            .collect()
    }
}

fn main() {
    let frames: Vec<TelemetryFrame> = (0..5u32).map(|i| TelemetryFrame {
        satellite_id: i % 3,
        sequence: (i / 3) as u64,
        timestamp_ms: 1_700_000_000 + i as u64,
        station_id: 1,
        payload: vec![i as u8; 128],
    }).collect();

    let index = FrameSequenceIndex::from_frames(&frames);
    let dups = index.find_duplicates();
    println!("{} duplicate frames found", dups.len());
}

The FrameSequenceIndex co-exists with the original Vec<TelemetryFrame>. Hot operations use the index; display and forwarding use the original frames. The transition is incremental — no global refactor required.


Key Takeaways

  • AoS is natural when operations access all or most fields of one entity at a time, or when random access by index is common. SoA is natural when operations process all entities but only a few fields — the common case in simulation and batch processing.

  • SoA improves cache utilisation for field-subset operations because the working set is smaller: processing altitudes_km[0..n] loads only altitude data, not names, metadata, or other cold fields.

  • Sequential access of a contiguous Vec<f64> is maximally cache-friendly and SIMD-friendly. The hardware prefetcher predicts linear access patterns; SIMD intrinsics or auto-vectorisation require uniformly-typed contiguous data.

  • The practical pattern is field grouping: split hot fields (accessed in every loop) from cold fields (accessed occasionally), and store them in separate vecs. This is a hybrid AoS/SoA approach.

  • Transitioning from AoS to SoA does not require a full rewrite. Extract hot fields into a companion SoA index, keep both in sync on insert, and route hot-path operations through the index.


Lesson 3 — Arena Allocation: Bump Allocators for High-Throughput Telemetry Processing

Module: Foundation — M05: Data-Oriented Design in Rust Position: Lesson 3 of 3 Source: Rust for Rustaceans — Jon Gjengset, Chapter 9 (GlobalAlloc). SoA and arena patterns synthesized from training knowledge.

Source note: The GlobalAlloc trait and its safety requirements are covered in Rust for Rustaceans, Ch. 9. The bump allocator pattern and its application to telemetry pipelines are synthesized from training knowledge. Recommended further reading: the bumpalo crate documentation and Alexis Beingessner's "The Allocator API, Bump Allocation, and You" (Gankra.github.io).



Context

The Meridian telemetry processor allocates a Vec<u8> for every telemetry frame payload. At 100,000 frames per second, that is 100,000 malloc/free calls per second hitting the global allocator. The global allocator (jemalloc or the system allocator) is designed for general-purpose allocation: it handles arbitrary sizes, arbitrary lifetimes, and thread-safe concurrent access. This generality has a cost: each allocation acquires an internal lock or performs atomic operations, searches for a free block of appropriate size, and updates allocator metadata.

For short-lived objects that all die together — a batch of frames processed in one scheduling quantum, all freed at the end — a bump allocator eliminates all of that overhead. A bump allocator maintains a pointer into a pre-allocated slab of memory. Each allocation is one pointer addition. Deallocation is a no-op — the entire slab is reclaimed at once when the allocation epoch ends. For the right workload, this is 10–100× faster than the global allocator.

This lesson covers how bump allocators work, when they are appropriate, and how to implement and use them in Rust for high-throughput frame processing.


Core Concepts

The Global Allocator and Its Overhead

Every Box::new(x), Vec::new(), and String::new() in Rust calls the global allocator via the GlobalAlloc trait (Rust for Rustaceans, Ch. 9):

#![allow(unused)]
fn main() {
// The GlobalAlloc trait (simplified from std):
pub unsafe trait GlobalAlloc {
    unsafe fn alloc(&self, layout: std::alloc::Layout) -> *mut u8;
    unsafe fn dealloc(&self, ptr: *mut u8, layout: std::alloc::Layout);
}
}

The default allocator (jemalloc in most production Rust, or the system allocator) handles:

  • Thread safety (internal locks or lock-free data structures)
  • Size classes and free lists for different allocation sizes
  • Fragmentation management
  • Returning memory to the OS when freed

For long-lived, variously-sized allocations with arbitrary lifetimes, this is correct and often fast. For thousands of small, short-lived allocations that all have the same lifetime, it is expensive overkill.

Bump Allocation: The Pattern

A bump allocator owns a contiguous slab of memory. Each allocation is a pointer increment:

[slab start]  →  [offset]  →  [slab end]
                    ↑
               pointer bumps forward on each allocation

Freeing individual allocations is not supported. The entire slab is reset in one operation when all allocations are no longer needed — the "epoch" ends and the offset pointer resets to zero.

Properties:

  • Allocation: O(1), typically one integer addition and a bounds check.
  • Deallocation: O(1) total for all allocations from one epoch — reset the offset.
  • Thread safety: A single-threaded bump allocator has no synchronisation overhead. A thread-local bump allocator gives each thread its own slab with no contention.
  • Fragmentation: None within the epoch. Memory is never reused for a different allocation during the epoch — no fragmentation.
  • Limitation: Cannot free individual allocations. All allocations from one bump allocator share the same lifetime.

Using bumpalo for Safe Bump Allocation

The bumpalo crate provides a production-quality bump allocator with a safe Rust interface:

// bumpalo is not available in the Playground — this shows the API.
// Add to Cargo.toml: bumpalo = { version = "3", features = ["collections"] }
// use bumpalo::Bump;
// use bumpalo::collections::Vec as BumpVec;

// Illustrating the pattern with a manual approach instead:
struct BumpArena {
    slab: Vec<u8>,
    offset: usize,
}

impl BumpArena {
    fn new(capacity: usize) -> Self {
        Self {
            slab: vec![0u8; capacity],
            offset: 0,
        }
    }

    /// Allocate `size` bytes aligned to `align`.
    /// Returns None if the slab is exhausted.
    fn alloc(&mut self, size: usize, align: usize) -> Option<&mut [u8]> {
        // Align the current offset up.
        let aligned = (self.offset + align - 1) & !(align - 1);
        let end = aligned + size;
        if end > self.slab.len() {
            return None; // Out of space.
        }
        self.offset = end;
        Some(&mut self.slab[aligned..end])
    }

    /// Reset the arena — all previous allocations are invalidated.
    fn reset(&mut self) {
        self.offset = 0;
    }

    fn used(&self) -> usize { self.offset }
    fn capacity(&self) -> usize { self.slab.len() }
}

fn main() {
    let mut arena = BumpArena::new(4096);

    // Allocate space for 10 u64 values.
    let buf = arena.alloc(10 * 8, 8).expect("arena exhausted");
    println!("allocated {} bytes, used {}/{}", buf.len(), arena.used(), arena.capacity());

    // Reset — all allocations invalidated, slab reused.
    arena.reset();
    println!("after reset: used {}", arena.used());
}

Thread-Local Arenas for Concurrent Processing

For a multi-threaded pipeline where each worker thread processes its own batch of frames, a thread-local arena eliminates all lock contention:

#![allow(unused)]
fn main() {
use std::cell::RefCell;

const ARENA_CAPACITY: usize = 16 * 1024 * 1024; // 16MB per thread

thread_local! {
    // Each worker thread has its own private arena.
    // No synchronisation — no atomic operations, no locks.
    static FRAME_ARENA: RefCell<Vec<u8>> = RefCell::new(vec![0u8; ARENA_CAPACITY]);
    static ARENA_OFFSET: RefCell<usize> = const { RefCell::new(0) };
}

fn alloc_frame_buffer(size: usize) -> *mut u8 {
    FRAME_ARENA.with(|arena| {
        ARENA_OFFSET.with(|offset| {
            let mut off = offset.borrow_mut();
            let aligned = (*off + 7) & !7; // 8-byte alignment
            let end = aligned + size;
            let arena = arena.borrow();
            if end > arena.len() {
                panic!("thread-local arena exhausted — increase ARENA_CAPACITY or reduce batch size");
            }
            *off = end;
            arena.as_ptr() as *mut u8
        })
    })
}

fn reset_thread_arena() {
    ARENA_OFFSET.with(|offset| *offset.borrow_mut() = 0);
}
}

In practice, use bumpalo::Bump in a thread_local! instead of building the unsafe version above. bumpalo handles alignment, growth, and lifetime correctly with a safe interface.

Epoch-Based Processing: The Right Workload

The bump allocator pattern maps naturally onto batch processing where all objects in a batch share the same lifetime:

use std::time::Instant;

/// Simulates a frame batch processor using a bump-style pre-allocated pool.
/// Each frame's payload is drawn from the batch buffer.
/// When the batch is complete, the buffer is reset — no individual frees.
struct FrameBatchProcessor {
    /// Pre-allocated buffer for all frame payloads in one batch.
    payload_pool: Vec<u8>,
    pool_offset: usize,
    batch_size: usize,
    frames_in_batch: usize,
}

impl FrameBatchProcessor {
    fn new(batch_size: usize, max_payload_per_frame: usize) -> Self {
        Self {
            payload_pool: vec![0u8; batch_size * max_payload_per_frame],
            pool_offset: 0,
            batch_size,
            frames_in_batch: 0,
        }
    }

    /// Claim space for a frame payload from the pool.
    /// Returns a mutable slice for the caller to fill.
    fn claim_payload_slot(&mut self, size: usize) -> Option<&mut [u8]> {
        if self.frames_in_batch >= self.batch_size {
            return None; // Batch full.
        }
        let end = self.pool_offset + size;
        if end > self.payload_pool.len() {
            return None; // Pool exhausted.
        }
        let slot = &mut self.payload_pool[self.pool_offset..end];
        self.pool_offset = end;
        self.frames_in_batch += 1;
        Some(slot)
    }

    /// Process the current batch and reset for the next one.
    /// All payload slots are implicitly freed — no individual deallocation.
    fn flush_and_reset(&mut self) -> usize {
        let processed = self.frames_in_batch;
        self.pool_offset = 0;
        self.frames_in_batch = 0;
        processed
    }
}

fn main() {
    let mut processor = FrameBatchProcessor::new(1000, 1024);
    let start = Instant::now();

    // Simulate processing 100,000 frames in batches of 1,000.
    let mut total = 0;
    for _batch in 0..100 {
        for _frame in 0..1000 {
            // Claim a 256-byte payload slot — no malloc.
            if let Some(slot) = processor.claim_payload_slot(256) {
                slot[0] = 0xAA; // Simulate writing frame data.
            }
        }
        total += processor.flush_and_reset();
    }

    let elapsed = start.elapsed();
    println!("processed {total} frames in {:?}", elapsed);
    println!("~{:.0} frames/sec", total as f64 / elapsed.as_secs_f64());
}

When Not to Use Bump Allocation

Bump allocators are not appropriate when:

  • Lifetimes are mixed. If some objects from a batch need to outlive the batch (e.g., forwarding a specific frame to a slow downstream consumer while releasing the rest), a bump allocator cannot express this. The solution is to copy out the long-lived objects to global-allocator memory before resetting.

  • Individual deallocation is required. A bump allocator cannot free one allocation while keeping others alive. Use a pool allocator (fixed-size slots with a free list) if individual deallocation of same-sized objects is needed.

  • Batches are unpredictably sized. If you cannot bound the total allocation size of a batch, the arena may exhaust. Size the arena conservatively — or use bumpalo, which supports growth by chaining multiple slabs.


Code Examples

Comparing Global vs Arena Allocation for Frame Batches

This benchmark illustrates the overhead difference. Without running it on actual hardware, the expected speedup for small short-lived allocations is 5–20× in favour of the arena.

use std::time::Instant;

const FRAMES: usize = 100_000;
const PAYLOAD_SIZE: usize = 256;

fn bench_global_alloc() -> std::time::Duration {
    let start = Instant::now();
    for _ in 0..FRAMES {
        // Each Vec::new() + push triggers malloc + memcpy.
        let mut v = Vec::with_capacity(PAYLOAD_SIZE);
        for i in 0..PAYLOAD_SIZE {
            v.push(i as u8);
        }
        // Drop at end of loop iteration — free() called 100,000 times.
        let _ = v;
    }
    start.elapsed()
}

fn bench_arena_alloc() -> std::time::Duration {
    // Pre-allocate a slab for the entire batch.
    let mut slab = vec![0u8; FRAMES * PAYLOAD_SIZE];
    let start = Instant::now();
    let mut offset = 0;
    for frame_idx in 0..FRAMES {
        let start_byte = offset;
        let end_byte = offset + PAYLOAD_SIZE;
        for (i, byte) in slab[start_byte..end_byte].iter_mut().enumerate() {
            *byte = (frame_idx ^ i) as u8;
        }
        offset = end_byte;
    }
    // All frames "freed" by resetting offset to 0 — one operation.
    offset = 0;
    let _ = offset;
    start.elapsed()
}

fn main() {
    // Warm up to avoid measurement noise from cold caches.
    let _ = bench_global_alloc();
    let _ = bench_arena_alloc();

    let global_time = bench_global_alloc();
    let arena_time  = bench_arena_alloc();

    println!("global alloc: {:?}", global_time);
    println!("arena alloc:  {:?}", arena_time);
    let speedup = global_time.as_nanos() as f64 / arena_time.as_nanos() as f64;
    println!("arena speedup: {speedup:.1}×");
}

The arena's advantage grows with allocation count. At 100,000 256-byte frames, the arena avoids 100,000 malloc/free round-trips. The global allocator also has to find and merge free blocks over time as the heap fragments; the arena has zero fragmentation overhead.


Key Takeaways

  • The global allocator (malloc/free) is general-purpose: thread-safe, handles arbitrary sizes and lifetimes, manages fragmentation. Its generality has overhead — internal synchronisation, free list management, metadata updates.

  • A bump allocator eliminates this overhead for objects with a shared lifetime. Allocation is one integer addition. Deallocation is resetting one offset — all objects from one epoch freed simultaneously.

  • The lifetime constraint is the critical requirement. If any object from a bump-allocated batch must outlive the batch, copy it out to the global allocator before resetting. Do not try to mix lifetimes within one arena.

  • Thread-local arenas eliminate all cross-thread contention. Each worker thread gets its own slab; no lock, no atomic operation, no cache line bounce for allocation.

  • Use bumpalo in production. It handles alignment, growth via chained slabs, and safe lifetimes. Implement your own bump allocator only for educational purposes or in no_std environments where crate dependencies are restricted.

  • Profile before optimising. The global allocator is fast for typical workloads. Bump allocation is a targeted optimisation for high-frequency, same-lifetime allocation patterns — not a universal replacement for Vec or Box.


Project — High-Throughput Telemetry Packet Processor

Module: Foundation — M05: Data-Oriented Design in Rust Prerequisite: All three module quizzes passed (≥70%)


Mission Brief

TO: Platform Engineering FROM: Mission Control Systems Lead CLASSIFICATION: UNCLASSIFIED // INTERNAL SUBJECT: RFC-0055 — Telemetry Packet Processor Performance Target


The current Rust-language telemetry processor runs at 62,000 frames per second under sustained load. The conjunction avoidance pipeline requires 100,000 frames per second to maintain sub-10-second delivery windows during peak orbital density events. The gap is 38%. Profiling shows two root causes:

  1. Allocator pressure. The processor allocates a Vec<u8> per frame payload on the global heap. At 100k fps, this is 100k malloc/free round-trips per second — 18% of CPU time.
  2. Cache waste. The TelemetryFrame struct packs hot routing fields with cold payload data. Sequential scan of 100k frames for deduplication loads 2.4× more data than the deduplication logic uses.

Your task is to rebuild the processor core using the three techniques from this module: cache-optimal struct layout, SoA separation of hot and cold data, and arena allocation for frame payloads.


System Specification

Frame Structure

#![allow(unused)]
fn main() {
/// Hot fields — accessed in every pass (routing, deduplication, sorting).
/// Must fit in ≤ 32 bytes and be ordered by descending alignment.
#[derive(Debug, Clone, Copy)]
pub struct FrameHeader {
    pub timestamp_ms: u64,
    pub sequence:     u64,
    pub satellite_id: u32,
    pub byte_count:   u16,
    pub station_id:   u8,
    pub flags:        u8,
}

/// Cold data — accessed only when forwarding to downstream consumers.
/// Held as a reference into the batch arena; lifetime is one processing epoch.
pub struct FramePayload<'arena> {
    pub data: &'arena [u8],
}
}

Processing Pipeline

The processor receives frames in batches of up to 10,000. For each batch:

  1. Claim payload space from the batch arena for each frame.
  2. Validate each frame's CRC (simulated: check that flags & 0x80 == 0).
  3. Deduplicate by (satellite_id, sequence) — discard duplicates using a SoA scan over hot headers.
  4. Sort the batch by timestamp_ms ascending — sort only the header array, not the payloads.
  5. Forward unique sorted frames to a tokio::sync::mpsc::Sender<ForwardedFrame>.
  6. Reset the arena — all payload allocations freed simultaneously.

Performance Target

  • Process 100,000 frames per second sustained across a benchmark of 10,000 batches × 1,000 frames.
  • Arena allocation must be used for frame payloads — no Vec<u8> per payload.
  • Hot field access (deduplication and sort) must operate on the header array, not the full frame struct.
  • Struct size assertions must compile: size_of::<FrameHeader>() == 24.

Output

A binary crate that:

  1. Generates synthetic frame batches
  2. Runs the full pipeline (validate → deduplicate → sort → forward) for 10,000 batches
  3. Reports frames per second, percentage of duplicates discarded, and arena reset count
  4. Confirms no per-frame heap allocations occur in the hot path (verified by measuring allocator calls)

Acceptance Criteria

#CriterionVerifiable
1size_of::<FrameHeader>() == 24 — const assertion in sourceYes — compile-time
2Frame payloads allocated from batch arena, not global heapYes — code review: no Vec::new() or Box::new() in hot path
3Deduplication operates on &[FrameHeader] — no full struct accessYes — code review
4Sort operates on the header array by index — payloads not movedYes — code review
5Arena resets after each batch — used bytes reset to 0Yes — assert in batch loop
6Benchmark reports ≥ 100,000 frames/sec on a modern laptopYes — timing output
7Duplicate detection uses a HashSet<(u32, u64)> on header fields onlyYes — code review

Hints

Hint 1 — FrameHeader size assertion
#![allow(unused)]
fn main() {
const _: () = assert!(
    std::mem::size_of::<FrameHeader>() == 24,
    "FrameHeader must be 24 bytes — check field order and padding"
);
}

Field order for 24 bytes with no padding:

  • timestamp_ms: u64 (8)
  • sequence: u64 (8)
  • satellite_id: u32 (4)
  • byte_count: u16 (2)
  • station_id: u8 (1)
  • flags: u8 (1) = 24 bytes, alignment = 8, no padding.
Hint 2 — Batch arena design

Pre-allocate the slab once per processor lifetime. Reset between batches:

#![allow(unused)]
fn main() {
pub struct BatchArena {
    slab: Vec<u8>,
    offset: usize,
}

impl BatchArena {
    pub fn new(capacity: usize) -> Self {
        Self { slab: vec![0u8; capacity], offset: 0 }
    }

    /// Allocate `size` bytes; returns a mutable slice into the slab.
    pub fn alloc(&mut self, size: usize) -> Option<&mut [u8]> {
        let aligned = (self.offset + 7) & !7; // 8-byte alignment
        let end = aligned + size;
        if end > self.slab.len() { return None; }
        self.offset = end;
        Some(&mut self.slab[aligned..end])
    }

    /// Reset — all previous allocations implicitly freed.
    pub fn reset(&mut self) { self.offset = 0; }
    pub fn used(&self) -> usize { self.offset }
}
}

Size the arena for worst-case batch: max_batch_size * max_payload_size.

Hint 3 — SoA deduplication

Maintain a Vec<FrameHeader> (hot, dense) separate from payloads:

#![allow(unused)]
fn main() {
use std::collections::HashSet;

fn deduplicate(headers: &[FrameHeader]) -> Vec<usize> {
    let mut seen = HashSet::with_capacity(headers.len());
    headers
        .iter()
        .enumerate()
        .filter_map(|(i, h)| {
            if seen.insert((h.satellite_id, h.sequence)) {
                Some(i) // Unique — keep this index.
            } else {
                None    // Duplicate — discard.
            }
        })
        .collect()
}
}

The deduplication loop touches only satellite_id and sequence from FrameHeader — 12 bytes of the 24-byte struct. With 1,000 headers per batch at 24 bytes each, the working set is 24 KB — fits in L1 cache.

Hint 4 — Sort headers by timestamp without moving payloads

Sort an index array by headers[i].timestamp_ms, not the headers themselves. This avoids any payload movement:

#![allow(unused)]
fn main() {
fn sort_by_timestamp(indices: &mut Vec<usize>, headers: &[FrameHeader]) {
    indices.sort_unstable_by_key(|&i| headers[i].timestamp_ms);
}
}

Iterating indices in sorted order gives frames in timestamp order without copying or moving any data.


Reference Implementation

Reveal reference implementation
use std::collections::HashSet;
use std::time::Instant;

// --- FrameHeader ---

#[derive(Debug, Clone, Copy)]
pub struct FrameHeader {
    pub timestamp_ms: u64,
    pub sequence:     u64,
    pub satellite_id: u32,
    pub byte_count:   u16,
    pub station_id:   u8,
    pub flags:        u8,
}

const _SIZE: () = assert!(std::mem::size_of::<FrameHeader>() == 24);
const _ALIGN: () = assert!(std::mem::align_of::<FrameHeader>() == 8);

// --- BatchArena ---

pub struct BatchArena {
    slab: Vec<u8>,
    offset: usize,
}

impl BatchArena {
    pub fn new(capacity: usize) -> Self {
        Self { slab: vec![0u8; capacity], offset: 0 }
    }

    pub fn alloc(&mut self, size: usize) -> Option<&mut [u8]> {
        let aligned = (self.offset + 7) & !7;
        let end = aligned + size;
        if end > self.slab.len() { return None; }
        self.offset = end;
        Some(&mut self.slab[aligned..end])
    }

    pub fn reset(&mut self) { self.offset = 0; }
    pub fn used(&self) -> usize { self.offset }
}

// --- Pipeline ---

fn validate(header: &FrameHeader) -> bool {
    header.flags & 0x80 == 0  // Simulated CRC: high bit = error flag.
}

fn deduplicate_indices(headers: &[FrameHeader]) -> Vec<usize> {
    let mut seen = HashSet::with_capacity(headers.len());
    headers.iter().enumerate().filter_map(|(i, h)| {
        if seen.insert((h.satellite_id, h.sequence)) { Some(i) } else { None }
    }).collect()
}

fn sort_indices_by_timestamp(indices: &mut Vec<usize>, headers: &[FrameHeader]) {
    indices.sort_unstable_by_key(|&i| headers[i].timestamp_ms);
}

fn process_batch(
    arena: &mut BatchArena,
    batch: &[(u64, u64, u32, u16, u8, u8, Vec<u8>)], // (ts, seq, sat, bytes, stn, flags, raw_payload)
) -> (usize, usize) {
    // 1. Fill header array and claim arena slots for payloads.
    let mut headers: Vec<FrameHeader> = Vec::with_capacity(batch.len());
    let mut payload_offsets: Vec<(usize, usize)> = Vec::with_capacity(batch.len()); // (start, len)

    for (ts, seq, sat, bytes, stn, flags, payload) in batch {
        let header = FrameHeader {
            timestamp_ms: *ts,
            sequence: *seq,
            satellite_id: *sat,
            byte_count: *bytes,
            station_id: *stn,
            flags: *flags,
        };

        // Validate before claiming arena space.
        if !validate(&header) { continue; }

        let slot = match arena.alloc(payload.len()) {
            Some(s) => s,
            None => break, // Arena full — drop remaining frames.
        };
        slot.copy_from_slice(payload);

        let start = arena.used() - payload.len();
        payload_offsets.push((start, payload.len()));
        headers.push(header);
    }

    // 2. Deduplicate on hot header array — SoA benefit.
    let mut unique_indices = deduplicate_indices(&headers);
    let duplicates = headers.len() - unique_indices.len();

    // 3. Sort by timestamp — header array only, no payload movement.
    sort_indices_by_timestamp(&mut unique_indices, &headers);

    let forwarded = unique_indices.len();

    // 4. Reset arena — all payload slots freed in O(1).
    arena.reset();

    (forwarded, duplicates)
}

fn main() {
    const BATCH_SIZE: usize = 1_000;
    const NUM_BATCHES: usize = 10_000;
    const MAX_PAYLOAD: usize = 256;

    let mut arena = BatchArena::new(BATCH_SIZE * MAX_PAYLOAD + 64);

    // Generate synthetic batch data.
    let batch: Vec<(u64, u64, u32, u16, u8, u8, Vec<u8>)> = (0..BATCH_SIZE)
        .map(|i| {
            let seq = (i / 3) as u64; // Every 3 frames share a sequence — ~33% duplicates.
            (
                1_700_000_000 + i as u64,
                seq,
                (i % 48) as u32,
                MAX_PAYLOAD as u16,
                (i % 12) as u8,
                0u8,
                vec![(i & 0xFF) as u8; MAX_PAYLOAD],
            )
        })
        .collect();

    let start = Instant::now();
    let mut total_forwarded = 0usize;
    let mut total_duplicates = 0usize;
    let mut resets = 0usize;

    for _ in 0..NUM_BATCHES {
        let (fwd, dups) = process_batch(&mut arena, &batch);
        total_forwarded += fwd;
        total_duplicates += dups;
        resets += 1;
        assert_eq!(arena.used(), 0, "arena must be reset after each batch");
    }

    let elapsed = start.elapsed();
    let total_frames = BATCH_SIZE * NUM_BATCHES;
    let fps = total_frames as f64 / elapsed.as_secs_f64();

    println!("--- Telemetry Processor Benchmark ---");
    println!("frames:     {}", total_frames);
    println!("forwarded:  {}", total_forwarded);
    println!("duplicates: {} ({:.1}%)", total_duplicates,
        100.0 * total_duplicates as f64 / total_frames as f64);
    println!("resets:     {}", resets);
    println!("elapsed:    {:.2?}", elapsed);
    println!("throughput: {:.0} frames/sec", fps);
    println!("FrameHeader size: {} bytes", std::mem::size_of::<FrameHeader>());
}

Reflection

The three optimisations in this project compose:

  • Struct layout ensures the FrameHeader array is compact (24 bytes/entry, no wasted padding). 24 KB for 1,000 headers — fits in L1 cache, fully available during the deduplication and sort passes.
  • SoA separation means deduplication and sorting never touch payload data — the payload arena is not in the working set during hot-path operations.
  • Arena allocation eliminates 100,000 per-second malloc/free round-trips. All payloads for one batch are freed in a single pointer reset.

Each optimisation is independently valuable. Together, they target the three most common sources of throughput bottlenecks in high-frequency data pipelines: allocator pressure, memory bandwidth waste, and cache thrashing.

The benchmarking mindset from Module 6 (Performance and Profiling) will give you the tools to measure these improvements precisely — comparing before and after with criterion, identifying the limiting factor with perf, and validating that the improvements hold under realistic workload conditions.

Module 06 — Performance & Profiling

Track: Foundation — Mission Control Platform Position: Module 6 of 6 (Foundation track complete) Source material: Rust for Rustaceans — Jon Gjengset, Chapter 6; criterion, cargo-flamegraph, perf, dhat documentation Quiz pass threshold: 70% on all three lessons to unlock the project



Mission Context

The Module 5 telemetry processor achieves 100,000 frames per second in isolation. The integrated control plane pipeline runs at 71,000. The 29% gap is not in the algorithm — it is in measurement blind spots: unmeasured allocations, unverified assumptions about what the compiler optimises away, and code paths that look fast but are not.

Performance engineering without measurement is optimism. This module provides the measurement toolkit: criterion for reliable benchmarks, flamegraph and perf for identifying hot paths, and allocation counting for detecting hidden heap overhead. The project combines all three into a structured audit that turns a performance gap into a documented, measured, verified improvement.


What You Will Learn

By the end of this module you will be able to:

  • Identify the three failure modes of naive Instant::now() benchmarks: dead-code elimination, constant folding, and I/O overhead masking the function under test
  • Apply std::hint::black_box correctly to both inputs and outputs to prevent compiler optimisations from invalidating benchmark results
  • Write criterion benchmarks with proper setup/measurement separation, interpret confidence intervals and p-values, and run parameterised benchmarks across input sizes
  • Configure the release profile with debug symbols for profiling, generate flamegraphs with cargo flamegraph, and identify hot paths from flamegraph visual patterns
  • Read perf stat output to diagnose whether a workload is compute-bound, memory-bound, or branch-prediction-bound before generating a flamegraph
  • Use a #[global_allocator] counting wrapper to count allocations in a specific code path, embed zero-allocation assertions in CI, and eliminate common hidden allocation sources (HashMap::new(), Vec::collect(), format!())

Lessons

Lesson 1 — Benchmarking with criterion: Writing Reliable Microbenchmarks

Covers the three failure modes of naive timing loops, std::hint::black_box placement for both input and output, criterion API and confidence interval interpretation, setup/measurement separation, benchmarking at realistic input sizes, and reading the statistical significance output.

Key question this lesson answers: How do you know your benchmark is measuring what you think it is, and how do you distinguish a real performance change from measurement noise?

lesson-01-benchmarking.md / lesson-01-quiz.toml


Lesson 2 — CPU Profiling with flamegraph and perf: Finding Hot Paths

Covers the sampling profiler model, configuring release builds with debug symbols for profiling, perf stat hardware counter diagnosis (IPC, cache miss rate, branch miss rate), cargo flamegraph workflow, reading flamegraph visual patterns (wide flat bars, deep towers, distributed overhead), and #[inline(never)] for profiling visibility.

Key question this lesson answers: Which function is consuming the most CPU time, and how do you distinguish a compute-bound bottleneck from a memory-bound one?

lesson-02-flamegraph.md / lesson-02-quiz.toml


Lesson 3 — Memory Profiling: Heap Allocation Tracking and Reducing Allocator Pressure

Covers the allocation cost model, #[global_allocator] counting wrappers for exact per-path allocation counts, HashMap::with_capacity and Vec::with_capacity pre-allocation, clear() for buffer reuse across batches, dhat for call-site-attributed heap profiling, and CI-embedded zero-allocation assertions.

Key question this lesson answers: How many allocations happen in the hot path, which call sites are responsible, and how do you make that a CI assertion rather than a one-time finding?

lesson-03-memory-profiling.md / lesson-03-quiz.toml


Capstone Project — Meridian Control Plane Performance Audit

Apply the full three-phase audit workflow to the integrated telemetry pipeline: establish a criterion baseline, generate a flamegraph to identify the hot path, use a counting allocator to quantify per-stage allocation overhead, implement the highest-impact fix, and verify the improvement is statistically significant (p < 0.05). Document findings in audit.md.

Acceptance is against 7 verifiable criteria including correct criterion usage, flamegraph generation, per-stage allocation counts, a documented fix, and a p < 0.05 improvement.

project-performance-audit.md


Prerequisites

Modules 1–5 must be complete. Module 5 (Data-Oriented Design) established the optimisations being measured here — this module gives you the tools to verify that those optimisations actually work and to prevent future regressions. Module 2 (Concurrency Primitives) introduced atomic operations, which are used by the counting allocator in Lesson 3.

Foundation Track Complete

With Module 6 complete, the Foundation track is done. The six modules cover the complete toolset for building Meridian's control plane in Rust: async task scheduling, concurrency primitives, message-passing architectures, network I/O, data-oriented design, and performance measurement. The four specialisation tracks — Database Internals, Data Pipelines, Data Lakes, and Distributed Systems — are now unlocked and can be taken in any order.

Lesson 1 — Benchmarking with criterion: Writing Reliable Microbenchmarks

Module: Foundation — M06: Performance & Profiling Position: Lesson 1 of 3 Source: Rust for Rustaceans — Jon Gjengset, Chapter 6



Context

The Module 5 project processor targets 100,000 frames per second. You have a number — but how confident are you in it? The benchmark loop used Instant::now() / elapsed() around a single iteration. That measurement is subject to three failure modes documented in Rust for Rustaceans Ch. 6: performance variance between runs (caused by CPU temperature, OS scheduler interrupts, memory layout), compiler optimisation eliminating the code under test entirely, and I/O overhead masking the actual function cost. A timing loop that contains a println! is usually measuring the speed of terminal output, not your function.

The criterion crate addresses all three. It runs each benchmark hundreds of times, applies statistical analysis to separate signal from noise, detects and reports outliers, and generates comparison reports that tell you whether a change is a real regression or measurement noise. When Meridian's CI pipeline regresses the frame processor throughput by 15%, criterion is how you prove the regression is real, quantify its size, and track it to the specific commit.

Source: Rust for Rustaceans, Chapter 6 (Gjengset)


Core Concepts

Why Instant::now() Loops Are Not Enough

Consider this naive benchmark:

fn my_function(data: &[u32]) -> u64 {
    data.iter().map(|&x| x as u64).sum()
}

fn main() {
    let data: Vec<u32> = (0..1_000).collect();
    let start = std::time::Instant::now();
    for _ in 0..10_000 {
        let _ = my_function(&data);
    }
    println!("took {:?}", start.elapsed());
}

Two problems. First, the compiler may eliminate my_function entirely — the result _ is discarded, so nothing in the code requires the computation to happen (Rust for Rustaceans, Ch. 6). In release mode, the loop body may compile to nothing. Second, a single run on a loaded machine is noise, not signal. CPU clock scaling, branch predictor warmup, and OS scheduler preemption all add variance. A function that takes 50µs may measure anywhere from 40µs to 200µs depending on external conditions.

criterion Basics

Add criterion to Cargo.toml:

[dev-dependencies]
criterion = { version = "0.5", features = ["html_reports"] }

[[bench]]
name = "frame_processor"
harness = false

A criterion benchmark in benches/frame_processor.rs:

// Note: criterion is a dev-dependency — not available in the Playground.
// This demonstrates the API. In production add criterion = "0.5" to Cargo.toml.

// use criterion::{black_box, criterion_group, criterion_main, Criterion};

// fn bench_deduplication(c: &mut Criterion) {
//     let headers: Vec<u64> = (0..1000).collect();
//     c.bench_function("dedup_1000", |b| {
//         b.iter(|| {
//             // black_box prevents the compiler from optimising away the input
//             // or treating the result as dead code.
//             black_box(deduplicate(black_box(&headers)))
//         })
//     });
// }
//
// criterion_group!(benches, bench_deduplication);
// criterion_main!(benches);

// Illustrating the structure with std::hint::black_box instead:
fn deduplicate(headers: &[u64]) -> usize {
    let mut seen = std::collections::HashSet::new();
    headers.iter().filter(|&&h| seen.insert(h)).count()
}

fn main() {
    let headers: Vec<u64> = (0..1000).collect();

    // Warm up
    for _ in 0..100 {
        std::hint::black_box(deduplicate(std::hint::black_box(&headers)));
    }

    // Measure
    let iterations = 10_000;
    let start = std::time::Instant::now();
    for _ in 0..iterations {
        std::hint::black_box(deduplicate(std::hint::black_box(&headers)));
    }
    let elapsed = start.elapsed();
    println!("deduplicate(1000): {:.2?} per iteration",
        elapsed / iterations);
}

black_box: Preventing Dead-Code Elimination

std::hint::black_box (or criterion::black_box) is the key primitive for correct benchmarks. It is an identity function that tells the compiler: assume this value is used in some arbitrary way (Rust for Rustaceans, Ch. 6). This prevents two failure modes:

Eliminating dead computation: if the benchmark discards the result with let _ = expensive(), the compiler may eliminate the call. black_box(expensive()) forces the computation to occur because the compiler must assume black_box uses its argument.

Constant-folding inputs: if the input to a benchmark is a compile-time constant, the compiler may pre-compute the result at compile time. black_box(input) forces the compiler to treat the input as runtime-unknown.

fn sum_slice(data: &[u32]) -> u64 {
    data.iter().map(|&x| x as u64).sum()
}

fn main() {
    let data: Vec<u32> = (0..1_000).collect();

    // WRONG: compiler may eliminate the call — result discarded, input known.
    {
        let start = std::time::Instant::now();
        for _ in 0..10_000 {
            let _ = sum_slice(&data);
        }
        println!("(likely wrong) took {:?}", start.elapsed());
    }

    // CORRECT: black_box prevents dead-code elimination and constant folding.
    {
        let start = std::time::Instant::now();
        for _ in 0..10_000 {
            std::hint::black_box(sum_slice(std::hint::black_box(&data)));
        }
        println!("(correct) took {:?}", start.elapsed());
    }
}

Note the placement: black_box on the input prevents constant folding (the compiler must treat the slice as runtime-unknown). black_box on the output prevents dead-code elimination (the compiler must assume the result is used).

Benchmark Structure Best Practices

Separate setup from measurement. Criterion's b.iter(|| { ... }) closure is the measured unit. Anything outside it is setup and runs once. Constructing test data inside the measured closure inflates the result with allocation cost.

// Illustrating the pattern with manual timing:
fn build_test_headers(n: usize) -> Vec<u64> {
    // This is setup — not what we want to measure.
    (0..n as u64).collect()
}

fn deduplicate_headers(headers: &[u64]) -> usize {
    let mut seen = std::collections::HashSet::new();
    headers.iter().filter(|&&h| seen.insert(h)).count()
}

fn bench_with_setup() {
    // Build test data ONCE — not inside the measured loop.
    let headers = build_test_headers(1000);

    let iterations = 100_000u32;
    let start = std::time::Instant::now();
    for _ in 0..iterations {
        // Only the function under test is measured.
        std::hint::black_box(deduplicate_headers(std::hint::black_box(&headers)));
    }
    let elapsed = start.elapsed();
    println!("deduplicate(1000): {:.2?}/iter", elapsed / iterations);
}

fn main() {
    bench_with_setup();
}

Benchmark at realistic input sizes. A function that is O(n log n) may be cache-bound at n=100 and compute-bound at n=100,000. Benchmark at the sizes you actually use in production. For Meridian's conjunction screen, that is 50,000 objects — not 100.

Use criterion's input size parameter for scaling analysis. BenchmarkGroup lets you benchmark the same function at multiple input sizes and plot throughput vs. size. The slope of that plot tells you whether your function is cache-bound (throughput drops sharply above L2 size) or compute-bound (throughput scales smoothly).

Interpreting Criterion Output

cargo bench produces output like:

dedup_1000              time:   [12.453 µs 12.501 µs 12.554 µs]
                        change: [-2.1431% -1.6789% -1.1920%] (p=0.00 < 0.05)
                        Performance has improved.

The three numbers are the lower bound, estimate, and upper bound of the 95% confidence interval. If you see a wide interval (e.g., [5 µs, 50 µs, 200 µs]), measurement variance is high — run on a quieter machine, increase iteration count, or use --profile-time for more samples.

The change line compares against the previous run (stored in target/criterion). A p-value below 0.05 means the change is statistically significant with 95% confidence. Changes with p > 0.05 are likely noise.


Code Examples

Benchmarking the Telemetry Processor with Parameterised Input Sizes

This example uses black_box correctly and varies input size to understand the performance profile across the range of realistic batch sizes.

use std::collections::HashSet;
use std::hint::black_box;
use std::time::{Duration, Instant};

fn deduplicate(headers: &[(u32, u64)]) -> Vec<usize> {
    let mut seen = HashSet::with_capacity(headers.len());
    headers.iter().enumerate()
        .filter_map(|(i, &(sat, seq))| {
            if seen.insert((sat, seq)) { Some(i) } else { None }
        })
        .collect()
}

fn sort_by_timestamp(indices: &mut Vec<usize>, timestamps: &[u64]) {
    indices.sort_unstable_by_key(|&i| timestamps[i]);
}

/// Run `iterations` iterations, return median per-iteration duration.
fn time_fn<F: Fn()>(f: F, iterations: u32) -> Duration {
    // Warm up — let branch predictor and instruction cache settle.
    for _ in 0..10 { f(); }

    let start = Instant::now();
    for _ in 0..iterations { f(); }
    start.elapsed() / iterations
}

fn main() {
    println!("{:<10} {:>15} {:>15} {:>15}",
        "n_frames", "dedup (µs)", "sort (µs)", "total (µs)");
    println!("{}", "-".repeat(55));

    for &n in &[100usize, 500, 1_000, 5_000, 10_000] {
        // Build test data once — not in the measured loop.
        let headers: Vec<(u32, u64)> = (0..n)
            .map(|i| ((i % 48) as u32, (i / 3) as u64)) // ~33% duplicates
            .collect();
        let timestamps: Vec<u64> = (0..n).map(|i| (n - i) as u64).collect();

        let dedup_time = time_fn(|| {
            black_box(deduplicate(black_box(&headers)));
        }, 10_000);

        let sort_time = time_fn(|| {
            let mut indices = (0..n).collect::<Vec<_>>();
            sort_by_timestamp(black_box(&mut indices), black_box(&timestamps));
            black_box(indices);
        }, 10_000);

        println!("{:<10} {:>15.2} {:>15.2} {:>15.2}",
            n,
            dedup_time.as_secs_f64() * 1e6,
            sort_time.as_secs_f64() * 1e6,
            (dedup_time + sort_time).as_secs_f64() * 1e6,
        );
    }
}

The slope of the dedup time as n grows reveals whether the function is O(n) (linear slope on a linear plot) or showing cache effects (steeper slope beyond a threshold). If dedup time grows faster than linearly above n=1000, the HashSet working set has exceeded L1 cache and you are paying for L2/L3 misses.


Key Takeaways

  • Instant::now() around a single-pass loop is not a reliable benchmark. Performance variance between runs, compiler dead-code elimination, and I/O in the loop can all produce completely wrong numbers (Rust for Rustaceans, Ch. 6).

  • std::hint::black_box (or criterion::black_box) prevents the compiler from eliminating benchmark code as dead. Apply it to both the input (prevent constant folding) and the output (prevent dead-code elimination).

  • criterion runs each benchmark the statistically appropriate number of times, computes confidence intervals, detects outliers, and reports whether changes between runs are statistically significant. Use p < 0.05 as the threshold for treating a change as real.

  • Separate setup from measurement. Build test data outside the measured closure. Benchmark at realistic production input sizes, not toy sizes. Use parameterised input to understand performance scaling behaviour.

  • A 95% confidence interval that is narrow (< 5% spread) indicates a reliable measurement. A wide interval indicates high variance — run on a quieter machine or use cargo bench -- --sample-size 200 for more samples.


Lesson 2 — CPU Profiling with flamegraph and perf: Finding Hot Paths

Module: Foundation — M06: Performance & Profiling Position: Lesson 2 of 3 Source: Synthesized from training knowledge (cargo flamegraph, perf, pprof documentation)

Source note: This lesson synthesizes from cargo-flamegraph, Linux perf, and pprof documentation. Verify specific CLI flags against your installed version of perf — options vary between kernel versions.


Context

criterion tells you that the frame deduplication function takes 12.5µs. It does not tell you why. Is it the HashSet insertions? The iterator chain? A memory allocation path? To answer that question, you need a CPU profiler — a tool that samples the program's call stack at regular intervals and shows you where time is being spent.

The flamegraph is the standard visualisation for this: a call tree where width encodes time and the call stack grows upward. The widest frames at the top are where your program actually spends its time. A deep narrow tower is a deep but fast call chain. A wide flat bar is a hot leaf function. Reading flamegraphs is a skill that takes a few profiling sessions to develop, but the insight-to-effort ratio is very high.

This lesson covers the two-tool profiling workflow for Rust on Linux: perf to collect samples and cargo flamegraph to generate the visualisation.


Core Concepts

The Profiling Workflow

CPU profiling works by sampling: the OS timer fires at regular intervals (typically 99 Hz or 999 Hz), captures the current call stack, and records the sample. After the program finishes, the accumulated stack samples are folded into a call tree and rendered as a flamegraph. Functions that appear in more samples are proportionally wider in the graph.

The standard workflow:

# 1. Build with debug info (but optimisations enabled — profile release code).
#    debug = true in [profile.release] preserves symbols without losing optimisations.
#    Add to Cargo.toml:
#    [profile.release]
#    debug = true

# 2. Install cargo-flamegraph (wraps perf or dtrace depending on platform).
cargo install flamegraph

# 3. Profile the binary.
cargo flamegraph --bin meridian-processor -- --frames 100000

# 4. Open the generated flamegraph.svg in a browser.
#    Click any frame to zoom in. Search by function name.

On Linux, cargo flamegraph uses perf record under the hood. On macOS, it uses dtrace. The output is always a flamegraph.svg.

Building for Profiling: Debug Symbols in Release Mode

Profiling a debug build measures the wrong thing — debug code contains bounds checks, non-inlined functions, and other overhead that does not exist in production. Profile release builds.

But release builds strip debug symbols by default — the flamegraph shows mangled symbol addresses instead of function names. The fix: add debug info to the release profile without disabling optimisations:

# Cargo.toml
[profile.release]
debug = true      # Include debug symbols (DWARF info).
opt-level = 3     # Keep full optimisation.
# Note: debug = true increases binary size (~3-10×) but has negligible
# runtime overhead. Strip the binary before deploying to production.

Alternatively, use the profiling profile convention:

[profile.profiling]
inherits = "release"
debug = true

Then cargo build --profile profiling && cargo flamegraph --profile profiling.

Reading a Flamegraph

A flamegraph stacks call frames vertically — the root (main) at the bottom, callees above. Width is proportional to the percentage of samples that included that frame in the call stack. The top-most wide frames are the actual hot spots.

Patterns to recognise:

Wide flat bar at the top — a leaf function consuming significant CPU. Investigate whether it can be optimised directly (algorithm, data structure choice) or eliminated (caching, avoiding the call).

Wide bar with many narrow children — a function that spends time distributed across many callees. No single child is dominant; the function itself may be doing overhead work.

Deep narrow tower — a long call chain that is individually fast. Usually indicates overhead from indirection (dynamic dispatch, many small function calls). #[inline] or refactoring may help.

[unknown] frames — samples from code without debug symbols (runtime, system libraries). Usually not actionable. Can be reduced by profiling with kernel symbols (--call-graph dwarf).

perf stat: Hardware Counter Snapshot

Before generating a flamegraph, perf stat gives a quick diagnostic of what kind of bottleneck you have:

perf stat -e cycles,instructions,cache-misses,cache-references,branch-misses \
    ./target/release/meridian-processor --frames 100000

Output:

 Performance counter stats for './target/release/meridian-processor':

      4,521,847,032      cycles
      6,234,891,045      instructions          #    1.38  insn per cycle
         12,847,334      cache-misses          #    8.23% of all cache refs
        156,234,123      cache-references
          2,341,234      branch-misses         #    0.21% of all branches

Instructions per cycle (IPC): 1.38 is moderate. Modern CPUs can sustain 3–4 IPC. Low IPC (< 1.5) suggests the processor is stalling — often on memory latency (cache misses) or branch mispredictions.

Cache miss rate: 8.23% is high. Typically < 1% is good. High cache miss rates point to the data layout problems covered in Module 5 — large structs, poor locality, random access patterns.

Branch miss rate: 0.21% is normal. > 5% suggests unpredictable branches — sorting or using branchless comparisons may help.

cargo-flamegraph in Practice

// Example: a function with a deliberately inefficient hot path
// to demonstrate profiling workflow.

fn find_conjunctions_naive(
    altitudes: &[f64],
    norad_ids: &[u32],
    threshold_km: f64,
) -> Vec<(u32, u32)> {
    let mut alerts = Vec::new();
    let n = altitudes.len();
    for i in 0..n {
        for j in (i + 1)..n {
            // This inner loop is O(n²) — will show as wide in a flamegraph.
            // The call to f64::abs() will likely appear as a hot child.
            if (altitudes[i] - altitudes[j]).abs() < threshold_km {
                alerts.push((norad_ids[i], norad_ids[j]));
            }
        }
    }
    alerts
}

fn main() {
    // Simulate workload for profiling.
    let n = 5_000;
    let altitudes: Vec<f64> = (0..n).map(|i| 400.0 + (i as f64) * 0.1).collect();
    let norad_ids: Vec<u32> = (0..n as u32).collect();

    let alerts = find_conjunctions_naive(&altitudes, &norad_ids, 2.0);
    println!("{} conjunction alerts", alerts.len());
}

In a flamegraph of this code, find_conjunctions_naive will be wide at the top (O(n²) iterations), with the subtraction and abs() call visible as the actual hot operations. The outer loop iteration overhead and the Vec::push for matches will also appear.

The flamegraph makes it immediately obvious: the inner loop is the hot path. The fix — using a sort + linear scan instead of O(n²) comparison — is visible from the profile before reading a single line of source.

Annotating Hot Functions with #[inline(never)]

By default, the compiler inlines small functions, which is good for performance but bad for profiling — inlined calls disappear into their callers in the flamegraph. For functions you specifically want to measure in isolation:

// Prevents inlining — this function will appear as a distinct frame in the flamegraph.
// Remove before production use if inlining is desired for performance.
#[inline(never)]
fn compute_altitude_delta(a: f64, b: f64) -> f64 {
    (a - b).abs()
}

fn main() {
    // In a flamegraph, compute_altitude_delta will appear as its own frame,
    // making it easy to see exactly how much time the subtraction + abs costs.
    let result = compute_altitude_delta(410.0, 408.5);
    println!("{}", result);
}

Use #[inline(never)] temporarily during profiling investigations. Remove it afterward — the compiler's inlining decisions are generally correct for production code.


Code Examples

A Profiling-Instrumented Processor Binary

The entry point for profiling runs a realistic workload of sufficient duration for the sampler to collect meaningful data. Too short (< 1 second) and there are too few samples for a reliable flamegraph.

use std::collections::HashSet;
use std::hint::black_box;
use std::time::Instant;

fn build_test_data(n: usize) -> (Vec<u64>, Vec<(u32, u64)>) {
    let timestamps: Vec<u64> = (0..n as u64).rev().collect();
    let headers: Vec<(u32, u64)> = (0..n)
        .map(|i| ((i % 48) as u32, (i / 3) as u64))
        .collect();
    (timestamps, headers)
}

#[inline(never)] // Visible as its own frame in flamegraph
fn dedup_pass(headers: &[(u32, u64)]) -> Vec<usize> {
    let mut seen = HashSet::with_capacity(headers.len());
    headers.iter().enumerate()
        .filter_map(|(i, &(sat, seq))| {
            if seen.insert((sat, seq)) { Some(i) } else { None }
        })
        .collect()
}

#[inline(never)] // Visible as its own frame in flamegraph
fn sort_pass(indices: &mut Vec<usize>, timestamps: &[u64]) {
    indices.sort_unstable_by_key(|&i| timestamps[i]);
}

fn process_batch(timestamps: &[u64], headers: &[(u32, u64)]) -> usize {
    let mut indices = dedup_pass(headers);
    sort_pass(&mut indices, timestamps);
    indices.len()
}

fn main() {
    // Run enough iterations for perf to collect ~1000+ samples.
    // At 99 Hz sampling, we need ~10 seconds of CPU time.
    let (timestamps, headers) = build_test_data(10_000);
    let batches = 50_000;

    let start = Instant::now();
    let mut total = 0usize;
    for _ in 0..batches {
        total += black_box(process_batch(
            black_box(&timestamps),
            black_box(&headers),
        ));
    }
    let elapsed = start.elapsed();

    println!("processed {} batches, {} unique frames", batches, total);
    println!("throughput: {:.0} batches/sec", batches as f64 / elapsed.as_secs_f64());
    println!("wall time:  {:.2?}", elapsed);
}

The #[inline(never)] attributes on dedup_pass and sort_pass ensure they appear as distinct frames in the flamegraph. The black_box calls prevent dead-code elimination from interfering with the profiling workload. The loop runs long enough to collect statistically meaningful samples.


Key Takeaways

  • Profile release builds with debug symbols (debug = true in [profile.release]). Profiling debug builds measures overhead that does not exist in production.

  • perf stat provides a hardware counter snapshot before you generate a flamegraph. High cache miss rate (> 5%) points to data layout issues; low IPC (< 1.5) suggests the processor is stalling on memory; high branch miss rate suggests unpredictable conditionals.

  • In a flamegraph, width encodes time. Wide frames at the top are hot leaf functions — the actual bottleneck. Wide frames with narrow children indicate distributed overhead. Deep narrow towers indicate fast call chains, not hot spots.

  • #[inline(never)] temporarily prevents a function from being inlined so it appears as a distinct frame in the profiler. Remove it after the investigation — inlining is correct for production code.

  • A profiling session should last at least 5–10 seconds of CPU time for reliable sample counts at 99 Hz. Use a workload that resembles production access patterns at production input sizes.


Lesson 3 — Memory Profiling: Heap Allocation Tracking and Reducing Allocator Pressure

Module: Foundation — M06: Performance & Profiling Position: Lesson 3 of 3 Source: Synthesized from training knowledge (dhat, heaptrack, jemalloc statistics, custom allocator wrappers)

Source note: This lesson synthesizes from dhat (Valgrind/DHAT profiler), heaptrack documentation, and allocator counting patterns. Verify dhat-rs API against the current crate version.



Context

The CPU flamegraph from Lesson 2 shows the telemetry processor spending 18% of time in malloc and free. The criterion benchmark from Lesson 1 confirms: 12.5µs per 1000-frame batch, 2.3µs of which is allocator overhead. The fix from Module 5 — arena allocation — eliminates this. But before implementing it, you need to know: exactly how many allocations happen per batch? Which call sites are responsible? Are there unexpected allocations from library code that you assumed was allocation-free?

Memory profiling answers these questions. Unlike CPU profiling (which samples stochastically), allocation profiling intercepts every alloc and dealloc call — giving you exact counts, sizes, and call sites. The tools: dhat for lightweight in-process counting, heaptrack for comprehensive heap timeline recording, and a custom counting allocator for targeted measurements in CI.


Core Concepts

The Allocation Cost Model

Every Vec::new(), Box::new(), String::from(), and collection growth hits the global allocator. The actual cost depends on the allocator (jemalloc is faster than the system allocator for concurrent workloads), the allocation size (small allocations have higher per-byte overhead), and contention (the global allocator serialises concurrent allocations internally).

Profiling allocation patterns reveals three categories of allocatable objects:

Long-lived allocations — startup config, connection state, per-session data structures. These are unavoidable and not a throughput problem.

Per-batch allocations — temporary buffers, work vectors, accumulators that are created and freed within one processing epoch. These are the target of arena allocation — eliminate them with pre-allocation.

Unexpected allocations — library calls that allocate internally even though the API looks allocation-free. format!(), HashMap::new(), Vec::collect() when the iterator doesn't know its size. These show up in memory profiles and are often surprising.

Counting Allocations: The Simplest Approach

Before reaching for a full memory profiler, a counting allocator wrapper tells you exactly how many allocations occur in a specific code path. This works in any environment and imposes very low overhead:

use std::alloc::{GlobalAlloc, Layout, System};
use std::sync::atomic::{AtomicU64, Ordering};

/// Wraps the system allocator and counts every alloc/dealloc.
struct CountingAllocator;

static ALLOC_COUNT: AtomicU64 = AtomicU64::new(0);
static DEALLOC_COUNT: AtomicU64 = AtomicU64::new(0);
static ALLOC_BYTES: AtomicU64 = AtomicU64::new(0);

unsafe impl GlobalAlloc for CountingAllocator {
    unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
        ALLOC_COUNT.fetch_add(1, Ordering::Relaxed);
        ALLOC_BYTES.fetch_add(layout.size() as u64, Ordering::Relaxed);
        System.alloc(layout)
    }

    unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
        DEALLOC_COUNT.fetch_add(1, Ordering::Relaxed);
        System.dealloc(ptr, layout)
    }
}

#[global_allocator]
static ALLOCATOR: CountingAllocator = CountingAllocator;

fn snapshot() -> (u64, u64, u64) {
    (
        ALLOC_COUNT.load(Ordering::Relaxed),
        DEALLOC_COUNT.load(Ordering::Relaxed),
        ALLOC_BYTES.load(Ordering::Relaxed),
    )
}

fn reset_counters() {
    ALLOC_COUNT.store(0, Ordering::Relaxed);
    DEALLOC_COUNT.store(0, Ordering::Relaxed);
    ALLOC_BYTES.store(0, Ordering::Relaxed);
}

// --- Application code under test ---

fn process_frames(frames: &[Vec<u8>]) -> usize {
    // Allocates a HashSet internally.
    let mut seen = std::collections::HashSet::new();
    frames.iter().filter(|f| seen.insert(f.as_ptr())).count()
}

fn main() {
    let frames: Vec<Vec<u8>> = (0..100).map(|i| vec![i as u8; 256]).collect();

    // Reset — we only want to count allocations from process_frames.
    reset_counters();

    let result = process_frames(&frames);

    let (allocs, deallocs, bytes) = snapshot();
    println!("process_frames({}) result: {}", frames.len(), result);
    println!("  allocations:   {allocs}");
    println!("  deallocations: {deallocs}");
    println!("  bytes:         {bytes}");
}

The output reveals exactly how many times the global allocator was called inside process_frames. If the count is non-zero when it should be zero (the function is supposed to be allocation-free), you have a hidden allocation to hunt down.

Common Hidden Allocation Sources

HashSet::new() and HashMap::new() — these start empty (no allocation) but trigger an allocation on the first insert. HashSet::with_capacity(n) pre-allocates for n elements, avoiding the first realloc. Using with_capacity eliminates the grow-and-rehash allocation that occurs when the initial capacity is exceeded.

Vec::collect() without size hint — if the iterator does not implement ExactSizeIterator, the Vec starts with a small capacity and grows (allocating) as elements arrive. Call .collect::<Vec<_>>() only when you know the iterator is small or provide a size hint via .size_hint().

format!() and string operations — every format! call allocates a String. In hot paths, prefer writing to a pre-allocated String with write! or push_str, or avoid String entirely in favour of a stack buffer.

Arc::clone() is not free — cloning an Arc does not allocate, but Arc::new() does. In a hot path, pre-create the Arc at batch setup time rather than per-frame.

Iterator adapters that buffer.sorted() from itertools allocates a Vec. .flat_map() with iterators that have non-trivial state may allocate. Check whether the adapter is allocation-free before using it in a hot path.

dhat: In-Process Heap Profiling

dhat (from Valgrind, with a Rust port via the dhat crate) instruments every allocation with a call-site stack trace. It produces a profile that shows, for each allocation site, the total bytes allocated, the peak live bytes, and the number of calls:

# Cargo.toml
[dependencies]
dhat = { version = "0.3", optional = true }

[features]
dhat-heap = ["dhat"]
// In main.rs — only active when the dhat-heap feature is enabled.
// cfg gate prevents any overhead in production builds.
#[cfg(feature = "dhat-heap")]
#[global_allocator]
static ALLOC: dhat::Alloc = dhat::Alloc;

fn main() {
    #[cfg(feature = "dhat-heap")]
    let _profiler = dhat::Profiler::new_heap();

    // ... run workload ...
    println!("dhat profile written on drop of _profiler");
}

Run with: cargo run --features dhat-heap. At program exit, dhat writes dhat-heap.json. View it at https://nnethercote.github.io/dh_view/dh_view.html.

The profile shows total bytes allocated per call site — letting you immediately identify which function is responsible for most allocations, even if that function is inside a library you did not write.

Reducing Allocator Pressure: Patterns

Pre-allocate with with_capacity:

fn process_batch_optimised(n: usize) -> Vec<usize> {
    // Pre-allocate with known capacity — no reallocation on push.
    let mut result = Vec::with_capacity(n);
    let mut seen = std::collections::HashSet::with_capacity(n);

    for i in 0..n {
        if seen.insert(i % (n / 2)) {  // ~50% are unique
            result.push(i);
        }
    }
    result
}

fn main() {
    let batch = process_batch_optimised(10_000);
    println!("{} unique items", batch.len());
}

Reuse allocations across calls with clear() instead of dropping and reallocating:

struct FrameProcessor {
    // Persistent buffers — allocated once, reused every batch.
    seen:    std::collections::HashSet<(u32, u64)>,
    indices: Vec<usize>,
}

impl FrameProcessor {
    fn new(expected_batch_size: usize) -> Self {
        Self {
            seen:    std::collections::HashSet::with_capacity(expected_batch_size),
            indices: Vec::with_capacity(expected_batch_size),
        }
    }

    fn process(&mut self, headers: &[(u32, u64)]) -> &[usize] {
        // clear() retains the allocation — no new malloc per batch.
        self.seen.clear();
        self.indices.clear();

        for (i, &(sat, seq)) in headers.iter().enumerate() {
            if self.seen.insert((sat, seq)) {
                self.indices.push(i);
            }
        }
        &self.indices
    }
}

fn main() {
    let mut processor = FrameProcessor::new(1000);
    let headers: Vec<(u32, u64)> = (0..1000)
        .map(|i| ((i % 48) as u32, (i / 3) as u64))
        .collect();

    for batch_num in 0..5 {
        let unique = processor.process(&headers);
        println!("batch {batch_num}: {} unique frames", unique.len());
    }
}

The FrameProcessor struct holds the HashSet and Vec across batch calls. Each batch calls clear() — which sets the length to zero but retains the allocated capacity. After the first batch warms up the allocation, subsequent batches make zero allocator calls for these data structures.


Code Examples

Measuring Allocations Per Batch in CI

Embedding an allocation count assertion in CI ensures that future refactors do not accidentally reintroduce per-frame allocations:

use std::alloc::{GlobalAlloc, Layout, System};
use std::sync::atomic::{AtomicU64, Ordering::Relaxed};

struct CountingAllocator;
static ALLOC_COUNT: AtomicU64 = AtomicU64::new(0);

unsafe impl GlobalAlloc for CountingAllocator {
    unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
        ALLOC_COUNT.fetch_add(1, Relaxed);
        System.alloc(layout)
    }
    unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
        System.dealloc(ptr, layout)
    }
}

#[global_allocator]
static ALLOCATOR: CountingAllocator = CountingAllocator;

// --- Frame processor under test ---

struct Processor {
    seen:    std::collections::HashSet<(u32, u64)>,
    indices: Vec<usize>,
}

impl Processor {
    fn new(cap: usize) -> Self {
        Self {
            seen:    std::collections::HashSet::with_capacity(cap),
            indices: Vec::with_capacity(cap),
        }
    }

    fn process_batch(&mut self, headers: &[(u32, u64)]) -> usize {
        self.seen.clear();
        self.indices.clear();
        for (i, &key) in headers.iter().enumerate() {
            if self.seen.insert(key) {
                self.indices.push(i);
            }
        }
        self.indices.len()
    }
}

fn main() {
    let headers: Vec<(u32, u64)> = (0..1000)
        .map(|i| ((i % 48) as u32, (i / 3) as u64))
        .collect();

    let mut processor = Processor::new(1000);

    // Warm up — first batch may allocate as HashSet grows.
    processor.process_batch(&headers);

    // Reset counter — subsequent batches should be allocation-free.
    ALLOC_COUNT.store(0, Relaxed);

    // Run 100 batches.
    for _ in 0..100 {
        std::hint::black_box(processor.process_batch(std::hint::black_box(&headers)));
    }

    let allocs = ALLOC_COUNT.load(Relaxed);
    println!("allocations across 100 batches: {allocs}");

    // In CI: assert!(allocs == 0, "unexpected allocations in hot path: {allocs}");
    if allocs == 0 {
        println!("PASS: hot path is allocation-free after warm-up");
    } else {
        println!("WARN: {allocs} unexpected allocations detected");
    }
}

The pattern: warm up once (let pre-allocated capacity fill), reset the counter, then assert zero allocations across subsequent batches. This assertion in CI will fail the build if any refactor introduces a hidden allocation.


Key Takeaways

  • Memory profiling reveals the call sites responsible for allocations, total bytes allocated per site, and peak live bytes. dhat (via the dhat crate) provides this with minimal production overhead when gated behind a feature flag.

  • A counting allocator wrapper (#[global_allocator] with atomic counters) is the fastest way to count allocations in a specific code path. Use it to establish a baseline, then assert zero allocations in CI for hot paths.

  • HashSet::with_capacity(n) and Vec::with_capacity(n) pre-allocate to avoid grow-and-rehash allocations. If you know the expected size, always use with_capacity.

  • clear() retains the underlying allocation. Use it to reuse Vec and HashMap buffers across batches rather than dropping and reallocating each time.

  • Common hidden allocation sources: format!(), HashMap::new() without capacity, Vec::collect() on unsized iterators, iterator adapters that buffer internally (.sorted(), .chunks() on non-slice iterators), and Arc::new() in a per-frame code path.

  • Profile allocations before optimising. The counting allocator tells you how many allocations happen. The flamegraph from Lesson 2 tells you where time is spent. Together they give a complete picture: is the bottleneck the allocation count, the allocator latency, or the subsequent memory access pattern?


Project — Meridian Control Plane Performance Audit

Module: Foundation — M06: Performance & Profiling Prerequisite: All three module quizzes passed (≥70%)



Mission Brief

TO: Platform Engineering FROM: Mission Control Systems Lead CLASSIFICATION: UNCLASSIFIED // INTERNAL SUBJECT: RFC-0058 — Control Plane Performance Audit and Remediation


The telemetry processor built in Module 5 achieves 100,000 frames per second in isolation. When integrated with the full control plane pipeline — ground station TCP ingress, deduplication, sort, downstream forwarding — the integrated system runs at 71,000 frames per second, 29% below target.

Your task is to conduct a structured performance audit of the integrated pipeline, identify the bottleneck using the tools from this module, implement a targeted fix with measurable improvement, and document the result.


Pipeline Under Audit

The pipeline processes frames through four stages:

[TCP Ingress] → [Validator] → [Deduplicator] → [Forwarder]

Each stage has a measurable input and output rate. Profiling tools tell you which stage is the bottleneck and which specific function within that stage consumes the most CPU.


Audit Procedure

Phase 1: Establish a Baseline with criterion

Write a criterion benchmark for the full pipeline (not just the processor). Measure:

  • Frames per second through the complete pipeline
  • Per-stage latency breakdown (validator, deduplicator, forwarder separately)
  • Memory allocation count per batch (using a counting allocator)

The baseline establishes the starting point. Every fix must demonstrate measurable improvement against this baseline — not just "it felt faster".

Phase 2: CPU Profile with flamegraph

Run cargo flamegraph on the pipeline binary for 30 seconds under sustained load. Identify:

  • Which stage occupies the most flamegraph width
  • Which function within that stage is the hot leaf
  • Whether the flamegraph shows malloc/free as significant contributors

Phase 3: Memory Profile with a Counting Allocator

Integrate the counting allocator from Lesson 3. For each batch of 1,000 frames:

  • Count total allocations per batch
  • Count allocations per stage (reset/snapshot around each stage)
  • Identify which stage is responsible for the most allocations

Phase 4: Implement and Measure a Fix

Based on the profiling findings, implement the highest-impact fix. Typical candidates:

  • Replace Vec::new() in the deduplicator with a reused buffer (clear() pattern)
  • Replace HashMap::new() with HashMap::with_capacity(batch_size)
  • Replace format!() in the validator with a pre-allocated error buffer
  • Apply arena allocation for payloads that were missed in Module 5

Re-run the criterion benchmark. Document the before/after comparison.


Expected Output

A workspace with:

  1. A meridian-pipeline binary crate implementing the four-stage pipeline
  2. A benches/pipeline.rs criterion benchmark measuring the full pipeline and each stage
  3. An audit.md document recording:
    • Baseline criterion output (copy from terminal)
    • Flamegraph findings (which function was the hot path)
    • Allocation counts per stage per batch (from counting allocator)
    • The fix implemented
    • Post-fix criterion output showing improvement
    • criterion's statistical significance output (p-value)

Acceptance Criteria

#CriterionVerifiable
1criterion benchmark runs and produces confidence intervals for the full pipelineYes — cargo bench output
2black_box applied correctly — input and output both wrappedYes — code review
3Test data built outside the criterion closure, not insideYes — code review
4Flamegraph generated for a ≥ 30-second profiling runYes — flamegraph.svg present
5Allocation counts per stage documented in audit.mdYes — numbers in the document
6At least one measurable fix implemented and documented with before/after timingYes — audit.md
7criterion reports p < 0.05 for the improvement (statistically significant)Yes — criterion output in audit.md

Hints

Hint 1 — Criterion benchmark structure
#![allow(unused)]
fn main() {
// benches/pipeline.rs
// use criterion::{black_box, criterion_group, criterion_main, BenchmarkId, Criterion};
//
// fn bench_pipeline(c: &mut Criterion) {
//     let mut group = c.benchmark_group("pipeline");
//
//     for batch_size in [100, 500, 1000, 5000].iter() {
//         let headers = build_test_headers(*batch_size);
//
//         group.bench_with_input(
//             BenchmarkId::new("full", batch_size),
//             batch_size,
//             |b, _| {
//                 b.iter(|| {
//                     black_box(run_pipeline(black_box(&headers)))
//                 })
//             },
//         );
//     }
//     group.finish();
// }
//
// criterion_group!(benches, bench_pipeline);
// criterion_main!(benches);
}
Hint 2 — Per-stage allocation counting
#![allow(unused)]
fn main() {
// Reset counter, run stage, snapshot:
ALLOC_COUNT.store(0, Ordering::Relaxed);
let result = run_validator(black_box(&frames));
let validator_allocs = ALLOC_COUNT.load(Ordering::Relaxed);

ALLOC_COUNT.store(0, Ordering::Relaxed);
let deduped = run_deduplicator(black_box(&result));
let dedup_allocs = ALLOC_COUNT.load(Ordering::Relaxed);

println!("validator: {validator_allocs} allocs/batch");
println!("deduplicator: {dedup_allocs} allocs/batch");
}
Hint 3 — Reusing buffers between batches

If the deduplicator creates a new HashSet each batch, convert it to a persistent struct:

#![allow(unused)]
fn main() {
pub struct Deduplicator {
    seen: std::collections::HashSet<(u32, u64)>,
    unique_indices: Vec<usize>,
}

impl Deduplicator {
    pub fn new(expected_batch: usize) -> Self {
        Self {
            seen: std::collections::HashSet::with_capacity(expected_batch),
            unique_indices: Vec::with_capacity(expected_batch),
        }
    }

    pub fn process(&mut self, headers: &[(u32, u64)]) -> &[usize] {
        self.seen.clear();           // Retains allocation.
        self.unique_indices.clear(); // Retains allocation.
        for (i, &key) in headers.iter().enumerate() {
            if self.seen.insert(key) {
                self.unique_indices.push(i);
            }
        }
        &self.unique_indices
    }
}
}
Hint 4 — Flamegraph build configuration

Add to Cargo.toml:

[profile.release]
debug = true

[profile.profiling]
inherits = "release"
debug = true

Build and profile:

cargo build --profile profiling
cargo flamegraph --profile profiling --bin meridian-pipeline -- \
    --duration 30 --batch-size 1000

If cargo flamegraph is not installed: cargo install flamegraph. Requires perf on Linux or Xcode instruments on macOS.


Reference Implementation

Reveal reference implementation
// src/main.rs — pipeline implementation for profiling
use std::alloc::{GlobalAlloc, Layout, System};
use std::sync::atomic::{AtomicU64, Ordering::Relaxed};
use std::hint::black_box;
use std::time::Instant;

// --- Counting allocator ---

struct CountingAllocator;
static ALLOC_COUNT: AtomicU64 = AtomicU64::new(0);

unsafe impl GlobalAlloc for CountingAllocator {
    unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
        ALLOC_COUNT.fetch_add(1, Relaxed);
        System.alloc(layout)
    }
    unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
        System.dealloc(ptr, layout)
    }
}
#[global_allocator]
static ALLOCATOR: CountingAllocator = CountingAllocator;

// --- Pipeline stages ---

#[inline(never)]
fn validate(headers: &[(u32, u64, u8)]) -> Vec<(u32, u64)> {
    headers.iter()
        .filter(|&&(_, _, flags)| flags & 0x80 == 0)
        .map(|&(sat, seq, _)| (sat, seq))
        .collect()
}

pub struct Deduplicator {
    seen:    std::collections::HashSet<(u32, u64)>,
    indices: Vec<usize>,
}

impl Deduplicator {
    pub fn new(cap: usize) -> Self {
        Self {
            seen:    std::collections::HashSet::with_capacity(cap),
            indices: Vec::with_capacity(cap),
        }
    }

    #[inline(never)]
    pub fn process(&mut self, valid: &[(u32, u64)]) -> &[usize] {
        self.seen.clear();
        self.indices.clear();
        for (i, &key) in valid.iter().enumerate() {
            if self.seen.insert(key) { self.indices.push(i); }
        }
        &self.indices
    }
}

#[inline(never)]
fn forward(valid: &[(u32, u64)], unique: &[usize]) -> usize {
    unique.iter().map(|&i| valid[i].0 as usize).sum()
}

fn run_pipeline(
    headers: &[(u32, u64, u8)],
    dedup: &mut Deduplicator,
) -> usize {
    let valid = validate(headers);
    let unique = dedup.process(&valid).to_vec();
    forward(&valid, &unique)
}

fn main() {
    let batch_size = 1_000usize;
    let headers: Vec<(u32, u64, u8)> = (0..batch_size)
        .map(|i| ((i % 48) as u32, (i / 3) as u64, 0u8))
        .collect();

    let mut dedup = Deduplicator::new(batch_size);

    // Warm up.
    for _ in 0..10 { run_pipeline(&headers, &mut dedup); }

    // Measure allocations per batch.
    ALLOC_COUNT.store(0, Relaxed);
    for _ in 0..1000 {
        black_box(run_pipeline(black_box(&headers), &mut dedup));
    }
    let allocs = ALLOC_COUNT.load(Relaxed);
    println!("allocs across 1000 batches: {allocs}");
    println!("allocs per batch: {:.1}", allocs as f64 / 1000.0);

    // Throughput measurement.
    let batches = 100_000u32;
    let start = Instant::now();
    for _ in 0..batches {
        black_box(run_pipeline(black_box(&headers), &mut dedup));
    }
    let elapsed = start.elapsed();
    let fps = (batches as usize * batch_size) as f64 / elapsed.as_secs_f64();
    println!("throughput: {:.0} frames/sec", fps);
    println!("elapsed: {:.2?}", elapsed);
}

Reflection

The audit methodology in this project — baseline, profile, identify, fix, verify — is the standard performance engineering workflow. The workflow is the skill, not the specific tools. perf and flamegraph will be replaced by better tools; the habit of measuring before and after, asserting statistical significance, and documenting findings will not.

The counting allocator CI assertion from Lesson 3 is the instrument that keeps the improvements from this module from being silently regressed six months from now. Every performance optimisation needs a regression test. For throughput, that test is a criterion baseline stored in target/criterion. For allocation-freedom, it is a assert_eq!(allocs, 0) assertion in the CI pipeline.

With Module 6 complete, the full Foundation track is done. Every capability the control plane relies on — async scheduling, concurrency primitives, message passing, networking, data layout, and performance measurement — is now in your toolkit. The track-specific modules (Database Internals, Data Pipelines, Data Lakes, Distributed Systems) build directly on this foundation.

Module 01 — Storage Engine Fundamentals

Track: Database Internals — Orbital Object Registry
Position: Module 1 of 6
Source material: Database Internals — Alex Petrov, Chapters 1–4; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 3
Quiz pass threshold: 70% on all three lessons to unlock the project


Mission Context

SDA INCIDENT REPORT — OOR-2026-0041
Classification: OPERATIONAL DEFICIENCY
Subject: TLE index query latency exceeding conjunction avoidance SLA

ESA's Space Surveillance and Tracking (SST) division has notified Meridian Space Systems that our Two-Line Element (TLE) index cannot scale past 100,000 tracked orbital objects. Current architecture stores TLE records as serialized JSON blobs in PostgreSQL — every conjunction query triggers a full table scan. With the post-fragmentation debris field projected to add 12,000 new objects this quarter, the system will exceed the 500ms conjunction query SLA within 60 days.

Directive: Build a purpose-built storage engine for the Orbital Object Registry. Start with the lowest layer — how bytes hit disk and come back.

This module establishes the foundational layer of the Orbital Object Registry storage engine. Before you can index, query, or recover data, you need a reliable on-disk format and an efficient way to move pages between disk and memory. Every decision made here — page size, record layout, eviction policy — propagates upward through the entire engine.


Learning Outcomes

After completing this module, you will be able to:

  1. Design a fixed-size page format with headers, magic bytes, and checksums for integrity verification
  2. Implement a buffer pool that caches hot pages in memory and evicts cold pages using LRU or CLOCK policies
  3. Explain why random I/O is the dominant cost in storage engines and how page-aligned access patterns reduce it
  4. Implement a slotted page layout that supports variable-length records with in-page compaction
  5. Reason about the tradeoffs between page size, I/O amplification, and internal fragmentation
  6. Map TLE records to a binary page format suitable for the Orbital Object Registry

Lesson Summary

Lesson 1 — File Formats and Page Layout

How storage engines organize bytes on disk. Fixed-size pages, headers, magic bytes, and the page abstraction that separates logical records from physical storage. Why 4KB or 8KB pages align with OS and hardware boundaries.

Key question: Why do storage engines use fixed-size pages instead of variable-length records written sequentially?

Lesson 2 — Buffer Pool Management

The page cache that sits between the storage engine and the OS. LRU and CLOCK eviction policies, page pinning, dirty page tracking, and the flush protocol. Why the buffer pool exists even though the OS has its own page cache.

Key question: When should a storage engine bypass the OS page cache and manage its own buffer pool?

Lesson 3 — Slotted Pages

How to store variable-length records within a fixed-size page. The slot array, free space pointer, and in-page compaction. How deletions create fragmentation and how the engine reclaims space without rewriting the entire page.

Key question: How does a slotted page maintain stable record identifiers when records are moved during compaction?


Capstone Project — TLE Record Page Manager

Build a page manager that reads and writes orbital TLE records to a custom binary page format backed by a simple buffer pool. The page manager must support insert, lookup by slot, delete, and page-level compaction. Acceptance criteria and the full project brief are in project-tle-page-manager.md.


File Index

module-01-storage-engine-fundamentals/
├── README.md                          ← this file
├── lesson-01-page-layout.md           ← File formats and page layout
├── lesson-01-quiz.toml                ← Quiz (5 questions)
├── lesson-02-buffer-pool.md           ← Buffer pool management
├── lesson-02-quiz.toml                ← Quiz (5 questions)
├── lesson-03-slotted-pages.md         ← Slotted pages
├── lesson-03-quiz.toml                ← Quiz (5 questions)
└── project-tle-page-manager.md        ← Capstone project brief

Prerequisites

  • Foundation Track completed (all 6 modules)
  • Familiarity with std::fs::File, Read, Write, Seek traits
  • Basic understanding of how operating systems manage file I/O

What Comes Next

Module 2 (B-Tree Index Structures) builds on the page abstraction from this module. The B-tree nodes you implement in Module 2 are stored in the pages you design here. The buffer pool you build here is the same buffer pool that serves page requests for the B-tree and, later, the LSM engine.

Lesson 1 — File Formats and Page Layout

Module: Database Internals — M01: Storage Engine Fundamentals
Position: Lesson 1 of 3
Source: Database Internals — Alex Petrov, Chapters 1–3; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 3

Source note: This lesson was synthesized from training knowledge. The following concepts would benefit from verification against the source books: exact page header field sizes in Petrov Ch. 3, the magic byte conventions across SQLite/InnoDB/RocksDB, and Petrov's specific framing of the page abstraction layer.



Context

Every storage engine eventually answers the same question: how do bytes get from memory to disk and back? The answer is almost never "write them sequentially and hope for the best." Sequential writes are fast, but random reads against an unstructured file are catastrophic — seeking to an arbitrary byte offset in a 10GB file on a spinning disk costs 5–10ms per seek. Even on SSDs, random 512-byte reads are an order of magnitude slower than reading aligned 4KB blocks.

The solution, used by virtually every production storage engine from SQLite to RocksDB to PostgreSQL, is the page abstraction: divide the file into fixed-size blocks (pages), give each page a numeric identifier, and build all higher-level structures — indices, records, free lists — on top of this uniform unit. Pages align with the OS virtual memory system (typically 4KB) and the storage device's block size, which means reads and writes hit the hardware at its natural granularity.

For the Orbital Object Registry, each TLE record is approximately 140 bytes (two 69-character lines plus metadata). A single 4KB page can hold roughly 25–28 TLE records. With 100,000 tracked objects, the entire catalog fits in approximately 4,000 pages — about 16MB. The page format you design in this lesson is the physical foundation that every subsequent module builds on.


Core Concepts

The Page Abstraction

A page is a fixed-size block of bytes — the atomic unit of I/O in a storage engine. The engine never reads or writes less than one page. This constraint seems wasteful (reading 4KB to retrieve a 140-byte TLE record), but it aligns with how hardware and operating systems actually work:

  • Disk drives read and write in sectors (512 bytes or 4KB for modern drives). Reading 1 byte costs the same as reading 4KB — the drive fetches the entire sector regardless.
  • The OS page cache manages memory in 4KB pages. A storage engine that uses the same page size gets free alignment with the kernel's caching layer.
  • mmap and direct I/O both operate on page-aligned boundaries. Misaligned reads require the kernel to fetch extra pages and copy the relevant bytes — an unnecessary overhead.

Common page sizes: 4KB (SQLite, default PostgreSQL), 8KB (PostgreSQL configurable, InnoDB), 16KB (InnoDB default), 64KB (some OLAP systems). Larger pages reduce the number of I/O operations for sequential scans but increase I/O amplification for point lookups (you read 64KB to get 140 bytes). The Orbital Object Registry uses 4KB pages — the catalog is small enough that point lookup amplification matters more than scan throughput.

Page Layout

Every page begins with a header that identifies the page and describes its contents. The header is the first thing the engine reads after loading a page from disk, and it must contain enough information to interpret the rest of the page without external context.

A minimal page header contains:

FieldSizePurpose
Magic bytes4 bytesIdentifies this as a valid OOR page (e.g., 0x4F4F5231 = "OOR1")
Page ID4 bytesUnique identifier for this page within the file
Page type1 byteDiscriminant: data page, index page, overflow page, free page
Record count2 bytesNumber of active records in this page
Free space offset2 bytesByte offset where free space begins
Checksum4 bytesCRC32 of the page body for integrity verification

Total header: 17 bytes. The remaining 4,079 bytes (in a 4KB page) are available for records.

Magic bytes serve two purposes: they let the engine detect corrupted or misidentified files on open (if the first 4 bytes of the file aren't OOR1, this isn't an OOR database), and they enable file-level identification by external tools (file command, hex editors). Production systems like SQLite use SQLite format 3\000 as the first 16 bytes of the file header.

Checksums detect bit rot and partial writes. A page whose checksum doesn't match its body was either corrupted on disk or partially written during a crash. The engine must reject it and attempt recovery from the WAL (Module 4). CRC32 is standard; some engines use xxHash for speed or SHA-256 for cryptographic integrity.

File Organization

The database file is a contiguous sequence of pages. Page 0 is typically a file header page that stores metadata: database version, page size, total page count, pointer to the free list head, and engine configuration. Pages 1 through N hold data.

┌──────────┬──────────┬──────────┬──────────┬─────┐
│  Page 0  │  Page 1  │  Page 2  │  Page 3  │ ... │
│ (header) │  (data)  │  (data)  │  (free)  │     │
└──────────┴──────────┴──────────┴──────────┴─────┘
     ↑
     File header: version, page size, page count,
     free list head → Page 3

Addressing: Given a page ID and the page size, the byte offset in the file is page_id * page_size. This arithmetic is the reason pages must be fixed-size — variable-size pages would require a separate index to locate each page, adding a layer of indirection to every I/O operation.

Free list management: When a page is deallocated (all records deleted, or a B-tree node is merged), it goes on the free list rather than being returned to the OS. The next allocation request takes a page from the free list before extending the file. This avoids filesystem fragmentation and keeps the file size stable under delete-heavy workloads.

Alignment and Direct I/O

When a storage engine bypasses the OS page cache (using O_DIRECT on Linux), all reads and writes must be aligned to the device's block size — typically 512 bytes or 4KB. Misaligned I/O fails with EINVAL. Even when not using direct I/O, aligned access avoids read-modify-write cycles in the kernel's page cache.

In Rust, allocating page-aligned buffers requires care. Vec<u8> does not guarantee alignment beyond the default allocator's alignment (typically 8 or 16 bytes). For direct I/O, you need explicit alignment:

use std::alloc::{alloc, dealloc, Layout};

/// Allocate a page-aligned buffer for direct I/O.
/// Safety: caller must ensure `page_size` is a power of two.
fn alloc_aligned_page(page_size: usize) -> *mut u8 {
    let layout = Layout::from_size_align(page_size, page_size)
        .expect("page_size must be a power of two");
    // Safety: layout is valid (non-zero size, power-of-two alignment)
    unsafe { alloc(layout) }
}

Production engines wrap this in a PageBuf type that handles allocation, deallocation, and provides safe access to the underlying bytes.


Code Examples

Defining the Page Format for Orbital TLE Records

The Orbital Object Registry needs a page format that can store TLE records with their associated metadata. This example defines the page header, serialization, and deserialization logic.

use std::io::{self, Read, Write, Seek, SeekFrom};
use std::fs::File;

const PAGE_SIZE: usize = 4096;
const MAGIC: [u8; 4] = [0x4F, 0x4F, 0x52, 0x31]; // "OOR1"
const HEADER_SIZE: usize = 17;

/// Page types in the Orbital Object Registry.
#[repr(u8)]
#[derive(Debug, Clone, Copy, PartialEq)]
enum PageType {
    FileHeader = 0,
    Data = 1,
    Index = 2,
    Free = 3,
    Overflow = 4,
}

/// Fixed-size page header. Sits at byte 0 of every page.
#[derive(Debug)]
struct PageHeader {
    magic: [u8; 4],
    page_id: u32,
    page_type: PageType,
    record_count: u16,
    free_space_offset: u16,
    checksum: u32,
}

impl PageHeader {
    fn new(page_id: u32, page_type: PageType) -> Self {
        Self {
            magic: MAGIC,
            page_id,
            page_type,
            record_count: 0,
            // Free space starts immediately after the header
            free_space_offset: HEADER_SIZE as u16,
            checksum: 0,
        }
    }

    fn serialize(&self, buf: &mut [u8]) {
        buf[0..4].copy_from_slice(&self.magic);
        buf[4..8].copy_from_slice(&self.page_id.to_le_bytes());
        buf[8] = self.page_type as u8;
        buf[9..11].copy_from_slice(&self.record_count.to_le_bytes());
        buf[11..13].copy_from_slice(&self.free_space_offset.to_le_bytes());
        buf[13..17].copy_from_slice(&self.checksum.to_le_bytes());
    }

    fn deserialize(buf: &[u8]) -> io::Result<Self> {
        if buf[0..4] != MAGIC {
            return Err(io::Error::new(
                io::ErrorKind::InvalidData,
                "invalid page magic bytes — not an OOR page",
            ));
        }
        Ok(Self {
            magic: MAGIC,
            page_id: u32::from_le_bytes(buf[4..8].try_into().unwrap()),
            page_type: match buf[8] {
                0 => PageType::FileHeader,
                1 => PageType::Data,
                2 => PageType::Index,
                3 => PageType::Free,
                4 => PageType::Overflow,
                _ => return Err(io::Error::new(
                    io::ErrorKind::InvalidData,
                    "unknown page type discriminant",
                )),
            },
            record_count: u16::from_le_bytes(buf[9..11].try_into().unwrap()),
            free_space_offset: u16::from_le_bytes(buf[11..13].try_into().unwrap()),
            checksum: u32::from_le_bytes(buf[13..17].try_into().unwrap()),
        })
    }
}

Notice that all multi-byte integers use little-endian encoding (to_le_bytes/from_le_bytes). This is a deliberate choice — the engine should produce the same on-disk format regardless of the host architecture. Big-endian is equally valid (and simplifies key comparison in B-trees, as we'll see in Module 2), but you must pick one and enforce it everywhere. Mixing endianness across page types is a subtle bug that survives unit tests and explodes in production.

Reading and Writing Pages to Disk

The page I/O layer translates between page IDs and file offsets. This is the lowest layer of the storage engine — everything above it thinks in pages, not bytes.

/// Low-level page I/O against the database file.
struct PageFile {
    file: File,
    page_size: usize,
}

impl PageFile {
    fn open(path: &str, page_size: usize) -> io::Result<Self> {
        let file = File::options()
            .read(true)
            .write(true)
            .create(true)
            .open(path)?;
        Ok(Self { file, page_size })
    }

    /// Read a page from disk into the provided buffer.
    /// The buffer must be exactly `page_size` bytes.
    fn read_page(&mut self, page_id: u32, buf: &mut [u8]) -> io::Result<()> {
        assert_eq!(buf.len(), self.page_size);
        let offset = page_id as u64 * self.page_size as u64;
        self.file.seek(SeekFrom::Start(offset))?;
        self.file.read_exact(buf)?;
        Ok(())
    }

    /// Write a page buffer to disk at the correct offset.
    fn write_page(&mut self, page_id: u32, buf: &[u8]) -> io::Result<()> {
        assert_eq!(buf.len(), self.page_size);
        let offset = page_id as u64 * self.page_size as u64;
        self.file.seek(SeekFrom::Start(offset))?;
        self.file.write_all(buf)?;
        // Note: we do NOT fsync here. Durability is the WAL's job (Module 4).
        // Calling fsync on every page write would destroy throughput —
        // a single fsync costs 1-10ms on SSD, 10-30ms on spinning disk.
        Ok(())
    }

    /// Allocate a new page at the end of the file. Returns the new page ID.
    fn allocate_page(&mut self) -> io::Result<u32> {
        let file_len = self.file.seek(SeekFrom::End(0))?;
        let page_id = (file_len / self.page_size as u64) as u32;
        let zeroed = vec![0u8; self.page_size];
        self.file.write_all(&zeroed)?;
        Ok(page_id)
    }
}

Two things to notice: first, read_page uses read_exact, not read. A short read (fewer bytes than page_size) means the file is truncated or corrupted — the engine must not silently accept a partial page. Second, write_page does not call fsync. This is intentional. The WAL (Module 4) provides durability guarantees; the page file relies on the WAL for crash recovery. Calling fsync on every page write would reduce throughput from thousands of pages/second to fewer than 100 on spinning disk.

Computing and Verifying Page Checksums

Every page is checksummed before being written to disk. On read, the checksum is verified before the page contents are trusted. This catches bit rot, partial writes, and storage firmware bugs.

/// CRC32 checksum of the page body (everything after the checksum field).
/// We zero the checksum field before computing so the checksum is
/// deterministic regardless of the previous checksum value.
fn compute_checksum(page_buf: &[u8]) -> u32 {
    // Checksum covers bytes 17..PAGE_SIZE (the body).
    // The header's checksum field (bytes 13..17) is excluded from the
    // computation — it stores the result.
    let body = &page_buf[HEADER_SIZE..];
    crc32fast::hash(body)
}

fn write_page_with_checksum(
    page_file: &mut PageFile,
    page_id: u32,
    buf: &mut [u8],
) -> io::Result<()> {
    let checksum = compute_checksum(buf);
    buf[13..17].copy_from_slice(&checksum.to_le_bytes());
    page_file.write_page(page_id, buf)
}

fn read_and_verify_page(
    page_file: &mut PageFile,
    page_id: u32,
    buf: &mut [u8],
) -> io::Result<()> {
    page_file.read_page(page_id, buf)?;
    let stored = u32::from_le_bytes(buf[13..17].try_into().unwrap());
    let computed = compute_checksum(buf);
    if stored != computed {
        return Err(io::Error::new(
            io::ErrorKind::InvalidData,
            format!(
                "page {} checksum mismatch: stored={:#010x}, computed={:#010x}",
                page_id, stored, computed
            ),
        ));
    }
    Ok(())
}

The checksum covers only the page body, not the header's checksum field itself. This avoids a circular dependency: you can't include the checksum in the data being checksummed. Some engines (e.g., PostgreSQL) checksum the entire page with the checksum field zeroed before computation — both approaches work, but you must document which one you use.


Key Takeaways

  • The page is the atomic unit of I/O in a storage engine. All reads and writes operate on full pages, never partial pages. This aligns with hardware block sizes and OS page cache granularity.
  • Page size is a tradeoff: larger pages reduce I/O count for scans but increase amplification for point lookups. 4KB is the default for OLTP-style workloads like the Orbital Object Registry.
  • Every page starts with a header containing magic bytes, page ID, type discriminant, and a checksum. The header must be self-describing — the engine should be able to interpret any page without external context.
  • Byte order must be fixed across the entire on-disk format. Pick little-endian or big-endian and enforce it everywhere. Never rely on native endianness.
  • Page writes do not call fsync. Durability is provided by the write-ahead log, not by synchronous page flushes. This is a fundamental architectural decision that separates high-throughput engines from naive implementations.

Lesson 2 — Buffer Pool Management

Module: Database Internals — M01: Storage Engine Fundamentals
Position: Lesson 2 of 3
Source: Database Internals — Alex Petrov, Chapter 5; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 3

Source note: This lesson was synthesized from training knowledge. The following concepts would benefit from verification against the source books: Petrov's specific CLOCK algorithm variant, his framing of the buffer pool vs. OS page cache tradeoff, and the exact dirty page flush protocols described in Chapter 5.



Context

The page I/O layer from Lesson 1 reads and writes pages directly to disk. Every read_page call triggers a system call, a disk seek (on HDD) or a flash translation layer lookup (on SSD), and a DMA transfer. For the Orbital Object Registry's conjunction query workload — which repeatedly accesses the same hot set of TLE records during a pass window — going to disk for every page request is unacceptable. A conjunction check against 100 objects would issue 100+ page reads, taking 50–100ms on SSD and over a second on spinning disk.

The buffer pool solves this by caching recently-accessed pages in memory. It sits between the storage engine's upper layers (B-tree, LSM, query processor) and the page I/O layer. When a page is requested, the buffer pool checks whether it's already in memory. If so, it returns a pointer to the cached copy — no disk I/O required. If not, it evicts a cold page to make room, reads the requested page from disk, caches it, and returns it. For a well-tuned buffer pool with a hot working set that fits in memory, the hit rate exceeds 99%, and the storage engine operates almost entirely from RAM.

The buffer pool exists even though the OS already has a page cache. The difference is control: the OS page cache uses a generic LRU policy that doesn't know about access patterns specific to the storage engine (sequential scan flooding, index traversal locality). A purpose-built buffer pool can use workload-aware eviction, pin pages during multi-step operations, and track dirty pages for coordinated flushing.


Core Concepts

Buffer Pool Architecture

The buffer pool is a fixed-size array of frames — each frame holds one page-sized buffer plus metadata. The metadata tracks:

  • Page ID — which on-disk page is currently loaded in this frame
  • Pin count — how many active references exist to this frame. A pinned page cannot be evicted.
  • Dirty flag — whether the page has been modified since it was loaded from disk. Dirty pages must be written back before eviction.
  • Reference bit (for CLOCK) — whether the page has been accessed recently

The buffer pool also maintains a page table — a hash map from page ID to frame index — for O(1) lookups. When the engine requests page 42, the buffer pool checks the page table. If page 42 maps to frame 7, the engine gets a reference to frame 7's buffer. If page 42 is not in the page table, the buffer pool must evict a page and load 42 from disk.

Page Table (HashMap<PageId, FrameId>)
┌─────────┬─────────┐
│ Page 42 │ Frame 7 │
│ Page 13 │ Frame 2 │
│ Page 99 │ Frame 0 │
│   ...   │   ...   │
└─────────┴─────────┘

Frame Array
┌─────────┬─────────┬─────────┬─────────┐
│ Frame 0 │ Frame 1 │ Frame 2 │ Frame 3 │ ...
│ pg=99   │ (empty) │ pg=13   │ (empty) │
│ pin=1   │         │ pin=0   │         │
│ dirty=N │         │ dirty=Y │         │
└─────────┴─────────┴─────────┴─────────┘

Eviction Policies: LRU

Least Recently Used (LRU) evicts the page that hasn't been accessed for the longest time. The intuition: if a page hasn't been needed recently, it's unlikely to be needed soon. LRU is implemented with a doubly-linked list — on every access, the page is moved to the head of the list. On eviction, the tail page is removed.

LRU's weakness is scan flooding: a single sequential scan over the entire database evicts every hot page from the buffer pool, even if those pages are accessed hundreds of times per second by other queries. After the scan completes, every subsequent request misses the buffer pool and goes to disk. This is catastrophic for the OOR — a full catalog export scan would evict the conjunction query hot set.

Mitigations: LRU-K (track the K-th most recent access, not just the most recent), 2Q (separate queues for first-access and re-access pages), or ARC (adaptive replacement cache). PostgreSQL uses a clock-sweep approximation. MySQL/InnoDB uses a two-list LRU with a "young" and "old" sublist.

Eviction Policies: CLOCK

CLOCK approximates LRU without the overhead of maintaining a linked list. Each frame has a single reference bit. When a page is accessed, its reference bit is set to 1. When the buffer pool needs to evict, it sweeps through the frames in circular order (like a clock hand):

  1. If the current frame's reference bit is 1, clear it to 0 and advance the hand.
  2. If the current frame's reference bit is 0 and the page is not pinned and not dirty (or if dirty, flush it first), evict this page.

CLOCK is cheaper per operation than LRU (no linked list manipulation on every access — just set a bit) and provides comparable hit rates for most workloads. It's the default in many systems.

The weakness is the same as LRU: a full scan sets every reference bit to 1, requiring the clock hand to sweep the entire pool before any page can be evicted. CLOCK-sweep with a scan-resistant enhancement (used by PostgreSQL) mitigates this by not setting the reference bit for pages accessed during a sequential scan.

Page Pinning

A page is pinned when the engine is actively using it and it must not be evicted. The pin count tracks how many concurrent users hold a reference to the page. A page is evictable only when pin_count == 0.

Pinning is critical for correctness: if the engine is in the middle of reading a B-tree node and the buffer pool evicts that page, the engine reads garbage. The protocol:

  1. Fetch a page: buffer pool loads or finds it, increments pin count, returns a handle.
  2. Use the page: engine reads or writes the page data.
  3. Unpin the page: engine decrements pin count when done. If the engine modified the page, it marks it dirty.

Failing to unpin a page is a resource leak — the page can never be evicted, and eventually the buffer pool fills with pinned pages and all fetch requests fail. In Rust, RAII handles this naturally: the page handle decrements the pin count in its Drop implementation.

Dirty Page Flushing

A dirty page has been modified in memory but not yet written back to disk. The buffer pool tracks dirty pages and flushes them in two scenarios:

  1. Eviction flush: when a dirty page is selected for eviction, it must be written to disk before the frame can be reused.
  2. Background flush: a periodic background thread scans for dirty pages and writes them to disk proactively, reducing the chance that an eviction will stall on a synchronous write.

The buffer pool does not call fsync after every flush. Durability is the WAL's responsibility (Module 4). The buffer pool's flush is an optimization — it keeps the page file reasonably up-to-date so that crash recovery doesn't have to replay the entire WAL.


Code Examples

A Simple LRU Buffer Pool for the Orbital Object Registry

This buffer pool caches OOR pages in memory and evicts the least recently used unpinned page when the pool is full.

use std::collections::{HashMap, VecDeque};
use std::io;

const PAGE_SIZE: usize = 4096;

/// Metadata for a single buffer pool frame.
struct Frame {
    page_id: Option<u32>,
    data: [u8; PAGE_SIZE],
    pin_count: u32,
    is_dirty: bool,
}

impl Frame {
    fn new() -> Self {
        Self {
            page_id: None,
            data: [0u8; PAGE_SIZE],
            pin_count: 0,
            is_dirty: false,
        }
    }
}

/// LRU buffer pool backed by the OOR page file.
struct BufferPool {
    frames: Vec<Frame>,
    page_table: HashMap<u32, usize>,  // page_id -> frame_index
    /// LRU order: front = most recently used, back = least recently used.
    /// Contains frame indices. Only unpinned frames participate in LRU.
    lru_list: VecDeque<usize>,
    page_file: PageFile,  // From Lesson 1
}

impl BufferPool {
    fn new(num_frames: usize, page_file: PageFile) -> Self {
        let mut frames = Vec::with_capacity(num_frames);
        let mut lru_list = VecDeque::with_capacity(num_frames);
        for i in 0..num_frames {
            frames.push(Frame::new());
            lru_list.push_back(i); // All frames start as free (evictable)
        }
        Self {
            frames,
            page_table: HashMap::new(),
            lru_list,
            page_file,
        }
    }

    /// Fetch a page into the buffer pool. Returns the frame index.
    /// The page is pinned — caller MUST call `unpin` when done.
    fn fetch_page(&mut self, page_id: u32) -> io::Result<usize> {
        // Fast path: page is already in the pool
        if let Some(&frame_idx) = self.page_table.get(&page_id) {
            self.frames[frame_idx].pin_count += 1;
            self.move_to_front(frame_idx);
            return Ok(frame_idx);
        }

        // Slow path: need to load from disk. Find an evictable frame.
        let frame_idx = self.find_evict_target()?;

        // If the frame holds a dirty page, flush it before reuse
        if let Some(old_page_id) = self.frames[frame_idx].page_id {
            if self.frames[frame_idx].is_dirty {
                self.page_file.write_page(
                    old_page_id,
                    &self.frames[frame_idx].data,
                )?;
            }
            self.page_table.remove(&old_page_id);
        }

        // Load the requested page into the frame
        self.page_file.read_page(page_id, &mut self.frames[frame_idx].data)?;
        self.frames[frame_idx].page_id = Some(page_id);
        self.frames[frame_idx].pin_count = 1;
        self.frames[frame_idx].is_dirty = false;
        self.page_table.insert(page_id, frame_idx);
        self.move_to_front(frame_idx);

        Ok(frame_idx)
    }

    /// Unpin a page. Caller must indicate whether the page was modified.
    fn unpin(&mut self, frame_idx: usize, is_dirty: bool) {
        let frame = &mut self.frames[frame_idx];
        assert!(frame.pin_count > 0, "unpin called on unpinned frame");
        frame.pin_count -= 1;
        if is_dirty {
            frame.is_dirty = true;
        }
    }

    /// Find the least recently used unpinned frame.
    fn find_evict_target(&self) -> io::Result<usize> {
        // Scan from the back (LRU end) for an unpinned frame
        for &frame_idx in self.lru_list.iter().rev() {
            if self.frames[frame_idx].pin_count == 0 {
                return Ok(frame_idx);
            }
        }
        Err(io::Error::new(
            io::ErrorKind::Other,
            "buffer pool exhausted: all frames are pinned",
        ))
    }

    /// Move a frame to the front of the LRU list (most recently used).
    fn move_to_front(&mut self, frame_idx: usize) {
        self.lru_list.retain(|&idx| idx != frame_idx);
        self.lru_list.push_front(frame_idx);
    }
}

The move_to_front implementation is O(n) because VecDeque::retain scans the entire list. A production buffer pool uses an intrusive doubly-linked list for O(1) LRU updates — Rust crates like intrusive-collections provide this. The O(n) approach is correct and sufficient for understanding the algorithm; the optimization matters only when the buffer pool has thousands of frames and fetch rates exceed 100k/sec.

Notice the pin-count assert in unpin: a double-unpin is a logic bug that must crash immediately in development. In production, this would be debug_assert! to avoid panicking on a user-facing code path.

RAII Page Handle for Automatic Unpinning

Rust's ownership system prevents the "forgot to unpin" bug class entirely. A page handle unpins automatically when it goes out of scope.

/// RAII handle to a pinned buffer pool page.
/// Automatically unpins the page when dropped.
struct PageHandle<'a> {
    pool: &'a mut BufferPool,
    frame_idx: usize,
    dirty: bool,
}

impl<'a> PageHandle<'a> {
    fn data(&self) -> &[u8; PAGE_SIZE] {
        &self.pool.frames[self.frame_idx].data
    }

    fn data_mut(&mut self) -> &mut [u8; PAGE_SIZE] {
        self.dirty = true;
        &mut self.pool.frames[self.frame_idx].data
    }
}

impl<'a> Drop for PageHandle<'a> {
    fn drop(&mut self) {
        self.pool.unpin(self.frame_idx, self.dirty);
    }
}

This is one of the places where Rust's borrow checker provides a genuine advantage over C/C++ buffer pool implementations. In C, every code path that fetches a page must remember to unpin it — including error paths, early returns, and exception-like longjmp flows. In Rust, the Drop implementation runs unconditionally when the handle leaves scope. The borrow checker also prevents holding a &mut reference to the page data after the handle is dropped, which would alias freed memory in C.

The tradeoff: the &'a mut BufferPool borrow means you can only hold one PageHandle at a time with this design. A production buffer pool uses Arc<Mutex<...>> or unsafe interior mutability to allow multiple concurrent page handles — we'll revisit this pattern when we implement B-tree traversal in Module 2.


Key Takeaways

  • The buffer pool is a fixed-size array of page-sized frames with a hash map for O(1) page-to-frame lookup. It eliminates disk I/O for hot pages and is the single largest performance lever in any storage engine.
  • LRU eviction is simple but vulnerable to scan flooding. CLOCK approximates LRU at lower cost per operation. Production engines use hybrid policies (LRU-K, 2Q, ARC) to resist scan-induced cache pollution.
  • Page pinning prevents eviction during active use. In Rust, RAII handles make pin leaks impossible — the Drop implementation guarantees unpinning on all code paths, including panics.
  • Dirty pages are flushed on eviction and by background threads. The buffer pool does not call fsync — durability is the WAL's job.
  • The "all frames pinned" error means the buffer pool is undersized for the workload's concurrency level. In the OOR, this can happen during peak conjunction checking if every active query holds a page pin simultaneously.

Lesson 3 — Slotted Pages

Module: Database Internals — M01: Storage Engine Fundamentals
Position: Lesson 3 of 3
Source: Database Internals — Alex Petrov, Chapter 3; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 3

Source note: This lesson was synthesized from training knowledge. The following concepts would benefit from verification against the source books: Petrov's specific slotted page layout, his terminology for the slot directory vs. cell pointer array, and the compaction algorithm described in Chapter 3.



Context

The page format from Lesson 1 stores records at fixed offsets. This works for fixed-size records — and if every TLE record were exactly 140 bytes, that would be sufficient. But real TLE data is messier: newer objects have additional metadata fields (drag coefficients, maneuver flags, covariance matrices), older legacy records omit optional fields, and record sizes will grow as ESA adds new conjunction assessment data. A format that requires all records to be the same size either wastes space (padding every record to the maximum) or breaks when schema evolves.

The slotted page layout solves this by decoupling record identity from record position. Records are addressed by slot numbers, and a slot directory at the beginning of the page maps each slot to the record's actual byte offset and length within the page. Records grow from the end of the page toward the front, while the slot directory grows from the front toward the end. They meet in the middle — and when they collide, the page is full.

This layout is the standard for every major relational database (PostgreSQL, MySQL/InnoDB, SQLite) and many key-value stores. Understanding it is prerequisite for everything that follows: B-tree nodes (Module 2) are slotted pages, WAL log records reference slot IDs (Module 4), and MVCC version chains (Module 5) track records by their page-and-slot address.


Core Concepts

Slotted Page Layout

A slotted page has three regions:

┌──────────────────────────────────────────────┐
│ Page Header (17 bytes — from Lesson 1)       │
├──────────────────────────────────────────────┤
│ Slot Directory (grows →)                     │
│ [slot 0: offset, len] [slot 1: offset, len] │
├──────────────────────────────────────────────┤
│                                              │
│           Free Space                         │
│                                              │
├──────────────────────────────────────────────┤
│ Records (grow ←)                             │
│ [record 1 data] [record 0 data]             │
└──────────────────────────────────────────────┘

Slot directory: An array of (offset, length) pairs, one per record. Slot 0 is the first entry. Each entry is 4 bytes (2 bytes offset + 2 bytes length), supporting records up to 65,535 bytes and offsets within a 64KB page. For 4KB pages, this is more than sufficient.

Records: Stored at the end of the page, growing backward (toward lower offsets). The first record inserted goes at the very end of the page; subsequent records are placed just before the previous one.

Free space: The gap between the end of the slot directory and the start of the record region. As records are inserted, the free space shrinks from both sides. The page is full when slot_directory_end >= record_region_start.

Record Addressing: (PageId, SlotId)

Higher layers of the storage engine refer to records by a Record ID (RID): a (page_id, slot_id) pair. This identifier is stable — it doesn't change when records are moved within the page during compaction, because the slot directory is updated to reflect the new offset. External references (B-tree leaf pointers, index entries) store RIDs, not raw byte offsets.

This indirection is what makes the slotted page powerful: the engine can rearrange the physical layout of records within a page (to reclaim fragmented space) without invalidating any external references. The slot ID stays the same; only the offset in the slot directory changes.

When a record is deleted, its slot entry is marked as tombstoned (offset set to a sentinel like 0xFFFF) but not removed from the directory. Removing it would shift all subsequent slot IDs by one, invalidating every external reference to those slots. Tombstoned slots can be reused for future inserts.

Insertion

To insert a record of N bytes:

  1. Check if there is enough free space: free_space >= N + 4 (4 bytes for the new slot entry).
  2. Find the next available slot. If there's a tombstoned slot, reuse it. Otherwise, append a new entry to the directory.
  3. Write the record at record_region_start - N.
  4. Update the slot entry with (offset, N).
  5. Update the page header: increment record count, adjust free_space_offset.

If the free space check fails, the page is full. The caller must either split the page (in a B-tree) or allocate a new page (in a heap file).

Deletion and Fragmentation

Deleting a record tombstones its slot entry and marks the record's bytes as reclaimable. But it doesn't shift other records — doing so would change their offsets and require updating every other slot entry that points past the deleted record.

This creates internal fragmentation: there are N bytes of garbage between valid records. Over time, a page can have plenty of total free space but no contiguous block large enough for a new record.

Before delete:
[Header][Slot 0][Slot 1][Slot 2]   [free]   [Rec 2][Rec 1][Rec 0]

After deleting record 1:
[Header][Slot 0][TOMB ][Slot 2]   [free]   [Rec 2][DEAD ][Rec 0]
                                             ← gap here cannot be used
                                               unless compacted →

Page Compaction

Compaction reclaims fragmented space by sliding all live records to the end of the page (closing the gaps left by deleted records) and updating their slot directory entries to reflect the new offsets. After compaction, all free space is contiguous.

The algorithm:

  1. Collect all live records (slot entries that are not tombstoned), sorted by their current offset in descending order.
  2. Starting from the end of the page, write each record contiguously.
  3. Update each slot entry with the new offset.
  4. Update the page header's free_space_offset.

Compaction is an in-page operation — it never spills to disk or affects other pages. It runs when an insert fails due to fragmentation (total free space is sufficient, but contiguous free space is not). Some engines compact proactively during quiet periods to avoid stalling inserts.

Overflow Pages

A single record might exceed the page's usable space (4,079 bytes in a 4KB page). This shouldn't happen for TLE records (140 bytes), but the engine must handle it for forward compatibility — covariance matrices and conjunction assessment reports can be kilobytes.

The solution: when a record is too large, store the first portion in the primary page and the remainder in one or more overflow pages. The slot entry points to the in-page portion, which contains a pointer to the overflow chain. This is sometimes called TOAST (The Oversized Attribute Storage Technique) in PostgreSQL terminology.

For the Orbital Object Registry, overflow pages are unlikely but should be supported. The implementation can be deferred until the schema actually requires it.


Code Examples

A Slotted Page Implementation for TLE Records

This implements the core slotted page logic: insert, lookup, delete, and compaction.

const PAGE_SIZE: usize = 4096;
const HEADER_SIZE: usize = 17;
const SLOT_SIZE: usize = 4; // 2 bytes offset + 2 bytes length
const TOMBSTONE: u16 = 0xFFFF;

/// A slotted page that stores variable-length records.
struct SlottedPage {
    data: [u8; PAGE_SIZE],
}

impl SlottedPage {
    fn new(page_id: u32) -> Self {
        let mut page = Self {
            data: [0u8; PAGE_SIZE],
        };
        // Initialize header (simplified — reuse PageHeader from Lesson 1)
        let mut header = PageHeader::new(page_id, PageType::Data);
        header.serialize(&mut page.data);
        page
    }

    /// Number of slots (including tombstoned ones).
    fn slot_count(&self) -> u16 {
        u16::from_le_bytes(self.data[9..11].try_into().unwrap())
    }

    fn set_slot_count(&mut self, count: u16) {
        self.data[9..11].copy_from_slice(&count.to_le_bytes());
    }

    /// Byte offset where the record region begins (records grow downward).
    fn record_region_start(&self) -> u16 {
        u16::from_le_bytes(self.data[11..13].try_into().unwrap())
    }

    fn set_record_region_start(&mut self, offset: u16) {
        self.data[11..13].copy_from_slice(&offset.to_le_bytes());
    }

    /// Read a slot directory entry.
    fn read_slot(&self, slot_id: u16) -> (u16, u16) {
        let base = HEADER_SIZE + (slot_id as usize) * SLOT_SIZE;
        let offset = u16::from_le_bytes(
            self.data[base..base + 2].try_into().unwrap()
        );
        let length = u16::from_le_bytes(
            self.data[base + 2..base + 4].try_into().unwrap()
        );
        (offset, length)
    }

    fn write_slot(&mut self, slot_id: u16, offset: u16, length: u16) {
        let base = HEADER_SIZE + (slot_id as usize) * SLOT_SIZE;
        self.data[base..base + 2].copy_from_slice(&offset.to_le_bytes());
        self.data[base + 2..base + 4].copy_from_slice(&length.to_le_bytes());
    }

    /// Free space available for new records + slot entries.
    fn free_space(&self) -> usize {
        let slot_dir_end = HEADER_SIZE + (self.slot_count() as usize) * SLOT_SIZE;
        let rec_start = self.record_region_start() as usize;
        if rec_start > slot_dir_end {
            rec_start - slot_dir_end
        } else {
            0
        }
    }

    /// Insert a record. Returns the slot ID on success.
    fn insert(&mut self, record: &[u8]) -> Option<u16> {
        let needed = record.len() + SLOT_SIZE; // record bytes + new slot entry
        if self.free_space() < needed {
            return None; // Page full — caller should try compaction or new page
        }

        // Find a tombstoned slot to reuse, or allocate a new one
        let slot_id = self.find_free_slot();

        // Place the record at the end of the record region
        let new_offset = self.record_region_start() - record.len() as u16;
        let start = new_offset as usize;
        let end = start + record.len();
        self.data[start..end].copy_from_slice(record);

        // Update the slot directory
        self.write_slot(slot_id, new_offset, record.len() as u16);
        self.set_record_region_start(new_offset);

        Some(slot_id)
    }

    /// Look up a record by slot ID. Returns None if the slot is
    /// tombstoned or out of range.
    fn get(&self, slot_id: u16) -> Option<&[u8]> {
        if slot_id >= self.slot_count() {
            return None;
        }
        let (offset, length) = self.read_slot(slot_id);
        if offset == TOMBSTONE {
            return None; // Deleted record
        }
        let start = offset as usize;
        let end = start + length as usize;
        Some(&self.data[start..end])
    }

    /// Delete a record by tombstoning its slot entry.
    fn delete(&mut self, slot_id: u16) -> bool {
        if slot_id >= self.slot_count() {
            return false;
        }
        let (offset, _) = self.read_slot(slot_id);
        if offset == TOMBSTONE {
            return false; // Already deleted
        }
        self.write_slot(slot_id, TOMBSTONE, 0);
        true
    }

    /// Find a tombstoned slot to reuse, or allocate a new one.
    fn find_free_slot(&mut self) -> u16 {
        let count = self.slot_count();
        for i in 0..count {
            let (offset, _) = self.read_slot(i);
            if offset == TOMBSTONE {
                return i;
            }
        }
        // No tombstoned slots — extend the directory
        self.set_slot_count(count + 1);
        count
    }
}

Key design decisions: the slot directory grows forward from the header, records grow backward from the end of the page, and the two regions meet in the middle. This maximizes usable space — there's no fixed boundary between "slot space" and "record space." A page with few large records uses most of its space for record data; a page with many small records uses more for the slot directory.

The insert method does not attempt compaction automatically. The caller is responsible for detecting "free space exists but is fragmented" and calling compact() before retrying. This keeps the insert path simple and predictable.

Page Compaction: Defragmenting Live Records

When deletes have fragmented the record region, compaction slides all live records together and reclaims the gaps.

impl SlottedPage {
    /// Compact the page: slide all live records to the end,
    /// eliminating gaps from deleted records.
    fn compact(&mut self) {
        let slot_count = self.slot_count();

        // Collect live records: (slot_id, data_copy)
        let mut live_records: Vec<(u16, Vec<u8>)> = Vec::new();
        for i in 0..slot_count {
            let (offset, length) = self.read_slot(i);
            if offset != TOMBSTONE {
                let start = offset as usize;
                let end = start + length as usize;
                live_records.push((i, self.data[start..end].to_vec()));
            }
        }

        // Rewrite records contiguously from the end of the page
        let mut cursor = PAGE_SIZE as u16;
        for (slot_id, record) in &live_records {
            cursor -= record.len() as u16;
            let start = cursor as usize;
            let end = start + record.len();
            self.data[start..end].copy_from_slice(record);
            self.write_slot(*slot_id, cursor, record.len() as u16);
        }

        self.set_record_region_start(cursor);
    }
}

This implementation copies live records into a temporary Vec and writes them back. A more memory-efficient approach would sort records by offset and slide them in-place, but the copy approach is correct, simple, and fast enough for 4KB pages. The total data moved is at most 4,079 bytes — negligible compared to the cost of a single disk I/O.

After compaction, the page's free space is contiguous. An insert that failed before compaction (due to fragmentation) will succeed after it — assuming the total free space is sufficient.


Key Takeaways

  • Slotted pages decouple record identity (slot ID) from physical position (byte offset). Records can be moved within the page without invalidating external references.
  • The (page_id, slot_id) Record ID is the stable address used by B-tree leaf nodes, index entries, and MVCC version chains. Every higher layer depends on this abstraction.
  • Deletions create internal fragmentation. Compaction reclaims fragmented space by sliding live records together — an in-page operation that touches no other pages.
  • Tombstoning (not removing) deleted slot entries preserves slot ID stability. A removed slot would shift all subsequent IDs, breaking every external reference.
  • The "free space" calculation must account for both record bytes and slot directory growth. An insert that appears to fit by record size alone may fail because there's no room for the new slot entry.

Project — TLE Record Page Manager

Module: Database Internals — M01: Storage Engine Fundamentals
Track: Orbital Object Registry
Estimated effort: 4–6 hours


SDA Incident Report — OOR-2026-0042

Classification: ENGINEERING DIRECTIVE
Subject: Prototype page manager for the Orbital Object Registry

Ref: OOR-2026-0041 (TLE index latency deficiency)

The first deliverable in the OOR storage engine build is a page manager capable of reading and writing TLE records to a custom binary page format. This component sits at the bottom of the storage stack — every subsequent module builds on it. The page manager must demonstrate correct page layout, buffer pool caching, slotted page record management, and integrity verification via checksums.



Objective

Build a PageManager that:

  1. Manages a database file composed of fixed-size 4KB pages
  2. Implements a buffer pool with LRU or CLOCK eviction
  3. Uses slotted pages for variable-length TLE record storage
  4. Verifies page integrity with CRC32 checksums on every read
  5. Supports insert, lookup by Record ID (page_id, slot_id), delete, and page compaction

TLE Record Format

For this project, a TLE record is a byte blob with the following structure:

/// A Two-Line Element record for a tracked orbital object.
struct TleRecord {
    /// NORAD catalog number (unique object ID, e.g., 25544 for ISS)
    norad_id: u32,
    /// International designator (e.g., "98067A")
    intl_designator: [u8; 8],
    /// Epoch year + fractional day (e.g., 24045.5 = Feb 14 2024, 12:00 UTC)
    epoch: f64,
    /// Mean motion (revolutions per day)
    mean_motion: f64,
    /// Eccentricity (dimensionless, 0–1)
    eccentricity: f64,
    /// Inclination (degrees)
    inclination: f64,
    /// Right ascension of ascending node (degrees)
    raan: f64,
    /// Argument of perigee (degrees)
    arg_perigee: f64,
    /// Mean anomaly (degrees)
    mean_anomaly: f64,
    /// Drag term (B* coefficient)
    bstar: f64,
    /// Element set number (for provenance tracking)
    element_set: u16,
    /// Revolution number at epoch
    rev_number: u32,
}

Serialized size: 4 + 8 + (8 × 8) + 2 + 4 = 82 bytes. Use little-endian encoding for all fields. You may add a 2-byte record-length prefix if your slotted page implementation requires it.


Acceptance Criteria

  1. Page I/O correctness. Pages are written to and read from a file at the correct offsets. A page written at page_id * 4096 is read back identically.

  2. Checksum verification. Every read_page call computes a CRC32 over the page body and compares it to the stored checksum. A tampered page (any bit flipped in the body) is detected and returns an error.

  3. Buffer pool hit rate. Insert 200 TLE records across multiple pages, then read them back in the same order. The buffer pool (configured with 8 frames) should achieve a hit rate above 90% on the read pass. Print the hit/miss counts.

  4. Slotted page insert and lookup. Insert 40 records into a single page. Look up each by its (page_id, slot_id) and verify the data matches.

  5. Delete and compaction. Delete every other record (slots 0, 2, 4, ...). Verify that lookups to deleted slots return None. Compact the page and verify that all remaining records are still accessible by their original slot IDs.

  6. Page full handling. Insert records until a page reports full. Verify that the failure is detected before corrupting any data. Allocate a new page and continue inserting.

  7. Deterministic output. The program runs without external dependencies beyond std and crc32fast. Output includes the buffer pool hit/miss stats and a summary of records inserted/read/deleted.


Starter Structure

tle-page-manager/
├── Cargo.toml
├── src/
│   ├── main.rs          # Entry point: runs the acceptance criteria
│   ├── page.rs          # PageHeader, SlottedPage, checksums
│   ├── buffer_pool.rs   # BufferPool, Frame, eviction policy
│   ├── page_file.rs     # PageFile: raw I/O to the database file
│   └── tle.rs           # TleRecord serialization/deserialization

Hints

Hint 1 — Serializing TLE records

Use to_le_bytes() for each field and concatenate them into a Vec<u8>. For deserialization, slice the byte buffer at the known offsets and use from_le_bytes(). Do not use serde or bincode — the point of this project is to understand raw binary layout.

impl TleRecord {
    fn serialize(&self) -> Vec<u8> {
        let mut buf = Vec::with_capacity(82);
        buf.extend_from_slice(&self.norad_id.to_le_bytes());
        buf.extend_from_slice(&self.intl_designator);
        buf.extend_from_slice(&self.epoch.to_le_bytes());
        // ... remaining fields
        buf
    }
}
Hint 2 — Buffer pool sizing

With 200 TLE records at 82 bytes each and ~49 records per page (4,079 usable bytes / 82 bytes ≈ 49, minus slot overhead), you need approximately 5 pages. An 8-frame buffer pool can hold the entire working set — but only if pages aren't evicted prematurely. Make sure your LRU implementation correctly promotes re-accessed pages.

Hint 3 — Compaction correctness check

After compacting, iterate all slot IDs and verify:

  • Live records return the same data as before compaction
  • Tombstoned slots still return None
  • The page's total free space increased (fragmentation reclaimed)
  • The page's contiguous free space equals total free space (no more gaps)
Hint 4 — Checksum verification testing

To test checksum detection, write a valid page to disk, then flip a single bit in the page body using raw file I/O. Read the page back through the buffer pool and verify that it returns a checksum error, not corrupted data.

// Flip bit 0 of byte 20 in page 1
let offset = 1 * PAGE_SIZE + 20;
file.seek(SeekFrom::Start(offset as u64))?;
let mut byte = [0u8; 1];
file.read_exact(&mut byte)?;
byte[0] ^= 0x01; // flip lowest bit
file.seek(SeekFrom::Start(offset as u64))?;
file.write_all(&byte)?;

Reference Implementation

Reveal full reference implementation

The reference implementation is intentionally omitted for this project. The three lessons provide all the code building blocks — your job is to integrate them into a working system. If you get stuck:

  1. Start with page_file.rs — get raw page I/O working first
  2. Add page.rs — implement PageHeader and SlottedPage from Lesson 1 and 3
  3. Add buffer_pool.rs — wrap the page file with caching from Lesson 2
  4. Add tle.rs — serialization is straightforward byte manipulation
  5. Wire them together in main.rs — run each acceptance criterion sequentially

What Comes Next

The page manager you build here is used directly by Module 2. B-tree nodes are stored as slotted pages in the buffer pool. The (page_id, slot_id) Record ID becomes the leaf-node pointer format in the B+ tree index.

Module 02 — B-Tree Index Structures

Track: Database Internals — Orbital Object Registry
Position: Module 2 of 6
Source material: Database Internals — Alex Petrov, Chapters 2, 4–6; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 3
Quiz pass threshold: 70% on all three lessons to unlock the project


Mission Context

SDA INCIDENT REPORT — OOR-2026-0043
Classification: PERFORMANCE DEFICIENCY
Subject: NORAD catalog ID lookups require full page scans

The page manager from Module 1 stores TLE records but provides no way to locate a specific record without scanning every page. A conjunction query for NORAD ID 25544 (ISS) currently reads all data pages sequentially — O(N) in the number of pages. With 100,000 tracked objects across ~2,000 data pages, a single point lookup takes 2–5ms. During a pass window with 500 conjunction checks per second, this saturates the I/O subsystem.

Directive: Build a B+ tree index over NORAD catalog IDs. Point lookups must be O(log N) in the number of records. Range scans over contiguous NORAD ID ranges must traverse only the relevant leaf pages.

The B-tree is the most widely deployed index structure in database systems. PostgreSQL, MySQL/InnoDB, SQLite, and most file systems use B-tree variants for ordered key lookups. This module covers the structure, invariants, and maintenance operations (splits and merges) that keep the tree balanced under insert and delete workloads.


Learning Outcomes

After completing this module, you will be able to:

  1. Describe the B-tree invariants (minimum fill factor, sorted keys, balanced height) and explain why they guarantee O(log N) lookups
  2. Implement node split and merge operations that maintain B-tree balance on insert and delete
  3. Distinguish between B-trees and B+ trees, and explain why B+ trees are preferred for range scans and disk-based storage
  4. Implement a B+ tree leaf-level linked list for efficient range scans over NORAD ID ranges
  5. Analyze the I/O cost of B-tree operations in terms of tree height and page size
  6. Integrate a B+ tree index with the page manager from Module 1

Lesson Summary

Lesson 1 — B-Tree Structure: Keys, Pointers, and Invariants

The B-tree data structure: internal nodes, leaf nodes, the fill factor invariant, and the guarantee of O(log N) height. How keys and child pointers are arranged within a node, and how the tree is traversed for point lookups.

Key question: What is the maximum height of a B-tree indexing 100,000 NORAD IDs with a branching factor of 200?

Lesson 2 — Node Splits and Merges

Maintaining B-tree balance under writes. How inserts cause node splits (bottom-up), how deletes cause node merges or redistributions, and how these operations propagate up the tree. The difference between eager and lazy merge strategies.

Key question: Can a single insert into a B-tree with height H cause more than H page writes?

Lesson 3 — B+ Trees and Range Scans

The B+ tree variant: all data in leaf nodes, internal nodes hold only separator keys, and leaf nodes are linked for sequential access. Why this layout is optimal for disk-based storage engines that need both point lookups and range scans.

Key question: Why do B+ trees outperform B-trees for range scans even when both have the same height?


Capstone Project — B+ Tree TLE Index Engine

Build a B+ tree index over NORAD catalog IDs that supports point lookups, range scans, inserts, and deletes. The index is backed by the page manager from Module 1 — each B+ tree node is a slotted page in the buffer pool. Full project brief in project-btree-index.md.


File Index

module-02-btree-index-structures/
├── README.md                          ← this file
├── lesson-01-btree-structure.md       ← B-tree structure and invariants
├── lesson-01-quiz.toml                ← Quiz (5 questions)
├── lesson-02-splits-merges.md         ← Node splits and merges
├── lesson-02-quiz.toml                ← Quiz (5 questions)
├── lesson-03-bplus-trees.md           ← B+ trees and range scans
├── lesson-03-quiz.toml                ← Quiz (5 questions)
└── project-btree-index.md             ← Capstone project brief

Prerequisites

  • Module 1 (Storage Engine Fundamentals) completed
  • Understanding of slotted pages and the buffer pool

What Comes Next

Module 3 (LSM Trees & Compaction) introduces a fundamentally different index structure optimized for write-heavy workloads. The B+ tree you build here is read-optimized — every insert modifies pages in-place, which is expensive for high write throughput. The LSM tree takes the opposite approach: batch writes in memory and flush them to immutable files. Understanding both structures and their tradeoffs is essential for choosing the right approach for the OOR's workload.

Lesson 1 — B-Tree Structure: Keys, Pointers, and Invariants

Module: Database Internals — M02: B-Tree Index Structures
Position: Lesson 1 of 3
Source: Database Internals — Alex Petrov, Chapters 2 and 4

Source note: This lesson was synthesized from training knowledge. The following concepts would benefit from verification against the source books: Petrov's specific notation for B-tree order vs. branching factor, and his framing of the fill factor invariant.



Context

A heap file of slotted pages provides O(1) access by Record ID (page_id, slot_id), but O(N) access by key — finding NORAD ID 25544 requires scanning every data page. For the Orbital Object Registry, this is the difference between a 0.05ms indexed lookup and a 5ms full scan. At 500 conjunction checks per second, indexed lookups consume 25ms of I/O per second. Full scans consume 2,500ms — the system spends more time scanning than computing.

The B-tree is a balanced, sorted, multi-way tree optimized for disk-based storage. Each node occupies one page, and the tree's branching factor (number of children per node) is determined by how many keys fit in a page. A B-tree with a branching factor of 200 and 100,000 records has a height of 3 — any record can be found in at most 3 page reads. Compare this to a binary search tree, which would have height ~17 and require 17 page reads.

B-trees were invented in 1970 by Bayer and McCreight specifically for disk-based access patterns. Every modern relational database and most file systems use B-tree variants as their primary index structure.


Core Concepts

Tree Structure

A B-tree of order m has the following properties:

  • Every internal node has at most m children and at most m - 1 keys.
  • Every internal node (except the root) has at least ⌈m/2⌉ children.
  • The root has at least 2 children (unless it is a leaf).
  • All leaves are at the same depth.
  • Keys within each node are sorted in ascending order.

The keys in an internal node serve as separators — they direct the search to the correct child. For a node with keys [K₁, K₂, K₃] and children [C₀, C₁, C₂, C₃]:

  • All keys in subtree C₀ are < K₁
  • All keys in subtree C₁ are ≥ K₁ and < K₂
  • All keys in subtree C₂ are ≥ K₂ and < K₃
  • All keys in subtree C₃ are ≥ K₃
                    [30 | 60]
                   /    |    \
          [10|20]    [40|50]    [70|80|90]
          /  |  \    /  |  \    /  |  |  \
        ... ... ... ... ... ... ... ... ... ...

Branching Factor and Tree Height

The branching factor determines how wide the tree is — and therefore how shallow. For the Orbital Object Registry:

  • Page size: 4KB
  • Key size: 4 bytes (NORAD ID as u32)
  • Child pointer size: 4 bytes (page ID as u32)
  • Node header overhead: ~20 bytes

Usable space per node: 4096 - 20 = 4076 bytes. Each key-pointer pair: 4 + 4 = 8 bytes. Maximum keys per node: 4076 / 8 ≈ 509. So the branching factor is approximately 510.

Tree height for N records with branching factor B: h = ⌈log_B(N)⌉ + 1 (counting from root to leaf, inclusive).

RecordsB=510 HeightPage Reads per Lookup
100,00022
1,000,00022
100,000,00033

With branching factor 510, the entire 100,000-record OOR catalog is reachable in 2 page reads (root + leaf). The root node is almost always cached in the buffer pool, so in practice most lookups require only 1 disk read (the leaf node).

Point Lookup Algorithm

To find key K:

  1. Start at the root node.
  2. Binary search the node's keys to find the correct child pointer.
  3. Follow the child pointer to the next level.
  4. Repeat until a leaf node is reached.
  5. Binary search the leaf node for K.

Each level requires one page read and one binary search. Binary search within a node is O(log m) comparisons — negligible compared to the page read cost.

Node Layout on Disk

Each B-tree node is stored as a page in the page file. The node layout within a page:

┌────────────────────────────────────────────┐
│ Node Header                                │
│  - node_type: u8 (Internal=0, Leaf=1)      │
│  - key_count: u16                          │
│  - parent_page_id: u32 (for split propagation) │
├────────────────────────────────────────────┤
│ Key-Pointer Pairs (for internal nodes):    │
│  [child_0] [key_0] [child_1] [key_1] ...  │
│                                            │
│ Key-Value Pairs (for leaf nodes):          │
│  [key_0] [rid_0] [key_1] [rid_1] ...      │
│  (RID = page_id + slot_id of data record)  │
└────────────────────────────────────────────┘

Internal nodes store (child_page_id, key) pairs. Leaf nodes store (key, record_id) pairs where the record ID points to the actual TLE record in a data page (from Module 1).

The Fill Factor Invariant

The minimum fill requirement (at least ⌈m/2⌉ children per internal node) is what guarantees the tree stays balanced. Without it, degenerate deletions could produce a tree where one branch is much deeper than another, destroying the O(log N) guarantee.

The fill factor also ensures space efficiency — every node is at least half full, so the tree uses at most 2x the minimum space needed. In practice, B-trees maintain an average fill factor of ~67% (between the 50% minimum and 100% maximum), and bulk-loaded trees can achieve >90%.


Code Examples

B-Tree Node Representation

This defines the on-disk layout for B-tree nodes in the Orbital Object Registry, where keys are NORAD catalog IDs (u32) and values are Record IDs.

/// Record ID: a pointer to a TLE record in a data page.
#[derive(Debug, Clone, Copy, PartialEq)]
struct RecordId {
    page_id: u32,
    slot_id: u16,
}

/// A B-tree node stored in a single page.
#[derive(Debug)]
enum BTreeNode {
    Internal(InternalNode),
    Leaf(LeafNode),
}

#[derive(Debug)]
struct InternalNode {
    page_id: u32,
    /// Separator keys. `keys[i]` is the boundary between children[i] and children[i+1].
    keys: Vec<u32>,
    /// Child page IDs. `children.len() == keys.len() + 1`.
    children: Vec<u32>,
}

#[derive(Debug)]
struct LeafNode {
    page_id: u32,
    /// Sorted key-value pairs. Keys are NORAD IDs, values are Record IDs.
    keys: Vec<u32>,
    values: Vec<RecordId>,
}

impl InternalNode {
    /// Find the child page that could contain the given key.
    fn find_child(&self, key: u32) -> u32 {
        // Binary search for the first separator key > search key.
        // The child to the left of that separator is the correct subtree.
        let pos = self.keys.partition_point(|&k| k <= key);
        self.children[pos]
    }
}

impl LeafNode {
    /// Point lookup: find the Record ID for a given NORAD ID.
    fn find(&self, key: u32) -> Option<RecordId> {
        match self.keys.binary_search(&key) {
            Ok(idx) => Some(self.values[idx]),
            Err(_) => None,
        }
    }
}

The partition_point method is the correct choice for internal node search — it finds the insertion point, which corresponds to the child that owns the search key's range. Using binary_search would be wrong: duplicate separator keys (from splits) would match incorrectly, and binary_search returns an arbitrary match when duplicates exist.

Traversal: Root-to-Leaf Lookup

A complete point lookup traverses from the root to a leaf, reading one page per level.

/// Look up a NORAD ID in the B-tree. Returns the Record ID if found.
fn btree_lookup(
    root_page_id: u32,
    key: u32,
    buffer_pool: &mut BufferPool,
) -> io::Result<Option<RecordId>> {
    let mut current_page_id = root_page_id;

    loop {
        let frame_idx = buffer_pool.fetch_page(current_page_id)?;
        let page_data = buffer_pool.frame_data(frame_idx);
        let node = deserialize_node(page_data)?;
        // Unpin immediately — we've extracted the data we need.
        // In a real implementation, we'd hold the pin during the
        // search for concurrency safety (see Module 5).
        buffer_pool.unpin(frame_idx, false);

        match node {
            BTreeNode::Internal(internal) => {
                current_page_id = internal.find_child(key);
                // Continue traversal — follow the child pointer
            }
            BTreeNode::Leaf(leaf) => {
                return Ok(leaf.find(key));
            }
        }
    }
}

This reads at most h pages where h is the tree height. For the OOR (100k records, branching factor ~510), h = 2. The root page is almost always cached in the buffer pool (it's accessed by every lookup), so the typical cost is 1 disk read — just the leaf page.

The comment about unpinning immediately is important: in a concurrent engine (Module 5), you'd hold the pin while searching to prevent the page from being evicted mid-traversal. For single-threaded Module 2, immediate unpin is safe and keeps the buffer pool available.


Key Takeaways

  • A B-tree with branching factor B and N records has height O(log_B(N)). With B ≈ 500 (common for 4KB pages with small keys), a tree indexing 100 million records is only 4 levels deep.
  • The fill factor invariant (nodes at least half full) guarantees balanced height and prevents degenerate trees. Splits and merges (Lesson 2) maintain this invariant.
  • Internal nodes contain separator keys and child pointers. Leaf nodes contain the actual key-to-record-ID mapping. The search algorithm binary-searches within each node and follows pointers down the tree.
  • The branching factor is determined by page size and key/pointer sizes. Larger pages or smaller keys mean a wider tree and fewer I/O operations per lookup.
  • Root and upper-level internal nodes are almost always cached in the buffer pool, so the practical I/O cost of a lookup is usually just 1 page read (the leaf).

Lesson 2 — Node Splits and Merges

Module: Database Internals — M02: B-Tree Index Structures
Position: Lesson 2 of 3
Source: Database Internals — Alex Petrov, Chapters 4–5

Source note: This lesson was synthesized from training knowledge. Verify Petrov's specific split/merge algorithm variants and his treatment of lazy vs. eager rebalancing against Chapters 4–5.



Context

A B-tree that only grows (inserts, never deletes) would eventually have every node at 100% capacity. The next insert into a full node would fail — unless the tree can restructure itself. Node splitting is the mechanism: when a node overflows, it divides into two half-full nodes and promotes a separator key to the parent. This maintains the B-tree invariant (all leaves at the same depth) and keeps every node between half and completely full.

The reverse operation — node merging — handles deletions. When a node drops below the minimum fill factor (half full), it either borrows keys from a sibling or merges with a sibling. Without merging, a delete-heavy workload could leave the tree full of nearly-empty nodes, wasting space and degrading scan performance.

For the OOR, inserts happen when new objects are cataloged or existing TLEs are updated (≈1,000/day for routine operations, burst to 10,000+ during fragmentation events). Deletes happen when objects re-enter the atmosphere or are reclassified. The split/merge machinery ensures the index stays balanced through both workload patterns.


Core Concepts

Leaf Node Split

When an insert arrives at a full leaf node:

  1. Allocate a new leaf page from the page manager.
  2. Move the upper half of the keys (and their record IDs) to the new node.
  3. The median key becomes the separator — it is promoted to the parent internal node.
  4. The parent inserts the separator key with a pointer to the new node.
Before split (leaf full, max 4 keys):
  Parent: [...| 30 |...]
               |
  Leaf:   [10, 20, 30, 40]   ← inserting 25

After split:
  Parent: [...| 25 | 30 |...]
               |       |
  Left:   [10, 20]    Right: [25, 30, 40]

The choice of median matters: promoting the middle key keeps both new nodes as close to half-full as possible, maximizing the number of inserts before the next split. Some implementations promote the first key of the right node instead — simpler but slightly less balanced.

Internal Node Split

If the parent is also full when receiving the promoted separator, the parent itself must split. This propagation continues upward until either a non-full ancestor is found or the root splits. A root split is the only operation that increases the tree's height:

  1. Split the root into two children.
  2. Create a new root with one separator key pointing to the two children.
  3. Tree height increases by 1.

Root splits are rare — for a B-tree with branching factor 510, the root doesn't split until it contains 509 keys, meaning the tree holds at least 260,000 records at height 2 before needing height 3.

The Cost of Splits

A single insert can trigger a cascade of splits from leaf to root. In the worst case (every ancestor is full), inserting one key causes h splits — one per level. Each split writes 2 pages (the original node and the new node) plus modifies the parent, so the worst-case write amplification is 2h + 1 page writes for one insert.

In practice, cascading splits are rare. The average cost of an insert is approximately 1.5 page writes: one for the leaf update and 0.5 for the amortized split cost (since splits happen once per ~B/2 inserts).

Deletion and Underflow

When a key is deleted from a leaf:

  1. Remove the key and its record ID from the leaf.
  2. If the leaf is still at least half full, done.
  3. If the leaf is below half full (underflow), rebalance.

Rebalancing options, tried in order:

  • Redistribute from a sibling: If an adjacent sibling has more than the minimum number of keys, transfer one key from the sibling through the parent. This keeps both nodes at valid fill levels.
  • Merge with a sibling: If both the underflowing node and its sibling are at minimum, merge them into one node and remove the separator from the parent.

Merge reduces the parent's key count by one. If the parent then underflows, the same process propagates upward. A merge at the root level (when the root has only one child) reduces the tree height by 1.

Redistribution vs. Merge

Redistribution (sibling has spare keys):
  Parent: [...| 30 |...]          Parent: [...| 25 |...]
           |       |         →              |       |
  Left: [10]    Right: [25,30,40]   Left: [10,25]  Right: [30,40]

Merge (both at minimum):
  Parent: [...| 30 |...]          Parent: [...|...]
           |       |         →           |
  Left: [10]    Right: [30]        Merged: [10,30]

Redistribution is preferred because it doesn't change the tree structure — no nodes are created or destroyed, no parent keys are removed. Merge is the fallback when redistribution isn't possible.

Lazy vs. Eager Rebalancing

Not all implementations rebalance immediately on underflow. Lazy rebalancing tolerates slightly-underfull nodes, deferring merges until a periodic compaction pass or until the node becomes completely empty. This reduces write amplification at the cost of slightly lower space efficiency and slightly higher scan costs (more nodes to traverse).

PostgreSQL's B-tree implementation, for example, does not merge on every delete — it marks deleted entries as "dead" and reclaims space during VACUUM. This is partly because merge operations require exclusive locks on multiple nodes, which would block concurrent readers.

For the OOR, lazy rebalancing is the pragmatic choice: the delete rate (~100/day for atmospheric re-entries) is low enough that occasional underfull nodes have negligible impact on scan performance.


Code Examples

Leaf Node Split

Splitting a full leaf node during insert, promoting the median key to the parent.

impl LeafNode {
    /// Split this leaf and return (median_key, new_right_leaf).
    /// After split, `self` retains the lower half of the keys.
    fn split(&mut self, new_page_id: u32) -> (u32, LeafNode) {
        let mid = self.keys.len() / 2;

        // The median key is promoted to the parent as a separator
        let median_key = self.keys[mid];

        // Right half moves to the new node
        let right_keys = self.keys.split_off(mid);
        let right_values = self.values.split_off(mid);

        let right = LeafNode {
            page_id: new_page_id,
            keys: right_keys,
            values: right_values,
            // In a B+ tree, link the leaves (see Lesson 3)
            next_leaf: self.next_leaf,
        };

        // Update the left node's forward pointer to the new right sibling
        self.next_leaf = Some(new_page_id);

        (median_key, right)
    }
}

split_off(mid) is the correct choice here — it takes the elements from index mid to the end in O(1) amortized time (it's a Vec truncation + ownership transfer). The left node retains elements [0..mid) and the right node gets [mid..]. The median key is promoted to the parent but also remains in the right leaf — in a B+ tree, leaf nodes hold all keys, and internal nodes hold copies as separators.

Inserting into a B-Tree with Split Propagation

A top-level insert that handles splits propagating up the tree.

/// Insert a key-value pair into the B+ tree.
/// If the root splits, creates a new root and increases tree height.
fn btree_insert(
    root_page_id: &mut u32,
    key: u32,
    rid: RecordId,
    buffer_pool: &mut BufferPool,
) -> io::Result<()> {
    // Traverse to the leaf, collecting the path of ancestors
    let (leaf_page_id, ancestors) = find_leaf_with_path(
        *root_page_id, key, buffer_pool
    )?;

    // Attempt to insert into the leaf
    let overflow = insert_into_leaf(leaf_page_id, key, rid, buffer_pool)?;

    if let Some((promoted_key, new_page_id)) = overflow {
        // Leaf split occurred — propagate up
        propagate_split(
            root_page_id, promoted_key, new_page_id,
            &ancestors, buffer_pool
        )?;
    }

    Ok(())
}

/// Propagate a split upward through the ancestors.
fn propagate_split(
    root_page_id: &mut u32,
    mut promoted_key: u32,
    mut new_child_page_id: u32,
    ancestors: &[u32],  // page IDs from root to parent-of-leaf
    buffer_pool: &mut BufferPool,
) -> io::Result<()> {
    // Walk ancestors from bottom (parent of leaf) to top (root)
    for &ancestor_page_id in ancestors.iter().rev() {
        let overflow = insert_into_internal(
            ancestor_page_id, promoted_key, new_child_page_id,
            buffer_pool,
        )?;

        match overflow {
            None => return Ok(()), // Ancestor had room — done
            Some((key, page_id)) => {
                promoted_key = key;
                new_child_page_id = page_id;
                // Continue propagating
            }
        }
    }

    // If we reach here, the root itself split.
    // Create a new root pointing to the old root and the new child.
    let new_root_page = buffer_pool.allocate_page()?;
    let new_root = InternalNode {
        page_id: new_root_page,
        keys: vec![promoted_key],
        children: vec![*root_page_id, new_child_page_id],
    };
    serialize_and_write_node(&new_root, buffer_pool)?;
    *root_page_id = new_root_page;

    Ok(())
}

The ancestor path is collected during the initial traversal. This avoids re-traversing the tree during split propagation, which would be both slower and incorrect under concurrent modifications (a problem we'll address in Module 5).

The root_page_id is passed as &mut u32 because a root split changes it — the old root becomes a child of the new root. In a production engine, the root page ID is stored in the file header page and updated atomically with WAL protection.


Key Takeaways

  • Node splits maintain the B-tree invariant by dividing overfull nodes and promoting a separator key to the parent. Splits propagate upward; root splits are the only operation that increases tree height.
  • The amortized cost of an insert is ~1.5 page writes. The worst case (full cascade) is 2h+1 writes but occurs rarely — once per ~B/2 inserts per level.
  • Deletions trigger rebalancing when a node drops below half full. Redistribution (borrowing from a sibling) is preferred; merging is the fallback. Merges propagate upward like splits.
  • Lazy rebalancing (deferring merges) reduces write amplification and lock contention at the cost of slightly underfull nodes. Most production B-tree implementations use some form of lazy deletion.
  • The ancestor path must be collected during traversal for split propagation. Re-traversing the tree after a split is both slower and unsafe under concurrent access.

Lesson 3 — B+ Trees and Range Scans

Module: Database Internals — M02: B-Tree Index Structures
Position: Lesson 3 of 3
Source: Database Internals — Alex Petrov, Chapters 5–6; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 3

Source note: This lesson was synthesized from training knowledge. Verify Petrov's treatment of B+ tree leaf linking, his comparison of B-tree vs. B+ tree I/O cost for range scans, and his coverage of prefix compression in Chapter 6.



Context

The B-tree from Lessons 1 and 2 stores key-value pairs in both internal and leaf nodes. This is correct and complete for point lookups, but it has a significant limitation for range scans: the data is distributed across all levels of the tree. Scanning NORAD IDs 40000–40500 requires traversing internal nodes to find the start, then potentially bouncing between internal and leaf nodes to collect all matching records.

The B+ tree variant solves this by separating concerns: internal nodes contain only separator keys and child pointers (they are a pure navigation structure), and all data records live in leaf nodes. Leaf nodes are linked into a doubly-linked list, so a range scan only needs one tree traversal (to find the starting leaf) followed by a sequential walk along the leaf chain.

This is why virtually every production relational database uses B+ trees, not plain B-trees, as their primary index structure. The leaf-level linked list turns range scans from O(N log N) (re-traversing from the root for each key) into O(log N + K) where K is the number of matching keys — a massive improvement for the OOR's conjunction query workload, which frequently scans orbital parameter ranges.


Core Concepts

B+ Tree vs. B-Tree

PropertyB-TreeB+ Tree
Data locationAll nodes (internal + leaf)Leaf nodes only
Internal node contentsKeys + values + child pointersKeys + child pointers only
Range scan supportMust re-traverse or backtrackSequential leaf walk
Branching factorLower (values consume space in internal nodes)Higher (internal nodes hold more keys)
Point lookup I/OPotentially fewer reads (data can be in internal nodes)Always traverses to leaf
Disk space for keysEach key stored onceSeparator keys duplicated in internal nodes

For the OOR, the B+ tree's higher branching factor (more keys per internal node) and efficient range scans outweigh the slight overhead of always traversing to leaf nodes. The conjunction query workload is dominated by range scans over orbital parameter ranges, not single-key lookups.

Leaf-Level Linked List

B+ tree leaf nodes are connected in a linked list sorted by key order. Each leaf stores a pointer to its right sibling (and optionally its left sibling for reverse scans):

Root: [30 | 60]
       /    |    \
     /      |      \
 [10,20] → [30,40,50] → [60,70,80,90]
  (leaf 1)    (leaf 2)       (leaf 3)

A range scan for keys 25–55:

  1. Tree traversal: root → leaf 2 (first key ≥ 25 is 30).
  2. Sequential scan: read leaf 2 (keys 30, 40, 50), follow next pointer to leaf 3 (key 60 > 55, stop).
  3. Total I/O: 2 page reads (root + leaf 2) + 1 page read (leaf 3) = 3 pages. The root is cached, so practical I/O is 2 page reads.

Without the linked list, the engine would need to return to the root for each key, or implement a complex in-order traversal using a stack of parent pointers.

Separator Keys in Internal Nodes

In a B+ tree, internal node keys are separators — they do not need to be exact copies of leaf keys. They only need to correctly direct traffic. If the left child's maximum key is 29 and the right child's minimum key is 30, any value in the range [30, ...] works as a separator. Some implementations use abbreviated separators (the shortest key that correctly divides the two children) to fit more entries per internal node.

This means a delete from a leaf does not necessarily require updating the parent's separator. If you delete key 30 from a leaf, the separator in the parent can remain 30 — it still correctly directs traffic because all keys in the right child are ≥ 30 (the next key might be 31). The separator only needs updating if the split boundary itself changes.

Prefix and Suffix Compression

Leaf nodes in a B+ tree often contain keys with significant commonality — for example, NORAD IDs in the range 40000–40499 share the prefix "400". Prefix compression stores the common prefix once and encodes each key as a delta from the prefix. Suffix truncation removes key suffixes that are unnecessary for correct comparison within the node.

For the OOR's u32 NORAD IDs, these optimizations provide modest savings (4-byte keys have limited prefix sharing). They become critical for composite keys or string keys — a B+ tree indexing international designators like "2024-001A" through "2024-999Z" would benefit substantially.

Bulk Loading

Building a B+ tree by inserting records one at a time produces nodes at ~50-67% average fill. Bulk loading builds the tree bottom-up from sorted data:

  1. Sort all records by key.
  2. Pack records into leaf nodes at near-100% fill.
  3. Build internal nodes bottom-up by taking separator keys from each pair of adjacent leaves.
  4. Repeat upward until the root is created.

Bulk loading is O(N) in the number of records and produces an optimally-packed tree. For the OOR's initial catalog load (100,000 records), bulk loading fills ~200 leaf pages at 100% fill. One-by-one inserts would produce ~300-400 pages at 50-67% fill.


Code Examples

Extending the leaf node from Lesson 1 with linked-list pointers for range scan support.

/// B+ tree leaf node with sibling links for range scans.
#[derive(Debug)]
struct BPlusLeafNode {
    page_id: u32,
    keys: Vec<u32>,
    values: Vec<RecordId>,
    /// Forward pointer to the next leaf (right sibling)
    next_leaf: Option<u32>,
    /// Backward pointer for reverse scans (optional)
    prev_leaf: Option<u32>,
}

impl BPlusLeafNode {
    /// Range scan: iterate all keys in [start, end] starting from this leaf.
    /// Returns a lazy iterator that follows the leaf chain.
    fn range_scan_from<'a>(
        &'a self,
        start: u32,
        end: u32,
        buffer_pool: &'a mut BufferPool,
    ) -> impl Iterator<Item = io::Result<(u32, RecordId)>> + 'a {
        // Find the first key >= start in this leaf
        let start_idx = self.keys.partition_point(|&k| k < start);

        // Yield matching keys from this leaf and follow the chain
        BPlusRangeScanIterator {
            current_keys: &self.keys[start_idx..],
            current_values: &self.values[start_idx..],
            idx: 0,
            end_key: end,
            next_leaf_page: self.next_leaf,
            buffer_pool,
            done: false,
        }
    }
}

The range scan iterator is lazy — it reads the next leaf page only when the current leaf's keys are exhausted. This avoids reading leaf pages beyond what the caller actually consumes (important if the caller stops early after finding a match).

Range Scan Iterator

The iterator follows the leaf chain, reading one page at a time.

/// Iterator over a range of B+ tree leaf entries.
/// Follows the leaf-level linked list until the end key is exceeded.
struct BPlusRangeScanIterator<'a> {
    current_leaf: Option<BPlusLeafNode>,
    idx: usize,
    end_key: u32,
    buffer_pool: &'a mut BufferPool,
    done: bool,
}

impl<'a> BPlusRangeScanIterator<'a> {
    fn next_entry(&mut self) -> Option<io::Result<(u32, RecordId)>> {
        loop {
            if self.done {
                return None;
            }

            let leaf = self.current_leaf.as_ref()?;

            // Check if there are more entries in the current leaf
            if self.idx < leaf.keys.len() {
                let key = leaf.keys[self.idx];
                if key > self.end_key {
                    self.done = true;
                    return None;
                }
                let rid = leaf.values[self.idx];
                self.idx += 1;
                return Some(Ok((key, rid)));
            }

            // Current leaf exhausted — follow the chain
            match leaf.next_leaf {
                None => {
                    self.done = true;
                    return None;
                }
                Some(next_page_id) => {
                    match self.load_leaf(next_page_id) {
                        Ok(next_leaf) => {
                            self.current_leaf = Some(next_leaf);
                            self.idx = 0;
                            // Loop back to yield from the new leaf
                        }
                        Err(e) => {
                            self.done = true;
                            return Some(Err(e));
                        }
                    }
                }
            }
        }
    }

    fn load_leaf(&mut self, page_id: u32) -> io::Result<BPlusLeafNode> {
        let frame_idx = self.buffer_pool.fetch_page(page_id)?;
        let data = self.buffer_pool.frame_data(frame_idx);
        let leaf = deserialize_leaf_node(data)?;
        self.buffer_pool.unpin(frame_idx, false);
        Ok(leaf)
    }
}

This pattern — a struct that holds iteration state and lazily loads pages — is the volcano iterator model that we'll formalize in Module 6 (Query Processing). Every B+ tree range scan in every database engine works this way: traverse to the starting leaf, then pull one record at a time from the leaf chain, fetching the next page only when the current one is exhausted.


Key Takeaways

  • B+ trees store all data in leaf nodes and use internal nodes purely for navigation. This maximizes the internal node branching factor and enables efficient range scans via the leaf-level linked list.
  • Range scans are O(log N + K): one tree traversal to the starting leaf, then K sequential leaf reads. This is the primary advantage over plain B-trees and hash indices.
  • Separator keys in internal nodes don't need to be exact copies of leaf keys — any value that correctly directs traffic works. This enables prefix compression and abbreviated separators.
  • Bulk loading produces a B+ tree at near-100% fill factor in O(N) time, compared to O(N log N) for one-by-one inserts at ~50-67% fill. Always bulk-load when building an index from scratch.
  • The leaf-level scan iterator is the first instance of the volcano iterator pattern — a pull-based interface that lazily fetches pages on demand. This pattern recurs throughout the query processing stack.

Project — B+ Tree TLE Index Engine

Module: Database Internals — M02: B-Tree Index Structures
Track: Orbital Object Registry
Estimated effort: 6–8 hours



SDA Incident Report — OOR-2026-0043

Classification: ENGINEERING DIRECTIVE
Subject: Build ordered index for NORAD catalog ID lookups and range scans

The page manager from Module 1 stores TLE records but requires full scans for key-based access. Build a B+ tree index over NORAD catalog IDs that provides O(log N) point lookups and efficient range scans via a leaf-level linked list. The index must integrate with the existing page manager and buffer pool.


Objective

Build a B+ tree index that:

  1. Uses the page manager and buffer pool from Module 1 — each B+ tree node is stored as a page
  2. Supports point lookups by NORAD catalog ID in O(log N) page reads
  3. Supports range scans over NORAD ID ranges using the leaf-level linked list
  4. Handles inserts with automatic node splitting and split propagation
  5. Handles deletes with tombstoning (lazy rebalancing is acceptable)
  6. Provides a bulk-load operation for initial catalog construction

Acceptance Criteria

  1. Point lookup correctness. Insert 10,000 TLE records with random NORAD IDs. Look up each by ID and verify the returned Record ID matches.

  2. Range scan correctness. Insert NORAD IDs 1–10,000. Scan range [5000, 5100] and verify exactly 101 records returned in sorted order.

  3. Split handling. Insert records until at least 3 leaf splits occur. Verify the tree remains balanced (all leaves at same depth) and all records retrievable.

  4. Bulk-load efficiency. Bulk-load 100,000 sorted records. Verify leaf fill factor above 95%. Compare leaf count to one-by-one insertion.

  5. Delete correctness. Delete 1,000 records by NORAD ID. Verify lookups return None for deleted keys and remaining records are unaffected.

  6. Integration with buffer pool. Run full test suite with only 16 buffer pool frames. Verify correctness under eviction pressure.

  7. Deterministic output. Print tree height, leaf count, fill factor, and buffer pool hit/miss stats after each test phase.


Starter Structure

btree-index/
├── Cargo.toml
├── src/
│   ├── main.rs           # Entry point: runs acceptance criteria
│   ├── btree.rs           # BPlusTree: insert, lookup, range_scan, delete, bulk_load
│   ├── node.rs            # InternalNode, LeafNode: serialization, split, merge
│   ├── page.rs            # Reuse from Module 1
│   ├── buffer_pool.rs     # Reuse from Module 1
│   ├── page_file.rs       # Reuse from Module 1
│   └── tle.rs             # Reuse from Module 1

Hints

Hint 1 — Node serialization format

Use a 1-byte discriminant at the start of each node page to distinguish internal from leaf. Internal: [type=0][key_count][child_0][key_0][child_1].... Leaf: [type=1][key_count][next_leaf][prev_leaf][key_0][rid_0]....

Hint 2 — Ancestor path for split propagation

During root-to-leaf traversal, push each internal node's page ID onto a Vec<u32>. After a leaf split, pop ancestors one at a time to insert the promoted key. If the ancestor splits too, continue popping. Empty stack = create new root.

Hint 3 — Bulk-load algorithm
  1. Sort all records by NORAD ID.
  2. Pack keys into leaf nodes at capacity, write each, record its page ID and last key.
  3. Build internal nodes bottom-up: group separators into internal node pages.
  4. Repeat step 3 until one root remains.
  5. Link leaves into a doubly-linked list.
Hint 4 — Buffer pool pressure during splits

Split propagation can pin the split node, the new node, and the parent simultaneously (3 frames). With a 16-frame pool and a 2-level tree, this is safe. But unpin nodes as soon as you've serialized them back — don't hold all three longer than necessary.


What Comes Next

Module 3 introduces LSM trees — a fundamentally different approach. Where B+ trees update pages in-place, LSM trees batch writes in memory and flush immutable files. You'll understand when each is appropriate for the OOR workload.

Module 03 — LSM Trees & Compaction

Track: Database Internals — Orbital Object Registry
Position: Module 3 of 6
Source material: Database Internals — Alex Petrov, Chapter 7; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 3; Mini-LSM Course — Alex Chi Z
Quiz pass threshold: 70% on all three lessons to unlock the project


Mission Context

SDA INCIDENT REPORT — OOR-2026-0044
Classification: PERFORMANCE DEFICIENCY
Subject: B+ tree write throughput insufficient for fragmentation event ingestion

During the Cosmos-2251 debris cascade simulation, the OOR must ingest 12,000 new TLE records in under 60 seconds. The B+ tree index from Module 2 achieves ~200 inserts/second — each insert requires a root-to-leaf traversal and potential node split, generating 2-4 random page writes per insert. At this rate, ingesting 12,000 records takes 60 seconds, consuming the entire conjunction window.

Directive: Evaluate and implement an LSM-tree-based storage architecture. LSM trees batch writes in memory and flush them as immutable sorted files, converting random writes to sequential writes. The tradeoff: reads become more expensive (must check multiple files), but write throughput increases by 10–100x.

The LSM tree (Log-Structured Merge Tree) is the dominant architecture for write-heavy storage engines. RocksDB, LevelDB, Cassandra, HBase, CockroachDB, and TiKV all use LSM-tree variants. Where the B+ tree is optimized for read-heavy workloads with moderate writes, the LSM tree is optimized for write-heavy workloads where reads can tolerate checking multiple sorted structures.

This module covers the full LSM architecture: memtables, sorted string tables (SSTs), the write and read paths, compaction strategies, and read optimizations (bloom filters, block cache). It draws on the mini-lsm course structure and the LSM coverage in Database Internals Chapter 7.


Learning Outcomes

After completing this module, you will be able to:

  1. Describe the LSM write path (memtable → immutable memtable → SST flush) and explain why it converts random writes to sequential I/O
  2. Implement a memtable backed by a sorted data structure (e.g., BTreeMap) and flush it to an immutable SST file
  3. Design an SST file format with data blocks, index blocks, and metadata blocks
  4. Explain the three amplification factors (read, write, space) and how compaction strategies trade between them
  5. Compare leveled, tiered, and FIFO compaction strategies and choose the appropriate strategy for a given workload
  6. Implement bloom filters and a block cache to reduce read amplification in an LSM engine

Lesson Summary

Lesson 1 — Memtables and Sorted String Tables

The LSM write path: how writes are batched in an in-memory memtable, frozen into immutable memtables, and flushed to sorted string table (SST) files on disk. The SST file format: data blocks, index blocks, bloom filter blocks, and footer. How the read path probes the memtable first, then SSTs from newest to oldest.

Key question: Why does the LSM tree maintain both a mutable and one or more immutable memtables instead of writing directly from the mutable memtable to disk?

Lesson 2 — Compaction Strategies

The core problem: as SSTs accumulate, reads slow down (more files to check) and space grows (deleted keys aren't reclaimed until compacted). Compaction merges SSTs to reduce read amplification and reclaim space, at the cost of write amplification. Leveled compaction, tiered (universal) compaction, and FIFO compaction — each trades differently between read, write, and space amplification.

Key question: Can you design a compaction strategy that minimizes all three amplification factors simultaneously?

Lesson 3 — Bloom Filters, Block Cache, and Read Optimization

LSM reads are expensive — they must check the memtable plus potentially every SST level. Bloom filters let the engine skip SSTs that definitely don't contain the target key. The block cache keeps hot SST data blocks in memory. Together, they reduce the effective read amplification from O(levels) to near O(1) for point lookups.

Key question: A bloom filter with a 1% false positive rate eliminates 99% of unnecessary SST reads. What is the cost of increasing it to 0.1%?


Capstone Project — LSM-Backed TLE Storage Engine

Build an LSM storage engine for the Orbital Object Registry that supports put, get, delete, and scan operations. The engine must implement memtable→SST flush, a basic leveled compaction strategy, and bloom filters for point lookup optimization. Full project brief in project-lsm-engine.md.


File Index

module-03-lsm-trees-compaction/
├── README.md                          ← this file
├── lesson-01-memtables-ssts.md        ← Memtables and sorted string tables
├── lesson-01-quiz.toml                ← Quiz (5 questions)
├── lesson-02-compaction.md            ← Compaction strategies
├── lesson-02-quiz.toml                ← Quiz (5 questions)
├── lesson-03-read-optimization.md     ← Bloom filters, block cache, read path
├── lesson-03-quiz.toml                ← Quiz (5 questions)
└── project-lsm-engine.md             ← Capstone project brief

Prerequisites

  • Module 1 (Storage Engine Fundamentals) — page I/O concepts
  • Module 2 (B-Tree Index Structures) — understanding of B+ tree tradeoffs (to compare against)

What Comes Next

Module 4 (Write-Ahead Logging & Recovery) adds durability to the LSM engine. Currently, a crash loses all data in the memtable (which is in-memory only). The WAL ensures that every write is persisted before being acknowledged, and the recovery process replays the WAL to rebuild the memtable after a crash.

Lesson 1 — Memtables and Sorted String Tables

Module: Database Internals — M03: LSM Trees & Compaction
Position: Lesson 1 of 3
Source: Database Internals — Alex Petrov, Chapter 7; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 3; Mini-LSM Week 1

Source note: This lesson was synthesized from training knowledge and the Mini-LSM course structure. Verify Petrov's specific SSTable format description and Kleppmann's LSM compaction cost analysis against the source texts.



Context

The B+ tree from Module 2 provides O(log N) reads but suffers under write-heavy workloads: every insert modifies a leaf page in place, and node splits amplify writes further. For the Orbital Object Registry's burst ingestion scenario (8,000 new objects in 45 minutes), the B+ tree's write amplification of 10–20x is the bottleneck.

The Log-Structured Merge Tree (LSM) eliminates in-place updates entirely. All writes go to an in-memory sorted structure called a memtable. When the memtable reaches a size threshold, it is frozen (becomes immutable) and flushed to disk as a Sorted String Table (SSTable) — a file of sorted key-value pairs that is never modified after creation. Reads probe the memtable first, then search SSTables from newest to oldest.

This architecture makes writes trivially fast: inserting a key-value pair is a single in-memory operation (an insert into a skip list or B-tree in RAM). The cost is shifted to reads, which must check multiple SSTables, and to background compaction, which merges SSTables to keep read amplification bounded. The entire LSM design is a bet that write throughput matters more than read latency for many workloads — and for TLE burst ingestion, it does.


Core Concepts

The Memtable

The memtable is a mutable, in-memory sorted data structure that buffers incoming writes. Common implementations:

  • Skip list (used by LevelDB, RocksDB): O(log N) insert and lookup, lock-free concurrent reads, good cache behavior. The standard choice.
  • Red-black tree or B-tree in memory: O(log N) operations, but harder to make lock-free.
  • Sorted vector: O(N) insert (shift), O(log N) lookup. Only viable for very small memtables.

For the OOR, a BTreeMap<Vec<u8>, Vec<u8>> is the simplest correct implementation. Production engines use skip lists for concurrent access, but the algorithm is the same: maintain sorted order in memory, flush to disk when full.

Key operations:

  • Put(key, value): Insert or update a key in the memtable. O(log N).
  • Delete(key): Insert a tombstone — a special marker that indicates the key has been deleted. The tombstone must persist through SSTables so that older versions of the key are masked.
  • Get(key): Look up a key. Returns the value, or the tombstone if deleted, or None if the key is not in the memtable.

The tombstone design is critical: a delete cannot simply remove the key from the memtable, because older SSTables on disk may still contain the key. Without a tombstone, a read would miss the memtable (key not present), then find the old value in an SSTable and incorrectly return it.

Memtable Freeze and Flush

When the memtable reaches its size threshold (typically 4–64MB), the engine:

  1. Freezes the current memtable — it becomes immutable (no more writes).
  2. Creates a new active memtable for incoming writes.
  3. In the background, flushes the immutable memtable to disk as an SSTable.

The freeze-then-flush pattern ensures writes are never blocked by disk I/O. The only latency-sensitive operation is the in-memory insert. Multiple immutable memtables can exist simultaneously (queued for flush), but each consumes memory, so the engine must flush faster than new memtables are created.

Write path:
  Put(key, value)
       │
       ▼
  Active Memtable (mutable, in-memory)
       │ (size threshold reached)
       ▼
  Immutable Memtable (frozen, in-memory)
       │ (background flush)
       ▼
  SSTable on disk (sorted, immutable)

SSTable Format

An SSTable is a file containing sorted key-value pairs organized into data blocks. The standard layout:

┌─────────────────────────────────────────────┐
│ Data Block 0: [k0:v0, k1:v1, k2:v2, ...]   │
│ Data Block 1: [k3:v3, k4:v4, k5:v5, ...]   │
│ ...                                          │
│ Data Block N: [kM:vM, ...]                   │
├─────────────────────────────────────────────┤
│ Meta Block: Bloom filter (optional)          │
├─────────────────────────────────────────────┤
│ Index Block: [block_0_last_key → offset,     │
│               block_1_last_key → offset,     │
│               ...                          ] │
├─────────────────────────────────────────────┤
│ Footer: index_offset, meta_offset, magic     │
└─────────────────────────────────────────────┘

Data blocks (typically 4KB each) contain sorted key-value pairs. Keys within a block can use prefix compression — store the shared prefix once and encode each key as a delta.

Index block maps the last key of each data block to the block's file offset. A point lookup binary-searches the index block to find which data block might contain the key, then searches within that block.

Meta block stores auxiliary data — most importantly, a bloom filter for fast negative lookups (Lesson 3).

Footer is the last few bytes of the file, containing offsets to the index and meta blocks. The reader starts by reading the footer, then uses its offsets to locate everything else.

SSTables are immutable — once written, they are never modified. Updates and deletes are handled by writing new SSTables that supersede older ones. This immutability is the source of LSM's concurrency simplicity: readers can access any SSTable without locks (the file doesn't change under them), and the only coordination needed is between the flush/compaction writers and the metadata that tracks which SSTables are active.

The Read Path

To read a key from an LSM engine:

  1. Check the active memtable. If found, return immediately.
  2. Check each immutable memtable from newest to oldest.
  3. Check each SSTable from newest to oldest (L0, then L1, then L2, ...).
  4. If the key is not found anywhere, it does not exist.

At each level, finding a tombstone means the key was deleted — stop searching and return "not found." This is why tombstones must be ordered correctly: a tombstone in a newer SSTable masks the key's value in all older SSTables.

The worst case is a negative lookup (key doesn't exist): the engine must check every memtable and every SSTable before concluding the key is absent. This is where bloom filters (Lesson 3) provide the biggest win — they let the engine skip entire SSTables in O(1) per filter check.

The Merge Iterator

Range scans (and compaction) require merging sorted streams from multiple sources — the memtable and several SSTables. The merge iterator (also called a multi-way merge) takes N sorted iterators and produces a single sorted stream:

  1. Maintain a min-heap of the current key from each source.
  2. Pop the smallest key. If multiple sources have the same key, take the one from the newest source (the memtable or the most recent SSTable).
  3. Advance the source that produced the popped key.

The newest-wins rule is what makes updates and deletes work correctly: a newer value for the same key supersedes the older one, and a newer tombstone masks the older value.


Code Examples

A Simple Memtable Backed by BTreeMap

The memtable stores key-value pairs in sorted order. Tombstones are represented as None values.

use std::collections::BTreeMap;

/// Memtable: in-memory sorted store for LSM writes.
/// Keys are byte slices. Values are `Option<Vec<u8>>` where
/// `None` represents a tombstone (deleted key).
struct MemTable {
    map: BTreeMap<Vec<u8>, Option<Vec<u8>>>,
    size_bytes: usize,
    size_limit: usize,
}

impl MemTable {
    fn new(size_limit: usize) -> Self {
        Self {
            map: BTreeMap::new(),
            size_bytes: 0,
            size_limit,
        }
    }

    /// Insert or update a key-value pair.
    fn put(&mut self, key: Vec<u8>, value: Vec<u8>) {
        let entry_size = key.len() + value.len();
        self.size_bytes += entry_size;
        self.map.insert(key, Some(value));
    }

    /// Mark a key as deleted (insert a tombstone).
    fn delete(&mut self, key: Vec<u8>) {
        let entry_size = key.len();
        self.size_bytes += entry_size;
        self.map.insert(key, None); // None = tombstone
    }

    /// Look up a key. Returns:
    /// - Some(Some(value)) if the key exists
    /// - Some(None) if the key is tombstoned (deleted)
    /// - None if the key is not in this memtable
    fn get(&self, key: &[u8]) -> Option<&Option<Vec<u8>>> {
        self.map.get(key)
    }

    /// True if the memtable has reached its size limit and should be frozen.
    fn should_flush(&self) -> bool {
        self.size_bytes >= self.size_limit
    }

    /// Iterate all entries in sorted order for flushing to an SSTable.
    fn iter(&self) -> impl Iterator<Item = (&Vec<u8>, &Option<Vec<u8>>)> {
        self.map.iter()
    }
}

The three-valued return from get is essential: None means "this memtable has no information about this key — keep searching older sources." Some(None) means "this key was deleted — stop searching." Some(Some(value)) means "here's the value." Collapsing the first two cases would make deleted keys reappear from older SSTables.

SSTable Builder: Flushing the Memtable to Disk

When the memtable is frozen, its entries are written to an SSTable file in sorted order.

use std::io::{self, Write, BufWriter};
use std::fs::File;

const BLOCK_SIZE: usize = 4096;

/// Builds an SSTable file from sorted key-value pairs.
struct SsTableBuilder {
    writer: BufWriter<File>,
    /// Index entries: (last_key_in_block, block_offset)
    index: Vec<(Vec<u8>, u64)>,
    current_block: Vec<u8>,
    current_block_offset: u64,
    entry_count: usize,
}

impl SsTableBuilder {
    fn new(path: &str) -> io::Result<Self> {
        let file = File::create(path)?;
        Ok(Self {
            writer: BufWriter::new(file),
            index: Vec::new(),
            current_block: Vec::new(),
            current_block_offset: 0,
            entry_count: 0,
        })
    }

    /// Add a key-value pair. Keys must be added in sorted order.
    fn add(&mut self, key: &[u8], value: Option<&[u8]>) -> io::Result<()> {
        // Encode the entry: [key_len: u16][key][is_tombstone: u8][value_len: u16][value]
        let is_tombstone = value.is_none();
        let val = value.unwrap_or(&[]);

        self.current_block.extend_from_slice(&(key.len() as u16).to_le_bytes());
        self.current_block.extend_from_slice(key);
        self.current_block.push(if is_tombstone { 1 } else { 0 });
        self.current_block.extend_from_slice(&(val.len() as u16).to_le_bytes());
        self.current_block.extend_from_slice(val);

        self.entry_count += 1;

        // If the block is full, flush it
        if self.current_block.len() >= BLOCK_SIZE {
            self.flush_block(key)?;
        }
        Ok(())
    }

    fn flush_block(&mut self, last_key: &[u8]) -> io::Result<()> {
        let offset = self.current_block_offset;
        self.writer.write_all(&self.current_block)?;
        self.index.push((last_key.to_vec(), offset));
        self.current_block_offset += self.current_block.len() as u64;
        self.current_block.clear();
        Ok(())
    }

    /// Finalize the SSTable: write the index block and footer.
    fn finish(mut self) -> io::Result<()> {
        // Flush any remaining data in the current block
        if !self.current_block.is_empty() {
            let last_key = self.index.last()
                .map(|(k, _)| k.clone())
                .unwrap_or_default();
            self.flush_block(&last_key)?;
        }

        // Write the index block
        let index_offset = self.current_block_offset;
        for (key, offset) in &self.index {
            self.writer.write_all(&(key.len() as u16).to_le_bytes())?;
            self.writer.write_all(key)?;
            self.writer.write_all(&offset.to_le_bytes())?;
        }

        // Write the footer: index_offset + magic
        self.writer.write_all(&index_offset.to_le_bytes())?;
        self.writer.write_all(b"OORSST01")?; // 8-byte magic
        self.writer.flush()?;

        Ok(())
    }
}

The builder writes entries into fixed-size blocks and records the last key and offset of each block in the index. The footer at the end of the file lets the reader locate the index without scanning the entire file. This is the same layout used by LevelDB and RocksDB's table format (with more compression and filtering in production).

Notice that add requires keys in sorted order — the caller (the memtable flush code) is responsible for iterating the memtable in order. Violating this invariant produces a corrupt SSTable where binary search returns wrong results.


Key Takeaways

  • The LSM write path is: active memtable → freeze → immutable memtable → background flush → SSTable on disk. Writes are never blocked by disk I/O — they complete as soon as the in-memory insert finishes.
  • Deletes are tombstones, not removals. A tombstone must persist through SSTables to mask older versions of the key. Compaction eventually garbage-collects tombstones once no older version exists.
  • SSTables are immutable sorted files partitioned into data blocks with an index block for O(log B) block lookup (where B is the number of blocks). Immutability enables lock-free concurrent reads.
  • The read path checks memtable first, then SSTables from newest to oldest. Negative lookups (key doesn't exist) are the worst case — they must check every source. Bloom filters (Lesson 3) mitigate this.
  • The merge iterator produces a single sorted stream from multiple sources, with newest-wins semantics for duplicate keys. This is the core data structure for both reads and compaction.

Lesson 2 — Compaction Strategies

Module: Database Internals — M03: LSM Trees & Compaction
Position: Lesson 2 of 3
Source: Database Internals — Alex Petrov, Chapter 7; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 3; Mini-LSM Week 2

Source note: This lesson was synthesized from training knowledge and the Mini-LSM course compaction chapters. Verify the specific amplification formulas against Petrov Chapter 7 and the RocksDB Tuning Guide.



Context

Without compaction, the LSM engine accumulates SSTables indefinitely. Every flush creates a new SSTable in Level 0 (L0). After 100 flushes, there are 100 L0 SSTables — and a point lookup must check all of them. Read amplification grows linearly with the number of SSTables. Space amplification grows too: deleted keys still consume space in older SSTables, and updated keys have multiple versions.

Compaction is the background process that merges SSTables to reduce read and space amplification. It reads multiple SSTables, merge-sorts their entries (applying tombstones and keeping only the newest version of each key), and writes the result as fewer, larger SSTables. The merged input files are then deleted.

The compaction strategy determines which SSTables to merge and when. Different strategies make different tradeoffs between three amplification factors:

  • Read amplification (RA): How many SSTables must be checked for a single read. Lower RA = faster reads.
  • Write amplification (WA): How many times each byte of user data is written to disk over its lifetime. Lower WA = faster ingestion.
  • Space amplification (SA): How much extra disk space is used beyond the logical data size. Lower SA = less storage cost.

No strategy minimizes all three simultaneously — this is the fundamental tradeoff of LSM design. Choosing a strategy means choosing which factor to sacrifice for the workload at hand.


Core Concepts

Level 0 and the Flush Problem

When a memtable is flushed, it becomes an L0 SSTable. L0 is special: its SSTables have overlapping key ranges (each SSTable contains whatever keys were in the memtable at freeze time, which can be any subset of the key space). This means a point lookup at L0 must check every L0 SSTable — there's no way to narrow the search by key range.

All compaction strategies share a common first step: merge L0 SSTables into L1, where SSTables have non-overlapping key ranges. In L1 and below, a point lookup can determine which SSTable(s) to check based on key range alone (binary search on SSTable boundaries), reducing read amplification.

L0: [a-z] [b-y] [c-x]    ← overlapping, must check all
L1: [a-f] [g-m] [n-z]    ← non-overlapping, check at most 1
L2: [a-c] [d-f] [g-i] ...← non-overlapping, check at most 1

Leveled Compaction

Leveled compaction (default in RocksDB, used by LevelDB) organizes SSTables into levels with exponentially increasing size targets. Each level's total size is a fixed multiple (the size ratio, typically 10) of the level above:

  • L1 target: 256MB
  • L2 target: 2.56GB (10 × L1)
  • L3 target: 25.6GB (10 × L2)

When a level exceeds its target size, the engine picks one SSTable from that level and merges it with all overlapping SSTables in the next level.

Key property: Within each level (L1+), SSTables have non-overlapping key ranges. This means at most one SSTable per level needs to be checked for a point lookup.

Amplification characteristics:

  • Read amplification: O(L) where L is the number of levels. With size ratio 10, a 1TB database has ~4 levels → ~4 SSTable reads per lookup. Excellent.
  • Write amplification: High — in the worst case, a single SSTable is rewritten once per level transition. With size ratio 10 and 4 levels, write amplification is approximately 10 × L = 40x. Each byte of data is rewritten ~10 times per level hop.
  • Space amplification: Low — each key has at most one copy per level, and compaction removes obsolete versions. Typically 1.1–1.2x.

Leveled compaction is optimal for read-heavy workloads with limited tolerance for space overhead — exactly the OOR's conjunction query workload after the initial bulk ingestion.

Tiered (Universal) Compaction

Tiered compaction (RocksDB's "Universal" mode, used by Cassandra) groups SSTables into tiers (sorted runs) of similar size. When the number of tiers at a size level reaches a threshold, they are merged into a single larger tier.

Key property: Each tier is a sorted run (non-overlapping internally), but different tiers can overlap with each other. A read must check one SSTable per tier.

Amplification characteristics:

  • Read amplification: O(T) where T is the number of tiers. Worse than leveled because tiers overlap.
  • Write amplification: Low — each compaction merges tiers of similar size, so each byte is rewritten fewer times. Typically 2–5x for well-configured tiering.
  • Space amplification: High — during compaction, both the input tiers and the output tier exist simultaneously, requiring up to 2x the logical data size in temporary space. Permanent space amplification is also higher because multiple tiers may hold different versions of the same key.

Tiered compaction is optimal for write-heavy workloads where ingestion throughput matters more than read latency — exactly the OOR's burst ingestion during a fragmentation event.

FIFO Compaction

FIFO compaction simply deletes the oldest SSTable when total storage exceeds a threshold. No merging occurs. This is appropriate for time-series data with a natural expiration window — if the OOR only needs TLE records from the last 7 days, FIFO compaction automatically ages out older data.

Amplification characteristics:

  • Read amplification: High — all SSTables persist until aged out.
  • Write amplification: 1x — data is written once (the initial flush) and never rewritten.
  • Space amplification: Bounded by the retention window.

FIFO is unsuitable for the OOR's catalog (which must retain all active objects indefinitely) but useful for the telemetry stream (which has natural time-based expiration).

Amplification Tradeoff Summary

StrategyRead AmpWrite AmpSpace AmpBest For
LeveledLow (O(L))High (~10L)Low (~1.1x)Read-heavy, space-sensitive
TieredMedium (O(T))Low (~2-5x)High (~2x)Write-heavy, burst ingestion
FIFOHighMinimal (1x)Time-boundedTTL data, append-only streams

The RocksDB wiki summarizes the tradeoff space clearly: it is generally impossible to minimize all three amplification factors simultaneously. The compaction strategy is a knob that slides between them.

Compaction Mechanics: The Merge Process

Regardless of strategy, the actual compaction operation is the same:

  1. Select input SSTables (determined by the strategy).
  2. Create a merge iterator over all input SSTables.
  3. For each key in sorted order:
    • If multiple versions exist, keep only the newest.
    • If the newest version is a tombstone and there are no older SSTables that might contain the key, drop the tombstone (garbage collection).
    • Otherwise, write the entry to the output SSTable(s).
  4. Split output into new SSTables when they reach the target file size.
  5. Atomically update the LSM metadata to swap old SSTables for new ones.
  6. Delete the old input SSTables.

Step 5 is critical for crash safety — if the engine crashes during compaction, it must not lose data. Either the old SSTables or the new ones should be the active set, never a mix. This is typically handled by writing a manifest (metadata log) that records which SSTables are active, and updating it atomically (via rename or WAL).

Compaction Scheduling

Compaction runs in background threads and must be scheduled carefully:

  • Too little compaction: Read amplification grows as uncompacted SSTables accumulate.
  • Too much compaction: Write bandwidth is consumed by compaction, starving foreground writes.
  • Compaction during peak load: The background I/O from compaction interferes with foreground query latency.

Production engines use rate limiting (RocksDB's rate_limiter) to cap compaction I/O bandwidth, and priority scheduling to defer compaction during high-load periods. The SILK paper (USENIX ATC '19) formalized this as a latency-aware compaction scheduler.

For the OOR, the practical guideline: run compaction aggressively during quiet periods (between pass windows) and throttle during conjunction query bursts.


Code Examples

A Simple Leveled Compaction Controller

This determines which SSTables to compact and when, based on level size targets.

/// Metadata for an SSTable in the LSM state.
#[derive(Debug, Clone)]
struct SstMeta {
    id: u64,
    level: usize,
    size_bytes: u64,
    min_key: Vec<u8>,
    max_key: Vec<u8>,
}

/// LSM state: tracks all active SSTables by level.
struct LsmState {
    /// L0 SSTables (overlapping key ranges, newest first)
    l0_sstables: Vec<SstMeta>,
    /// L1+ levels: each level is a vec of non-overlapping SSTables sorted by key range
    levels: Vec<Vec<SstMeta>>,
    /// Size ratio between adjacent levels (typically 10)
    size_ratio: u64,
    /// L1 target size in bytes
    l1_target_bytes: u64,
}

/// A compaction task: which SSTables to merge and where to put the output.
struct CompactionTask {
    input_ssts: Vec<SstMeta>,
    output_level: usize,
}

impl LsmState {
    /// Determine if compaction is needed and generate a task.
    fn generate_compaction_task(&self) -> Option<CompactionTask> {
        // Priority 1: Too many L0 SSTables (merge all L0 into L1)
        if self.l0_sstables.len() >= 4 {
            let mut inputs: Vec<SstMeta> = self.l0_sstables.clone();
            // Include all L1 SSTables that overlap with any L0 SSTable
            let l0_min = inputs.iter().map(|s| &s.min_key).min().unwrap().clone();
            let l0_max = inputs.iter().map(|s| &s.max_key).max().unwrap().clone();
            if let Some(l1) = self.levels.get(0) {
                for sst in l1 {
                    if sst.max_key >= l0_min && sst.min_key <= l0_max {
                        inputs.push(sst.clone());
                    }
                }
            }
            return Some(CompactionTask {
                input_ssts: inputs,
                output_level: 1,
            });
        }

        // Priority 2: A level exceeds its target size
        for (i, level) in self.levels.iter().enumerate() {
            let level_num = i + 1; // levels[0] = L1
            let target = self.l1_target_bytes * self.size_ratio.pow(i as u32);
            let actual: u64 = level.iter().map(|s| s.size_bytes).sum();

            if actual > target {
                // Pick the SSTable with the most overlap in the next level
                // (simplified: pick the first SSTable)
                if let Some(sst) = level.first() {
                    let mut inputs = vec![sst.clone()];
                    // Add overlapping SSTables from the next level
                    if let Some(next_level) = self.levels.get(i + 1) {
                        for next_sst in next_level {
                            if next_sst.max_key >= sst.min_key
                                && next_sst.min_key <= sst.max_key
                            {
                                inputs.push(next_sst.clone());
                            }
                        }
                    }
                    return Some(CompactionTask {
                        input_ssts: inputs,
                        output_level: level_num + 1,
                    });
                }
            }
        }

        None // No compaction needed
    }
}

The L0-to-L1 compaction merges all L0 SSTables with the overlapping portion of L1. This is the most expensive compaction operation (L0 SSTables overlap each other, so the entire key range may be involved), but it's necessary to establish the non-overlapping property at L1.

The level-to-level compaction picks a single SSTable from the overfull level and merges it with the overlapping SSTables in the next level. Production implementations (RocksDB) cycle through SSTables in key order to ensure uniform compaction across the key space, preventing hot spots.


Key Takeaways

  • Compaction is the LSM engine's background maintenance process — it merges SSTables to reduce read amplification and space amplification at the cost of write amplification.
  • The three amplification factors (read, write, space) are fundamentally in tension. No compaction strategy minimizes all three. Leveled compaction favors reads; tiered favors writes; FIFO favors write throughput for TTL data.
  • Leveled compaction organizes SSTables into levels with non-overlapping key ranges and exponentially increasing size targets. Write amplification of ~10x per level is the cost of O(L) read amplification.
  • Tiered compaction groups SSTables into sorted runs of similar size. Write amplification of 2-5x is the reward for tolerating higher read and space amplification.
  • L0 is special: SSTables have overlapping key ranges and must all be checked on every read. Flushing L0 to L1 is the highest-priority compaction task.
  • Compaction scheduling must balance background I/O against foreground query latency. Throttle during peak load, compact aggressively during quiet periods.

Lesson 3 — Bloom Filters, Block Cache, and Read Optimization

Module: Database Internals — M03: LSM Trees & Compaction
Position: Lesson 3 of 3
Source: Database Internals — Alex Petrov, Chapters 7–8; Mini-LSM Week 1 Day 7

Source note: This lesson was synthesized from training knowledge. Verify the bloom filter bits-per-key formula and Petrov's block cache eviction policies against the source texts.



Context

The LSM read path from Lesson 1 checks every SSTable from newest to oldest. For a database with 50 SSTables, a negative lookup (key doesn't exist) requires 50 SSTable probes — each involving reading an index block and potentially a data block from disk. Even with leveled compaction limiting the effective probe count to one SSTable per level, a 4-level LSM still reads 4 SSTables per negative lookup.

Two optimizations close the read performance gap between LSM trees and B+ trees:

  1. Bloom filters — a probabilistic data structure attached to each SSTable that answers "is this key possibly in this SSTable?" in O(1) with no disk I/O. A "no" answer is definitive (the key is definitely not there), so the engine skips the entire SSTable. With a 1% false positive rate, 99% of unnecessary SSTable probes are eliminated.

  2. Block cache — an in-memory cache of recently-read SSTable data blocks and index blocks. Hot blocks stay in memory, eliminating disk reads for frequently-accessed keys. Combined with index block pinning (keeping all index blocks in memory permanently), this makes most SSTable lookups a single data block read.

Together, these two techniques reduce the practical I/O cost of an LSM point lookup to approximately 1 disk read — competitive with the B+ tree.


Core Concepts

Bloom Filters

A bloom filter is a bit array of m bits with k independent hash functions. To add a key, hash it with all k functions and set the corresponding bits. To check a key, hash it and verify all k bits are set. If any bit is 0, the key is definitely not in the set. If all bits are 1, the key is probably in the set (but might be a false positive).

The false positive probability for a bloom filter with m bits, k hash functions, and n inserted keys is approximately:

FPR ≈ (1 - e^(-kn/m))^k

The optimal number of hash functions for a given m/n (bits per key) ratio is k = (m/n) × ln(2).

Practical sizing for the OOR:

Bits per keyFalse positive rateMemory per 10,000 keys
59.2%6.1 KB
100.82%12.2 KB
140.08%17.1 KB
200.0006%24.4 KB

10 bits per key is the standard choice (RocksDB default) — it provides a ~1% false positive rate at modest memory cost. For the OOR's 100,000 keys, the bloom filter per SSTable adds ~122 KB, trivially small compared to the SSTable data itself.

A bloom filter is stored in the SSTable's meta block and loaded into memory when the SSTable is opened. It is never written to disk again — it's read-only once created. This means the filter is computed during the SSTable build (compaction or flush) and persists for the SSTable's lifetime.

Bloom Filter in the Read Path

When the engine needs to check an SSTable for a key:

  1. Consult the in-memory bloom filter for that SSTable.
  2. If the filter says "no" → skip the SSTable entirely. Zero disk I/O.
  3. If the filter says "maybe" → read the index block, find the candidate data block, read the data block, and search for the key.

For a negative lookup across 50 SSTables with a 1% FPR:

  • Without bloom filters: 50 SSTable probes (50 index reads + up to 50 data reads).
  • With bloom filters: ~0.5 SSTable probes on average (50 × 0.01 = 0.5 false positives).

This transforms the LSM's worst-case operation into a near-best-case: negative lookups, which were O(L × SSTables) disk reads, become O(0.5) disk reads on average.

Block Cache

The block cache is an LRU cache (similar to the buffer pool from Module 1) that stores recently-accessed SSTable blocks in memory. Unlike the buffer pool, which caches full pages, the block cache stores individual SSTable blocks (typically 4KB) keyed by (sst_id, block_offset).

Two categories of blocks are cached:

  • Data blocks — the actual key-value data. Cached on demand (when a read hits the block).
  • Index blocks — the SSTable's internal index mapping last-key-per-block to block offset. Frequently accessed (every point lookup into the SSTable reads the index block first).

Many engines pin index blocks and filter blocks in the cache — they are loaded when the SSTable is opened and never evicted. This guarantees that every SSTable lookup requires at most one disk read (for the data block), because the filter and index are always in memory.

The block cache sits above the OS page cache and provides the engine with workload-aware caching. Like the buffer pool, it exists because the OS page cache doesn't understand SSTable access patterns — it can't distinguish between a hot data block and a cold compaction input.

Prefix Bloom Filters

Standard bloom filters answer "is this exact key in the SSTable?" For range scans, you need a different question: "does this SSTable contain any keys with this prefix?" A prefix bloom filter hashes key prefixes instead of full keys, enabling prefix-based filtering.

For the OOR, a prefix bloom on the first 3 bytes of the NORAD ID would let range scans skip SSTables that don't contain any keys in the target range. The false positive rate is higher (more keys share a prefix than match exactly), but the I/O savings for range scans are significant.

Combining Optimizations: End-to-End Read Path

A fully optimized LSM point lookup:

  1. Check active memtable (in-memory, O(log N)).
  2. Check immutable memtables (in-memory, O(log N) each).
  3. For each SSTable, newest to oldest: a. Check the bloom filter (in-memory, O(k) hash operations). If negative → skip. b. Read the index block (in block cache → 0 disk I/O if pinned). c. Binary search the index block for the target data block. d. Read the data block (in block cache → 0 I/O if hot; 1 disk read if cold). e. Search the data block for the key.
  4. First match (value or tombstone) terminates the search.

For a positive lookup on a hot key: 0 disk reads (everything in cache). For a negative lookup: 0 disk reads (bloom filters reject all SSTables). For a cold positive lookup: 1 disk read (the data block; index and filter are pinned).


Code Examples

A Simple Bloom Filter for SSTable Key Filtering

This implements the core bloom filter operations — build during SSTable creation, query during reads.

use std::hash::{Hash, Hasher};
use std::collections::hash_map::DefaultHasher;

/// A bloom filter for probabilistic key membership testing.
struct BloomFilter {
    bits: Vec<u8>,
    num_bits: usize,
    num_hashes: u32,
}

impl BloomFilter {
    /// Create a bloom filter sized for `num_keys` with the given
    /// bits-per-key ratio. Optimal hash count is computed automatically.
    fn new(num_keys: usize, bits_per_key: usize) -> Self {
        let num_bits = num_keys * bits_per_key;
        let num_bytes = (num_bits + 7) / 8;
        // Optimal k = bits_per_key * ln(2) ≈ bits_per_key * 0.693
        let num_hashes = ((bits_per_key as f64) * 0.693).ceil() as u32;
        let num_hashes = num_hashes.max(1).min(30); // Clamp to [1, 30]

        Self {
            bits: vec![0u8; num_bytes],
            num_bits,
            num_hashes,
        }
    }

    /// Add a key to the bloom filter.
    fn insert(&mut self, key: &[u8]) {
        for i in 0..self.num_hashes {
            let bit_pos = self.hash(key, i) % self.num_bits;
            self.bits[bit_pos / 8] |= 1 << (bit_pos % 8);
        }
    }

    /// Check if a key might be in the set.
    /// Returns false → definitely not in the set.
    /// Returns true → possibly in the set (check the SSTable).
    fn may_contain(&self, key: &[u8]) -> bool {
        for i in 0..self.num_hashes {
            let bit_pos = self.hash(key, i) % self.num_bits;
            if self.bits[bit_pos / 8] & (1 << (bit_pos % 8)) == 0 {
                return false; // Definitive: key is NOT in the set
            }
        }
        true // All bits set — key is PROBABLY in the set
    }

    /// Generate the i-th hash of a key using double hashing.
    /// h(i) = h1 + i * h2, where h1 and h2 are independent hashes.
    fn hash(&self, key: &[u8], i: u32) -> usize {
        let mut h1 = DefaultHasher::new();
        key.hash(&mut h1);
        let hash1 = h1.finish();

        let mut h2 = DefaultHasher::new();
        // Mix in a constant to get an independent second hash
        (key, 0xDEADBEEFu32).hash(&mut h2);
        let hash2 = h2.finish();

        (hash1.wrapping_add((i as u64).wrapping_mul(hash2))) as usize
    }

    /// Serialize the bloom filter for storage in the SSTable meta block.
    fn to_bytes(&self) -> Vec<u8> {
        let mut buf = Vec::with_capacity(4 + 4 + self.bits.len());
        buf.extend_from_slice(&(self.num_bits as u32).to_le_bytes());
        buf.extend_from_slice(&self.num_hashes.to_le_bytes());
        buf.extend_from_slice(&self.bits);
        buf
    }
}

The double hashing technique generates k hash values from just two base hashes: h(i) = h1 + i × h2. This is mathematically equivalent to using k independent hash functions for bloom filter purposes (proven by Kirsch and Mitzenmacher, 2006) and much cheaper to compute.

The DefaultHasher is SipHash, which is well-distributed but not the fastest. Production bloom filters use xxHash or wyhash for speed. The algorithm is the same regardless of hash function — only throughput changes.

Integrating the Bloom Filter into SSTable Reads

Modifying the LSM read path to consult bloom filters before reading SSTable blocks.

/// Check an SSTable for a key, using the bloom filter to skip if possible.
fn check_sstable(
    sst: &SsTableReader,
    key: &[u8],
    block_cache: &mut BlockCache,
) -> io::Result<Option<Option<Vec<u8>>>> {
    // Step 1: Bloom filter check (in-memory, zero I/O)
    if !sst.bloom_filter.may_contain(key) {
        return Ok(None); // Definitely not in this SSTable
    }

    // Step 2: Index block lookup (cached or pinned, usually zero I/O)
    let block_handle = sst.find_block_for_key(key, block_cache)?;

    // Step 3: Data block read (1 disk read if not cached)
    let data_block = block_cache.get_or_load(
        sst.id,
        block_handle.offset,
        block_handle.size,
        &sst.file,
    )?;

    // Step 4: Search the data block for the key
    match data_block.find(key) {
        Some(entry) => Ok(Some(entry.value.clone())), // Found (value or tombstone)
        None => Ok(None), // Key not in this block (bloom filter false positive)
    }
}

When the bloom filter returns false (line 4), the entire SSTable is skipped — no index read, no data read, no disk I/O. This is the single biggest read-path optimization in the LSM architecture.

When the bloom filter returns true but the key isn't actually in the SSTable (false positive), the engine performs an unnecessary index + data block read. At 1% FPR, this happens once per 100 negative probes per SSTable — rare enough to be negligible.


Key Takeaways

  • Bloom filters eliminate 99% of unnecessary SSTable probes for negative lookups at 10 bits per key. This transforms the LSM's worst case (negative lookups checking every SSTable) into a near-zero-I/O operation.
  • The false positive rate is tunable via bits-per-key: 10 bits → ~1% FPR, 14 bits → ~0.1% FPR. The OOR should use 10 bits per key as the default, matching RocksDB's default.
  • Block cache stores recently-accessed SSTable blocks in memory. Pinning index and filter blocks guarantees that every SSTable lookup costs at most 1 disk read (for the data block).
  • The fully optimized LSM read path: bloom filter (in-memory) → index block (pinned in cache) → data block (cached or 1 disk read). For hot keys, this is 0 disk reads — competitive with B+ tree performance.
  • Prefix bloom filters extend filtering to range scans by hashing key prefixes instead of full keys. Higher false positive rate but significant I/O savings for prefix-based range queries.

Project — LSM-Backed TLE Storage Engine

Module: Database Internals — M03: LSM Trees & Compaction
Track: Orbital Object Registry
Estimated effort: 8–10 hours



SDA Incident Report — OOR-2026-0044

Classification: ENGINEERING DIRECTIVE
Subject: Build LSM storage engine prototype for write-optimized TLE ingestion

The B+ tree index cannot sustain burst ingestion rates during fragmentation events. Build an LSM-tree-based storage engine that batches writes in a memtable, flushes to immutable SSTables, and uses leveled compaction to bound read amplification. Bloom filters on each SSTable must reduce negative lookup cost to near-zero.


Objective

Build a complete LSM storage engine that:

  1. Accepts put(key, value) and delete(key) into an in-memory memtable
  2. Freezes and flushes the memtable to SSTable files when a size threshold is reached
  3. Supports get(key) by probing the memtable, then SSTables from newest to oldest
  4. Implements a simple leveled compaction (merge all L0 into L1) triggered by SSTable count
  5. Attaches a bloom filter to each SSTable for fast negative lookups
  6. Supports scan(start_key, end_key) via a merge iterator over all sources

Acceptance Criteria

  1. Write throughput. Insert 100,000 TLE records with a 4MB memtable limit. Measure and print the total time and records/second. Target: >50,000 records/second (in-memory memtable inserts).

  2. Memtable flush. Verify that SSTables are created on disk after the memtable reaches the size threshold. Print the number of SSTables after all inserts.

  3. Point lookup correctness. After all inserts, look up 1,000 random NORAD IDs and verify each returns the correct TLE record. Look up 1,000 non-existent IDs and verify each returns None.

  4. Bloom filter effectiveness. Report bloom filter hit/miss stats: how many SSTable probes were skipped by the bloom filter during the 1,000 negative lookups. Target: >95% skip rate.

  5. Delete correctness. Delete 1,000 records. Verify get returns None for deleted keys. Verify that non-deleted keys adjacent to deleted keys still return correct values.

  6. Compaction. Trigger compaction (merge L0 SSTables into L1). Verify that the number of SSTables decreases. Verify all keys are still accessible after compaction. Verify that deleted keys (tombstones) are garbage-collected if compaction output is the bottommost level.

  7. Range scan. Scan NORAD IDs [40000, 40500]. Verify the results are in sorted order and include exactly the expected keys.


Starter Structure

lsm-storage/
├── Cargo.toml
├── src/
│   ├── main.rs           # Entry point: runs acceptance criteria
│   ├── memtable.rs        # MemTable: BTreeMap-backed sorted store
│   ├── sstable.rs         # SsTableBuilder, SsTableReader
│   ├── bloom.rs           # BloomFilter: insert, may_contain, serialize
│   ├── merge_iter.rs      # MergeIterator: multi-way merge over sorted sources
│   ├── compaction.rs      # Compaction controller and merge logic
│   └── lsm.rs             # LsmEngine: top-level API (put, get, delete, scan)

Hints

Hint 1 — SSTable file naming

Name SSTable files with a monotonically increasing ID: sst_000001.dat, sst_000002.dat, etc. Higher IDs are newer. The LSM state (which SSTables are active) can be tracked in memory as a Vec<SstMeta> per level. Persist the LSM state to a manifest file for crash recovery (or defer this to Module 4).

Hint 2 — Simple L0→L1 compaction trigger

The simplest compaction trigger: when the number of L0 SSTables reaches 4, merge all L0 SSTables into a single sorted L1 SSTable. This eliminates L0's overlapping key ranges. If L1 already has SSTables, include them in the merge to maintain the non-overlapping invariant.

Hint 3 — Merge iterator design

Use a BinaryHeap<Reverse<(key, source_id, value)>> as the min-heap. The source_id breaks ties: lower source ID = newer source. When popping an entry, skip all entries with the same key from older sources (they are superseded).

Hint 4 — Atomicity of SSTable swap

During compaction, write all output SSTables before modifying the LSM state. Then atomically update the state (swap old inputs for new outputs). If the engine crashes mid-compaction, the old SSTables are still valid — the output SSTables are orphaned files that can be cleaned up. This is the simplest crash-safe compaction strategy without a full WAL/manifest (which Module 4 adds).


What Comes Next

Module 4 (WAL & Recovery) adds durability. The memtable is volatile — if the process crashes, unflushed writes are lost. The WAL logs every write before it enters the memtable, enabling recovery. The manifest log tracks which SSTables are active, enabling crash-safe compaction.

Module 04 — Write-Ahead Logging & Recovery

Track: Database Internals — Orbital Object Registry
Position: Module 4 of 6
Source material: Database Internals — Alex Petrov, Chapters 9–10; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 7
Quiz pass threshold: 70% on all three lessons to unlock the project


Mission Context

SDA INCIDENT REPORT — OOR-2026-0045
Classification: DATA LOSS INCIDENT
Subject: 2,400 TLE records lost after unplanned power failure

At 03:17 UTC, a PDU failure at Ground Station Bravo caused an unclean shutdown of the OOR storage engine. The active memtable contained approximately 2,400 TLE updates from the preceding 12-minute pass window. Because the memtable is a volatile in-memory structure, all 2,400 records were lost. The LSM engine restarted with only the previously flushed SSTables, leaving the catalog 12 minutes stale. Two conjunction alerts were delayed because the missing TLEs contained the most recent orbital elements for objects in a close-approach trajectory.

Directive: Implement a write-ahead log. Every mutation must be logged to durable storage before it is applied to the memtable. On crash recovery, replay the WAL to reconstruct the memtable to its pre-crash state.


Learning Outcomes

After completing this module, you will be able to:

  1. Explain the write-ahead rule and why it is the foundation of crash recovery
  2. Implement a WAL that logs key-value operations to an append-only file with checksummed records
  3. Describe the ARIES recovery protocol — analysis, redo, and undo phases
  4. Implement crash recovery by replaying the WAL to reconstruct the memtable
  5. Design a checkpointing strategy that limits WAL replay time after a crash
  6. Reason about the tradeoff between fsync frequency and durability guarantees

Lesson Summary

Lesson 1 — WAL Fundamentals

The write-ahead rule, log record format, LSN ordering, and the WAL's role in the LSM write path. Why fsync is the only guarantee of durability, and the latency cost of calling it.

Key question: What is the maximum data loss window under group commit with 50ms batch intervals?

Lesson 2 — Crash Recovery

The ARIES recovery protocol adapted for the LSM engine. Analysis phase (determine which WAL records need replay), redo phase (replay committed operations into the memtable), and how the manifest tracks SSTable state for consistent restart.

Key question: If the engine crashes during a compaction, does it use the old or new SSTables on recovery?

Lesson 3 — Checkpointing

Fuzzy checkpoints that snapshot the LSM state without blocking writes. How checkpoints bound WAL replay time and enable WAL truncation. The tradeoff between checkpoint frequency and recovery time.

Key question: What is the maximum WAL replay time for the OOR with 60-second checkpoint intervals?


Capstone Project — Durable TLE Update Pipeline

Add WAL-based durability to the Module 3 LSM engine. Every write is logged before entering the memtable. On simulated crash, the engine recovers to a consistent state by replaying the WAL. Full brief in project-durable-pipeline.md.


File Index

module-04-wal-recovery/
├── README.md
├── lesson-01-wal-fundamentals.md
├── lesson-01-quiz.toml
├── lesson-02-crash-recovery.md
├── lesson-02-quiz.toml
├── lesson-03-checkpointing.md
├── lesson-03-quiz.toml
└── project-durable-pipeline.md

Prerequisites

  • Module 3 (LSM Trees & Compaction) completed

What Comes Next

Module 5 (Transactions & Isolation) adds concurrent read/write support with MVCC snapshot isolation, building on the durable foundation established here.

Lesson 1 — WAL Fundamentals

Module: Database Internals — M04: Write-Ahead Logging & Recovery
Position: Lesson 1 of 3
Source: Database Internals — Alex Petrov, Chapters 9–10; Mini-LSM Week 2 Day 6

Source note: This lesson was synthesized from training knowledge. Verify Petrov's WAL record format, LSN semantics, and fsync discussion against Chapters 9–10.



Context

The LSM engine from Module 3 achieves high write throughput by buffering writes in a memtable and flushing them to SSTables in the background. But the memtable is volatile — it lives in process memory. If the process crashes, the OS kills the process, or power fails, the memtable's contents are lost. For the OOR, this means losing every TLE update since the last flush — potentially minutes of orbital data during an active pass window.

The Write-Ahead Log (WAL) is the solution: an append-only file on durable storage that records every mutation before it is applied to the memtable. The key insight is the write-ahead rule: no modification to the in-memory state is visible until its corresponding log record has been durably written to the WAL. If the engine crashes after logging but before flushing, the WAL contains enough information to reconstruct the memtable by replaying the logged operations.

This changes the durability guarantee from "data is safe after SSTable flush" (every 5–60 seconds) to "data is safe after WAL write" (every operation, or every batch). The cost is one sequential disk write per operation (or per batch) — but sequential appends are cheap, especially on SSDs.


Core Concepts

The Write-Ahead Rule

The rule is simple and inviolable: log first, then mutate. The LSM write path becomes:

  1. Serialize the operation (put or delete) into a WAL log record.
  2. Append the record to the WAL file.
  3. Call fsync on the WAL file (or batch fsync for a group of records).
  4. Apply the operation to the memtable.
  5. Return success to the caller.

If the engine crashes between steps 2 and 4, the WAL contains the operation but the memtable does not — recovery will replay it. If the engine crashes before step 2, the operation was never logged — the caller did not receive a success response, so the operation is not considered committed.

WAL Record Format

Each WAL record is a self-contained, checksummed unit:

┌──────────────────────────────────────────┐
│ Record Header                            │
│  - LSN: u64 (log sequence number)        │
│  - record_type: u8 (Put=1, Delete=2)     │
│  - key_len: u16                          │
│  - value_len: u16                        │
│  - checksum: u32 (CRC32 of key+value)    │
├──────────────────────────────────────────┤
│ Key bytes (key_len bytes)                │
├──────────────────────────────────────────┤
│ Value bytes (value_len bytes, if Put)    │
└──────────────────────────────────────────┘

The Log Sequence Number (LSN) is a monotonically increasing identifier assigned to each record. LSNs establish a total order over all operations — recovery replays records in LSN order to reconstruct the exact pre-crash state. The LSN also correlates WAL records with SSTable flushes: when the memtable is flushed, the engine records the highest LSN contained in that flush. On recovery, only WAL records with LSNs greater than the last flushed LSN need to be replayed.

The checksum protects against partial writes: if the engine crashes mid-write, the incomplete record will have an invalid checksum and is discarded during recovery.

fsync Strategies

fsync is expensive: 0.1–1ms on SSD, 5–30ms on spinning disk. Three strategies for trading durability against throughput:

Per-operation fsync: Maximum durability — at most one operation can be lost. Throughput limited by fsync latency (~1,000–10,000 ops/sec on SSD).

Group commit (batch fsync): Buffer multiple WAL writes, then fsync the batch. If the batch covers 100 operations and fsync takes 0.5ms, the amortized cost is 5µs per operation. Standard approach — used by RocksDB, PostgreSQL, MySQL.

No fsync: Maximum throughput, minimum durability — a crash can lose up to 30 seconds of data. Acceptable for caches, not for the OOR.

For the OOR, group commit is the correct choice: batch TLE updates per pass window, fsync once per batch.

WAL in the LSM Write Path

put(key, value)
     │
     ▼
 WAL append (serialize + write + optional fsync)
     │
     ▼
 Memtable insert (in-memory, fast)
     │
     ▼
 Return success

When the memtable is flushed to an SSTable, the engine records the flushed LSN. WAL records at or below this LSN are no longer needed for recovery, enabling WAL truncation.


Code Examples

WAL Writer: Appending Checksummed Records

use std::io::{self, Write, BufWriter};
use std::fs::{File, OpenOptions};

#[repr(u8)]
#[derive(Clone, Copy)]
enum WalRecordType {
    Put = 1,
    Delete = 2,
}

struct WalWriter {
    writer: BufWriter<File>,
    next_lsn: u64,
}

impl WalWriter {
    fn open(path: &str) -> io::Result<Self> {
        let file = OpenOptions::new()
            .create(true)
            .append(true)
            .open(path)?;
        Ok(Self {
            writer: BufWriter::new(file),
            next_lsn: 0,
        })
    }

    fn log_put(&mut self, key: &[u8], value: &[u8]) -> io::Result<u64> {
        self.write_record(WalRecordType::Put, key, Some(value))
    }

    fn log_delete(&mut self, key: &[u8]) -> io::Result<u64> {
        self.write_record(WalRecordType::Delete, key, None)
    }

    fn write_record(
        &mut self,
        rec_type: WalRecordType,
        key: &[u8],
        value: Option<&[u8]>,
    ) -> io::Result<u64> {
        let lsn = self.next_lsn;
        self.next_lsn += 1;
        let val = value.unwrap_or(&[]);

        // Checksum covers key + value
        let mut hasher = crc32fast::Hasher::new();
        hasher.update(key);
        hasher.update(val);
        let checksum = hasher.finalize();

        // Header: LSN(8) + type(1) + key_len(2) + val_len(2) + checksum(4) = 17 bytes
        self.writer.write_all(&lsn.to_le_bytes())?;
        self.writer.write_all(&[rec_type as u8])?;
        self.writer.write_all(&(key.len() as u16).to_le_bytes())?;
        self.writer.write_all(&(val.len() as u16).to_le_bytes())?;
        self.writer.write_all(&checksum.to_le_bytes())?;
        self.writer.write_all(key)?;
        self.writer.write_all(val)?;

        Ok(lsn)
    }

    /// Flush to durable storage. Call after a batch for group commit.
    fn sync(&mut self) -> io::Result<()> {
        self.writer.flush()?;
        self.writer.get_ref().sync_all()
    }
}

WAL Reader: Replaying Records for Recovery

struct WalRecord {
    lsn: u64,
    rec_type: WalRecordType,
    key: Vec<u8>,
    value: Vec<u8>,
}

struct WalReader {
    reader: std::io::BufReader<File>,
}

impl WalReader {
    fn next_record(&mut self) -> io::Result<Option<WalRecord>> {
        let mut header = [0u8; 17];
        match self.reader.read_exact(&mut header) {
            Ok(()) => {}
            Err(e) if e.kind() == io::ErrorKind::UnexpectedEof => return Ok(None),
            Err(e) => return Err(e),
        }

        let lsn = u64::from_le_bytes(header[0..8].try_into().unwrap());
        let rec_type = match header[8] {
            1 => WalRecordType::Put,
            2 => WalRecordType::Delete,
            _ => return Err(io::Error::new(
                io::ErrorKind::InvalidData, "invalid WAL record type",
            )),
        };
        let key_len = u16::from_le_bytes(header[9..11].try_into().unwrap()) as usize;
        let val_len = u16::from_le_bytes(header[11..13].try_into().unwrap()) as usize;
        let stored_checksum = u32::from_le_bytes(header[13..17].try_into().unwrap());

        let mut key = vec![0u8; key_len];
        let mut value = vec![0u8; val_len];
        self.reader.read_exact(&mut key)?;
        self.reader.read_exact(&mut value)?;

        // Verify checksum — detects partial writes from crashes
        let mut hasher = crc32fast::Hasher::new();
        hasher.update(&key);
        hasher.update(&value);
        if hasher.finalize() != stored_checksum {
            return Err(io::Error::new(
                io::ErrorKind::InvalidData,
                format!("WAL checksum mismatch at LSN {} — partial write detected", lsn),
            ));
        }

        Ok(Some(WalRecord { lsn, rec_type, key, value }))
    }
}

The checksum verification is the corruption boundary: a failed checksum means the engine crashed mid-write. All prior records are valid; this record and everything after it are discarded.


Key Takeaways

  • The write-ahead rule is absolute: log the operation to durable storage before applying it to the memtable. This guarantees that committed operations survive crashes.
  • fsync is the durability boundary. Group commit amortizes the cost across many operations — the standard approach for production engines.
  • Each WAL record is self-contained with a CRC32 checksum. Partial writes are detected and discarded during recovery.
  • The LSN orders all operations and correlates WAL records with SSTable flushes. WAL records at or below the flushed LSN are safe to truncate.
  • WAL writes are sequential appends — the cheapest form of disk I/O.

Lesson 2 — Crash Recovery

Module: Database Internals — M04: Write-Ahead Logging & Recovery
Position: Lesson 2 of 3
Source: Database Internals — Alex Petrov, Chapter 10

Source note: This lesson was synthesized from training knowledge. Verify Petrov's ARIES protocol adaptation for LSM engines against Chapter 10.



Context

The WAL ensures every committed operation is recorded on durable storage. After a crash, the engine must use that log to return to a consistent state. This is the job of the crash recovery protocol — a deterministic procedure that reads the WAL, determines what was lost, and reconstructs the in-memory state.

For the OOR's LSM engine, recovery is simpler than for a traditional B-tree database because SSTables are immutable. There are no dirty pages to redo or uncommitted transactions to undo — the only volatile state is the memtable. Recovery reconstructs the memtable by replaying WAL records that were not yet flushed to an SSTable.

The classical protocol is ARIES (Algorithms for Recovery and Isolation Exploiting Semantics). ARIES has three phases — analysis, redo, and undo. For the LSM engine, we adapt it: the analysis phase determines the recovery starting point, the redo phase replays the WAL into the memtable, and the undo phase is unnecessary (no uncommitted transactions to roll back at this stage).


Core Concepts

The Manifest

The manifest is a metadata file that records the LSM engine's durable state: which SSTables are active, what level each belongs to, and the highest flushed LSN. On recovery, the manifest is the starting point.

The manifest is itself an append-only log. Each entry records a state change:

[1] AddSSTable { id: 1, level: 0, min_key: "a", max_key: "z", flushed_lsn: 500 }
[2] AddSSTable { id: 2, level: 0, min_key: "b", max_key: "y", flushed_lsn: 1000 }
[3] Compaction { removed: [1, 2], added: [3], output_level: 1, flushed_lsn: 1000 }

LSM Recovery Protocol

  1. Read the manifest. Reconstruct the active SSTable set and determine the highest flushed LSN.

  2. Open the WAL. Seek to the first record with LSN > flushed_lsn.

  3. Replay WAL records. For each valid record, apply it to a fresh memtable:

    • Put → insert the key-value pair
    • Delete → insert a tombstone
  4. Stop at corruption. If a record's checksum fails, stop replaying. That record and everything after it are discarded.

  5. Resume normal operation. The memtable now contains all committed-but-unflushed operations.

Recovery timeline:
  [SSTables on disk]  [WAL on disk]
  ├─ flushed to LSN 1000 ─┤─ LSN 1001..1234 valid ─┤─ LSN 1235 corrupt ─┤
                           │                         │
                    Replay into memtable          Stop here

Crash During Compaction

If the engine crashes mid-compaction:

  • Before manifest update: Old SSTables are still listed as active. Partially-written new SSTables are orphaned files. Recovery uses the old SSTables. Orphans are cleaned up.
  • After manifest update: New SSTables are active. Old SSTables are marked for deletion. Recovery uses the new SSTables.

The manifest update is the atomicity boundary. The pattern: write data first, then atomically update metadata.

Crash During Flush

If the engine crashes mid-flush:

  • Before manifest update: The partial SSTable is orphaned. The WAL still contains all records. Recovery replays the WAL.
  • After manifest update: The SSTable is complete and active. WAL records up to the flushed LSN are redundant.

Orphan Cleanup

On startup, the engine scans the data directory for SSTable files not referenced by the manifest. These are orphans from interrupted compactions or flushes. They are deleted before normal operation begins.


Code Examples

LSM Engine Recovery

impl LsmEngine {
    fn recover(db_path: &str) -> io::Result<Self> {
        // Phase 1: Read manifest
        let (sst_state, flushed_lsn) = ManifestReader::replay(
            &format!("{}/MANIFEST", db_path)
        )?;
        eprintln!("Recovery: {} SSTables, flushed LSN = {}", sst_state.total_count(), flushed_lsn);

        // Phase 2: Replay WAL
        let mut memtable = MemTable::new(16 * 1024 * 1024);
        let mut replayed = 0u64;
        let mut next_lsn = flushed_lsn + 1;

        if let Ok(mut reader) = WalReader::open(&format!("{}/WAL", db_path)) {
            loop {
                match reader.next_record() {
                    Ok(Some(record)) => {
                        if record.lsn <= flushed_lsn { continue; }
                        match record.rec_type {
                            WalRecordType::Put => memtable.put(record.key, record.value),
                            WalRecordType::Delete => memtable.delete(record.key),
                        }
                        next_lsn = record.lsn + 1;
                        replayed += 1;
                    }
                    Ok(None) => break,
                    Err(e) => {
                        eprintln!("Recovery: corrupt record ({}), {} replayed", e, replayed);
                        break;
                    }
                }
            }
        }
        eprintln!("Recovery: replayed {} WAL records", replayed);

        // Phase 3: Clean up orphaned SSTables
        cleanup_orphans(db_path, &sst_state)?;

        // Phase 4: Open fresh WAL and manifest for new writes
        let wal = WalWriter::open(&format!("{}/WAL", db_path))?;
        let manifest = ManifestWriter::open(&format!("{}/MANIFEST", db_path))?;

        Ok(Self { active_memtable: memtable, immutable_memtables: Vec::new(),
                  sst_state, wal, manifest, next_lsn })
    }
}

fn cleanup_orphans(db_path: &str, sst_state: &LsmState) -> io::Result<()> {
    let active_ids: std::collections::HashSet<u64> = sst_state.all_sst_ids().collect();
    for entry in std::fs::read_dir(db_path)? {
        let entry = entry?;
        let name = entry.file_name().to_string_lossy().to_string();
        if name.starts_with("sst_") && name.ends_with(".dat") {
            let id: u64 = name[4..name.len()-4].parse().unwrap_or(u64::MAX);
            if !active_ids.contains(&id) {
                eprintln!("Recovery: deleting orphaned SSTable {}", name);
                std::fs::remove_file(entry.path())?;
            }
        }
    }
    Ok(())
}

The recovery procedure is deterministic: given the same manifest and WAL files, it always produces the same memtable state.


Key Takeaways

  • LSM crash recovery is simpler than B-tree recovery because SSTables are immutable. The only volatile state to reconstruct is the memtable.
  • The manifest tracks active SSTables and the flushed LSN. It is the recovery starting point and the atomicity boundary for compaction and flush operations.
  • Recovery replays WAL records with LSN > flushed_lsn into a fresh memtable. Partial writes are detected by checksum failure and discarded.
  • The pattern "write data, then atomically update metadata" applies to both flushes and compactions. The manifest update is the commit point.
  • Orphan cleanup on startup removes SSTable files from interrupted operations that were never recorded in the manifest.

Lesson 3 — Checkpointing

Module: Database Internals — M04: Write-Ahead Logging & Recovery
Position: Lesson 3 of 3
Source: Database Internals — Alex Petrov, Chapter 10

Source note: This lesson was synthesized from training knowledge. Verify Petrov's fuzzy checkpoint algorithm and his WAL truncation semantics against Chapter 10.



Context

Without checkpointing, the WAL grows indefinitely. If the engine has been running for 24 hours with 100,000 TLE updates, the WAL contains 100,000 records — all of which must be scanned during recovery to find those with LSN > flushed_lsn. Recovery time grows linearly with WAL size. For a system that must return to service within seconds after a crash (the OOR's conjunction avoidance SLA requires <30s recovery), unbounded WAL growth is unacceptable.

A checkpoint snapshots the current LSM state and records the position where recovery should start. After a checkpoint, WAL records before the checkpoint position can be safely deleted, bounding both WAL size and recovery time.


Core Concepts

What a Checkpoint Records

A checkpoint writes the following to the manifest:

  1. Checkpoint LSN — the highest LSN that is fully durable (either in an SSTable or committed to the WAL and fsync'd at checkpoint time).
  2. Active SSTable list — every SSTable currently in the LSM state, with level assignments.
  3. Memtable status — the LSN range of the current active memtable (not yet flushed).

After the checkpoint, the WAL can be truncated up to the minimum recovery LSN: the smallest LSN that might still need replay. This is the lower bound of the active memtable's LSN range at checkpoint time.

Fuzzy Checkpoints

A sharp checkpoint freezes all writes, flushes the memtable, records the state, and then resumes. This guarantees that the checkpoint LSN is fully consistent — but it blocks writes for the duration of the flush (potentially seconds).

A fuzzy checkpoint avoids blocking writes:

  1. Record the current memtable's LSN range and the active SSTable list.
  2. Write this snapshot to the manifest.
  3. Continue accepting writes — the memtable keeps growing.

The tradeoff: recovery after a fuzzy checkpoint must replay WAL records from the memtable's start LSN (not the checkpoint LSN), because the memtable was not flushed at checkpoint time. Fuzzy checkpoints are faster (no flush) but result in slightly longer recovery (more WAL records to replay).

In the LSM architecture, fuzzy checkpoints are natural: the memtable flush is a form of checkpointing. Every time a memtable is flushed to an SSTable, the flushed LSN advances, and older WAL records become eligible for truncation. Explicit checkpoints are only needed if flush intervals are very long.

WAL Truncation

After a checkpoint (or flush), WAL records below the minimum recovery LSN can be deleted. Two strategies:

Segment-based: The WAL is split into fixed-size segments (e.g., 64MB files). A segment can be deleted when all its records have LSN ≤ the minimum recovery LSN. Simple and efficient — the filesystem handles cleanup.

Single-file with logical truncation: The WAL is one file. A "truncation point" is maintained in the manifest. On recovery, records before this point are skipped. The file is physically truncated (or rewritten) during periodic maintenance.

Segment-based is the standard approach (used by RocksDB, Kafka, PostgreSQL's WAL segments). It avoids the complexity of in-place truncation and enables simple space reclamation.

Recovery Time Analysis

Recovery time = time to read manifest + time to replay WAL records.

Manifest replay is fast (typically <100 entries). WAL replay dominates: each record requires deserialization and a memtable insert. At ~1µs per memtable insert and 100,000 records to replay, recovery takes ~100ms for the replay phase.

Checkpointing bounds this: with checkpoints every 60 seconds and 3,000 writes/sec, the maximum WAL replay is ~180,000 records = ~180ms. Well within the OOR's 30-second recovery SLA.

Coordinating Checkpoints with Compaction

Checkpoints and compaction both modify the manifest. To prevent conflicts:

  1. Acquire a manifest lock before writing a checkpoint or compaction result.
  2. Write the manifest entry.
  3. Release the lock.

The lock is held briefly (one file write + fsync), so contention is low. The manifest itself is append-only, so there are no conflicting edits — only the ordering of entries matters.


Code Examples

Checkpoint Implementation

impl LsmEngine {
    /// Write a fuzzy checkpoint to the manifest.
    /// This records the current LSM state without blocking writes.
    fn checkpoint(&mut self) -> io::Result<()> {
        // Snapshot the current state
        let active_ssts: Vec<SstMeta> = self.sst_state.all_sstables().cloned().collect();
        let memtable_min_lsn = self.active_memtable_min_lsn();
        let checkpoint_lsn = self.next_lsn - 1;

        // Write checkpoint to manifest
        self.manifest.write_checkpoint(CheckpointEntry {
            checkpoint_lsn,
            memtable_min_lsn,
            active_sstables: active_ssts.iter().map(|s| s.id).collect(),
        })?;
        self.manifest.sync()?;

        eprintln!(
            "Checkpoint at LSN {}: {} active SSTables, \
             WAL replay starts at LSN {}",
            checkpoint_lsn, active_ssts.len(), memtable_min_lsn,
        );

        // Truncate WAL segments that are fully below memtable_min_lsn
        self.wal.truncate_before(memtable_min_lsn)?;

        Ok(())
    }

    fn active_memtable_min_lsn(&self) -> u64 {
        // The earliest LSN in the active memtable is the minimum
        // recovery point — WAL records before this are redundant.
        // If the memtable is empty, use the flushed LSN.
        self.active_memtable
            .min_lsn()
            .unwrap_or(self.sst_state.flushed_lsn())
    }
}

The checkpoint does not flush the memtable — it records where the memtable starts (min LSN) so recovery knows where to begin WAL replay. This is the "fuzzy" part: writes continue during and after the checkpoint, but the WAL truncation point is safely advanced.

WAL Segment Manager

/// Manages WAL as a series of fixed-size segments for clean truncation.
struct WalSegmentManager {
    dir: String,
    segment_size: usize,
    active_segment: WalWriter,
    active_segment_id: u64,
}

impl WalSegmentManager {
    /// Truncate (delete) all WAL segments whose max LSN is below the given LSN.
    fn truncate_before(&mut self, min_recovery_lsn: u64) -> io::Result<()> {
        let entries = std::fs::read_dir(&self.dir)?;
        for entry in entries {
            let entry = entry?;
            let name = entry.file_name().to_string_lossy().to_string();
            if name.starts_with("wal_") && name.ends_with(".log") {
                // Parse segment ID from filename: wal_000042.log → 42
                let seg_id = name[4..10].parse::<u64>().unwrap_or(u64::MAX);
                // Each segment covers a known LSN range.
                // Conservative: only delete if segment_max_lsn < min_recovery_lsn
                if self.segment_max_lsn(seg_id) < min_recovery_lsn {
                    std::fs::remove_file(entry.path())?;
                    eprintln!("WAL: deleted segment {}", name);
                }
            }
        }
        Ok(())
    }

    fn segment_max_lsn(&self, segment_id: u64) -> u64 {
        // In practice, track this in memory or in the segment header.
        // Simplified: assume segments hold a known max number of records.
        (segment_id + 1) * (self.segment_size as u64 / 103) // ~records per segment
    }
}

Segment-based truncation is simple: delete files whose records are all below the recovery starting point. No in-place file modification, no complex bookkeeping. The filesystem handles space reclamation.


Key Takeaways

  • Checkpoints bound WAL size and recovery time by recording a recovery starting point. Without checkpoints, the WAL grows indefinitely and recovery scans the entire log.
  • Fuzzy checkpoints avoid blocking writes — they snapshot the LSM state without flushing the memtable. The tradeoff is slightly longer recovery (WAL replay from the memtable's start LSN, not the checkpoint LSN).
  • In an LSM engine, every memtable flush is implicitly a checkpoint — it advances the flushed LSN and makes earlier WAL records eligible for truncation.
  • WAL segments (fixed-size log files) enable clean truncation by deleting entire segment files. This is simpler and more efficient than truncating a single growing file.
  • Recovery time for the OOR: ~180ms worst case with 60-second checkpoint intervals and 3,000 writes/sec. Well within the 30-second conjunction avoidance SLA.

Project — Durable TLE Update Pipeline

Module: Database Internals — M04: Write-Ahead Logging & Recovery
Track: Orbital Object Registry
Estimated effort: 6–8 hours



SDA Incident Report — OOR-2026-0045

Classification: ENGINEERING DIRECTIVE
Subject: Add WAL-based durability to the LSM storage engine

Ref: OOR-2026-0045 (data loss incident after PDU failure)

Integrate a write-ahead log into the Module 3 LSM engine. Every mutation must be logged before it enters the memtable. The engine must recover to a consistent state after simulated crashes.


Acceptance Criteria

  1. WAL write path. Every put and delete call appends a checksummed record to the WAL before modifying the memtable. Verify by inspecting the WAL file after 1,000 inserts.

  2. Clean recovery. Insert 5,000 records, gracefully shut down, then recover. All 5,000 records must be accessible after recovery.

  3. Crash recovery. Insert 5,000 records. Simulate a crash by calling std::process::abort() (or simply skipping the shutdown routine). Restart and recover. Records up to the last fsync'd batch must be accessible. Report how many records were recovered vs. the expected count.

  4. Crash during flush. Insert records until a memtable flush is triggered. Simulate a crash mid-flush (after writing the SSTable but before updating the manifest). Recover and verify all data is intact — the orphaned SSTable is ignored, and the WAL is replayed to reconstruct the memtable.

  5. WAL truncation. After recovery, trigger a flush and checkpoint. Verify the WAL is truncated — old segments are deleted, and the remaining WAL contains only records above the flushed LSN.

  6. Recovery time. Measure recovery time for WAL sizes of 10,000, 50,000, and 100,000 records. Report the time for each. Target: recovery < 500ms for 100,000 records.

  7. Manifest correctness. After multiple flush/compaction/checkpoint cycles, recover the engine and verify the manifest correctly reports the active SSTable set and flushed LSN.


Starter Structure

durable-pipeline/
├── Cargo.toml
├── src/
│   ├── main.rs           # Entry point: runs acceptance criteria
│   ├── wal.rs             # WalWriter, WalReader, WalSegmentManager
│   ├── manifest.rs        # ManifestWriter, ManifestReader, checkpoint
│   ├── lsm.rs             # LsmEngine with WAL integration and recovery
│   ├── memtable.rs        # Reuse from Module 3
│   ├── sstable.rs         # Reuse from Module 3
│   ├── bloom.rs           # Reuse from Module 3
│   └── compaction.rs      # Reuse from Module 3

Hints

Hint 1 — Simulating a crash

The simplest crash simulation: after writing N records, drop the LsmEngine without calling any shutdown method, then construct a new LsmEngine::recover(). Alternatively, write to a temporary directory, copy/rename files to simulate partial state, and then recover from the copy.

Hint 2 — Manifest format

Keep the manifest simple: a sequence of newline-delimited JSON records. Each record is either {"type": "add_sst", "id": 42, "level": 1, "flushed_lsn": 5000} or {"type": "remove_sst", "ids": [31, 32, 33]} or {"type": "checkpoint", "lsn": 10000, "active_ssts": [42, 43]}. Parse with serde_json or manual string parsing.

Hint 3 — Crash-during-flush simulation

Write the SSTable file, then abort before writing the manifest entry. On recovery, the manifest doesn't list the SSTable. Scan the data directory for SSTable files not in the manifest and delete them (orphan cleanup). Replay the WAL to reconstruct the memtable.


What Comes Next

Module 5 (Transactions & Isolation) adds MVCC support — concurrent readers see consistent snapshots while writers continue modifying the database. The WAL and manifest from this module provide the durability foundation that MVCC transactions depend on.

Module 05 — Transactions & Isolation

Track: Database Internals — Orbital Object Registry
Position: Module 5 of 6
Source material: Database Internals — Alex Petrov, Chapters 12–13; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 7; Mini-LSM Week 3
Quiz pass threshold: 70% on all three lessons to unlock the project


Mission Context

SDA INCIDENT REPORT — OOR-2026-0046
Classification: DATA ANOMALY
Subject: Conjunction query returned stale TLE data during concurrent catalog update

At 14:22 UTC, a conjunction assessment for NORAD 43013 used TLE epoch 2026-084.2 while a concurrent bulk update was writing epoch 2026-084.7 for the same object. The assessment computed a miss distance of 2.3km using the stale epoch. The updated epoch would have yielded 0.8km — below the avoidance maneuver threshold. The conjunction alert was delayed by 4 minutes until the next assessment cycle picked up the updated TLE.

Root cause: The LSM engine provides no isolation between concurrent readers and writers. A long-running conjunction query can read a mix of old and new TLE versions, producing inconsistent results.

Directive: Implement multi-version concurrency control (MVCC) with snapshot isolation. Every conjunction query must see a consistent snapshot of the catalog — either entirely before or entirely after any concurrent update.


Learning Outcomes

After completing this module, you will be able to:

  1. Explain the ACID properties and which guarantees are provided by each isolation level
  2. Implement two-phase locking (2PL) and explain why it prevents all anomalies but limits concurrency
  3. Implement MVCC snapshot isolation in an LSM engine using timestamped keys
  4. Explain write skew and why snapshot isolation does not prevent it
  5. Design a garbage collection strategy for old MVCC versions
  6. Reason about the tradeoff between isolation level and concurrent throughput

Lesson Summary

Lesson 1 — ACID Properties and Isolation Levels

What Atomicity, Consistency, Isolation, and Durability mean concretely. The isolation levels (Read Uncommitted, Read Committed, Repeatable Read, Serializable) and which anomalies each prevents.

Lesson 2 — Two-Phase Locking (2PL)

Lock-based concurrency control. Shared and exclusive locks, the growing and shrinking phases, strict 2PL, and deadlock detection. Why 2PL is correct but limits throughput.

Lesson 3 — MVCC and Snapshot Isolation

Multi-version concurrency control: storing multiple versions of each key with timestamps. Snapshot reads, write conflicts, and garbage collection. Adapted for the LSM architecture using timestamped keys (following Mini-LSM Week 3).


Capstone Project — Conjunction Query Engine with MVCC Snapshot Reads

Add MVCC snapshot isolation to the LSM engine. Concurrent conjunction queries see consistent catalog snapshots. Concurrent writers do not block readers. Full brief in project-conjunction-engine.md.


File Index

module-05-transactions-isolation/
├── README.md
├── lesson-01-acid-isolation.md
├── lesson-01-quiz.toml
├── lesson-02-two-phase-locking.md
├── lesson-02-quiz.toml
├── lesson-03-mvcc-snapshots.md
├── lesson-03-quiz.toml
└── project-conjunction-engine.md

Prerequisites

  • Module 4 (WAL & Recovery) completed

What Comes Next

Module 6 (Query Processing) adds structured query execution on top of the transactional storage engine — the volcano iterator model, vectorized execution, and join algorithms.

Lesson 1 — ACID Properties and Isolation Levels

Module: Database Internals — M05: Transactions & Isolation
Position: Lesson 1 of 3
Source: Database Internals — Alex Petrov, Chapter 12; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 7

Source note: This lesson was synthesized from training knowledge. Verify Kleppmann's isolation level taxonomy and anomaly definitions against Chapter 7.



Context

The OOR's LSM engine from Modules 3–4 provides durable, crash-recoverable storage for TLE records. But it offers no guarantees about what happens when multiple operations execute concurrently. A conjunction query reading NORAD 43013's TLE while a bulk update is overwriting it can see a partially-updated record — or a mix of old and new versions across different objects. The result is a phantom: a conjunction assessment computed against a catalog state that never actually existed.

Transactions are the abstraction that prevents this. A transaction groups multiple operations into a single atomic, isolated unit. The ACID properties define what "correct" means for transactions, and the isolation level determines how strictly concurrent transactions are separated.


Core Concepts

ACID Properties

Atomicity: All operations in a transaction succeed or all fail. If a bulk TLE update covers 500 objects and fails on object 347, the first 346 updates are rolled back. The catalog is never left in a partially-updated state.

Consistency: The database moves from one valid state to another. Application-level invariants (e.g., every NORAD ID is unique, every TLE has a valid epoch) are preserved across transactions. Consistency is primarily the application's responsibility — the database enforces it through constraints.

Isolation: Concurrent transactions appear to execute serially. A conjunction query running alongside a bulk update sees either the entirely pre-update or entirely post-update catalog — never a mix. The isolation level determines how strictly this is enforced.

Durability: Once a transaction commits, its effects survive crashes. This is the WAL's job (Module 4).

Isolation Levels and Anomalies

Each isolation level prevents a specific set of anomalies — situations where concurrent execution produces results that no serial execution could produce.

Read Uncommitted: No isolation. A transaction can read another transaction's uncommitted writes. Vulnerable to dirty reads (reading data that may be rolled back).

Read Committed: A transaction only sees committed data. Prevents dirty reads. Still vulnerable to non-repeatable reads (reading the same key twice and getting different values because another transaction committed between the two reads).

Repeatable Read / Snapshot Isolation: A transaction sees a consistent snapshot taken at transaction start. Prevents dirty reads and non-repeatable reads. Still vulnerable to write skew (two transactions read overlapping data, make disjoint writes, and produce a state that neither would have produced alone).

Serializable: Full isolation — concurrent transactions produce results equivalent to some serial ordering. Prevents all anomalies including write skew. Most expensive to enforce.

LevelDirty ReadNon-Repeatable ReadPhantom ReadWrite Skew
Read Uncommitted
Read Committed
Repeatable Read✗/✓
Snapshot Isolation
Serializable

✓ = prevented, ✗ = possible

For the OOR, snapshot isolation is the practical target: conjunction queries need a consistent view of the catalog (preventing dirty reads, non-repeatable reads, and phantoms), but full serializability's overhead is unnecessary for a read-dominated workload.

Write Skew: The Anomaly Snapshot Isolation Misses

Two conjunction analysts each read that the other is on duty. Both decide to go off-duty simultaneously, leaving no one on watch. Each transaction's writes are consistent with its own read snapshot, but the combined result violates the invariant "at least one analyst on duty."

In the OOR context: two concurrent TLE update transactions each read that a different ground station is providing TLE data for NORAD 25544. Each decides to delete the other station's TLE (deduplication). Result: both TLEs are deleted, and the object has no TLE data. Each transaction saw a valid state, but the combined result is invalid.

Snapshot isolation does not prevent this because neither transaction writes a key that the other reads — they write disjoint keys. The conflict is at the application invariant level, not the data access level. Preventing write skew requires serializable isolation (2PL or SSI).


Code Examples

Transaction Interface for the OOR

/// A transaction handle that provides snapshot isolation.
struct Transaction {
    /// Snapshot timestamp — all reads see data as of this moment.
    read_ts: u64,
    /// Commit timestamp — assigned at commit time.
    write_ts: Option<u64>,
    /// Buffered writes — applied to the engine only on commit.
    write_set: Vec<(Vec<u8>, Option<Vec<u8>>)>,
}

impl Transaction {
    fn begin(engine: &LsmEngine) -> Self {
        Self {
            read_ts: engine.current_timestamp(),
            write_ts: None,
            write_set: Vec::new(),
        }
    }

    /// Read a key as of this transaction's snapshot timestamp.
    fn get(&self, key: &[u8], engine: &LsmEngine) -> io::Result<Option<Vec<u8>>> {
        // Read the version of the key that was committed at or before read_ts
        engine.get_at_timestamp(key, self.read_ts)
    }

    /// Buffer a write (applied on commit).
    fn put(&mut self, key: Vec<u8>, value: Vec<u8>) {
        self.write_set.push((key, Some(value)));
    }

    fn delete(&mut self, key: Vec<u8>) {
        self.write_set.push((key, None));
    }

    /// Commit the transaction: apply all buffered writes atomically.
    fn commit(mut self, engine: &mut LsmEngine) -> io::Result<()> {
        let write_ts = engine.next_timestamp();
        self.write_ts = Some(write_ts);

        // Apply all writes with the commit timestamp
        for (key, value) in self.write_set {
            match value {
                Some(val) => engine.put_with_ts(&key, &val, write_ts)?,
                None => engine.delete_with_ts(&key, write_ts)?,
            }
        }
        Ok(())
    }
}

The key insight: reads use the read_ts (taken at transaction start), so they always see a consistent snapshot. Writes are buffered and applied atomically with a write_ts (taken at commit time). Other transactions that started before write_ts will not see these writes — they read at their own read_ts.


Key Takeaways

  • ACID properties define transaction correctness. Atomicity (all-or-nothing), Isolation (concurrent transactions don't interfere), and Durability (committed data survives crashes) are the storage engine's responsibility. Consistency is primarily the application's.
  • Snapshot isolation gives each transaction a consistent view of the database as of its start time. This prevents dirty reads, non-repeatable reads, and phantom reads — sufficient for the OOR's conjunction query workload.
  • Write skew is the anomaly that snapshot isolation misses. It occurs when two transactions read overlapping data and write disjoint keys, producing a combined result that neither would have produced alone.
  • The transaction interface separates read path (snapshot at read_ts) from write path (buffered, applied at write_ts). This is the foundation for MVCC (Lesson 3).

Lesson 2 — Two-Phase Locking (2PL)

Module: Database Internals — M05: Transactions & Isolation
Position: Lesson 2 of 3
Source: Database Internals — Alex Petrov, Chapter 12; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 7

Source note: This lesson was synthesized from training knowledge. Verify Petrov's 2PL description and deadlock detection algorithms against Chapter 12.



Context

Before MVCC became dominant, lock-based concurrency control was the standard approach to transaction isolation. Two-Phase Locking (2PL) is the classical protocol: transactions acquire locks before accessing data, and release them only after completing all operations. The "two phases" are the growing phase (acquiring locks) and the shrinking phase (releasing locks). A transaction never acquires a new lock after releasing any lock.

2PL provides full serializability — the strongest isolation level. But it comes at a cost: writers block readers, readers block writers, and concurrent throughput drops significantly under contention. For the OOR, where conjunction queries must not block TLE ingestion, 2PL's blocking behavior is problematic. Understanding 2PL is essential context for appreciating why MVCC (Lesson 3) is the preferred approach for read-heavy workloads.


Core Concepts

Lock Types

Shared lock (S): Allows the holder to read the locked resource. Multiple transactions can hold shared locks on the same resource simultaneously. Shared locks prevent writes but allow concurrent reads.

Exclusive lock (X): Allows the holder to read and write the locked resource. Only one transaction can hold an exclusive lock at a time. Exclusive locks block both reads and writes from other transactions.

Compatibility matrix:

S heldX held
S requested✓ grant✗ wait
X requested✗ wait✗ wait

The Two Phases

Growing phase: The transaction acquires locks as needed (shared for reads, exclusive for writes). It never releases any lock during this phase.

Shrinking phase: After the transaction decides to commit (or abort), it releases all locks. Once any lock is released, no new locks can be acquired.

Strict 2PL is the common variant: all locks are held until the transaction commits or aborts. No locks are released during the shrinking phase — they are all released at once at commit time. This prevents cascading aborts (where one transaction's abort forces other transactions that read its uncommitted data to also abort).

Deadlocks

Two transactions can deadlock if each holds a lock the other needs:

  • Transaction A holds an exclusive lock on NORAD 25544 and requests a shared lock on NORAD 43013.
  • Transaction B holds an exclusive lock on NORAD 43013 and requests a shared lock on NORAD 25544.
  • Neither can proceed. Both are stuck waiting.

Detection strategies:

  • Timeout: If a lock wait exceeds a threshold, abort the transaction and retry. Simple but imprecise — the timeout may be too long (wasted time) or too short (false positives).
  • Wait-for graph: Maintain a directed graph of which transactions are waiting for which. A cycle in the graph indicates a deadlock. Abort one transaction in the cycle (typically the youngest or the one with the least work done).

2PL Performance Characteristics

Under low contention (few transactions accessing the same keys), 2PL performs well — most lock requests are granted immediately. Under high contention (many transactions accessing overlapping keys), performance degrades:

  • Writers block readers: a bulk TLE update holding exclusive locks on 500 objects blocks all conjunction queries that need any of those objects.
  • Lock overhead: acquiring and releasing locks, checking the wait-for graph, and managing the lock table all consume CPU.
  • Deadlock aborts: wasted work when a deadlock victim is rolled back and retried.

For the OOR's workload — frequent long-running read transactions (conjunction queries) alongside burst write transactions (TLE ingestion) — 2PL would cause conjunction queries to stall during every ingestion burst.


Code Examples

A Simple Lock Manager

use std::collections::HashMap;
use std::sync::{Mutex, Condvar};

#[derive(Debug, Clone, Copy, PartialEq)]
enum LockMode { Shared, Exclusive }

struct LockEntry {
    mode: LockMode,
    holders: Vec<u64>,    // Transaction IDs holding this lock
    wait_queue: Vec<(u64, LockMode)>, // Transactions waiting for this lock
}

struct LockManager {
    locks: Mutex<HashMap<Vec<u8>, LockEntry>>,
    cond: Condvar,
}

impl LockManager {
    fn acquire(&self, txn_id: u64, key: &[u8], mode: LockMode) -> bool {
        let mut locks = self.locks.lock().unwrap();
        loop {
            let entry = locks.entry(key.to_vec()).or_insert_with(|| LockEntry {
                mode: LockMode::Shared,
                holders: Vec::new(),
                wait_queue: Vec::new(),
            });

            let can_grant = match (mode, entry.holders.is_empty()) {
                (_, true) => true, // No holders — any mode is fine
                (LockMode::Shared, false) => entry.mode == LockMode::Shared,
                (LockMode::Exclusive, false) => false,
            };

            if can_grant {
                entry.mode = mode;
                entry.holders.push(txn_id);
                return true;
            }

            // Cannot grant — add to wait queue
            entry.wait_queue.push((txn_id, mode));
            // Block until notified (simplified — real impl checks for deadlock)
            locks = self.cond.wait(locks).unwrap();
        }
    }

    fn release(&self, txn_id: u64, key: &[u8]) {
        let mut locks = self.locks.lock().unwrap();
        if let Some(entry) = locks.get_mut(key) {
            entry.holders.retain(|&id| id != txn_id);
            if entry.holders.is_empty() {
                // Grant to the first waiter
                if let Some((waiter_id, waiter_mode)) = entry.wait_queue.first().copied() {
                    entry.holders.push(waiter_id);
                    entry.mode = waiter_mode;
                    entry.wait_queue.remove(0);
                }
            }
            self.cond.notify_all();
        }
    }
}

This simplified lock manager illustrates the core mechanics. Production lock managers use per-key condition variables (not a single global one), hash-based lock tables for O(1) lookup, and wait-for graph tracking for deadlock detection.


Key Takeaways

  • Two-phase locking provides serializable isolation by ensuring transactions acquire all locks before releasing any. Strict 2PL holds all locks until commit.
  • 2PL's blocking behavior is the fundamental problem: writers block readers and readers block writers. For read-heavy workloads like conjunction queries, this creates unacceptable stalls during concurrent writes.
  • Deadlocks are an inherent risk of lock-based concurrency. Detection via wait-for graphs and resolution via transaction abort are standard but add overhead and wasted work.
  • 2PL is still used in some systems (MySQL/InnoDB for certain isolation levels, distributed databases for coordination). Understanding it provides essential context for why MVCC is preferred.

Lesson 3 — MVCC and Snapshot Isolation

Module: Database Internals — M05: Transactions & Isolation
Position: Lesson 3 of 3
Source: Database Internals — Alex Petrov, Chapter 13; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 7; Mini-LSM Week 3

Source note: This lesson was synthesized from training knowledge and the Mini-LSM Week 3 MVCC chapters. Verify Petrov's MVCC version chain description and Kleppmann's snapshot isolation anomaly analysis.



Context

MVCC solves the fundamental problem of 2PL — writers blocking readers — by keeping multiple versions of each key. Instead of locking a key and making other transactions wait, the engine stores every version alongside a timestamp. Readers pick the version that matches their snapshot timestamp; writers create new versions without disturbing old ones. Readers never block writers. Writers never block readers. The only conflict is writer-writer on the same key.

For the LSM architecture, MVCC is a natural fit. The LSM already stores sorted key-value pairs — extending keys to include a timestamp is a straightforward encoding change. Mini-LSM's Week 3 implements exactly this: the key format changes from user_key to (user_key, timestamp), where newer timestamps sort first. A snapshot read at timestamp T scans for the first version of each key with timestamp ≤ T.


Core Concepts

Timestamped Keys in LSM

The MVCC key format encodes the user key and a commit timestamp into a single sortable byte string:

MVCC key = [user_key_bytes] [timestamp as big-endian u64, inverted]

The timestamp is stored as big-endian and bitwise inverted (XOR with u64::MAX) so that newer timestamps sort before older ones in the LSM's byte-order comparison. This means a scan for key "NORAD-25544" encounters the newest version first — exactly what a snapshot read needs.

Example ordering for key "NORAD-25544":

"NORAD-25544" | ts=110 (inverted: 0xFFFFFFFFFFFFFF91)  ← newest, sorts first
"NORAD-25544" | ts=80  (inverted: 0xFFFFFFFFFFFFFFAF)
"NORAD-25544" | ts=50  (inverted: 0xFFFFFFFFFFFFFFCD)  ← oldest, sorts last

Snapshot Read

A transaction with read_ts = 100 reading key K:

  1. Seek to the first MVCC key with prefix K.
  2. Scan versions from newest to oldest.
  3. Return the first version with commit_ts ≤ read_ts.
  4. If the first matching version is a tombstone, the key is deleted in this snapshot — return None.

This is O(V) where V is the number of versions of the key. In practice V is small (1–5 for most keys) because compaction garbage-collects old versions.

Write Path with Timestamps

When a transaction commits with write_ts = 105:

  1. For each key in the write set, create an MVCC key (user_key, 105).
  2. Write all MVCC key-value pairs to the memtable (via the WAL, as in Module 4).
  3. These versions become visible to any transaction with read_ts ≥ 105.

Write-write conflicts: if two transactions both write the same user key, the later commit must detect the conflict. Simple approach: check if any version of the key with commit_ts > txn.read_ts exists at commit time. If so, another transaction has written this key after our snapshot — abort and retry.

Watermark and Garbage Collection

Old versions accumulate. If every write creates a new version, the database grows without bound. Garbage collection removes versions that are no longer visible to any active transaction.

The watermark is the minimum read_ts among all active transactions. Any version with commit_ts < watermark that has a newer version is safe to garbage-collect — no active transaction can ever read it.

Active transactions: read_ts = [100, 150, 200]
Watermark = 100

Key "NORAD-25544" versions:
  ts=200  (value_v3)  ← keep (above watermark, no newer version)
  ts=110  (value_v2)  ← keep (above watermark, may be needed by ts=100..149 txns)
  ts=80   (value_v1)  ← keep (ts=100 txn might need it — 80 ≤ 100 and next version is 110)
  ts=30   (value_v0)  ← GARBAGE: ts=80 exists and 30 < watermark, so no txn will ever read v0

Wait — v0 at ts=30 is still needed if a txn at ts=50 existed. But watermark is 100, meaning
no transaction has read_ts < 100. So v0 (ts=30) is only needed if some transaction reads at
ts=30..79. Since watermark=100 guarantees no such transaction exists, v0 is safe to collect.

Garbage collection happens during compaction: when the merge iterator encounters multiple versions of the same key, it keeps the newest version per user key that is above the watermark, plus one version at or below the watermark (for transactions at exactly the watermark timestamp). All older versions are dropped.

Write Batch Atomicity

MVCC writes must be atomic — all keys in a transaction get the same write_ts, and they all become visible at once. In the LSM engine, this means all keys in a write batch are logged to the WAL as a single unit and inserted into the memtable together. The write_ts is assigned from a global monotonic counter protected by a mutex (as in Mini-LSM's approach).


Code Examples

MVCC Key Encoding

/// Encode a user key and timestamp into an MVCC key.
/// Timestamps are inverted so newer versions sort first.
fn encode_mvcc_key(user_key: &[u8], timestamp: u64) -> Vec<u8> {
    let mut key = Vec::with_capacity(user_key.len() + 8);
    key.extend_from_slice(user_key);
    // Invert timestamp: newer (larger) timestamps become smaller bytes,
    // sorting first in ascending byte order.
    key.extend_from_slice(&(!timestamp).to_be_bytes());
    key
}

/// Decode an MVCC key back into user key and timestamp.
fn decode_mvcc_key(mvcc_key: &[u8]) -> (&[u8], u64) {
    let ts_start = mvcc_key.len() - 8;
    let user_key = &mvcc_key[..ts_start];
    let inverted_ts = u64::from_be_bytes(
        mvcc_key[ts_start..].try_into().unwrap()
    );
    (user_key, !inverted_ts)
}

Snapshot Read with MVCC

impl LsmEngine {
    /// Read the value of a key at the given snapshot timestamp.
    fn get_at_timestamp(
        &self,
        user_key: &[u8],
        read_ts: u64,
    ) -> io::Result<Option<Vec<u8>>> {
        // Seek to the newest version of this key
        let seek_key = encode_mvcc_key(user_key, u64::MAX);

        // Create a merge iterator over memtable + SSTables
        let mut iter = self.create_merge_iterator(&seek_key)?;

        while let Some((mvcc_key, value)) = iter.next()? {
            let (key, ts) = decode_mvcc_key(&mvcc_key);

            // Stop if we've moved past this user key
            if key != user_key {
                return Ok(None);
            }

            // Skip versions newer than our snapshot
            if ts > read_ts {
                continue;
            }

            // This is the newest version visible to us
            return match value {
                Some(val) => Ok(Some(val)),
                None => Ok(None), // Tombstone — key is deleted in this snapshot
            };
        }

        Ok(None) // Key not found in any source
    }
}

The seek to (user_key, u64::MAX) positions the iterator at the newest possible version of the key (since u64::MAX is the largest timestamp, and inverted it becomes the smallest byte value). The iterator then scans backward through versions until it finds one with ts ≤ read_ts.


Key Takeaways

  • MVCC stores multiple versions of each key with commit timestamps. Readers select the version matching their snapshot timestamp. Writers create new versions without disturbing old ones.
  • In the LSM architecture, MVCC keys are encoded as (user_key, inverted_timestamp) so that newer versions sort first in byte order. This makes snapshot reads efficient — the first matching version is the correct one.
  • Readers never block writers, and writers never block readers. The only conflict is write-write on the same key, detected at commit time.
  • Garbage collection removes old versions that are below the watermark (minimum active read_ts). This happens during compaction and is essential for bounding space amplification.
  • Write skew remains possible under snapshot isolation. For the OOR, this is an acceptable tradeoff — conjunction queries need consistent snapshots, not full serializability.

Project — Conjunction Query Engine with MVCC Snapshot Reads

Module: Database Internals — M05: Transactions & Isolation
Track: Orbital Object Registry
Estimated effort: 8–10 hours



SDA Incident Report — OOR-2026-0046

Classification: ENGINEERING DIRECTIVE
Subject: Add MVCC snapshot isolation to the OOR storage engine

Ref: OOR-2026-0046 (stale TLE data in conjunction assessment)

Extend the LSM engine with MVCC support. Conjunction queries must see consistent catalog snapshots. Concurrent TLE updates must not block or corrupt reads.


Acceptance Criteria

  1. MVCC key encoding. Encode user keys with inverted big-endian timestamps. Verify that newer versions sort before older versions in byte order.

  2. Snapshot read correctness. Insert key "NORAD-25544" at timestamps 50, 80, and 110. Read at timestamps 60, 90, and 120. Verify each read returns the correct version (ts=50, ts=80, ts=110 respectively).

  3. Tombstone visibility. Insert key "NORAD-99999" at ts=50, delete at ts=80. Read at ts=60 → value. Read at ts=90 → None.

  4. Concurrent reads and writes. Spawn two threads: one performs 10,000 reads at a fixed snapshot, the other performs 1,000 writes with incrementing timestamps. Verify all reads return consistent results (no torn reads, no version mixing). Writers must not block readers.

  5. Write conflict detection. Start two transactions with overlapping read timestamps. Both write the same key. The first to commit succeeds; the second detects the conflict and is aborted.

  6. Garbage collection. Set watermark to 100. Insert versions at ts=30, 70, 90, 120 for a key. Run compaction. Verify that ts=30 is garbage-collected, ts=70 and ts=90 are retained (safety margin), and ts=120 is retained.

  7. Conjunction simulation. Load 10,000 TLE records. Start a conjunction query (snapshot read over 100 objects). While the query is running, update 50 of those objects. Verify the query sees only the pre-update versions.


Starter Structure

conjunction-engine/
├── Cargo.toml
├── src/
│   ├── main.rs           # Entry point
│   ├── mvcc.rs            # MVCC key encoding, Transaction, conflict detection
│   ├── lsm.rs             # Extended with timestamp-aware get/put/scan
│   ├── compaction.rs      # Extended with watermark-aware GC
│   └── (reuse remaining modules from Modules 3–4)

Hints

Hint 1 — Timestamp encoding

Use !timestamp (bitwise NOT) converted to big-endian bytes, appended to the user key. This makes newer timestamps sort first without modifying the LSM's comparator.

Hint 2 — Write conflict detection

At commit time, scan the memtable and SSTables for any version of the key with commit_ts > txn.read_ts. If found, another transaction wrote this key after our snapshot — abort.

Hint 3 — Watermark computation

Maintain a BTreeSet<u64> of all active transaction read timestamps. The watermark is the minimum value in the set. When a transaction commits or aborts, remove its read_ts. Use a mutex to protect the set.


What Comes Next

Module 6 (Query Processing) builds structured query execution on top of the MVCC storage engine — scan operators, join algorithms, and the volcano iterator model for composable query plans.

Module 06 — Query Processing

Track: Database Internals — Orbital Object Registry
Position: Module 6 of 6
Source material: Database Internals — Alex Petrov, Chapters 14–15; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 6
Quiz pass threshold: 70% on all three lessons to unlock the project


Mission Context

SDA INCIDENT REPORT — OOR-2026-0047
Classification: PERFORMANCE DEFICIENCY
Subject: Multi-source TLE merge exceeds conjunction window deadline

The OOR ingests TLE data from 5 independent sources (18th SDS, ESA SST, LeoLabs, ExoAnalytic, Numerica). When multiple sources provide TLEs for the same object, the engine must merge them — selecting the most recent epoch, resolving conflicts, and joining against the master catalog. Currently this is done in application code with ad-hoc nested loops. A full catalog merge of 100,000 objects from 5 sources takes 45 seconds. The conjunction pipeline requires merge results within 10 seconds.

Directive: Implement a structured query processing layer: scan operators, join algorithms, and a composable execution model that can be optimized for the catalog merge workload.


Learning Outcomes

After completing this module, you will be able to:

  1. Implement the volcano (iterator) model for pull-based query execution with composable operators
  2. Explain vectorized execution and why processing column batches outperforms row-at-a-time for analytical queries
  3. Implement nested-loop, hash, and sort-merge join algorithms and determine which is optimal for a given workload
  4. Compose scan, filter, projection, and join operators into a query execution plan
  5. Analyze the I/O and memory costs of different join strategies for the OOR catalog merge workload

Lesson Summary

Lesson 1 — The Volcano (Iterator) Model

Pull-based query execution. Each operator (scan, filter, join) implements next() → Option<Row>. Operators compose like iterator chains. Pipelining and its limitations.

Lesson 2 — Vectorized Execution

Processing batches of rows (column vectors) instead of one row at a time. Cache efficiency, SIMD potential, and why OLAP engines (DuckDB, Velox, DataFusion) use vectorized execution.

Lesson 3 — Join Algorithms

Nested-loop join, hash join, and sort-merge join. Cost models, memory requirements, and when each algorithm is optimal. Application to the OOR multi-source TLE merge.


Capstone Project — Orbital Catalog Merge System

Build a query execution engine that merges TLE data from 5 sources using composable operators. The merge pipeline uses scan, filter, sort, and join operators composed in the volcano model. Full brief in project-catalog-merge.md.


File Index

module-06-query-processing/
├── README.md
├── lesson-01-volcano-model.md
├── lesson-01-quiz.toml
├── lesson-02-vectorized-execution.md
├── lesson-02-quiz.toml
├── lesson-03-join-algorithms.md
├── lesson-03-quiz.toml
└── project-catalog-merge.md

Prerequisites

  • Module 5 (Transactions & Isolation) completed
  • All previous modules in the Database Internals track

Track Complete

This is the final module of the Database Internals track. After completing it, you will have built a storage engine from the ground up: page layout (M1) → B-tree indexing (M2) → LSM write-optimized storage (M3) → crash recovery (M4) → MVCC concurrency (M5) → query processing (M6).

Lesson 1 — The Volcano (Iterator) Model

Module: Database Internals — M06: Query Processing
Position: Lesson 1 of 3
Source: Database Internals — Alex Petrov, Chapter 14

Source note: This lesson was synthesized from training knowledge. Verify Petrov's volcano model description and pipelining analysis against Chapter 14.



Context

The B+ tree range scan iterator from Module 2 and the LSM merge iterator from Module 3 are both instances of a general pattern: pull-based iteration. A consumer calls next(), the producer returns the next item or signals completion. The volcano model (also called the iterator model, introduced by Goetz Graefe in 1990) generalizes this into a complete query execution framework.

Every query operator — scan, filter, project, join, sort, aggregate — implements the same interface: open(), next(), close(). Operators compose by nesting: a filter's next() calls its child scan's next() and applies the predicate. A join's next() calls both children's next() methods to find matching rows. The entire query plan is a tree of iterators, and execution proceeds one row at a time from the root.

This model is simple, composable, and universal — it's used by PostgreSQL, MySQL, SQLite, and most traditional database engines.


Core Concepts

The Operator Interface

Every query operator implements:

trait Operator {
    /// Initialize the operator (open files, allocate buffers).
    fn open(&mut self) -> io::Result<()>;

    /// Return the next row, or None if exhausted.
    fn next(&mut self) -> io::Result<Option<Row>>;

    /// Release resources.
    fn close(&mut self) -> io::Result<()>;
}

Row is a tuple of typed values — for the OOR, it's a TLE record or a subset of its fields.

Operator Composition

A query plan is a tree of operators. The root operator's next() pulls data through the entire tree:

   ProjectOperator (select norad_id, epoch, mean_motion)
        │
   FilterOperator (where inclination > 80.0)
        │
   SeqScanOperator (scan TLE table)

Execution: Project.next() calls Filter.next(), which calls SeqScan.next() repeatedly until finding a row with inclination > 80.0, then returns it to Project, which extracts the requested columns.

Pipelining

In the volcano model, rows pipeline through operators: each row flows from leaf to root without being materialized in an intermediate buffer. SeqScan produces row 1, Filter checks it and passes it to Project, which emits it. Then row 2, and so on.

Pipelining is memory-efficient — only one row is in flight at a time (per operator). But some operators break the pipeline:

  • Sort: Must consume all input rows before producing any output (can't emit the smallest row until it has seen all rows).
  • Hash join (build phase): Must consume the entire build side before probing with the probe side.
  • Aggregate: Must consume all input to compute aggregates (e.g., COUNT, AVG).

These are pipeline breakers (also called blocking operators). They require materializing intermediate results, consuming memory proportional to the input size.

Volcano Model Limitations

CPU overhead: Each next() call involves a virtual method dispatch (or trait object call in Rust), a function call per operator per row. For 100,000 rows through 5 operators, that's 500,000 function calls. On modern CPUs, this overhead is small for I/O-bound queries but significant for CPU-bound analytical queries.

Cache inefficiency: Processing one row at a time means the CPU repeatedly jumps between different operator code paths. The instruction cache is constantly invalidated. Vectorized execution (Lesson 2) addresses this by processing batches.

No SIMD opportunity: Single-row processing cannot exploit SIMD instructions that process multiple values simultaneously.

For the OOR's I/O-bound conjunction queries (dominated by SSTable reads), the volcano model's CPU overhead is acceptable. For the CPU-bound catalog merge (5 sources × 100k objects × comparison logic), vectorized execution provides a significant speedup.


Code Examples

Operator Trait and Basic Operators

use std::io;

/// A row in the OOR catalog: simplified for the query layer.
#[derive(Debug, Clone)]
struct TleRow {
    norad_id: u32,
    epoch: f64,
    inclination: f64,
    mean_motion: f64,
    source: String,
}

/// Pull-based query operator.
trait Operator {
    fn open(&mut self) -> io::Result<()>;
    fn next(&mut self) -> io::Result<Option<TleRow>>;
    fn close(&mut self) -> io::Result<()>;
}

/// Sequential scan over an in-memory vector of TLE records.
struct SeqScan {
    data: Vec<TleRow>,
    cursor: usize,
}

impl Operator for SeqScan {
    fn open(&mut self) -> io::Result<()> {
        self.cursor = 0;
        Ok(())
    }
    fn next(&mut self) -> io::Result<Option<TleRow>> {
        if self.cursor < self.data.len() {
            let row = self.data[self.cursor].clone();
            self.cursor += 1;
            Ok(Some(row))
        } else {
            Ok(None)
        }
    }
    fn close(&mut self) -> io::Result<()> { Ok(()) }
}

/// Filter operator: passes through rows that match a predicate.
struct Filter {
    child: Box<dyn Operator>,
    predicate: Box<dyn Fn(&TleRow) -> bool>,
}

impl Operator for Filter {
    fn open(&mut self) -> io::Result<()> { self.child.open() }
    fn next(&mut self) -> io::Result<Option<TleRow>> {
        loop {
            match self.child.next()? {
                Some(row) if (self.predicate)(&row) => return Ok(Some(row)),
                Some(_) => continue,  // Row doesn't match — pull next
                None => return Ok(None),
            }
        }
    }
    fn close(&mut self) -> io::Result<()> { self.child.close() }
}

/// Projection operator: transforms rows.
struct Projection {
    child: Box<dyn Operator>,
    project_fn: Box<dyn Fn(TleRow) -> TleRow>,
}

impl Operator for Projection {
    fn open(&mut self) -> io::Result<()> { self.child.open() }
    fn next(&mut self) -> io::Result<Option<TleRow>> {
        match self.child.next()? {
            Some(row) => Ok(Some((self.project_fn)(row))),
            None => Ok(None),
        }
    }
    fn close(&mut self) -> io::Result<()> { self.child.close() }
}

Composing a Query Plan

fn build_high_inclination_query(tle_data: Vec<TleRow>) -> Box<dyn Operator> {
    let scan = Box::new(SeqScan { data: tle_data, cursor: 0 });

    let filter = Box::new(Filter {
        child: scan,
        predicate: Box::new(|row: &TleRow| row.inclination > 80.0),
    });

    let project = Box::new(Projection {
        child: filter,
        project_fn: Box::new(|mut row: TleRow| {
            // Strip fields we don't need downstream
            row.source = String::new();
            row
        }),
    });

    project
}

fn execute_query(mut plan: Box<dyn Operator>) -> io::Result<Vec<TleRow>> {
    plan.open()?;
    let mut results = Vec::new();
    while let Some(row) = plan.next()? {
        results.push(row);
    }
    plan.close()?;
    Ok(results)
}

The query plan is built bottom-up (scan → filter → project) and executed top-down (project pulls from filter pulls from scan). This separation between plan construction and execution is what makes the volcano model so composable — you can swap operators, add layers, or optimize the plan without changing the execution engine.


Key Takeaways

  • The volcano model defines a universal operator interface: open(), next(), close(). Every query operator — scan, filter, join, sort — implements this interface and composes with any other operator.
  • Pipelining passes rows through operators one at a time without intermediate materialization. Pipeline breakers (sort, hash build, aggregate) must buffer all input before producing output.
  • The model's simplicity is its strength: it's easy to implement, test, and extend. Its weakness is per-row overhead (function calls, cache misses) that hurts CPU-bound analytical queries.
  • The B+ tree range scan (Module 2) and LSM merge iterator (Module 3) are both volcano-model operators. The query layer from this module composes on top of them.

Lesson 2 — Vectorized Execution

Module: Database Internals — M06: Query Processing
Position: Lesson 2 of 3
Source: Designing Data-Intensive Applications — Martin Kleppmann, Chapter 6

Source note: This lesson was synthesized from training knowledge. Verify Kleppmann's columnar processing analysis against Chapter 6. Additional references: MonetDB/X100 paper (Boncz et al., 2005), DuckDB architecture documentation.



Context

The volcano model processes one row at a time. For the OOR's catalog merge — 500,000 rows from 5 sources, with comparison logic on 4 floating-point fields — the per-row function call overhead and cache inefficiency dominate CPU time. Vectorized execution addresses this by changing the unit of work from a single row to a batch (or vector) of rows.

Instead of next() → Option<Row>, vectorized operators return next_batch() → Option<ColumnBatch>, where a ColumnBatch contains 256–2048 rows stored in columnar format (one array per column). Operators process entire columns at once — a filter evaluates a predicate on an f64 array of 1024 inclination values in a tight loop, producing a selection bitmap. This tight loop is cache-friendly (one contiguous array), branch-predictor-friendly (same instruction repeated), and SIMD-exploitable (process 4 or 8 values per instruction).


Core Concepts

Row-at-a-Time vs. Batch-at-a-Time

Row-at-a-time (volcano): Each operator call processes 1 row. For N rows through K operators: N × K function calls, each touching a different memory region.

Batch-at-a-time (vectorized): Each operator call processes B rows. For N rows through K operators: (N/B) × K function calls. Each call processes a contiguous array, keeping the CPU cache warm and enabling auto-vectorization (SIMD).

The batch size B is typically 1024 or 2048 — large enough to amortize per-call overhead, small enough to fit in L1/L2 cache.

Columnar Batch Format

A vectorized batch stores data in columnar layout — one array per column:

Row-oriented (volcano):
  Row 0: { norad_id: 25544, epoch: 84.7, inclination: 51.6, ... }
  Row 1: { norad_id: 43013, epoch: 84.2, inclination: 97.4, ... }

Columnar (vectorized):
  norad_id:    [25544, 43013, ...]    ← contiguous u32 array
  epoch:       [84.7,  84.2,  ...]    ← contiguous f64 array
  inclination: [51.6,  97.4,  ...]    ← contiguous f64 array

A filter on inclination > 80.0 processes the inclination array without touching norad_id or epoch — only the relevant column is loaded into cache. The Rust compiler can auto-vectorize the tight comparison loop into SIMD instructions (e.g., _mm256_cmp_pd comparing 4 f64 values per instruction).

Selection Vectors

Instead of copying matching rows to a new batch (expensive), vectorized engines use a selection vector — an array of indices into the batch that identifies which rows passed the filter:

// Filter: inclination > 80.0
let inclinations: &[f64] = &batch.inclination;
let mut selection: Vec<u32> = Vec::new();
for (i, &inc) in inclinations.iter().enumerate() {
    if inc > 80.0 {
        selection.push(i as u32);
    }
}
// selection = [1, 3, 7, ...]  ← indices of matching rows

Downstream operators use the selection vector to skip non-matching rows without copying data. This avoids the memory allocation and copy cost of materializing filtered batches.

Apache Arrow as the Columnar Format

Apache Arrow defines a standardized in-memory columnar format used by DataFusion, DuckDB (internally similar), Polars, and many data processing engines. Key features:

  • Zero-copy sharing between operators — no serialization/deserialization between pipeline stages.
  • Validity bitmaps for null handling — one bit per value indicating null/non-null.
  • Dictionary encoding for low-cardinality string columns — stores unique values once and references them by index.

For the OOR, using Arrow arrays for TLE column batches enables integration with the broader Rust data ecosystem (the arrow crate).


Code Examples

Vectorized Filter Operator

const BATCH_SIZE: usize = 1024;

/// A batch of TLE records in columnar format.
struct TleBatch {
    norad_ids: Vec<u32>,
    epochs: Vec<f64>,
    inclinations: Vec<f64>,
    mean_motions: Vec<f64>,
    len: usize,
}

/// Vectorized operator interface.
trait VecOperator {
    fn open(&mut self) -> io::Result<()>;
    fn next_batch(&mut self) -> io::Result<Option<TleBatch>>;
    fn close(&mut self) -> io::Result<()>;
}

/// Vectorized filter: evaluates predicate on entire columns at once.
struct VecFilter {
    child: Box<dyn VecOperator>,
    /// Returns a boolean mask: true for rows that pass the filter.
    predicate: Box<dyn Fn(&TleBatch) -> Vec<bool>>,
}

impl VecOperator for VecFilter {
    fn open(&mut self) -> io::Result<()> { self.child.open() }

    fn next_batch(&mut self) -> io::Result<Option<TleBatch>> {
        loop {
            match self.child.next_batch()? {
                None => return Ok(None),
                Some(batch) => {
                    let mask = (self.predicate)(&batch);
                    let filtered = apply_mask(&batch, &mask);
                    if filtered.len > 0 {
                        return Ok(Some(filtered));
                    }
                    // Entire batch filtered out — pull next
                }
            }
        }
    }

    fn close(&mut self) -> io::Result<()> { self.child.close() }
}

/// Apply a boolean mask to a batch, keeping only rows where mask[i] is true.
fn apply_mask(batch: &TleBatch, mask: &[bool]) -> TleBatch {
    let mut out = TleBatch {
        norad_ids: Vec::new(), epochs: Vec::new(),
        inclinations: Vec::new(), mean_motions: Vec::new(), len: 0,
    };
    for (i, &keep) in mask.iter().enumerate() {
        if keep {
            out.norad_ids.push(batch.norad_ids[i]);
            out.epochs.push(batch.epochs[i]);
            out.inclinations.push(batch.inclinations[i]);
            out.mean_motions.push(batch.mean_motions[i]);
            out.len += 1;
        }
    }
    out
}

The predicate function operates on entire columns: |batch| batch.inclinations.iter().map(|&inc| inc > 80.0).collect(). This tight loop over a contiguous f64 array is exactly the pattern the compiler auto-vectorizes into SIMD instructions. A production implementation would use selection vectors instead of copying rows.


Key Takeaways

  • Vectorized execution processes batches of rows (typically 1024) instead of single rows, reducing per-row overhead by 100–1000x for CPU-bound operations.
  • Columnar layout stores each column as a contiguous array, enabling cache-efficient processing and SIMD auto-vectorization. A filter on one column never touches other columns.
  • Selection vectors track which rows pass a filter without copying data. This avoids materialization cost and keeps downstream operators working on the original arrays.
  • Apache Arrow provides a standardized columnar format for zero-copy interop between operators and libraries. The arrow Rust crate is the foundation for DataFusion.
  • Vectorized execution is most impactful for CPU-bound analytical queries (aggregations, joins, comparisons). For I/O-bound point lookups, the volcano model is sufficient.

Lesson 3 — Join Algorithms

Module: Database Internals — M06: Query Processing
Position: Lesson 3 of 3
Source: Database Internals — Alex Petrov, Chapter 15; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 6

Source note: This lesson was synthesized from training knowledge. Verify Petrov's join algorithm cost models and Kleppmann's distributed join discussion against the source chapters.



Context

The OOR's catalog merge problem is fundamentally a join: match TLE records from 5 sources on NORAD catalog ID, then select the best TLE for each object (most recent epoch, highest source priority). In SQL terms: SELECT * FROM source_a JOIN source_b ON a.norad_id = b.norad_id.

The choice of join algorithm determines whether this merge takes 45 seconds (nested-loop) or under 1 second (hash join). This lesson covers the three fundamental join algorithms, their cost models, and when each is optimal.


Core Concepts

Nested-Loop Join

The simplest join: for each row in the outer table, scan the entire inner table for matches.

for each row_a in source_a:          # |A| iterations
    for each row_b in source_b:      # |B| iterations per outer row
        if row_a.norad_id == row_b.norad_id:
            emit (row_a, row_b)

Cost: O(|A| × |B|) comparisons. For two 100k-row sources: 10 billion comparisons. Completely impractical for the OOR catalog merge.

When to use: Only when one side is very small (< 100 rows) or when no better algorithm is available (no index, insufficient memory for a hash table). Also useful for non-equi joins (e.g., a.epoch > b.epoch) where hash join doesn't apply.

Block nested-loop improves this by loading a block of the outer table into memory and scanning the inner table once per block. This reduces I/O from |A| × (inner scans) to |A|/B × (inner scans).

Hash Join

Build a hash table on the smaller input (the build side), then probe it with the larger input (the probe side).

Build phase: Scan the build side and insert each row into a hash table keyed by the join column.

Probe phase: Scan the probe side. For each row, hash the join column and look up matching rows in the hash table.

Build: hash_table = {}
for each row_b in source_b:
    hash_table[row_b.norad_id].append(row_b)

Probe:
for each row_a in source_a:
    for each row_b in hash_table[row_a.norad_id]:
        emit (row_a, row_b)

Cost: O(|A| + |B|) — one scan of each input. Hash table operations are O(1) amortized.

Memory: The hash table must fit in memory. Size ≈ |build_side| × (key_size + row_size + overhead). For 100k TLE records at ~100 bytes each: ~10MB. Trivially fits in memory.

When to use: Equi-joins (join on equality) where the build side fits in memory. This is the default join algorithm in most query engines for good reason — it's optimal for the vast majority of join workloads.

For the OOR: hash join merges two 100k-row sources in ~200k operations. Five sources require 4 sequential hash joins (or a multi-way hash join), all completing in under 100ms.

Sort-Merge Join

Sort both inputs on the join column, then merge them in a single pass (like the LSM merge iterator from Module 3).

Sort phase: Sort both inputs by the join key. O(|A| log |A| + |B| log |B|).

Merge phase: Advance two cursors through the sorted inputs, matching on the join key. O(|A| + |B|).

Sort source_a by norad_id
Sort source_b by norad_id

cursor_a = 0, cursor_b = 0
while cursor_a < |A| and cursor_b < |B|:
    if a[cursor_a].norad_id == b[cursor_b].norad_id:
        emit (a[cursor_a], b[cursor_b])
        advance both cursors (handling duplicates)
    elif a[cursor_a].norad_id < b[cursor_b].norad_id:
        cursor_a += 1
    else:
        cursor_b += 1

Cost: O(|A| log |A| + |B| log |B|) for the sort phases, O(|A| + |B|) for the merge. Dominated by the sort.

When to use: When inputs are already sorted (e.g., from an index scan or a preceding sort operator), the sort phase is free and the total cost is O(|A| + |B|) — optimal. Also useful when the join result must be sorted (the output is already in join-key order). Handles non-memory-fitting inputs gracefully via external sort.

For the OOR: if TLE sources are pre-sorted by NORAD ID (which they often are, since NORAD IDs are sequential), sort-merge join is optimal — the sort phase costs nothing, and the merge is a single linear pass.

Cost Comparison

AlgorithmTimeMemoryPre-sorted Input
Nested-loopO(A × B)O(1)No benefit
Hash joinO(A + B)O(min(A,B))No benefit
Sort-mergeO(A log A + B log B)O(A + B) for sortO(A + B) if pre-sorted

Multi-Way Join for 5 Sources

The OOR catalog merge joins 5 sources. Strategies:

Sequential pairwise: Join source 1 with 2, then result with 3, then with 4, then with 5. Four hash joins. Total cost: O(5 × N) where N is the source size. Simple and effective.

Multi-way sort-merge: Sort all 5 sources by NORAD ID, then merge all 5 simultaneously using a priority queue (exactly the merge iterator from Module 3). One pass through all data. Optimal if sources are pre-sorted.

For the OOR, the multi-way sort-merge is the better choice: TLE sources arrive pre-sorted by NORAD ID, and the merge iterator is already implemented.


Code Examples

Hash Join Implementation

use std::collections::HashMap;

/// Hash join: match TLE records from two sources on NORAD ID.
fn hash_join(
    build_side: &[TleRow],   // Smaller source
    probe_side: &[TleRow],   // Larger source
) -> Vec<(TleRow, TleRow)> {
    // Build phase: index the build side by NORAD ID
    let mut hash_table: HashMap<u32, Vec<&TleRow>> = HashMap::new();
    for row in build_side {
        hash_table.entry(row.norad_id).or_default().push(row);
    }

    // Probe phase: look up each probe-side row in the hash table
    let mut results = Vec::new();
    for probe_row in probe_side {
        if let Some(matches) = hash_table.get(&probe_row.norad_id) {
            for &build_row in matches {
                results.push((build_row.clone(), probe_row.clone()));
            }
        }
    }
    results
}

Sort-Merge Join for Pre-Sorted Sources

/// Sort-merge join on pre-sorted inputs. Both inputs must be sorted by norad_id.
fn sort_merge_join(
    left: &[TleRow],
    right: &[TleRow],
) -> Vec<(TleRow, TleRow)> {
    let mut results = Vec::new();
    let mut li = 0;
    let mut ri = 0;

    while li < left.len() && ri < right.len() {
        match left[li].norad_id.cmp(&right[ri].norad_id) {
            std::cmp::Ordering::Equal => {
                // Collect all rows with this key from both sides
                let key = left[li].norad_id;
                let l_start = li;
                while li < left.len() && left[li].norad_id == key { li += 1; }
                let r_start = ri;
                while ri < right.len() && right[ri].norad_id == key { ri += 1; }

                // Cross product of matching rows (for equi-join)
                for l in &left[l_start..li] {
                    for r in &right[r_start..ri] {
                        results.push((l.clone(), r.clone()));
                    }
                }
            }
            std::cmp::Ordering::Less => li += 1,
            std::cmp::Ordering::Greater => ri += 1,
        }
    }
    results
}

The sort-merge join's merge phase is identical to the LSM merge iterator logic. For the OOR's unique NORAD IDs (no duplicates within a source), the cross-product in the equal case always produces exactly one match — the merge is linear.


Key Takeaways

  • Nested-loop join is O(A × B) — only viable for very small inputs. Hash join is O(A + B) with O(min(A,B)) memory. Sort-merge join is O(A + B) if inputs are pre-sorted.
  • Hash join is the default for equi-joins in most query engines. It requires the build side to fit in memory, which is almost always true for the OOR's workload sizes.
  • Sort-merge join is optimal when inputs are pre-sorted (the sort phase is free). The LSM merge iterator from Module 3 is already a sort-merge join — the same algorithm applies here.
  • The OOR catalog merge (5 pre-sorted sources × 100k objects) is best served by a multi-way sort-merge: one linear pass through all sources using a merge iterator with a priority queue.
  • Join algorithm selection is a query optimization decision. The execution engine should support all three algorithms and choose based on input sizes, sort order, and available memory.

Project — Orbital Catalog Merge System

Module: Database Internals — M06: Query Processing
Track: Orbital Object Registry
Estimated effort: 6–8 hours



SDA Incident Report — OOR-2026-0047

Classification: ENGINEERING DIRECTIVE
Subject: Build a structured query engine for multi-source TLE catalog merging

Replace the ad-hoc nested-loop catalog merge with a composable query execution engine. The engine must support scan, filter, sort, and join operators, and merge TLE data from 5 sources within the conjunction pipeline's 10-second deadline.


Acceptance Criteria

  1. Volcano operators. Implement SeqScan, Filter, Projection, and Sort operators using the Operator trait. Compose them into a plan that scans 100,000 TLE records, filters by inclination > 80°, and projects to (norad_id, epoch).

  2. Hash join. Implement a hash join operator. Join two 100k-row sources on NORAD ID. Verify the output contains exactly the matching pairs.

  3. Sort-merge join. Implement a sort-merge join for pre-sorted inputs. Join two 100k-row sources (pre-sorted by NORAD ID). Verify output matches the hash join result.

  4. Multi-way merge. Implement a 5-way merge join using a min-heap (reuse the merge iterator pattern from Module 3). Merge 5 sources of 100k records each, all sorted by NORAD ID. Verify the merged output is sorted and contains all matching records.

  5. Performance target. The 5-way merge of 500,000 total records must complete in under 2 seconds. Print elapsed time. Compare against a naive nested-loop join on a subset (1,000 records per source) and report the speedup.

  6. Conflict resolution. When multiple sources provide TLEs for the same NORAD ID, select the TLE with the most recent epoch. Print the number of conflicts resolved and the winning source for 10 sample objects.

  7. Vectorized filter (bonus). Implement a vectorized filter that processes batches of 1024 rows in columnar format. Compare its throughput to the row-at-a-time volcano filter on 100,000 rows.


Starter Structure

catalog-merge/
├── Cargo.toml
├── src/
│   ├── main.rs            # Entry point
│   ├── operator.rs         # Operator trait, SeqScan, Filter, Projection, Sort
│   ├── hash_join.rs        # HashJoinOperator
│   ├── sort_merge.rs       # SortMergeJoinOperator
│   ├── merge_iter.rs       # MultiWayMerge (reuse from Module 3)
│   ├── vectorized.rs       # VecFilter (bonus)
│   └── tle.rs              # TleRow, test data generation

Test Data Generation

Generate synthetic TLE data for 5 sources:

fn generate_source(source_name: &str, num_objects: usize) -> Vec<TleRow> {
    let mut rng = /* deterministic seed per source */;
    (0..num_objects).map(|i| TleRow {
        norad_id: i as u32 + 1, // NORAD IDs 1..100000
        epoch: 84.0 + rng.gen::<f64>() * 0.5, // Slight epoch variation per source
        inclination: rng.gen::<f64>() * 180.0,
        mean_motion: 14.0 + rng.gen::<f64>() * 2.0,
        source: source_name.to_string(),
    }).collect()
}

Each source provides TLEs for the same 100k objects but with slightly different epochs and measurements. The merge resolves conflicts by picking the most recent epoch per NORAD ID.


Hints

Hint 1 — Hash join operator as a volcano operator

The hash join operator's open() consumes the entire build side (calling build_child.next() until None, building the hash table). Then next() probes one row at a time from the probe side. The build phase is a pipeline breaker; the probe phase pipelines.

Hint 2 — Conflict resolution as a post-merge step

After the 5-way merge produces groups of TLEs with the same NORAD ID, apply a "group-by" operator that collects all rows with the same key and emits the winner (most recent epoch). This is a simple reduce over each group.

Hint 3 — Performance measurement

Use std::time::Instant for timing. Measure the merge end-to-end (including any sort phases). For the nested-loop comparison, use a small subset (1,000 rows per source) to avoid waiting minutes.


Track Complete

Congratulations. You have built a storage engine from the ground up:

ModuleWhat You Built
M1: Storage Engine FundamentalsPage layout, buffer pool, slotted pages
M2: B-Tree Index StructuresB+ tree with splits, merges, range scans
M3: LSM Trees & CompactionMemtable, SSTables, leveled compaction, bloom filters
M4: WAL & RecoveryWrite-ahead log, crash recovery, checkpointing
M5: Transactions & IsolationMVCC snapshot isolation, write conflict detection, GC
M6: Query ProcessingVolcano model, vectorized execution, hash/sort-merge joins

The Orbital Object Registry is now a fully functional, crash-recoverable, transactional storage engine with indexed access and structured query execution. The ESA deadline has been met.

Module 01 — Stream Processing Foundations

Track: Data Pipelines — Space Domain Awareness Fusion Position: Module 1 of 6 Source material: Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11; Streaming Data — Andrew Psaltis, Chapters 1–3; Fundamentals of Data Engineering — Joe Reis & Matt Housley, Chapter 7; Kafka: The Definitive Guide — Shapira et al., Chapter 1 Quiz pass threshold: 70% on all three lessons to unlock the project


Mission Context

OPS ALERT — SDA-2026-0118 Classification: PIPELINE STAND-UP Subject: Heterogeneous sensor ingestion for SDA Fusion Service

Meridian's existing Space Domain Awareness pipeline is a Python script that polls each sensor source on a 30-second cron, batches the results into Parquet, and uploads to S3. End-to-end latency from observation to fused track is currently 4–7 minutes. After last quarter's Cosmos-1408 anti-satellite test, the post-event debris environment requires sub-30-second conjunction detection to maintain constellation safety.

Directive: Stand up the front door of the SDA Fusion Service. Three sensor source types (X-band radar arrays, optical telescopes, inter-satellite link feeds) must be ingested as a continuous stream, normalized into a common observation envelope, and forwarded to downstream fusion stages. No more cron, no more batching at the edge.

This module establishes the conceptual and physical foundations of stream processing — what a stream is, what it isn't, and how production systems are built around the source/sink boundary. Every architectural decision that follows in this track (orchestration, windowing, delivery semantics, observability) assumes you can reason fluently about the model introduced here.

The code you write in this module is the literal first stage of the SDA Fusion Service. It will be extended, not replaced, in every subsequent module.


Learning Outcomes

After completing this module, you will be able to:

  1. Define what makes a system a stream processor rather than a low-latency batch processor, and articulate the operational consequences of the difference
  2. Design a common observation envelope that unifies heterogeneous sensor wire formats without losing source-specific provenance
  3. Implement async source and sink abstractions in Rust using the dataflow model — operators that consume from upstream and produce to downstream
  4. Choose between push, pull, and poll ingestion patterns for a given source based on latency, control, and reliability requirements
  5. Reason about the bounded-vs-unbounded data distinction and its implications for memory, completeness, and correctness

Lesson Summary

Lesson 1 — Streams, Sources, and Sinks

The conceptual model. What distinguishes a stream from a queue, a log, or a continuously polled batch. The source/sink abstraction as the boundary of every streaming pipeline. Bounded vs unbounded data and why the distinction shapes the entire system. The observation envelope pattern for unifying heterogeneous sources.

Key question: If a sensor produces a fixed dataset of 10 million observations from a one-time fragmentation event, is processing that dataset a stream or a batch problem?

Lesson 2 — The Dataflow Model

The streaming abstraction in production systems. Operators as functions over streams: map, filter, fold, merge, partition. Why dataflow composition beats imperative loops for pipeline code. The graph topology — sources at the edges, sinks at the other edges, operators in between. State, statelessness, and where state lives in a streaming pipeline.

Key question: Why does the dataflow model treat the pipeline as a graph that runs continuously rather than as a function that is called with a batch of inputs?

Lesson 3 — Push, Pull, and Poll Semantics

The three patterns by which data enters and traverses a pipeline. Push (the source initiates and forwards), pull (the consumer requests and receives), poll (the consumer requests on a schedule). Why Kafka consumers poll rather than subscribe to a callback. Where each pattern fits in the SDA fusion topology — radar arrays push, optical archives pull, ISL beacons poll. The hidden cost of polling and the hidden risk of pushing.

Key question: The optical telescope archive exposes only an HTTP REST endpoint with no notification mechanism. How should the ingestion service interact with it, and what are the operational consequences?


Capstone Project — SDA Sensor Ingestion Service

Build the front door of the SDA Fusion Service. Three async source tasks (radar, optical, ISL) consume from their respective wire formats, normalize observations into a common Observation envelope, and forward them to a shared sink. The sink writes a structured event log that downstream stages will consume. Acceptance criteria, suggested architecture, and the full project brief are in project-sensor-ingestion.md.

This is the module where the SDA Fusion Service begins to exist. Every subsequent module's project extends what you build here.


File Index

module-01-stream-processing-foundations/
├── README.md                              ← this file
├── lesson-01-streams-sources-sinks.md     ← Streams, sources, sinks
├── lesson-01-quiz.toml                    ← Quiz (5 questions)
├── lesson-02-dataflow-model.md            ← The dataflow model
├── lesson-02-quiz.toml                    ← Quiz (5 questions)
├── lesson-03-push-pull-poll.md            ← Push, pull, and poll semantics
├── lesson-03-quiz.toml                    ← Quiz (5 questions)
└── project-sensor-ingestion.md            ← Capstone project brief

Prerequisites

  • Foundation Track completed (all 6 modules) — async Rust, channels, network programming, and data layout are assumed
  • Familiarity with tokio, tokio::sync::mpsc, serde, and anyhow::Result
  • Working understanding of TCP, UDP, and HTTP — the three transport types you'll deal with in the project

What Comes Next

Module 2 (Pipeline Orchestration Internals) takes the source-to-sink primitives you build here and composes them into a multi-stage DAG with a real task scheduler. The Observation envelope you define in this module is the data structure that flows through every subsequent stage of the pipeline.

Lesson 1 — Streams, Sources, and Sinks

Module: Data Pipelines — M01: Stream Processing Foundations Position: Lesson 1 of 3 Source: Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11 (Stream Processing); Streaming Data — Andrew Psaltis, Chapter 1 (Introducing Streaming Data); Fundamentals of Data Engineering — Joe Reis & Matt Housley, Chapter 7 (Ingestion: Bounded vs Unbounded Data)


Context

Every streaming system answers one question before any other: where does data enter, and where does it leave? The pipeline between those two points can grow arbitrarily complex — windowing, joins, stateful aggregation, exactly-once delivery — but the entry and exit points are non-negotiable. They are the contract between the pipeline and the rest of the world. Get them wrong and no amount of clever processing recovers the system.

The mental shift from batch to streaming is harder than it sounds. A batch job is a function: you give it a finite input, it returns a finite output, it terminates. A streaming pipeline is a process: it runs forever, it consumes a potentially infinite input, it produces a potentially infinite output, and "completion" is a category error. Most production incidents in streaming systems trace back to engineers who built a batch job and called it a stream — fixed-size buffers that fill up, retries that assume idempotency the source doesn't provide, "is the job done yet?" health checks against a process that has no notion of done.

For the SDA Fusion Service, the source side is three heterogeneous feeds. Ground-based X-band radar arrays push detection records over UDP at 100–500 Hz per array. Optical telescopes expose a REST API that returns observation batches when polled. The constellation's inter-satellite link beacons emit position reports on a 1 Hz cadence over a custom binary protocol. The sink side is a single fusion stage that expects one normalized envelope format. This module's job is to define what that envelope is, what a source and sink mean in this system, and how to think about the data crossing the boundary.


Core Concepts

What Makes a Stream a Stream

The defining property of a stream is unboundedness: there is no expectation that the input will end. This is the property that forces every other architectural decision in a streaming system. Memory cannot grow without bound, so state must be either fixed-size or explicitly bounded by time or count. Completeness checks (COUNT(*) over a stream, is the result correct?) cannot terminate, so they must be redefined as point-in-time approximations. Failure recovery cannot rely on "rerun the job from the start" because there is no start.

The DDIA framing is precise: a stream is a sequence of events, each event a small, immutable record describing something that happened at a point in time. Events are produced by producers (also called sources, publishers, emitters) and consumed by consumers (also called sinks, subscribers, recipients). The stream itself is not the events — it is the abstraction over the path they take from producer to consumer.

A finite, replayable input is not a stream just because it is processed event-by-event. A 10 GB log file processed line-by-line in a Spark job is a batch job with a streaming-style execution model. The distinction matters because finite inputs admit completion: you can compute exact aggregates, you can verify correctness end-to-end, you can retry the whole computation. Unbounded inputs admit none of these. Bounded data running through streaming infrastructure (sometimes called "stream-with-known-end") is real but rare in practice; treating an unbounded source as if it were bounded is a common and expensive mistake.

Bounded vs Unbounded Data

The bounded/unbounded distinction is the most important typology in data engineering. Reis and Housley make the case in Fundamentals of Data Engineering Ch. 7: the boundary you draw at ingestion shapes every downstream system. If you treat data as bounded, you can use ETL, you can do exact joins, you can cleanly partition by date. If you treat it as unbounded, you must use windowing, you must accept approximation, you must deal with late arrivals.

Sensor data from the SDA Fusion Service is unambiguously unbounded — the radar arrays will emit observations as long as there are objects in the sky and the radar has power. The fusion service has no notion of "the last observation"; the input is a continuous flow that the pipeline drinks from for as long as it operates.

Some sources are bounded but feel unbounded. The optical telescope archive exposes the last 24 hours of observations on demand. From the pipeline's perspective, the source produces observations now, and the archive happens to also expose recent ones. Treating it as a bounded "give me the last 24 hours" source produces a different system than treating it as an unbounded "show me observations as they appear" source — even though the underlying data is the same. The latter requires the pipeline to track what it has already consumed (a watermark or offset) and to poll only for new observations since that point.

The architectural rule: bounded sources can be treated as unbounded (just process them and stop when they end), but unbounded sources cannot be treated as bounded without introducing artificial cutoffs. When in doubt, treat it as unbounded.

Sources and Sinks as a Boundary

A source is the component that produces events into the pipeline. It is the boundary between the pipeline and the upstream world — the radar firmware, the satellite telemetry stream, the upstream Kafka topic, the message queue. The source's responsibility is to deliver events into the pipeline's first stage in a format the pipeline understands.

A sink is the symmetric component at the output: the boundary between the pipeline and the downstream world. The sink writes events to wherever they need to go — another Kafka topic, an object store, a database, a subscriber callback, a downstream service.

The pivotal insight is that source and sink are positions, not types. A Kafka topic is a sink for the pipeline that produces into it and a source for the pipeline that consumes from it. This is why production streaming architectures look like graphs of pipelines connected by durable queues — each queue is the sink of its writer and the source of its readers, and the queue's durability decouples them in time and in failure modes.

Three properties of a source determine how the pipeline must interact with it:

  1. Replayability. Can the pipeline ask the source for events it has already consumed? Kafka can (configurable retention; consumers track offsets); a UDP radar feed cannot (packets arrive once and are gone if missed).
  2. Ordering guarantees. Does the source guarantee a total order, a per-partition order, or no order at all? UDP gives no order. Kafka guarantees per-partition order. ISL beacons typically guarantee per-satellite order but not cross-satellite order.
  3. Delivery guarantees. Does the source guarantee at-least-once delivery, at-most-once, or exactly-once? UDP is at-most-once. TCP-based sources are at-least-once if the application acks correctly. Kafka producers can be configured for exactly-once via the idempotent producer (covered in Module 5).

These three properties propagate. A pipeline cannot offer a stronger delivery guarantee than its weakest source unless it is willing to drop, deduplicate, or buffer to make up the difference.

The Observation Envelope

When the pipeline accepts events from heterogeneous sources, the first stage's job is to normalize them into a single internal format. This is the envelope — a wrapper that preserves source-specific provenance while presenting a uniform interface to downstream stages.

For SDA fusion, every observation, regardless of source, has the same logical content: something was detected at some position at some time. The wire formats differ wildly — radar produces complex IQ samples reduced to range-rate pairs; optical produces angular measurements with timestamps; ISL produces full state vectors — but the downstream correlator does not need to know any of that. It needs position, time, source identifier, and uncertainty.

The envelope pattern:

struct Observation {
    // Identity and provenance
    source_id: SourceId,           // which sensor produced this
    source_kind: SourceKind,       // radar | optical | isl
    sensor_timestamp: SystemTime,  // when the sensor recorded it
    ingest_timestamp: SystemTime,  // when we received it

    // The actual observation payload
    target: ObservationTarget,     // what was observed (position, range-rate, etc.)
    uncertainty: Uncertainty,      // standard deviation or covariance

    // Routing and dedup
    observation_id: Uuid,          // unique per observation
}

The envelope is thin. It carries enough provenance to trace any observation back to its source and enough payload for the next stage to act on, but no more. Production envelopes drift toward fat over time as engineers add convenience fields; resist this. Every field in the envelope is paid for in CPU (deserialization), memory (buffering), and network (when stages run on different hosts).

A common mistake is to make the envelope a sum type that holds the original wire format plus normalized fields. This produces an envelope twice the necessary size and tempts downstream stages to peek at the original wire format, breaking the abstraction. If the original wire format must be preserved (for replay, audit, or forensic analysis), write it to a parallel "raw observations" sink at ingestion time. Don't carry it through the pipeline.


Code Examples

Defining the Observation Envelope

The envelope is the type that flows through every stage of the SDA Fusion Service. The choice of representation here propagates everywhere — a poor envelope makes every downstream lesson harder.

use serde::{Deserialize, Serialize};
use std::time::SystemTime;
use uuid::Uuid;

/// The kind of sensor that produced an observation. This drives source-specific
/// handling at later stages (e.g., uncertainty models differ by sensor type).
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
pub enum SourceKind {
    /// Ground-based X-band radar — high-rate, range-rate measurements
    Radar,
    /// Ground-based optical telescope — angular measurements
    Optical,
    /// Inter-satellite link beacon — full state vector reports
    InterSatelliteLink,
}

/// Stable identifier for the specific sensor that produced this observation.
/// Distinct from SourceKind: there are 14 X-band radars in the network, and we
/// want to know *which* one detected this object.
#[derive(Debug, Clone, PartialEq, Eq, Hash, Serialize, Deserialize)]
pub struct SourceId(pub String);

/// What the sensor observed. The variants reflect what each sensor type
/// actually measures — we do not pretend they all measure position vectors.
/// The correlator stage (Module 2) is responsible for fusing these into
/// position estimates.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum ObservationTarget {
    /// Range, range-rate, azimuth, elevation from a radar
    RangeRate { range_m: f64, range_rate_m_s: f64, az_rad: f64, el_rad: f64 },
    /// Right ascension and declination from an optical telescope
    Angular { ra_rad: f64, dec_rad: f64 },
    /// ECI-frame state vector from an ISL beacon
    StateVector { position_m: [f64; 3], velocity_m_s: [f64; 3] },
}

/// Per-measurement uncertainty. Production code uses full covariance matrices;
/// this is a starting representation that we will refine in Module 3.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Uncertainty {
    pub sigma: f64,
}

/// The canonical envelope. Every observation in the pipeline has this shape.
/// Downstream stages do not look at wire formats — only at this struct.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Observation {
    pub observation_id: Uuid,
    pub source_id: SourceId,
    pub source_kind: SourceKind,
    pub sensor_timestamp: SystemTime,
    pub ingest_timestamp: SystemTime,
    pub target: ObservationTarget,
    pub uncertainty: Uncertainty,
}

Three things to notice. First, observation_id is a UUID, not a sequence number. UUIDs are generated at the source without coordination — sequence numbers would require a central allocator and would become a bottleneck under load. The tradeoff is that UUIDs are 16 bytes versus 8 for a u64; for SDA's volumes (hundreds of thousands of observations per minute), this is a non-issue. Second, the envelope holds both sensor_timestamp and ingest_timestamp. This distinction (event time vs processing time) becomes load-bearing in Module 3, but the data must be captured at ingestion or it is lost forever. Third, ObservationTarget is a sum type rather than a normalized "always position vector" representation. Forcing premature unification at ingestion discards information that the correlation stage needs. Normalize the envelope; preserve the measurement.

A Source Trait

A source is a thing that produces a stream of observations. In Rust, the natural shape is an async trait that returns successive items:

use anyhow::Result;
use async_trait::async_trait;

/// A source produces observations from some upstream system. Implementations
/// hide the wire-format details behind this trait.
///
/// Cancellation: implementations should be cancel-safe at await points.
/// Dropping the future returned by `next` must not corrupt the source's state.
#[async_trait]
pub trait ObservationSource: Send {
    /// Yield the next observation, or None if the source has terminated.
    /// For genuinely unbounded sources (radar feeds), this never returns None.
    async fn next(&mut self) -> Result<Option<Observation>>;

    /// A stable identifier for logging and metrics. Not the same as
    /// SourceId on the envelope — a single source instance may produce
    /// observations from multiple SourceIds (e.g., one ISL listener
    /// receives beacons from many satellites).
    fn name(&self) -> &str;
}

Two design choices to flag. The trait returns Option<Observation> rather than just Observation because we want to signal graceful termination without using an error. Errors are reserved for actual failures (network drop, deserialization error, source-specific protocol violation). A radar source that never returns None is correct. An optical archive source that returns None when the archive has no new observations and the polling cadence has been satisfied is also correct. Second, the trait is not a Stream. We could have used futures::Stream<Item = Result<Observation>> and gained combinator support, but that buys us less than it costs: the explicit next method makes lifecycle management (logging, retries, source-specific timeouts) easier to compose. Modules 2 and 4 will build the orchestration layer around this trait.

A Minimal UDP Radar Source

This is the actual code that ingests from one of the X-band radar arrays. The radar firmware emits a fixed-size binary frame over UDP at 100–500 Hz; our job is to deserialize it and emit envelopes.

use anyhow::{Context, Result};
use async_trait::async_trait;
use std::time::{SystemTime, UNIX_EPOCH, Duration};
use tokio::net::UdpSocket;
use uuid::Uuid;

/// Wire format emitted by the X-band radar firmware. 64 bytes, packed.
/// Field layout is documented in Meridian-RF-2024-RADAR-WIRE-FORMAT.
#[repr(C, packed)]
#[derive(Clone, Copy)]
struct RadarFrame {
    array_id: u32,
    target_track_id: u32,
    timestamp_ns: u64,        // sensor-local clock since UNIX epoch
    range_m: f64,
    range_rate_m_s: f64,
    azimuth_rad: f64,
    elevation_rad: f64,
    sigma_range_m: f32,
    sigma_rate_m_s: f32,
    _reserved: [u8; 4],
}
const RADAR_FRAME_SIZE: usize = std::mem::size_of::<RadarFrame>();

pub struct UdpRadarSource {
    socket: UdpSocket,
    name: String,
    buf: Box<[u8; 1500]>,  // standard MTU; larger frames would arrive truncated
}

impl UdpRadarSource {
    pub async fn bind(addr: &str, name: impl Into<String>) -> Result<Self> {
        let socket = UdpSocket::bind(addr)
            .await
            .with_context(|| format!("binding radar source on {addr}"))?;
        Ok(Self {
            socket,
            name: name.into(),
            buf: Box::new([0u8; 1500]),
        })
    }
}

#[async_trait]
impl ObservationSource for UdpRadarSource {
    async fn next(&mut self) -> Result<Option<Observation>> {
        // recv_from is cancel-safe in tokio: a dropped future leaves no
        // partially-consumed datagram. This matters for the orchestrator
        // (Module 2), which cancels source tasks during shutdown.
        let (n, _peer) = self.socket.recv_from(&mut self.buf[..]).await
            .context("recv_from on radar UDP socket")?;

        if n != RADAR_FRAME_SIZE {
            // Truncated or oversized frame — log and drop. UDP gives no
            // recovery; the radar will emit the next frame in ~2-10ms.
            anyhow::bail!("radar frame size {n} != expected {RADAR_FRAME_SIZE}");
        }

        // SAFETY: we just verified the byte count matches the struct size,
        // and RadarFrame is #[repr(C, packed)] of POD types. The radar firmware
        // is documented to emit little-endian on the wire and our hosts are
        // little-endian; if we ever deploy on big-endian hosts we add a swap.
        let frame: RadarFrame = unsafe {
            std::ptr::read_unaligned(self.buf.as_ptr() as *const RadarFrame)
        };

        let sensor_ts = UNIX_EPOCH + Duration::from_nanos(frame.timestamp_ns);
        let array_id = frame.array_id; // copy out of packed struct for formatting

        Ok(Some(Observation {
            observation_id: Uuid::new_v4(),
            source_id: SourceId(format!("radar-{}", array_id)),
            source_kind: SourceKind::Radar,
            sensor_timestamp: sensor_ts,
            ingest_timestamp: SystemTime::now(),
            target: ObservationTarget::RangeRate {
                range_m: frame.range_m,
                range_rate_m_s: frame.range_rate_m_s,
                az_rad: frame.azimuth_rad,
                el_rad: frame.elevation_rad,
            },
            uncertainty: Uncertainty {
                sigma: frame.sigma_range_m as f64,
            },
        }))
    }

    fn name(&self) -> &str { &self.name }
}

A few points worth dwelling on. UDP gives no delivery guarantee — if a radar frame is lost in transit, it's gone. For a sensor producing 100–500 frames per second per array, this is acceptable; the consequence is slightly higher uncertainty in the fused track, not a missed conjunction. If we needed at-least-once delivery here, we would need a different transport (Kafka with a radar-side producer, for instance) and we would lose the simplicity of UDP. The choice of transport is a delivery-guarantee decision, not just a performance decision. We will return to this in Module 5. The unsafe block deserializing the frame is necessary because the wire format is #[repr(C, packed)] and UDP buffers have no alignment guarantee. Production systems use crates like zerocopy or bytemuck to make this safe; we use raw read_unaligned here for clarity. Either way, the cost of the deserialization is single-digit nanoseconds per frame, far below the per-frame budget.

A Minimal Sink

A sink consumes observations and writes them somewhere. The simplest possible sink is one that pushes envelopes onto an MPSC channel for the next stage to consume:

use tokio::sync::mpsc;

/// A sink consumes observations from a source and forwards them onward.
/// The simplest sink is a channel send: the next stage owns the receiver
/// and pulls from it.
pub struct ChannelSink {
    tx: mpsc::Sender<Observation>,
    name: String,
}

impl ChannelSink {
    pub fn new(tx: mpsc::Sender<Observation>, name: impl Into<String>) -> Self {
        Self { tx, name: name.into() }
    }

    /// Forward an observation to the downstream stage.
    /// Returns Err if the receiver has been dropped (downstream is gone).
    pub async fn write(&self, obs: Observation) -> Result<()> {
        // .send().await applies backpressure: if the channel is full,
        // this future does not resolve until there is capacity. The
        // upstream source is forced to wait. This is the right behavior —
        // we will analyze it in depth in Module 4.
        self.tx.send(obs).await
            .map_err(|_| anyhow::anyhow!("downstream receiver dropped"))?;
        Ok(())
    }

    pub fn name(&self) -> &str { &self.name }
}

The single most important property of this sink is that write awaits. When the downstream channel is full (because the next stage is slow), the source's next().await plus the sink's write().await form a chain that propagates backpressure all the way back to the radar UDP socket — at which point we start dropping packets at the kernel level rather than building unbounded memory in the application. This is the foundation of the dataflow model we'll cover in Lesson 2 and the explicit subject of Module 4. A non-awaiting sink (one that internally buffered into an unbounded queue) would silently OOM the process during a fragmentation event surge. The choice between bounded and unbounded internal buffering is a load-bearing architectural decision masquerading as an implementation detail.


Key Takeaways

  • A stream is defined by unboundedness: the input has no expected end. This single property dictates that state must be bounded, completeness is point-in-time, and recovery cannot rerun from the start. Treating an unbounded source as bounded is a category error that produces predictable failures under load.
  • Sources and sinks are boundary positions, not data types. The Kafka topic that is the sink of one pipeline is the source of the next. This composition is why production streaming architectures look like graphs connected by durable queues.
  • A source is characterized by three properties — replayability, ordering, delivery guarantees — and the pipeline cannot offer stronger guarantees than its weakest source without compensating mechanisms.
  • The observation envelope unifies heterogeneous wire formats behind a single internal type. Capture provenance (source ID, sensor timestamp, ingest timestamp) at the boundary; preserve the original measurement form rather than prematurely normalizing to a single shape.
  • tokio::sync::mpsc::Sender::send().await is the foundation of backpressure. A source that awaits its sink and a sink that awaits its downstream channel form a chain that propagates pressure to the kernel. Internal unbounded buffering breaks this chain and produces silent OOMs.

Lesson 2 — The Dataflow Model

Module: Data Pipelines — M01: Stream Processing Foundations Position: Lesson 2 of 3 Source: Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11 (Stream Processing, "Processing Streams" section); Kafka: The Definitive Guide — Shapira et al., Chapter 14 (Stream Processing: Topology, State); Fundamentals of Data Engineering — Joe Reis & Matt Housley, Chapter 3 ("The Dataflow Model and Unified Batch and Streaming")


Context

The dataflow model is the conceptual frame that production stream processing systems are organized around. Kafka Streams, Apache Flink, Apache Beam, the internal architecture of Tokio's Stream combinators, every Rust pipeline you'll write in this track — they all express their work as a graph of operators that consume from upstream, transform, and produce downstream. Understanding the model gives you a vocabulary to reason about pipelines that scales from single-process Rust binaries to multi-cluster Beam jobs.

The shift from imperative to dataflow is the same shift you made when you moved from for loops to iterator combinators in Rust. An imperative pipeline says "loop over inputs, do step A, do step B, do step C." A dataflow pipeline says "compose operator A with operator B with operator C; the framework decides how to schedule the work, how to parallelize it, where to introduce buffering, when to checkpoint." The shift is more than aesthetic — the dataflow representation is what enables the framework to make decisions an imperative loop can't (parallel execution across operators, automatic backpressure, exactly-once via barrier markers, topology-aware optimization). Reis and Housley argue in Fundamentals of Data Engineering Ch. 3 that the dataflow model is what unifies batch and streaming computation: a batch is a stream that ends, and the same operator graph can run over either.

For the SDA Fusion Service, the operator graph is the architecture document. Sources at the edges (radar, optical, ISL), sinks at the other edges (the catalog store, the alert emitter), and a chain of operators in between (normalize, dedupe, correlate, filter, enrich). When the conjunction-alert latency SLA is at risk, the question is which operator in the graph is the bottleneck. When the pipeline is rebuilt for a new sensor, the question is where in the graph the new branch attaches. Without the graph, every conversation is about implementation; with the graph, the conversation can be about architecture.


Core Concepts

Operators as Functions Over Streams

In the dataflow model, an operator is a function that takes one or more input streams and produces one or more output streams. The function is total — it is defined for every possible input event — but the output is not constrained to be a one-for-one mapping. An operator may emit zero, one, or many output events for each input event, and the relationship between input rate and output rate is part of the operator's specification.

The five canonical operator shapes are:

  • Map. One input event in, one output event out. Stateless. The dominant operator in normalization stages — converting a wire-format frame to an Observation envelope is a map.
  • Filter. One input event in, zero or one output events out. Stateless. Used to drop observations that fail validation, fall outside the area of interest, or duplicate ones already seen.
  • FlatMap. One input event in, zero or more output events out. Stateless. A radar frame containing multiple detected targets becomes one event per target.
  • Fold (Aggregate). One or more input events in, one output event out, with state accumulated across events. Stateful. Computing a running mean of range-rate per object is a fold.
  • Window (Group). A grouping operator that collects events into bounded buckets — by time, by count, by session — and emits one output per bucket when the window closes. Stateful and time-aware. Conjunction risk computation in Module 3 is a windowed operator.

A sixth shape that doesn't fit neatly into the above is the join, which takes two input streams and produces a single output stream containing matched pairs. Joins are the most expensive operator class — they require state proportional to the unmatched-but-still-relevant events from both sides — and we cover them in detail in Module 3.

The DDIA framing is that operators are streaming versions of relational algebra. Map is projection, filter is selection, fold is aggregation, window is group-by, join is join. The same algebraic identities hold (you can push filters past maps, you can fuse adjacent maps), and the same costs apply (joins are the expensive operation, aggregations require state). If you have an intuition for how a SQL query gets optimized, you have the foundation to reason about a streaming topology.

The Pipeline as a Topology

The full pipeline — sources to sinks — is a directed graph. Vertices are operators (including sources and sinks); edges are streams flowing between operators. Kafka Streams calls this a topology; Flink calls it a job graph; Beam calls it a pipeline. They are all the same object.

The topology has structural properties that matter:

  • Linear vs branching. A linear topology has a single path from source to sink. A branching topology has fan-out (one operator feeds multiple downstreams) or fan-in (multiple operators feed one downstream). The SDA pipeline is both: three sources fan in to a single normalization stage, then fan out to a correlator and a raw archive sink in parallel.
  • Acyclic vs cyclic. Almost all production topologies are acyclic. Cycles introduce hard problems: when does a fixed point exist? How is termination defined? How does backpressure traverse a cycle? Iterative algorithms in Beam and Flink support cycles with explicit barrier semantics, but the cost is significant. Treat cycles as a smell.
  • Stateless vs stateful operators. Stateless operators (map, filter, flatmap) can be parallelized trivially — replicate the operator N ways and load-balance events across the replicas. Stateful operators (fold, window, join) require partitioning — events that share a key must go to the same operator replica, because the state for that key is held there.

The topology view is what makes streaming pipelines explainable to other engineers and to operators. A diagram showing source-to-sink connectivity, with annotations for which operators are stateful and how the streams are partitioned, is more useful than any amount of source code for understanding why the pipeline behaves the way it does.

Statelessness, State, and Partitioning

The most consequential distinction among operators is whether they carry state. A stateless operator is a pure function — given the same input, it produces the same output every time. It can be torn down and rebuilt on a new host with no recovery; it can run with arbitrary parallelism. A stateful operator carries information between events: a counter, a window of recent values, a lookup table. State is the source of operational complexity in streaming systems. Every stateful operator is a question about checkpointing, recovery, exactly-once semantics, and partitioning.

Where does the state live? Three choices, in order of increasing operational cost:

  • In-process state. A HashMap inside the operator. Fast, simple, lost on crash, doesn't survive rescaling. Acceptable for low-importance operators or for operators whose state can be reconstructed from the input stream by replaying recent events.
  • Embedded persistent state. RocksDB or sled inside the operator process. Fast for local access, requires explicit checkpointing for recovery, requires partition-aware redistribution when scaling. This is what Kafka Streams and Flink use for their state backends.
  • External state. A separate database or cache (Redis, Cassandra, the OOR storage engine you built in the Database Internals track). Slow per access, easy to share across operator replicas, decouples scaling from state. Used when state must be queryable from outside the pipeline.

The choice of state backing is one of the most consequential decisions in pipeline architecture. We will not implement embedded persistent state in this track — it would consume the entire track on its own — but we will use in-process state in Modules 2 and 3 and discuss the implications of moving to embedded state in Module 5.

Partitioning is the bridge between state and parallelism. A stateful operator is partitioned by a key — for the SDA correlator, the key is the orbital object identifier. All observations of object 2024-001A route to the same operator replica, where the state for that object lives. Partitioning is what allows stateful operators to scale: add more replicas, repartition the stream by key, and each replica owns a disjoint subset of keys. Partitioning is also where ordering guarantees come from in streaming systems: within a single partition, events are processed in order; across partitions, no order is guaranteed.

Why Dataflow Beats Imperative Loops

You could write the SDA pipeline as a single async function that reads from sources, transforms in-line, and writes to sinks. Many systems start that way. Three things go wrong as the pipeline grows:

  1. Mixing concerns. The function ends up containing transport details (UDP receive logic), serialization (binary frame parsing), validation (drop frames with impossible range rates), business logic (correlation), and observability (metrics and lineage). Every change touches a function that touches everything else.

  2. Fixed parallelism. A monolithic loop runs at a single rate. If correlation is slow, ingestion is slow. If ingestion is slow, the radar UDP buffer overflows. The dataflow model lets each operator run at its own rate, with bounded buffers between them — slow operators can be replicated, fast operators can stay single-threaded.

  3. No structural visibility. When a metric goes wrong (P99 latency rises, throughput drops, errors spike), the only handle on the system is the call stack. The dataflow model gives every edge in the graph a name and lets you instrument each one independently. Per-stage lag, per-stage throughput, per-stage error rate become first-class observable properties.

The Kafka Streams architecture documentation makes this explicit: the topology is the artifact you reason about, debug against, and scale. The code that implements the topology is much shorter and changes much less often than the equivalent imperative code would.


Code Examples

Building Operators on tokio Channels

The simplest way to express a pipeline graph in Rust is one task per operator, connected by mpsc channels. Each operator is a long-running async function that owns one or more receivers and one or more senders.

use anyhow::Result;
use tokio::sync::mpsc;

/// A stateless map operator. Reads observations, applies a transformation,
/// forwards the result.
///
/// The signature is parameterized by the transform function for reusability;
/// in the SDA pipeline this is used for normalization, enrichment, and
/// schema migration.
pub async fn map_operator<F>(
    mut input: mpsc::Receiver<Observation>,
    output: mpsc::Sender<Observation>,
    mut transform: F,
) -> Result<()>
where
    F: FnMut(Observation) -> Observation + Send,
{
    while let Some(obs) = input.recv().await {
        let transformed = transform(obs);
        // .send().await applies backpressure if downstream is full.
        // If the receiver is dropped, the operator terminates cleanly —
        // a downstream shutdown signal naturally propagates upstream.
        if output.send(transformed).await.is_err() {
            tracing::info!("map operator: downstream closed, shutting down");
            return Ok(());
        }
    }
    Ok(())
}

/// A stateless filter operator. Drops observations for which the predicate
/// returns false. Used in SDA for area-of-interest filtering and validation.
pub async fn filter_operator<F>(
    mut input: mpsc::Receiver<Observation>,
    output: mpsc::Sender<Observation>,
    mut predicate: F,
) -> Result<()>
where
    F: FnMut(&Observation) -> bool + Send,
{
    while let Some(obs) = input.recv().await {
        if predicate(&obs) {
            if output.send(obs).await.is_err() {
                return Ok(());
            }
        }
        // Dropped observations: no send, no backpressure stall.
        // The filter operator runs at the input rate.
    }
    Ok(())
}

Two design points. The operators take ownership of the receivers but accept a Sender (which is cloneable) — this is intentional. When a topology has fan-out (one operator feeds multiple downstreams), the operator clones its Sender for each downstream branch. When it has fan-in (multiple operators feed one downstream), each upstream operator owns its own clone of the same Sender. Second, the operators terminate cleanly when their output channel is closed. This produces a clean shutdown propagation: closing the final sink causes the last operator to terminate, which drops its receiver, which causes the operator before it to terminate, and so on back to the source. This is the streaming-system equivalent of unwinding a call stack — but it works across asynchronous task boundaries.

A Stateful Fold Operator

The first interesting operator in the SDA pipeline is a stateful one: the deduplicator. The same orbital object can be detected by multiple radar arrays during a single pass; the deduplicator collapses these into a single observation per (object, time-window) pair before forwarding to the correlator.

use std::collections::HashMap;
use std::time::{Duration, Instant};

/// Deduplicates observations within a sliding time window keyed on object ID.
/// State is held in-process; on crash, we lose the dedup state and may
/// briefly emit duplicates as the window refills. This is acceptable for
/// the SDA pipeline; alternative state backings are discussed in Module 5.
///
/// The window is *not* an event-time window — that comes in Module 3.
/// This is a processing-time approximation suitable for ingestion-time dedup.
pub async fn dedup_operator(
    mut input: mpsc::Receiver<Observation>,
    output: mpsc::Sender<Observation>,
    window: Duration,
) -> Result<()> {
    // State: object ID -> last-seen time (in processing-time / wall clock).
    // For SDA volumes (~5e4 active objects), this HashMap stays under 5MB.
    // We periodically prune entries older than `window` to bound memory.
    let mut last_seen: HashMap<String, Instant> = HashMap::new();
    let mut last_pruned = Instant::now();

    while let Some(obs) = input.recv().await {
        // Use object track ID from the source for the dedup key. In a real
        // system this requires a per-source mapping to a global ID — that
        // is one of the things the correlator does. For now, dedup within
        // a single source's track ID space.
        let key = format!("{:?}:{}", obs.source_kind, obs.source_id.0);
        let now = Instant::now();

        let should_emit = match last_seen.get(&key) {
            Some(&t) if now.duration_since(t) < window => false,
            _ => true,
        };

        if should_emit {
            last_seen.insert(key, now);
            if output.send(obs).await.is_err() {
                return Ok(());
            }
        }

        // Periodic pruning to bound memory. Run every `window` interval.
        if now.duration_since(last_pruned) > window {
            last_seen.retain(|_, &mut t| now.duration_since(t) < window * 2);
            last_pruned = now;
        }
    }
    Ok(())
}

The state in this operator is a HashMap<String, Instant>. It survives across observations but does not survive a restart — if the process crashes, the next batch of observations will all appear novel and may be emitted twice. For the SDA pipeline this is an acceptable failure mode; the conjunction analysis tolerates a brief uptick in duplicate events after a restart, and we save the operational cost of a persistent state backend. This tradeoff — accepting transient correctness violations to avoid persistent state — is one of the most common in streaming systems and one that should always be made explicitly. We will revisit it in Module 5 when discussing exactly-once semantics. Note that the dedup window here is in processing time (wall-clock since we last saw this key). This is fine for an ingestion-time guard but is not the same thing as an event-time window — which we cover in Module 3, and which is necessary for any time-correctness guarantee.

Composing the Topology

With operators defined, the topology is built by spawning each operator as a task and wiring the channels between them. The structure of the spawning code is the structure of the pipeline graph.

use tokio::task::JoinSet;

/// Spawns the M1 ingestion topology:
///
///     [radar_src] ─┐
///     [optical_src] ┼─→ [normalize] ─→ [dedup] ─→ [sink]
///     [isl_src] ────┘
///
/// This is the operator graph for the SDA Sensor Ingestion Service.
/// Module 2 extends it with a real orchestrator; Module 3 replaces dedup
/// with a windowed correlator.
pub async fn spawn_ingestion_topology(
    mut radar: UdpRadarSource,
    mut optical: OpticalArchiveSource,
    mut isl: IslBeaconSource,
    final_sink: mpsc::Sender<Observation>,
) -> JoinSet<Result<()>> {
    let mut tasks = JoinSet::new();

    // Three source-to-normalize channels (fan-in).
    // Buffer size 1024 is a starting point; Module 4 covers buffer sizing.
    let (n_tx, n_rx) = mpsc::channel::<Observation>(1024);

    // Source tasks: each pushes to the same n_tx, so we clone it per source.
    let n_tx_radar = n_tx.clone();
    tasks.spawn(async move {
        loop {
            match radar.next().await? {
                Some(obs) => {
                    if n_tx_radar.send(obs).await.is_err() { break; }
                }
                None => break,
            }
        }
        Ok(())
    });
    let n_tx_optical = n_tx.clone();
    tasks.spawn(async move {
        loop {
            match optical.next().await? {
                Some(obs) => {
                    if n_tx_optical.send(obs).await.is_err() { break; }
                }
                None => break,
            }
        }
        Ok(())
    });
    let n_tx_isl = n_tx;  // last clone goes here, drop n_tx
    tasks.spawn(async move {
        loop {
            match isl.next().await? {
                Some(obs) => {
                    if n_tx_isl.send(obs).await.is_err() { break; }
                }
                None => break,
            }
        }
        Ok(())
    });

    // Normalize -> dedup edge.
    let (d_tx, d_rx) = mpsc::channel::<Observation>(1024);
    tasks.spawn(map_operator(n_rx, d_tx, |obs| {
        // Normalize timestamps: ensure ingest_timestamp is set.
        // The actual normalization rules grow as the pipeline matures.
        obs
    }));

    // Dedup -> final sink.
    tasks.spawn(dedup_operator(d_rx, final_sink, Duration::from_millis(500)));

    tasks
}

Notice that the function signature is the topology specification. The arguments name the inputs and outputs; the body lays out which operators connect to which channels. A reader unfamiliar with the codebase can understand the pipeline shape from this function alone — they don't need to chase through five files to understand what flows where. This is the dataflow model paying off: the structure of the code mirrors the structure of the data. Production systems with many operators eventually outgrow inline channel wiring and adopt a topology builder DSL (Kafka Streams' StreamsBuilder is the canonical example), but the principle is the same: declare the graph, let the framework spawn the tasks. We will build a small topology builder in Module 2.

Source note: This lesson synthesizes the dataflow model from DDIA Ch. 11 (which discusses operators and stream processing without using a single canonical name for the framework), Kafka Ch. 14 (which uses the term "topology" extensively), and FDE Ch. 3 (which uses "Dataflow Model" to specifically reference the Apache Beam paper of that name). Engineers familiar with one framework's vocabulary will recognize the others — the concepts are stable across implementations.


Key Takeaways

  • The pipeline is a graph of operators, not a sequence of function calls. Sources at the edges, sinks at the other edges, operators connected by streams in between. This representation is the artifact you architect, debug, and scale against.
  • Operators come in five canonical shapes — map, filter, flatmap, fold, window — plus join. Stateless operators (map, filter, flatmap) parallelize trivially; stateful ones (fold, window, join) require partitioning by key.
  • State is the source of operational cost in streaming systems. Every stateful operator forces decisions about checkpointing, recovery, and where state lives (in-process, embedded persistent, external). Choose deliberately and document the consequences.
  • Backpressure propagates along the topology when channels are bounded and operators await on send. This is the inverse of the imperative-loop pitfall: in dataflow code, doing nothing (awaiting) is the correct behavior under load.
  • The topology specification is architecture documentation. A function whose body wires up channels and spawns operators is the most direct expression of the pipeline's structure — clearer than any prose description. Production systems graduate to a topology DSL but the principle is unchanged.

Lesson 3 — Push, Pull, and Poll Semantics

Module: Data Pipelines — M01: Stream Processing Foundations Position: Lesson 3 of 3 Source: Fundamentals of Data Engineering — Joe Reis & Matt Housley, Chapter 7 ("Push Versus Pull Versus Poll Patterns" and "Consumer Pull and Push"); Streaming Data — Andrew Psaltis, Chapter 2 (Common Interaction Patterns); Kafka: The Definitive Guide — Shapira et al., Chapter 4 (Kafka Consumers — The Poll Loop)


Context

Three patterns govern how data crosses boundaries in a streaming pipeline. Push: the producer initiates contact and forwards data to the consumer. Pull: the consumer initiates contact and requests data from the producer. Poll: the consumer initiates contact on a schedule, regardless of whether new data exists. The choice among them is one of the most consequential and underrated decisions in streaming architecture. The wrong choice produces systems that look fine in development and fall over in production — push pipelines that overwhelm slow consumers, pull pipelines that introduce avoidable latency, poll pipelines that burn capacity asking sources that have nothing to say.

Reis and Housley make a strong case in Fundamentals of Data Engineering Chapter 7: every interaction between a source and the pipeline (and between every pair of stages within the pipeline) is one of these three patterns, and the choice should be deliberate. Most production confusion comes from systems that have grown organically into a mix of all three without anyone designing the mix on purpose. The Kafka consumer's poll loop, documented in detail in Shapira et al. Chapter 4, is the canonical example of a deliberately chosen pattern — Kafka could have been push-based, the designers chose poll-based, and the choice shapes the system's operational properties end-to-end.

For the SDA Fusion Service, the three sensor sources naturally land on different patterns. Radar arrays push UDP frames whether anyone is listening; optical archive servers expose a REST endpoint that must be pulled; ISL beacons emit on a fixed cadence that requires polling because there is no notification mechanism on the wire protocol. The pipeline cannot impose a single pattern on all three — it must adapt, and the adaptation logic lives at the boundary. Understanding what each pattern costs and what it guarantees is how you build that boundary correctly.


Core Concepts

Push Semantics

In push, the producer initiates the transfer. When new data exists, the producer sends it to the consumer immediately, without waiting for a request. The consumer is reactive: it receives data when the producer decides.

Push has two compelling properties. First, latency is minimal — data flows from producer to consumer with one network round trip and no waiting. Second, the producer needs no model of consumer demand — it produces at its natural rate and the consumer either keeps up or doesn't. For high-rate, latency-sensitive sources (radar at 200 Hz; market data; sensor telemetry), push is the natural fit.

The cost of push is that flow control is the consumer's responsibility, and getting it wrong is catastrophic. A push consumer that cannot keep up has only three options:

  1. Buffer. Store incoming data until processing catches up. If the producer's rate exceeds the consumer's processing rate persistently, the buffer grows without bound and the consumer OOMs.
  2. Drop. Discard data the consumer cannot process. This is the UDP model — packets in excess of the receive buffer are dropped at the kernel level. Acceptable for high-rate sensor data where individual events are low-value; unacceptable for events that must not be lost.
  3. Push back. Tell the producer to slow down. This requires a feedback channel that push semantics do not provide by default — implementing it is essentially layering pull semantics on top of push.

Push is the right choice when (a) latency matters more than reliability, (b) the producer's rate is bounded by something the consumer can rely on (sensor specifications, network bandwidth), and (c) the cost of a dropped event is low enough to absorb. Outside of those conditions, push is hazardous — it produces systems that work until they don't and fail badly when they do.

Pull Semantics

In pull, the consumer initiates the transfer. When the consumer is ready for more data, it sends a request; the producer responds with whatever is available. The consumer is in control: it consumes at its own rate.

Pull's cardinal property is demand-driven flow. The consumer asks only when it can process; the producer never sends faster than the consumer can absorb. This eliminates the consumer-side OOM mode of push and makes the producer-consumer relationship symmetric: both sides have a model of the other's pace.

Pull's cost is latency — every event waits at the producer until the next pull request arrives. For low-rate, low-latency-tolerant sources, this is negligible. For high-rate sources, it adds round-trip latency to every event and can become a bottleneck if the pull rate cannot keep up with production. Kafka mitigates this by allowing a single pull (poll) to return many events at once — max.poll.records is configurable up to 2,000 — so the round-trip cost is amortized across a batch. Network Programming with Rust covers similar batching strategies for non-Kafka pull-based protocols.

Pull is the right choice when (a) the consumer needs control over pace (because it is itself rate-limited downstream, or it processes batches), (b) the producer can hold data until requested (which means it has buffered or persistent storage), and (c) the additional round-trip latency is acceptable. Most production message queue clients are pull-based for exactly these reasons.

Poll Semantics

Polling is pull on a fixed schedule. The consumer issues pull requests at regular intervals — every 30 seconds, every minute, every hour — regardless of whether the producer has new data. It is the simplest possible pull strategy and the easiest to reason about.

Polling has one major virtue: statelessness. The consumer needs no notification mechanism, no long-lived connection, no producer-side state about pending data. Every poll cycle is independent. This is why polling dominates in environments where stateful protocols are expensive: HTTP-based archive APIs, cron-driven batch ingest, IoT devices on intermittent connections, and the Meridian ISL beacon protocol — which has no notification support and exposes only a "give me the latest state" endpoint.

Polling has two costs. First, inherent latency: an event produced just after a poll cycle must wait until the next cycle. With a 30-second poll interval, average per-event latency is 15 seconds and worst-case is 30 seconds. Second, wasted requests: most polls return nothing new, especially when the producer is quiet. The poll-rate-vs-latency tradeoff is direct — halve the interval, double the request rate, halve the average latency. There is no free lunch.

The Kafka poll loop is a hybrid: it is poll-based from the consumer's perspective (the consumer calls poll(timeout) in a loop), but the broker uses long polling to avoid the wasted-request cost. The broker holds the request open until either new data arrives or the timeout expires; if new data arrives during that window, it is returned immediately. Long polling collapses poll's wasted-request cost while preserving the consumer-driven flow control. Kafka: The Definitive Guide Ch. 4 documents the relevant configurations: fetch.min.bytes (don't return until at least this much data is available, up to the timeout) and fetch.max.wait.ms (don't hold the request longer than this).

Pattern Mismatch and Adaptation

The most common architecture problem in pipelines is pattern mismatch: the source uses one pattern and the pipeline expects another. The radar UDP source is push; the optical archive is pull; the ISL beacon must be polled. The pipeline's first stage cannot impose a single pattern on all three sources — it must adapt at the boundary, and the adaptation logic lives in the source implementation.

Three adaptation patterns:

  • Push to pull. The source maintains an internal buffer; an external next() call pulls from the buffer. The radar source uses this pattern: recv_from() is push-based at the kernel level, but the source's next() method exposes a pull interface to the rest of the pipeline. The buffer (the kernel UDP receive buffer) is bounded; overflow drops at the kernel.

  • Pull to poll. The source wraps a pull-based remote API in a polling loop. The optical archive source uses this pattern: it sleeps for the poll interval, makes a pull request to the REST endpoint, and yields the results. The poll interval is the dominant latency cost.

  • Poll to pull. The source polls on its own schedule and exposes the most recently fetched data on next(). The ISL beacon source uses this when the beacon's protocol doesn't support pulls — a background task polls, the foreground next() reads from a shared cell. The latency includes the polling interval plus the time to detect a change.

The pipeline interior, after the source layer, runs entirely on push semantics — tokio::sync::mpsc is push from the sender's perspective, with backpressure providing the rate-control loop that pure push lacks. This is a deliberate architecture choice: by adapting at the boundary, the rest of the pipeline gets uniform semantics and you don't have to reason about three different patterns at every operator.

Choosing a Pattern

The decision matrix is small:

Source propertyPushPullPoll
Latency-sensitive (<100 ms)only with batching
Producer rate exceeds consumer raterisky
Producer cannot buffer
Consumer wants to control pace
No notification mechanism on wireadapt to push at sourceadapt to pull at source
Network is intermittentrisky
Source is burstyonly with consumer-side bufferpoor — wastes polls during bursts

The Reis and Housley framing in FDE Ch. 7 is that push is appropriate for systems where latency dominates, pull for systems where consumer control dominates, and poll for systems where simplicity dominates. None is universally correct; the right answer is whichever matches the source's wire-protocol capabilities and the pipeline's downstream needs. The wrong answer is whichever the developer was most familiar with from a prior project.


Code Examples

Adapting an HTTP Pull Source: The Optical Archive

The optical telescope archive exposes a REST endpoint at https://optical-archive.meridian.internal/observations?since={timestamp}. The endpoint returns a JSON array of observations. The source's job is to wrap this into a pull-based ObservationSource.

use anyhow::{Context, Result};
use async_trait::async_trait;
use reqwest::Client;
use serde::Deserialize;
use std::collections::VecDeque;
use std::time::{Duration, SystemTime};
use tokio::time::sleep;

#[derive(Deserialize)]
struct OpticalRecord {
    obs_id: String,
    timestamp_ns: u64,
    ra_rad: f64,
    dec_rad: f64,
    sigma_arcsec: f64,
    site_id: String,
}

pub struct OpticalArchiveSource {
    client: Client,
    endpoint: String,
    poll_interval: Duration,
    /// High-water mark: timestamp of the most recent observation we have
    /// already consumed. Sent as the `since` parameter to avoid reprocessing.
    /// Persisting this across restarts is a Module 5 concern; for now we
    /// start fresh and may briefly reprocess on restart.
    high_water_mark_ns: u64,
    /// Local buffer of fetched-but-not-yet-yielded observations. Smooths
    /// the burstiness of a single fetch returning many records.
    buffer: VecDeque<Observation>,
    name: String,
}

impl OpticalArchiveSource {
    pub fn new(endpoint: impl Into<String>, poll_interval: Duration) -> Self {
        Self {
            client: Client::new(),
            endpoint: endpoint.into(),
            poll_interval,
            high_water_mark_ns: 0,
            buffer: VecDeque::new(),
            name: "optical-archive".into(),
        }
    }

    /// Fetch the next batch of observations from the archive. Returns the
    /// number of new observations buffered.
    async fn fetch_batch(&mut self) -> Result<usize> {
        let url = format!("{}?since={}", self.endpoint, self.high_water_mark_ns);
        let records: Vec<OpticalRecord> = self.client
            .get(&url)
            .timeout(Duration::from_secs(10))
            .send()
            .await
            .context("optical archive HTTP GET")?
            .error_for_status()?
            .json()
            .await
            .context("optical archive JSON parse")?;

        let count = records.len();
        for r in records {
            // Advance the high-water mark as we consume the batch.
            // The archive returns records in timestamp order, so the last
            // record sets the new mark.
            self.high_water_mark_ns = self.high_water_mark_ns.max(r.timestamp_ns);

            self.buffer.push_back(Observation {
                observation_id: uuid::Uuid::new_v4(),
                source_id: SourceId(format!("optical-{}", r.site_id)),
                source_kind: SourceKind::Optical,
                sensor_timestamp: SystemTime::UNIX_EPOCH
                    + Duration::from_nanos(r.timestamp_ns),
                ingest_timestamp: SystemTime::now(),
                target: ObservationTarget::Angular {
                    ra_rad: r.ra_rad,
                    dec_rad: r.dec_rad,
                },
                uncertainty: Uncertainty {
                    // Convert arcseconds to radians for downstream uniformity.
                    sigma: r.sigma_arcsec * std::f64::consts::PI / (180.0 * 3600.0),
                },
            });
        }
        Ok(count)
    }
}

#[async_trait]
impl ObservationSource for OpticalArchiveSource {
    async fn next(&mut self) -> Result<Option<Observation>> {
        loop {
            // Buffered observations from a prior fetch take priority —
            // we drain them before issuing a new pull.
            if let Some(obs) = self.buffer.pop_front() {
                return Ok(Some(obs));
            }
            // Buffer empty: fetch a new batch. If the archive returns
            // nothing, sleep for the poll interval and try again.
            // This is the poll part of the pattern.
            match self.fetch_batch().await {
                Ok(0) => {
                    sleep(self.poll_interval).await;
                    // Loop back to fetch again.
                }
                Ok(_) => {
                    // Buffer now has records; loop back to pop_front.
                }
                Err(e) => {
                    // Transient error: log and retry after the poll interval.
                    // Module 5 covers proper retry/backoff strategies; this
                    // is the minimum viable behavior.
                    tracing::warn!("optical archive fetch failed: {e:#}");
                    sleep(self.poll_interval).await;
                }
            }
        }
    }

    fn name(&self) -> &str { &self.name }
}

This source is poll-to-pull adaptation in code. The wire protocol is HTTP (a pull primitive), but the consumer's next() is also pull (the pipeline asks when it wants more). The polling layer is internal: when the buffer empties, the source decides whether to issue another HTTP request immediately or sleep for the poll interval. The since query parameter is the watermark mechanism — without it, every poll would return all observations and the source would either drown in duplicates or have to deduplicate downstream. The watermark approach is the standard way to convert a non-incremental REST API into an incremental stream. Note the operational consequence of the poll interval: at 5 seconds, the average optical observation waits 2.5 seconds at the archive before reaching the pipeline. For SDA's conjunction-detection SLA (sub-30-second end-to-end), this is comfortable. If the SLA tightened to 5 seconds, we would either need to push the optical archive team to add a notification mechanism (push) or shorten the poll interval significantly (which costs them server capacity). This is the conversation the pattern choice forces.

Long-Polling: The Kafka Pattern Applied

For sources where latency matters but the producer can hold open requests, long polling combines the simplicity of polling with the latency of push. The pipeline's poll request stays open at the producer until either new data arrives or the timeout expires.

/// A long-polling source. The remote endpoint supports a `wait_ms` parameter:
/// the request blocks at the server until either a new observation is
/// available or `wait_ms` elapses, whichever comes first. This pattern is
/// borrowed directly from how Kafka's poll() works at the broker.
pub struct LongPollSource {
    client: Client,
    endpoint: String,
    /// How long the server holds the request before returning empty.
    /// Trading off: longer = lower request rate but slower shutdown response.
    wait_ms: u32,
    high_water_mark_ns: u64,
    name: String,
}

#[async_trait]
impl ObservationSource for LongPollSource {
    async fn next(&mut self) -> Result<Option<Observation>> {
        loop {
            let url = format!(
                "{}?since={}&wait_ms={}",
                self.endpoint, self.high_water_mark_ns, self.wait_ms,
            );
            // Note the request timeout exceeds wait_ms — the server will
            // reliably return within wait_ms; we add a small grace period
            // to tolerate network jitter without aborting valid requests.
            let resp = self.client
                .get(&url)
                .timeout(Duration::from_millis(self.wait_ms as u64 + 2_000))
                .send()
                .await?
                .error_for_status()?
                .json::<Vec<OpticalRecord>>()
                .await?;

            if resp.is_empty() {
                // Server returned no new data within wait_ms.
                // Loop immediately to issue another long poll — no sleep.
                continue;
            }
            // Server returned records: convert and yield the first one.
            // (Production code would buffer the rest like OpticalArchiveSource.)
            let r = &resp[0];
            self.high_water_mark_ns = self.high_water_mark_ns.max(r.timestamp_ns);
            return Ok(Some(/* convert r to Observation, see prior example */
                Observation {
                    observation_id: uuid::Uuid::new_v4(),
                    source_id: SourceId(format!("optical-{}", r.site_id)),
                    source_kind: SourceKind::Optical,
                    sensor_timestamp: SystemTime::UNIX_EPOCH
                        + Duration::from_nanos(r.timestamp_ns),
                    ingest_timestamp: SystemTime::now(),
                    target: ObservationTarget::Angular {
                        ra_rad: r.ra_rad,
                        dec_rad: r.dec_rad,
                    },
                    uncertainty: Uncertainty {
                        sigma: r.sigma_arcsec * std::f64::consts::PI / (180.0 * 3600.0),
                    },
                }
            ));
        }
    }

    fn name(&self) -> &str { &self.name }
}

The latency profile of long polling is the producer's data-arrival latency plus one network round trip — essentially the same as push, with no producer-side notification mechanism required. The cost is slightly more server capacity (each consumer holds an open connection) and a deliberate choice of wait_ms that balances request volume against shutdown responsiveness. Kafka brokers default to a 500-ms maximum wait, which is the right order of magnitude for most systems. Note that long polling shifts complexity to the server: the server must support holding requests open, which not all REST APIs do. When it does, long polling is almost always preferable to fixed-interval polling.

A Push-Source-with-Backpressure-Adaptation

Sometimes a push source needs to be slowed down. The radar UDP source can't be slowed down (the radar emits whether anyone is listening), but a TCP-based push source can be — by simply not reading from the socket. Modern OS TCP stacks signal back-pressure all the way to the producer when the consumer's receive buffer fills.

use tokio::io::AsyncReadExt;
use tokio::net::TcpStream;

/// A push-based source over TCP. The producer streams length-prefixed binary
/// frames. The transport-level backpressure (TCP windowing) automatically
/// slows the producer when we stop reading — but only if our application-level
/// reads are themselves controlled by backpressure from the downstream sink.
pub struct TcpPushSource {
    stream: TcpStream,
    buf: Vec<u8>,
    name: String,
}

impl TcpPushSource {
    pub async fn connect(addr: &str, name: impl Into<String>) -> Result<Self> {
        let stream = TcpStream::connect(addr).await?;
        Ok(Self {
            stream,
            buf: vec![0u8; 65_536],
            name: name.into(),
        })
    }
}

#[async_trait]
impl ObservationSource for TcpPushSource {
    async fn next(&mut self) -> Result<Option<Observation>> {
        // Read a 4-byte length prefix.
        let mut len_buf = [0u8; 4];
        self.stream.read_exact(&mut len_buf).await?;
        let frame_len = u32::from_le_bytes(len_buf) as usize;
        if frame_len > self.buf.len() {
            anyhow::bail!("push source: frame size {frame_len} exceeds buffer {}", self.buf.len());
        }
        // Read exactly frame_len bytes.
        self.stream.read_exact(&mut self.buf[..frame_len]).await?;
        // Parse and return as Observation. Implementation omitted for brevity;
        // see the radar source for the deserialization pattern.
        Ok(Some(parse_isl_frame(&self.buf[..frame_len])?))
    }

    fn name(&self) -> &str { &self.name }
}

The interesting thing about this source is what happens when the pipeline downstream is slow. The orchestrator's call to next() will not happen as quickly. The TCP stream's read buffer (kernel-level) fills up. The kernel's TCP window shrinks. The remote producer's send window shrinks. The producer's writes block at the syscall level. The producer is slowed down — automatically, by the operating system, with no application-level coordination required. This is the hidden virtue of TCP-based push: backpressure traverses the network for free, as long as the application never reads faster than it can process. A push source that internally buffered into an unbounded queue would defeat this. Using read_exact synchronously inside next() preserves it. The Network Programming with Rust text covers TCP windowing and its interaction with application-level I/O in detail.

Source note: This lesson synthesizes pattern-choice guidance from FDE Ch. 7 ("Push Versus Pull Versus Poll Patterns"), Streaming Data Ch. 2 ("Common Interaction Patterns"), and Kafka Ch. 4 ("The Poll Loop"). The long-polling pattern as described matches Kafka's broker-side fetch.max.wait.ms mechanism; the SDA pipeline applies the same pattern to a custom REST API. The TCP-windowing claim about backpressure-for-free is well-established in network-programming texts (Stevens, TCP/IP Illustrated) but worth verifying against the production behavior of the specific TCP stack and kernel you deploy on.


Key Takeaways

  • Push, pull, and poll are not implementation details — they are architectural choices that determine latency, flow control, and failure modes for every interaction in the pipeline. Choose deliberately, document the choice, and review it when requirements change.
  • Push minimizes latency but offloads flow control to the consumer. When push is appropriate (high-rate, low-event-cost sources where drop-on-overload is acceptable), it is the best choice. When it isn't, push systems fail badly under load.
  • Pull gives the consumer rate control. Most production message queue clients are pull-based; the round-trip cost is amortized by batching multiple events per pull request. Kafka's max.poll.records and fetch.min.bytes are the canonical knobs.
  • Long polling is the practical compromise. It collapses the wasted-request cost of fixed-interval polling while preserving consumer-driven flow control. When the producer supports it, long polling is almost always preferable to fixed-interval polling.
  • TCP windowing provides free backpressure for push-over-TCP sources, as long as the application never reads faster than it can process. Internal unbounded application buffering breaks the chain. Stick to bounded buffers and synchronous read-then-process loops to preserve transport-level backpressure end-to-end.

Capstone Project — SDA Sensor Ingestion Service

Module: Data Pipelines — M01: Stream Processing Foundations Estimated effort: 1–2 weeks of focused work Prerequisites: All three lessons in this module passed at ≥70%


Mission Brief

OPS DIRECTIVE — SDA-2026-0118 / Phase 1 Implementation Classification: PIPELINE STAND-UP — INGESTION TIER

Stand up the front door of the Space Domain Awareness Fusion Service. Three sensor source types (X-band radar arrays over UDP, optical telescope archive over HTTP, ISL beacon network over TCP) must be unified into a single stream of normalized observation envelopes ready for fusion in downstream stages. The legacy Python ingestion script will be retired when this service reaches feature parity.

Success criteria for Phase 1: the service ingests from all three source types simultaneously, normalizes to the canonical Observation envelope, and writes to a structured event log that downstream stages can consume. Sustained throughput of 10,000 observations per second with end-to-end ingest-to-log latency under 250 ms at the 99th percentile.


What You're Building

A standalone Rust binary, sda-ingest, that:

  1. Connects to three configured source types — UDP radar, HTTP optical archive, TCP ISL beacon listener
  2. Wraps each source in an ObservationSource implementation that normalizes wire formats into the canonical Observation envelope
  3. Composes the three sources into a single fan-in topology that feeds a downstream sink
  4. Writes the normalized stream to a structured JSONL event log on local disk (one observation per line, atomically rotated by size)
  5. Exposes a small HTTP control plane for health checks and basic metrics (per-source ingest rate, channel occupancy, error counters)

The service must run as a long-lived process and gracefully shut down on SIGTERM — flushing the event log, closing source connections, and exiting cleanly. No data should be lost on a clean shutdown; some data may be in flight on a hard kill, and that is acceptable for this module (Module 5 covers durability).

The deliverable is the binary, the test suite, and a 1-page operational README documenting how to run it, what configuration it expects, and what its observable behavior looks like under load.


Suggested Architecture

┌────────────────────┐
│ UDP Radar Source   │──┐
│ (1-3 arrays)       │  │
└────────────────────┘  │
                        │
┌────────────────────┐  │   ┌──────────────┐   ┌──────────────┐   ┌──────────────┐
│ HTTP Optical       │──┼──→│  Normalize   │──→│   Validate   │──→│  JSONL Sink  │
│ Archive Source     │  │   │  (map)       │   │   (filter)   │   │              │
└────────────────────┘  │   └──────────────┘   └──────────────┘   └──────────────┘
                        │
┌────────────────────┐  │
│ TCP ISL Beacon     │──┘
│ Listener           │
└────────────────────┘

Each source runs in its own tokio task. All three feed a shared mpsc::Sender<Observation> (cloned three ways) that drains into the normalize operator. Normalize feeds validate; validate feeds the JSONL sink. The HTTP control-plane runs on a separate task; it shares an Arc<Metrics> with the data-plane tasks for read access.

You may diverge from this architecture if you have a defensible reason. Document it in the operational README.


Acceptance Criteria

Functional Requirements

  • The Observation envelope is defined as in Lesson 1 and used unchanged across all three sources
  • UdpRadarSource consumes the documented wire format (length, layout per Lesson 1) and produces valid Observation records
  • OpticalArchiveSource polls the HTTP endpoint with a since watermark, buffers multi-record responses, and produces valid Observation records (mock the HTTP endpoint for testing — a small mockito or wiremock setup is fine)
  • IslBeaconSource accepts incoming TCP connections, deserializes the wire format, and produces valid Observation records (mock the TCP producer for testing — a tokio::net::TcpListener paired with a producer task is fine)
  • All three sources implement the same ObservationSource trait
  • The topology fan-in correctly merges all three streams into the normalize operator
  • The JSONL sink writes one observation per line, with atomic rotation when the current file exceeds 64 MB
  • On SIGTERM, the service stops accepting new observations from sources, drains the in-flight pipeline, flushes the JSONL sink, and exits within 5 seconds

Quality Requirements

  • Every source handles a deserialization error by logging it and continuing — one bad frame must not stop the source
  • Every channel in the topology is bounded; buffer sizes are chosen and documented (a comment per channel is sufficient)
  • All await points in the data plane are cancel-safe; a dropped task does not corrupt source state
  • No .unwrap() or .expect() in non-startup code paths (initialization may panic on misconfiguration)
  • At least one integration test exercises the full pipeline end-to-end with all three (mocked) sources running concurrently

Operational Requirements

  • HTTP control plane exposes GET /health returning HTTP 200 when all source tasks are alive, 503 if any source has terminated unexpectedly
  • HTTP control plane exposes GET /metrics returning a JSON object with at minimum:
    • per-source ingest rate (observations per second, EWMA over 30s)
    • per-channel occupancy (current count and capacity)
    • per-source deserialization error count (lifetime of the process)
    • JSONL sink bytes written (lifetime of the process)
  • The service logs structured events (tracing + tracing-subscriber JSON formatter) — one log line per significant lifecycle event, not one per observation
  • A 1-page README.md in the project root documents: build, run, configure, expected metrics under steady-state load, known failure modes

Self-Assessed Stretch Goals

  • (self-assessed) Throughput sustained at 10,000 obs/sec with P99 ingest-to-log latency under 250 ms on a developer laptop. Provide a criterion benchmark and a flamegraph profile showing where time is spent.
  • (self-assessed) The optical source supports both fixed-interval polling and long-polling modes, configurable. Document the latency tradeoff in the README.
  • (self-assessed) The pipeline gracefully handles a "fragmentation event" simulation: drive the radar source at 10x normal rate for 60 seconds and observe that no observations are silently dropped at the application layer (UDP-level kernel drops are acceptable; document them).

Hints

How should I model the configuration?

A small TOML file is the path of least resistance:

[radar]
bind_addrs = ["0.0.0.0:7001", "0.0.0.0:7002"]

[optical]
endpoint = "https://optical-archive.example/observations"
poll_interval_ms = 5000

[isl]
listen_addr = "0.0.0.0:7100"

[sink]
output_dir = "/var/log/sda/ingest"
rotation_bytes = 67108864  # 64 MiB

[control_plane]
bind_addr = "127.0.0.1:9100"

serde + toml makes this trivial. figment if you want layered config (file + env vars). Don't over-engineer; you can add a config crate later.

What buffer size should the channels use?

The general rule from Module 4: buffers are sized to absorb expected short-term burstiness, not to be a primary backpressure mechanism.

For ingest-to-normalize, a buffer of 1024–4096 is reasonable for SDA's volumes. The dominant cost of an oversized buffer is increased latency under load — every observation in the buffer is one in front of yours. The dominant risk of an undersized buffer is unnecessary backpressure oscillation if downstream is bursty.

Pick a number, document why, and revisit once you have load-test data. You will revisit this in Module 4 with much more rigor.

How should I test the UDP radar source?

Spin up a tokio::net::UdpSocket in the test that sends the wire format to the source's bind address. The source thinks it's reading from a real radar; the test constructs the bytes and emits them. This pattern works for any push-over-UDP source.

#[tokio::test]
async fn radar_source_decodes_valid_frame() {
    let bind = "127.0.0.1:0";  // OS picks port
    let mut source = UdpRadarSource::bind(bind, "test-radar").await.unwrap();
    let source_addr = source.local_addr().unwrap();

    // The producer side: encode a known frame and send it.
    let producer = UdpSocket::bind("127.0.0.1:0").await.unwrap();
    let frame = encode_test_radar_frame(/* ... */);
    producer.send_to(&frame, source_addr).await.unwrap();

    let obs = source.next().await.unwrap().unwrap();
    assert_eq!(obs.source_kind, SourceKind::Radar);
    // ... assert other fields
}

You'll need to expose local_addr() on the source, or have the test know the bind address ahead of time (less robust because of port races).

What's a clean way to handle SIGTERM?

tokio::signal::ctrl_c() for SIGINT, tokio::signal::unix::signal(SignalKind::terminate()) for SIGTERM. Combine them with a tokio::select! against the main service loop; on signal, drop the source senders (which closes the channels), and let the normalize → validate → sink chain drain naturally.

let mut sigterm = tokio::signal::unix::signal(SignalKind::terminate())?;
tokio::select! {
    _ = sigterm.recv() => {
        tracing::info!("SIGTERM received; initiating graceful shutdown");
    }
    _ = tokio::signal::ctrl_c() => {
        tracing::info!("SIGINT received; initiating graceful shutdown");
    }
    res = run_service(&mut topology) => {
        // service exited on its own — usually means a source error propagated
        return res;
    }
}
// Drop sources to close their channel senders; downstream drains.
topology.shutdown().await;

The topology.shutdown() method is yours to design — typically it joins all tasks with a deadline and force-aborts any that don't finish in time.

How verbose should the metrics be?

Resist the urge to add a metric per operation. The four metric families required by the acceptance criteria are sufficient for Module 1.

You will revisit metrics with rigor in Module 6, where Reis and Housley's DataOps framing and Kafka's monitoring chapter (Shapira et al. Ch. 13) provide the proper foundation. For now, four metrics that you understand and that work correctly are far better than twenty that you cargo-culted from a Kafka dashboard.


Getting Started

Recommended order:

  1. Define the envelope. Get Observation and the trait in place; write a unit test that round-trips it through serde JSON.
  2. Implement the simplest source. The UDP radar source is the most self-contained — no HTTP, no TCP listener, no state. Start there. Get it tested end-to-end with a UDP producer in the test harness.
  3. Implement the JSONL sink. Get observations flowing source → channel → sink to disk before adding the other sources.
  4. Add the optical source. This is the most complex one because of the HTTP polling and watermark management. Mock the HTTP server in tests.
  5. Add the ISL TCP source. Apply what you learned from radar plus what you learned from the optical-source error handling.
  6. Wire the topology together. Compose the three sources into a fan-in; spawn each as a task; verify the integration test.
  7. Add the control plane. Health and metrics last; they are the cherry on top, not the foundation.

Aim for a working end-to-end pipeline by day 4 even if everything in it is minimal. Optimize and harden after that. Premature optimization (specifically, premature buffer-size tuning) is a common time-sink in this project.


What This Module Sets Up

In Module 2 you will replace this hand-spawned topology with a real orchestrator: a Rust task DAG executor that schedules operators, manages retries, and propagates idempotency keys across stage boundaries. The Observation envelope and the source/sink traits you define here will not change. The topology composition will become declarative.

In Module 3 you will replace the JSONL sink with an event-time correlation operator that windows observations by sensor timestamp and computes conjunction risk. The watermark logic in your optical source becomes the foundation for event-time watermark propagation across the pipeline.

You are not building a throwaway. You are building the first stage of a system that grows for five more modules.

Module 02 — Pipeline Orchestration Internals

Track: Data Pipelines — Space Domain Awareness Fusion Position: Module 2 of 6 Source material: Async Rust — Maxwell Flitton & Caroline Morton, the chapters on tokio::task, JoinSet, cancellation, and structured shutdown; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11 (operator-graph execution and failure handling); Kafka: The Definitive Guide — Shapira, Palino, Sivaram, Petty, Chapter 7 (Reliable Data Delivery); Fundamentals of Data Engineering — Joe Reis & Matt Housley, Chapters 2–3 (Orchestration as an Undercurrent; Plan for Failure) Quiz pass threshold: 70% on all four lessons to unlock the project


Mission Context

OPS ALERT — SDA-2026-0119 Classification: ORCHESTRATION TIER STAND-UP Subject: Replace hand-spawned topology with declarative orchestrator

Phase 1 ingestion (sda-ingest) is in production with three sensor sources, a normalize stage, a validate stage, and a JSONL sink. The next-quarter roadmap adds five more sources, a windowed dedup, a cross-sensor correlator, an alert emitter, and an audit sink. The hand-spawned topology in main.rs has reached the practical limit of what one engineer can hold in their head. Two operational gaps from last quarter's incidents are also unresolved: a panicking task is silently torn down with no recovery hook (SDA-2026-0094 postmortem), and the optical poller's retry logic is unjittered fixed-delay (SDA-2026-0103 postmortem — the 90-minute outage extension after a 30-second partner blip).

Directive: Build the orchestrator. A declarative DAG of operators, supervised by a single supervisor, with retry and circuit-breaker primitives that respect downstream characteristics, and runtime-level bulkheading for the new orbital propagator. The Phase 1 binary is refactored to use the orchestrator with no behavioral regression beyond the documented failure-handling improvements.

This module is the connective tissue of the rest of the track. The orchestrator built here is what every subsequent module's project hangs on. Module 3 changes one operator (the dedup → windowed correlator); Module 4 hardens channel boundaries against burst load; Module 5 makes operator state crash-safe via checkpointing; Module 6 wraps the assembled system in observability. None of those changes alter the orchestrator's API. The shape established here is load-bearing for the next four modules.

The mental model the module installs: an operator is a Task, a topology is an OperatorGraph, a running pipeline is a BuiltGraph driven by a Supervisor. Failures are dispatched on a four-case TaskExit enum. Retries use decorrelated jitter. Resource-isolation needs are met with separate runtimes. None of these patterns are SDA-specific; the orchestrator crate is meant to be reusable across any pipeline that fits the dataflow model.


Learning Outcomes

After completing this module, you will be able to:

  1. Articulate why a pipeline is naturally expressed as a graph of supervised tasks rather than a single async function, and explain the operational properties of each shape
  2. Distinguish CPU-bound from IO-bound operators and place each on the correct part of the Tokio runtime (spawn vs spawn_blocking)
  3. Reason about cancel-safety as a per-await-point property and identify operator implementations that violate it
  4. Build a declarative OperatorGraph with build-time validation of cycles and disconnected operators
  5. Implement a supervisor with bounded restart budgets that distinguishes panics from errors from clean exits
  6. Design retry policies that classify errors correctly, use decorrelated-jitter backoff, and compose with idempotency to produce effective exactly-once processing
  7. Apply circuit breakers and runtime-level bulkheading where retries alone are insufficient

Lesson Summary

Lesson 1 — The Task Model

What tokio::spawn actually does, why CPU-bound work belongs on spawn_blocking, what JoinHandle::abort actually means (cooperative, observed at the next await point), and what cancel-safety means as a per-await-point property. Closes with the Task wrapper struct that gives the orchestrator a uniform handle type for every operator and the TaskExit enum that distinguishes the four operationally meaningful exit cases.

Key question: If JoinHandle::abort() is called and the handle resolves to Err(JoinError) with is_cancelled() == true, does that mean the task has actually stopped running?

Lesson 2 — DAG Scheduling

The OperatorGraph builder, Kahn's algorithm topological sort with cycle detection, the four-pass build() (per-role validation, topo sort, channel allocation, spawn), and JoinSet for managing N operator handles with whichever-finishes-first semantics. Why the bounded-channel-per-edge invariant is what makes backpressure-traversal-through-the-DAG tractable and why fan-in/fan-out are expressed as explicit router operators rather than multi-edge vertices.

Key question: The pipeline has three sources fanning into a single normalize operator. What does the topo-sorted spawn order look like, and which property of the order is what makes the channel-wiring code work?

Lesson 3 — Retries and Idempotency

Three pieces of retry discipline: classifying transient vs permanent vs discardable errors (and why the classification is the operator's responsibility, not the framework's), exponential backoff with decorrelated jitter (and why fixed-delay retries amplify outages), idempotency as a property of the operation that lets at-least-once delivery compose into effective exactly-once. The with_retry wrapper, the RetryDisposition enum, and the dedup sink with sliding-window seen-set bounded by both time and count.

Key question: A hundred operator instances all hit the same downstream failure at the same instant. With fixed-delay retries, what happens to the downstream during recovery, and why?

Lesson 4 — Failure Modes

What retry cannot address: panics, cascading slowdowns, resource exhaustion in shared pools. The supervisor pattern with JoinSet::join_next and TaskExit dispatch. Bounded restart budgets and why "always restart" is dangerous. Three levels of bulkheading (channel, runtime, process). Cascading failures and the discipline of addressing the cause rather than the symptom. Circuit breakers as the fail-fast complement to retries.

Key question: The validate operator panics on a bad input. The pipeline has no supervisor. What happens to the pipeline's apparent behavior, and what changes when the supervisor pattern is wired in?


Capstone Project — Fusion Pipeline Orchestrator

Build the sda-orchestrator library: declarative OperatorGraph, supervised JoinSet-driven Task lifecycle, retry wrapper with decorrelated jitter, circuit breaker, and runtime-level bulkheading via a dedicated tokio::runtime::Runtime for the orbital propagator. The Phase 1 sda-ingest binary from Module 1 is refactored to use the new orchestrator with no behavioral regression beyond the documented failure-handling improvements. Acceptance criteria, suggested architecture, deterministic-timing test patterns, and the full project brief are in project-fusion-orchestrator.md.

The orchestrator is not a throwaway. The interface stays stable through Modules 3, 4, 5, and 6.


File Index

module-02-pipeline-orchestration-internals/
├── README.md                                  ← this file
├── lesson-01-task-model.md                    ← The task model
├── lesson-01-quiz.toml                        ← Quiz (5 questions)
├── lesson-02-dag-scheduling.md                ← DAG scheduling
├── lesson-02-quiz.toml                        ← Quiz (5 questions)
├── lesson-03-retries-idempotency.md           ← Retries and idempotency
├── lesson-03-quiz.toml                        ← Quiz (5 questions)
├── lesson-04-failure-modes.md                 ← Failure modes
├── lesson-04-quiz.toml                        ← Quiz (5 questions)
└── project-fusion-orchestrator.md             ← Capstone project brief

Prerequisites

  • Module 1 (Stream Processing Foundations) completed — the Observation envelope, the ObservationSource trait, the ChannelSink pattern, and the bounded-channel backpressure model are assumed
  • Foundation Track completed — async Rust, channels, network programming
  • Familiarity with tokio::task::JoinSet, tokio::sync::oneshot, tokio_util::sync::CancellationToken, and anyhow::Result
  • Working comfort reading and writing structured logs with tracing

What Comes Next

Module 3 (Event Time and Watermarks) replaces the processing-time dedup operator from Module 1 with a windowed event-time correlator that computes conjunction risk from observations of the same orbital event arriving from multiple sensors. The orchestrator interface stays the same — only one operator's implementation changes. Watermark propagation becomes a property of the graph's edges, which the orchestrator's channel structure already accommodates.

Lesson 1 — The Task Model

Module: Data Pipelines — M02: Pipeline Orchestration Internals Position: Lesson 1 of 4 Source: Async Rust — Maxwell Flitton & Caroline Morton, the Tokio runtime and tasks chapters (tokio::spawn, JoinHandle, spawn_blocking, cancellation via select!); Fundamentals of Data Engineering — Joe Reis & Matt Housley, Chapter 2 ("The Data Engineering Lifecycle: Orchestration as an Undercurrent")


Context

Module 1 stood up a working pipeline by spawning tasks ad hoc — one for each radar source, one for each optical poller, one normalize stage, one channel sink. The wiring code was a single function in main.rs that grew thirty lines longer every time a new operator was added. That approach scales until it doesn't, and it has stopped scaling. The next quarter's roadmap adds five more sensor sources, a cross-sensor correlator, a windowed dedup stage, an alert emitter, and an audit sink. Hand-spawning that topology produces a function nobody wants to touch. We need an orchestrator: a layer that accepts a description of the pipeline and runs it, deciding what to spawn where, when to restart, and how to shut down cleanly.

Before we can build the orchestrator, we need a precise account of what it is orchestrating. The unit of orchestration is the task — an async function spawned onto the Tokio runtime, paired with a JoinHandle for lifecycle reference. Most engineers writing async Rust have spawned tasks; far fewer have a working mental model of what the runtime guarantees, where the cooperative-scheduling contract breaks, and what cancel-safety actually means in practice. This lesson establishes that vocabulary. The next three lessons (DAG scheduling, retries and idempotency, failure modes) each build on it.

The Reis & Housley framing is useful here: orchestration is one of the "undercurrents" of the data engineering lifecycle — present everywhere, easy to ignore until it breaks. Fundamentals of Data Engineering makes the point that orchestration is not a service you call out to; it is the connective tissue of the whole system. That framing matters because it shapes what the orchestrator is responsible for. Not "scheduling jobs" in the Airflow sense — the SDA Fusion Service is one long-running pipeline, not a graph of nightly batch jobs. Orchestration here means: own the lifecycle of every operator in the pipeline, supervise it, and present a single coherent abstraction to the rest of the program.


Core Concepts

Tasks as the Unit of Work

A task is an async function that has been handed to the Tokio runtime to drive. The runtime owns scheduling — picking which task to poll next, which worker thread to poll it on, and when to suspend a task that is awaiting an unready future. The application owns lifecycle reference through the JoinHandle returned by tokio::spawn. This split is the thing that makes async Rust productive at scale: the runtime is in charge of when work happens; the application is in charge of what work happens.

A task is much cheaper than an OS thread. A thread carries an OS-level kernel stack (typically 8 MiB on Linux, allocated lazily but reserved in virtual address space), a kernel-scheduled execution context, and the per-thread bookkeeping the kernel maintains. A task is a single allocation containing the future plus a small header — usually under a kilobyte. Spawning a million tasks in a single process is routine; spawning a million OS threads is not. This cheapness is why streaming pipelines are naturally expressed as tasks-per-operator: we can afford one task per source, one per stage, one per partition replica, without thinking hard about the resource cost.

A pipeline operator is one task per stage instance. The radar source task reads from its UDP socket and forwards observations to the next channel. The normalize task reads from that channel and forwards. Every stage is independently schedulable; the runtime interleaves them across worker threads as their inputs become available. Module 1's spawn_ingestion_topology already worked this way; the orchestrator we will build in Lesson 2 makes the pattern explicit and composable.

CPU-Bound vs IO-Bound Tasks

Tokio is a cooperative scheduler. A task progresses by being polled; it yields control by hitting an .await that returns Pending. Between awaits, the task holds the worker thread it is running on. If a task spends 200 ms on a CPU-bound computation between awaits, no other task scheduled on that worker thread runs for 200 ms — even if that worker has a thousand pending tasks queued. This is the cooperative-scheduling contract, and violating it does not produce a clean error. It produces tail-latency spikes that look like "the runtime is mysteriously slow."

The rule: async tasks must yield frequently. The threshold has hardened around 10 microseconds as the natural budget per uninterrupted run — small enough that the runtime stays responsive at hundreds of thousands of tasks per second, large enough that you are not yielding inside tight inner loops. When the work between awaits exceeds 10 µs by more than a small factor — let alone 10 ms or 200 ms — the work belongs on a thread, not a task. The mechanism is tokio::task::spawn_blocking, which dispatches the closure to a separate thread pool sized for blocking work (default: 512 threads, configurable). The result comes back as a JoinHandle<T> you can .await from your task, but the work itself does not block the async runtime.

For SDA, the placement decisions are concrete. UDP frame deserialization in the radar source is sub-microsecond — stays in async. JSON deserialization of an optical archive response is single-digit microseconds — stays in async. The orbital propagation library's propagate(state_vector, dt) -> state_vector call runs in 1–5 ms per object — spawn_blocking. The cross-sensor correlator's covariance update is in the hundreds of microseconds — borderline; benchmark before deciding. When in doubt, measure. When unable to measure, prefer spawn_blocking for any computation that touches floating-point matrix algebra or calls into a non-async C library.

Lifecycle and JoinHandle

tokio::spawn(future) returns a JoinHandle<T> where T is the future's output. The handle is your sole reference to the running task. It carries three operations worth understanding precisely:

  • .await on the handle waits for the task to finish and yields its Result<T, JoinError>. The error case captures task panics and aborts; the success case is the future's actual return value.
  • .abort() signals the task to stop. This is cooperative: the runtime sets a cancellation flag, and the task observes it the next time it reaches an await point. A task in a tight CPU loop with no awaits ignores .abort() indefinitely. This is the same problem that puts CPU-bound work on spawn_blocking — and it has the same solution.
  • Drop of the handle does not cancel the task. By default, dropping a JoinHandle simply detaches the task; it continues running, and you have lost your reference to it. Detached tasks are a frequent source of operational problems: they outlive the code that spawned them, they accumulate without being supervised, and they hold resources that nobody else can release. The orchestrator must never detach an operator task.

For long-running pipeline operators — the kind that drain a channel and forward to the next — the handle never resolves under normal operation. The task runs as long as the upstream channel produces. "Completion" for an operator means the upstream closed (the source ran out, or the orchestrator triggered shutdown). The handle resolving with Ok(()) is the shutdown-success signal; resolving with Err(JoinError) is a panic signal; never resolving (until aborted) is the steady state.

Cancel-Safety

A future is cancel-safe if dropping it at any await point leaves no observable side effects partially applied. Equivalently: the future has either completed an operation or not started it; there is no third state where a row was half-inserted, a transaction was started but not committed, or a kernel buffer was half-read. Cancel-safety is a property of an individual await point, not of a function as a whole. A function with one cancel-safe await and one non-cancel-safe await is non-cancel-safe.

The reason this matters for the orchestrator is that shutdown is implemented by aborting tasks, and aborting a task is effectively dropping its current future. If an operator's inner loop is non-cancel-safe at the await point that gets aborted, the operator leaves the system in an inconsistent state on shutdown. This is the underlying mechanism for "the system runs fine until we deploy a new version and then conjunction alerts go missing for ninety seconds" — the alerts that were in flight got aborted at a non-cancel-safe await.

The Tokio primitives are largely cancel-safe by design: mpsc::Receiver::recv, Notify::notified, time::sleep, select! with cancel-safe arms, UdpSocket::recv_from. Many third-party crates are not. Database drivers that hold a transaction across an await are usually non-cancel-safe (the transaction stays open with no committer if the future is dropped). HTTP clients that stream a response body are non-cancel-safe at the body-read await. The discipline: when you write or use a non-cancel-safe future, wrap it in select! against the cancellation signal and explicitly handle the cancel branch — close the transaction, drop the response, release the lock — before the task exits.

A Task Abstraction for the Orchestrator

The plain JoinHandle<Result<()>> is enough to spawn and abort a task, but the orchestrator needs more. It needs a name (for logs and metrics), a restart policy (Lesson 4 builds this out), and the ability to query whether the task is alive and what failed if it isn't. We wrap the handle:

pub struct Task {
    name: String,
    handle: JoinHandle<Result<()>>,
    restart_policy: RestartPolicy,
}

This is the type that Lesson 2's DAG scheduler operates over and Lesson 4's supervisor restarts. The wrapper costs almost nothing at runtime — it is three fields next to an already-existing handle — but it gives the orchestrator a uniform handle type for every operator in the topology. Heterogeneous return types (some operators emit Result<Vec<Observation>>, some Result<()>) are not a concern in practice: every operator the orchestrator manages returns Result<()> because it runs forever until shut down, and a non-() return value would be discarded anyway. The standardization is on purpose.


Code Examples

Spawning a CPU-Bound Operator with spawn_blocking

The orbital propagator is the canonical CPU-bound operator in the SDA pipeline. Given a state vector and a propagation interval, it integrates the equations of motion forward in time — heavy floating-point work, no I/O, single-digit milliseconds per call. Putting that work in a normal async task is the standard mistake.

use anyhow::Result;
use tokio::sync::mpsc;
use tokio::task;

/// The wrong way: a CPU-bound operator in an async task.
/// This blocks the worker thread it runs on for the full duration of
/// `propagate`, starving every other task on that worker.
pub async fn propagate_inline(
    mut input: mpsc::Receiver<Observation>,
    output: mpsc::Sender<PropagatedObservation>,
) -> Result<()> {
    while let Some(obs) = input.recv().await {
        // propagate is 1-5ms of pure CPU. The tokio worker thread
        // running this task is unavailable to anything else for the
        // duration. With 8 worker threads, eight slow propagations
        // in flight stalls the entire runtime until they complete.
        let propagated = orbital::propagate(obs.target, obs.sensor_timestamp);
        output.send(PropagatedObservation { obs, propagated }).await?;
    }
    Ok(())
}

/// The right way: hand the CPU work to the blocking pool, await its
/// JoinHandle from inside the async context. The async worker thread
/// remains free to poll other tasks while the propagation runs.
pub async fn propagate_offloaded(
    mut input: mpsc::Receiver<Observation>,
    output: mpsc::Sender<PropagatedObservation>,
) -> Result<()> {
    while let Some(obs) = input.recv().await {
        // spawn_blocking returns a JoinHandle<T>. Awaiting it suspends
        // *this* task without blocking the worker thread; the runtime
        // is free to poll other operators in the meantime. The closure
        // runs on the blocking pool (default 512 threads).
        let propagated = task::spawn_blocking(move || {
            orbital::propagate(obs.target, obs.sensor_timestamp)
        })
        .await??;  // outer ? for JoinError, inner ? for the Result inside
        output.send(PropagatedObservation { obs, propagated }).await?;
    }
    Ok(())
}

The 10-microsecond budget rule says anything past that yields. In practice, the threshold worth respecting is much higher — closer to a millisecond — because the cost of the spawn_blocking dispatch itself (allocating a closure, signaling the blocking pool) is in the single-digit microseconds. Below that, you spend more on the offload than you save. Above it, you lose at least an order of magnitude of throughput by leaving the work inline. The two-version comparison above makes the difference visible in flame graphs: propagate_inline shows continuous on-CPU time with no idle worker threads even when input arrives in bursts; propagate_offloaded shows CPU on the blocking pool and idle async workers ready to handle the next operator's work.

A Cancel-Unsafe Operator and How to Fix It

The orchestrator triggers shutdown by aborting operator tasks. If an operator holds a resource across a non-cancel-safe await, that resource is leaked when the task is aborted. The example below is a sink that writes observations to an embedded SQLite — a common pattern for the audit sink — and the bug it has and the fix that resolves it.

use anyhow::Result;
use rusqlite::Connection;
use tokio::select;
use tokio::sync::mpsc;
use tokio_util::sync::CancellationToken;

/// Cancel-unsafe: the transaction is opened, observations are written
/// inside it, and the commit happens after .await on the channel recv.
/// If the task is aborted while waiting for the next batch, the
/// transaction stays open — and SQLite holds a writer lock the rest
/// of the process cannot release until the connection is dropped.
pub async fn audit_sink_unsafe(
    mut input: mpsc::Receiver<Observation>,
    mut conn: Connection,
) -> Result<()> {
    loop {
        let txn = conn.transaction()?;       // begin
        for _ in 0..1000 {
            let obs = match input.recv().await {  // ← abort point inside txn
                Some(o) => o,
                None => return Ok(()),
            };
            txn.execute(INSERT_OBS_SQL, params![&obs.observation_id])?;
        }
        txn.commit()?;
    }
}

/// Cancel-safe: select! between the channel recv and an explicit
/// cancellation token. The cancel branch closes the transaction
/// before the task exits. Aborts of the task itself become
/// rare — shutdown comes through the token.
pub async fn audit_sink_safe(
    mut input: mpsc::Receiver<Observation>,
    mut conn: Connection,
    shutdown: CancellationToken,
) -> Result<()> {
    loop {
        let txn = conn.transaction()?;
        for _ in 0..1000 {
            select! {
                recv = input.recv() => match recv {
                    Some(obs) => {
                        txn.execute(INSERT_OBS_SQL, params![&obs.observation_id])?;
                    }
                    None => {
                        txn.commit()?;
                        return Ok(());
                    }
                },
                _ = shutdown.cancelled() => {
                    // Explicit unwind: drop the in-flight transaction
                    // before returning. SQLite will roll it back when
                    // the txn binding goes out of scope.
                    drop(txn);
                    return Ok(());
                }
            }
        }
        txn.commit()?;
    }
}

Two design points are worth lingering on. First, the fix is not "make the SQLite call cancel-safe" — that is not a property the underlying library can offer. The fix is to wrap the non-cancel-safe await in a select! whose other arm is a cancellation signal that the orchestrator controls. The task is now the one that decides how to unwind, on its own terms. Second, the cooperative shutdown via CancellationToken makes JoinHandle::abort a fallback rather than the primary mechanism. Production orchestrators emit shutdown.cancel() first, give every operator a grace window (typically 5–10 seconds) to drain, and only fall back to .abort() for operators that have not exited. Lesson 4 returns to this pattern.

The Task Wrapper

This is the type the rest of the orchestrator works with. Heterogeneous operator implementations — a UDP source, an HTTP poller, a windowed correlator — all become Task values once spawned, with a uniform interface for the scheduler and supervisor.

use anyhow::{Context, Result};
use std::time::{Duration, Instant};
use tokio::sync::oneshot;
use tokio::task::{JoinError, JoinHandle};

/// What the supervisor should do when this task exits.
#[derive(Debug, Clone, Copy)]
pub enum RestartPolicy {
    /// Restart the operator on any non-graceful exit.
    Always,
    /// Restart up to N times within the given window.
    Bounded { max_restarts: u32, window: Duration },
    /// Never restart; failure of this operator should fail the pipeline.
    /// Reserved for operators where data integrity is at stake (e.g.,
    /// a sink whose retry would produce double-writes).
    Never,
}

/// Spawned operator handle. The orchestrator stores one of these per
/// operator in the running topology.
pub struct Task {
    name: String,
    handle: JoinHandle<Result<()>>,
    restart_policy: RestartPolicy,
    spawned_at: Instant,
}

impl Task {
    /// Spawn an operator and wrap its JoinHandle.
    pub fn spawn(
        name: impl Into<String>,
        restart_policy: RestartPolicy,
        future: impl std::future::Future<Output = Result<()>> + Send + 'static,
    ) -> Self {
        Self {
            name: name.into(),
            handle: tokio::spawn(future),
            restart_policy,
            spawned_at: Instant::now(),
        }
    }

    pub fn name(&self) -> &str { &self.name }
    pub fn restart_policy(&self) -> RestartPolicy { self.restart_policy }
    pub fn uptime(&self) -> Duration { self.spawned_at.elapsed() }

    /// True if the task has not yet completed. Cheap; useful in
    /// the supervisor's poll loop.
    pub fn is_alive(&self) -> bool { !self.handle.is_finished() }

    /// Wait for the task to exit. Distinguishes operator-returned
    /// errors from runtime panics so the supervisor can react
    /// differently to each (Lesson 4).
    pub async fn join(self) -> TaskExit {
        match self.handle.await {
            Ok(Ok(())) => TaskExit::Ok,
            Ok(Err(e)) => TaskExit::Errored(e),
            Err(join_err) if join_err.is_panic() => {
                TaskExit::Panicked(format!("{:?}", join_err.into_panic()))
            }
            Err(_) => TaskExit::Aborted,
        }
    }

    /// Cooperative shutdown via the orchestrator's cancellation
    /// token would normally have run first; this is the fallback
    /// for operators that did not honor the token in time.
    pub fn abort(&self) { self.handle.abort(); }
}

#[derive(Debug)]
pub enum TaskExit {
    Ok,
    Errored(anyhow::Error),
    Panicked(String),
    Aborted,
}

The join method is the single most important part of this type. It collapses Tokio's two-level error reporting (Result<Result<()>, JoinError>) into a flat TaskExit enum that distinguishes the four operationally meaningful cases. A panicked operator is a bug we want to alert on. An errored operator returned Err(_) from its future — it is an expected failure mode (a network partition, a transient API failure) and should be retried per its policy. An aborted operator was shut down by the orchestrator and is not a failure. An Ok exit means the operator's input ran out — the source closed cleanly — which is normal at end-of-stream but unusual in steady state. The supervisor in Lesson 4 dispatches on TaskExit directly; everything past this lesson assumes the wrapper exists.


Key Takeaways

  • A task is the unit of work the runtime schedules; a JoinHandle is your application's sole reference to it. Drop of the handle detaches rather than cancels, and detached tasks accumulate silently — the orchestrator must own every operator's handle.
  • The cooperative-scheduling contract demands tasks yield at sub-millisecond granularity. CPU-bound work belongs on spawn_blocking, not on tokio::spawn. Failure to honor this rule produces tail-latency spikes that look like runtime instability rather than the application bug they are.
  • JoinHandle::abort is cooperative: it sets a flag observed at the next await point. CPU-bound tasks ignore aborts until they yield. Production shutdown is a two-step protocol: signal a CancellationToken for cooperative drain, then fall back to .abort() for stragglers.
  • Cancel-safety is a per-await-point property. Non-cancel-safe awaits — database transactions, streaming HTTP bodies, custom locks — must be wrapped in select! against an explicit cancellation signal so the operator unwinds on its own terms. Aborting a non-cancel-safe future leaks the resource it held.
  • The Task wrapper turns Tokio's two-level error reporting into a four-case TaskExit enum (Ok, Errored, Panicked, Aborted) that the supervisor in Lesson 4 dispatches on. Standardize on Result<()> for every operator; standardize on Task for every spawned handle.

Lesson 2 — DAG Scheduling

Module: Data Pipelines — M02: Pipeline Orchestration Internals Position: Lesson 2 of 4 Source: Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11 (the operator-graph execution model in stream processing); Async Rust — Maxwell Flitton & Caroline Morton, sections on tokio::task::JoinSet and tokio::select!; Fundamentals of Data Engineering — Joe Reis & Matt Housley, Chapter 2 (Orchestration as the connective tissue of the lifecycle)

Source note: The DDIA chapter discussion of operator-graph execution is referenced from training knowledge of the printed text; the core model — vertices as operators, edges as channels, scheduling via topological order — is well-established and unchanged across editions. Specific algorithmic choices (Kahn's algorithm vs DFS-postorder) are decisions made for this curriculum, not direct citations.


Context

Lesson 1 established the task as the unit of orchestration and the Task wrapper as the orchestrator's handle on each one. We now need a way to describe an entire pipeline's worth of operators — radar source, ISL listener, optical poller, three normalizers, a windowed dedup, a correlator, a conjunction emitter, an audit sink — and turn that description into running tasks with their channels correctly wired. Module 1's spawn_ingestion_topology did this by hard-coding the topology in a long function. That technique is fine for three operators and fragile for ten. The orchestrator replaces it with a declarative graph: the application says what the pipeline looks like, and the orchestrator figures out what to spawn and in what order.

The data structure that captures "what the pipeline looks like" is a directed acyclic graph. Vertices are operators; directed edges are typed channels carrying observations (or, later, watermarks and barriers) from one operator to the next. The acyclic constraint is operationally critical: a pipeline with a cycle is a pipeline that can deadlock under backpressure, and a pipeline that can deadlock under backpressure is one we want to refuse to start rather than start and hope. We will spend most of this lesson building the graph type and its scheduler. The remaining time is on the operational properties the graph buys us: clean shutdown via JoinSet, end-to-end backpressure preservation, and detection of unreachable or orphaned operators at build time rather than at 3 AM.

The forward references matter. Module 3 adds watermark propagation to the graph's edges; the graph type designed here must accommodate that addition without a rewrite. Module 4 deepens the backpressure analysis; the bounded-channel-per-edge invariant established here is what makes it tractable. Module 5 adds checkpoint barriers as graph-level events; same invariant applies. The shape of OperatorGraph is load-bearing for the rest of the track.


Core Concepts

The Pipeline as a DAG

A pipeline's logical structure is a graph. Vertices are operators — the named pieces of work the pipeline performs (a source, a normalizer, a windowed dedup, a sink). Edges are channels — the typed conduits along which data flows from one operator to the next. The graph is directed because data flows in one direction along each edge. The graph is acyclic because allowing cycles introduces deadlock conditions that are tedious to reason about and unnecessary for the SDA pipeline's needs.

The vertices carry the operator's identity and the means to spawn it: a name, the async function (or closure) that runs the operator, the channel ends it expects to receive, and its restart policy from Lesson 1. The edges carry the channel itself plus its capacity. Edges between specific operators are typed — a channel from radar to normalize carries Observation, a channel from windowed-dedup to correlator might carry DedupedObservation — but the graph as a whole is heterogeneously typed and represented internally with type erasure (the operator function is boxed; the channels are stored in a typed map keyed on edge ID). This is a deliberate tradeoff that we will spell out in the code: the static-typing alternative produces a graph type whose generic parameters explode with the number of edges, and the orchestrator becomes harder to use than the manual spawn it replaces.

Forbidding cycles is a real constraint, not just a convention. A cyclic pipeline has at least one operator whose downstream depends on its own upstream; under backpressure, the upstream blocks waiting for the downstream to consume, and the downstream blocks waiting for the upstream to produce. Without a tie-breaking mechanism (an explicit budget on the cycle, an unbounded buffer, a priority order on which side blocks first), the cycle deadlocks. SDA's pipelines do not need cycles — every legitimate use case (a "rerun the failed observations" loop, a "feed audit results back to the source for sampling weight adjustment" path) can be expressed as a separate downstream operator that re-emits into a new pipeline. The orchestrator refuses to build a graph that contains a cycle, and produces an error message naming the cycle.

Topological Sort and Channel Wiring

Spawning the operators in arbitrary order does not work. The downstream operator's task must be created with the receiver end of the channel that connects it to the upstream operator. That receiver does not exist until the channel itself is created, which (in the natural construction order) happens when the upstream is being prepared. The natural flow is therefore: walk the graph in topological order (every operator visited before any of its descendants), create each operator's outgoing channels first, hand the receiver halves to the descendants when their turn comes.

Two algorithms produce a topological order: Kahn's algorithm (repeatedly pick a vertex with no remaining incoming edges and remove it) and DFS post-order (depth-first traversal, emit each vertex after all its descendants). Both are O(V + E). Kahn's is operationally preferred for pipelines because the order it produces is level-by-level — sources first, then their immediate consumers, then theirs, and so on — which corresponds to how an operator engineer thinks about the pipeline. DFS post-order can produce a less intuitive order in which a deep chain is finished before a shallow sibling is started; the spawned tasks are then more interleaved than the engineer expected.

The wiring problem has a subtle two-pass structure that the code below makes explicit: the first pass over the graph creates every channel (sender + receiver pair) and stores each pair in a map keyed on the edge ID; the second pass spawns each operator with the receiver end of its incoming edge and the sender end of its outgoing edge. This separation is what makes downstream-first iteration impossible: we do not know who the receiver belongs to until the second pass identifies the consumer of that edge. Single-pass approaches that try to allocate channels lazily as operators are walked tend to produce subtle bugs — operators spawned with placeholder channels that have to be patched, or scheduling orders that look topological but skip an edge.

JoinSet for Lifecycle Management

A pipeline with N operators produces N JoinHandles. The orchestrator's supervisor (Lesson 4) needs to react when any one of them completes — usually because of a panic or an error, sometimes because of a clean shutdown signal. Iterating over a Vec<JoinHandle> and calling .await on each is the wrong pattern: it serializes on the slowest operator's completion and offers no way to react to whichever finishes first. The right primitive is tokio::task::JoinSet.

A JoinSet<Result<()>> is a bag of in-flight tasks with a join_next() -> Future<Output = Option<Result<Result<()>, JoinError>>> method. The future resolves whenever any task in the set finishes, returning that task's result. The supervisor awaits join_next in a loop, dispatching on each result as it arrives. When the supervisor decides the pipeline should shut down, it calls JoinSet::abort_all(), which sends an abort signal to every task and lets the supervisor drain the resulting JoinErrors through the same join_next loop. The structural property: every operator's lifecycle ends through the supervisor's mouth, never through a detached return.

JoinSet is also the right place for the heterogeneous-return-type discipline from Lesson 1. Every operator returns Result<()>; that uniformity is what lets JoinSet be a single typed structure for the entire pipeline. The standardization paid for itself the moment we needed a single supervisor.

Backpressure Through the DAG

Module 1 established mpsc::Sender::send().await as the foundation of backpressure: a full channel suspends the upstream until the downstream catches up. The DAG inherits that property as long as every edge in the DAG is a bounded channel. The orchestrator enforces this with a graph-level invariant: there are no unbounded_channel calls anywhere in the graph builder. Edges are sized at construction time; sizes are documented per-edge with a comment about expected burst behavior; Module 4 develops the sizing discipline.

The DAG-level corollary is that backpressure traverses the entire graph. A slow sink causes its upstream's channel to fill, suspends that upstream, which causes its upstream's channel to fill, and so on back to the sources. The sources themselves either suspend on their own producing primitive (the UDP socket recv, the HTTP poll) or, for sources that cannot suspend (a radar feed that produces whether we are listening or not), drop at the kernel level. The pipeline never grows unbounded internal buffers under load. This is the property the audit script in Lesson 3 verifies and the property the Module 4 burst test exercises.

The DAG is also the natural place to check cycles. We forbid them because a cyclic pipeline cannot have this backpressure-traversal property — a cycle has no "back" to propagate to. If a future requirement legitimately needs cyclic data flow (a feedback path for late-arriving observations, a "retry the failed conjunction analysis" loop), it must be implemented as a feed-through to a new pipeline, not as a back-edge in the existing one. The constraint is a feature.

Unreachable Tasks and Orphaned Channels

The first non-trivial bug a pipeline-graph type catches is the disconnection bug. An engineer adds a new operator, registers it with the graph, but forgets to call connect(upstream, new_operator) or connect(new_operator, downstream). The graph builds. The operator spawns. It either reads from an empty channel forever (its upstream was never wired) or writes into a channel nobody reads (its downstream was never wired) — and the former scenario has no symptoms until somebody notices the operator is silent, while the latter scenario fills its channel and applies false backpressure to its upstream. Both bugs cost real time when they happen in production.

The graph builder catches both cases at build() time. An operator that has no incoming edges is either a registered source (legitimate) or a misconfigured operator (illegal). An operator that has no outgoing edges is either a registered sink (legitimate) or a misconfigured operator (illegal). The builder's role enumeration distinguishes these cases; an operator declared as add_operator (interior) is required to have both an incoming and an outgoing edge, while add_source and add_sink adjust the constraint accordingly. The error message names the offending operator and the missing edge direction. This kind of build-time validation is what gives the declarative orchestrator its primary advantage over hand-spawned topology: bugs that would require runtime instrumentation to detect become compile-time-style errors at startup.


Code Examples

The OperatorGraph Builder

The graph is constructed by a builder that collects vertices and edges and validates them at build() time. Operators are stored as type-erased boxed closures so the graph itself is not generic over each operator's type signature.

use anyhow::{anyhow, bail, Context, Result};
use std::collections::{BTreeMap, BTreeSet};
use std::future::Future;
use std::pin::Pin;
use tokio::sync::mpsc;

/// What an operator does once spawned. Erased to a boxed future so the
/// graph stores a heterogeneous collection.
pub type OperatorFuture = Pin<Box<dyn Future<Output = Result<()>> + Send + 'static>>;

/// An operator's role in the topology. The builder uses this to validate
/// that operators have the right edges connected.
#[derive(Debug, Clone, Copy)]
pub enum Role { Source, Operator, Sink }

struct VertexSpec {
    name: String,
    role: Role,
    /// Constructed once the channels are wired in build(). Closure takes
    /// the operator's incoming and outgoing channel-end handles.
    factory: Box<dyn FnOnce(WiredEnds) -> OperatorFuture + Send>,
    incoming: Option<EdgeId>,
    outgoing: Option<EdgeId>,
}

#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub struct EdgeId(u32);

/// The handles passed into the operator's factory closure when build()
/// wires the topology.
pub struct WiredEnds {
    pub rx: Option<mpsc::Receiver<Observation>>,
    pub tx: Option<mpsc::Sender<Observation>>,
}

pub struct OperatorGraph {
    vertices: Vec<VertexSpec>,
    edges: Vec<EdgeSpec>,
}

struct EdgeSpec {
    id: EdgeId,
    from_idx: usize,
    to_idx: usize,
    capacity: usize,
}

impl OperatorGraph {
    pub fn new() -> Self {
        Self { vertices: Vec::new(), edges: Vec::new() }
    }

    pub fn add_source(&mut self, name: impl Into<String>,
                      factory: impl FnOnce(WiredEnds) -> OperatorFuture + Send + 'static)
                      -> usize
    {
        self.push_vertex(name.into(), Role::Source, factory)
    }

    pub fn add_operator(&mut self, name: impl Into<String>,
                        factory: impl FnOnce(WiredEnds) -> OperatorFuture + Send + 'static)
                        -> usize
    {
        self.push_vertex(name.into(), Role::Operator, factory)
    }

    pub fn add_sink(&mut self, name: impl Into<String>,
                    factory: impl FnOnce(WiredEnds) -> OperatorFuture + Send + 'static)
                    -> usize
    {
        self.push_vertex(name.into(), Role::Sink, factory)
    }

    fn push_vertex(&mut self, name: String, role: Role,
                   factory: impl FnOnce(WiredEnds) -> OperatorFuture + Send + 'static)
                   -> usize
    {
        let idx = self.vertices.len();
        self.vertices.push(VertexSpec {
            name,
            role,
            factory: Box::new(factory),
            incoming: None,
            outgoing: None,
        });
        idx
    }

    pub fn connect(&mut self, from: usize, to: usize, capacity: usize) -> Result<EdgeId> {
        if self.vertices[from].outgoing.is_some() {
            bail!("operator {} already has an outgoing edge", self.vertices[from].name);
        }
        if self.vertices[to].incoming.is_some() {
            bail!("operator {} already has an incoming edge", self.vertices[to].name);
        }
        let id = EdgeId(self.edges.len() as u32);
        self.edges.push(EdgeSpec { id, from_idx: from, to_idx: to, capacity });
        self.vertices[from].outgoing = Some(id);
        self.vertices[to].incoming = Some(id);
        Ok(id)
    }
}

The shape is more verbose than a typical graph type because it carries the operator-spawn closure inline. The closure captures whatever the operator needs (config, source addresses, etc.) and is only invoked once build() has wired the channels. This delays the construction of channel-touching state until we actually have the channels, which is exactly the two-pass structure the lesson described. The single-incoming and single-outgoing restriction (one edge per direction per operator) is a simplification that the SDA pipeline does not need to violate; fan-in and fan-out are handled by intermediate router operators rather than by multi-edge vertices. This keeps the topo-sort and the wiring logic simple at modest cost in expressiveness, and the cost is recoverable when needed by introducing explicit fan-in/fan-out operators with their own typed semantics.

Topo-Sort, Cycle Detection, and Build

The build() step is where validation happens. It runs Kahn's algorithm to produce a topological order and to detect cycles, validates per-role edge requirements, allocates the channels, and spawns each operator with its wired channel ends.

use std::collections::{HashMap, VecDeque};

pub struct BuiltGraph {
    /// Spawned tasks, in topological order. The supervisor takes ownership.
    pub tasks: Vec<Task>,
}

impl OperatorGraph {
    pub fn build(self) -> Result<BuiltGraph> {
        // Pass 1: validate role constraints.
        for v in &self.vertices {
            match v.role {
                Role::Source if v.incoming.is_some() =>
                    bail!("source {} has an incoming edge; sources have no upstream", v.name),
                Role::Sink if v.outgoing.is_some() =>
                    bail!("sink {} has an outgoing edge; sinks have no downstream", v.name),
                Role::Operator if v.incoming.is_none() =>
                    bail!("operator {} has no incoming edge; did you forget to connect()?", v.name),
                Role::Operator if v.outgoing.is_none() =>
                    bail!("operator {} has no outgoing edge; did you forget to connect()?", v.name),
                Role::Source if v.outgoing.is_none() =>
                    bail!("source {} has no outgoing edge; did you forget to connect()?", v.name),
                Role::Sink if v.incoming.is_none() =>
                    bail!("sink {} has no incoming edge; did you forget to connect()?", v.name),
                _ => {}
            }
        }

        // Pass 2: Kahn's algorithm for topo sort + cycle detection.
        let n = self.vertices.len();
        let mut in_degree = vec![0usize; n];
        for e in &self.edges { in_degree[e.to_idx] += 1; }

        let mut ready: VecDeque<usize> = (0..n).filter(|&i| in_degree[i] == 0).collect();
        let mut order: Vec<usize> = Vec::with_capacity(n);

        while let Some(idx) = ready.pop_front() {
            order.push(idx);
            for e in &self.edges {
                if e.from_idx == idx {
                    in_degree[e.to_idx] -= 1;
                    if in_degree[e.to_idx] == 0 { ready.push_back(e.to_idx); }
                }
            }
        }

        if order.len() != n {
            // The remaining vertices form one or more cycles.
            let cycle_members: Vec<&str> = (0..n)
                .filter(|i| !order.contains(i))
                .map(|i| self.vertices[i].name.as_str())
                .collect();
            bail!("pipeline graph has a cycle involving operators: {:?}", cycle_members);
        }

        // Pass 3: allocate channels (sender + receiver pair per edge).
        let mut chan_tx: HashMap<EdgeId, mpsc::Sender<Observation>> = HashMap::new();
        let mut chan_rx: HashMap<EdgeId, mpsc::Receiver<Observation>> = HashMap::new();
        for e in &self.edges {
            let (tx, rx) = mpsc::channel(e.capacity);
            chan_tx.insert(e.id, tx);
            chan_rx.insert(e.id, rx);
        }

        // Pass 4: walk in topo order, spawning each operator with its wired ends.
        let mut tasks: Vec<Task> = Vec::with_capacity(n);
        for idx in order {
            let v = &self.vertices[idx];
            let rx = v.incoming.and_then(|e| chan_rx.remove(&e));
            let tx = v.outgoing.and_then(|e| chan_tx.remove(&e));
            // The factory closure consumes itself — pull it out via a take()
            // pattern. (Real code uses a helper to deduplicate this.)
            // ... factory invocation and Task::spawn elided for brevity.
            tasks.push(spawn_via_factory(&v.name, v.role, v.factory_take(), WiredEnds { rx, tx }));
        }

        Ok(BuiltGraph { tasks })
    }
}

Three things to notice. The validation in pass 1 rejects misconfigured graphs before any channels are allocated, which gives the engineer the cleanest possible error message — naming the operator that is missing its edge — rather than a downstream "channel is empty forever" symptom at runtime. The Kahn's-algorithm pass in pass 2 doubles as both topological sort and cycle detection: if any vertex has nonzero in-degree at the end, it is part of a cycle, and the unreached vertices name the cycle's members in the error message. The channel allocation in pass 3 is a single pass that creates every channel before any operator spawns; pass 4 then walks the topological order and hands each operator its channel ends, removing them from the maps as it goes. The remove rather than get is intentional: each receiver is owned by exactly one operator, and the map's emptiness at the end is itself a sanity check.

Cycle Detection in Practice

What the cycle error looks like to the engineer who wrote the offending pipeline. The example below intentionally builds a cycle (dedup → normalize → dedup) to illustrate the diagnostic.

fn build_buggy_pipeline() -> Result<BuiltGraph> {
    let mut g = OperatorGraph::new();
    let radar    = g.add_source("radar",    radar_factory());
    let normalize = g.add_operator("normalize", normalize_factory());
    let dedup    = g.add_operator("dedup",    dedup_factory());
    let sink     = g.add_sink("audit",       audit_factory());

    g.connect(radar,     normalize, 1024)?;
    g.connect(normalize, dedup,     1024)?;
    g.connect(dedup,     normalize, 1024)?; // accidental back-edge
    g.connect(dedup,     sink,         64)?;

    g.build() // returns Err with the cycle named
}

// The error returned by g.build():
// Error: pipeline graph has a cycle involving operators: ["normalize", "dedup"]

The diagnostic is not perfect — Kahn's algorithm reports the set of vertices in cycles rather than the specific edges that form them — but it is enough to direct the engineer to the right neighborhood. Production graph types augment this with a second pass that finds the strongly connected components and reports each cycle's edge sequence; the SDA orchestrator's diagnostic is sufficient for the topology sizes we expect (tens of operators per pipeline) and we leave the SCC enhancement for when it is needed. The point of the example is that a misconfigured pipeline fails at build time with an actionable message, not at runtime with a deadlocked operator.


Key Takeaways

  • The pipeline is a directed acyclic graph of operators connected by typed bounded channels. Vertices are operators with a name, role (source / operator / sink), and spawn factory; edges are channels with a documented capacity. The acyclic constraint is a deadlock-prevention requirement, not a stylistic choice.
  • OperatorGraph::build() runs four passes: per-role edge validation, Kahn's topological sort with cycle detection, channel allocation in a single pass, then operator spawning in topological order. Each pass produces an actionable error message at the earliest possible point.
  • tokio::task::JoinSet<Result<()>> is the right primitive for owning N operator handles. It supports join_next for whichever-finishes-first reaction and abort_all for clean shutdown. Standardize every operator on Result<()> and the JoinSet is homogeneously typed.
  • Bounded channels everywhere preserves the end-to-end backpressure property the rest of the track depends on. The graph builder makes "no unbounded_channel" a structural invariant, not a code-review rule.
  • Build-time validation catches disconnection bugs (unwired operators) and cycle bugs (back-edges). Both produce readable startup-time errors rather than runtime symptoms that require instrumentation to diagnose.

Lesson 3 — Retries and Idempotency

Module: Data Pipelines — M02: Pipeline Orchestration Internals Position: Lesson 3 of 4 Source: Kafka: The Definitive Guide — Shapira, Palino, Sivaram, Petty, Chapter 7 (Reliable Data Delivery — Send Acknowledgments, Configuring Producer Retries, Idempotent Producer); Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11 (failure handling in stream processing)


Context

Network calls in a streaming pipeline fail. The optical-archive REST endpoint goes down for thirty seconds during a partner deploy. The Kafka broker the alert sink is producing into has a leader election. The conjunction-emitter HTTP subscriber returns 503 because its database is being patched. None of these are "the pipeline is broken" — they are normal transient conditions in a system whose dependencies have their own operational lifecycle. The pipeline must keep running through them, which means it must retry.

Naive retries are themselves a source of incidents. A pipeline that retries every failure with no backoff turns a one-second downstream blip into a thundering-herd amplification of a hundred operator instances all reconnecting at once, each triggering more downstream work, each failing again. A pipeline that retries permanent errors (a 4xx, a deserialization failure, a schema-mismatch exception) loops forever on a poison pill. A pipeline that retries idempotent operations gets the right answer; one that retries non-idempotent operations produces duplicates whose downstream cost is sometimes invisible (a duplicate row in an analytics table) and sometimes catastrophic (a duplicate "fire thrusters" command to a satellite).

This lesson assembles three pieces of discipline. What to retry — distinguishing transient from permanent errors so retries help rather than hurt. How to retry — exponential backoff with jitter, so a hundred instances retrying after the same failure do not all retry at the same instant. How to make retries safe — idempotency, the property that lets at-least-once delivery (the default Kafka producer guarantee) compose into effective exactly-once processing (covered fully in Module 5; previewed here as the tooling we install now). The mission framing is concrete: an SDA-2026-NNNN incident two months ago saw a junior engineer add naive retries to the optical poller, which during a 30-second archive outage hammered the recovering archive with five thousand reconnect attempts per second and extended the outage to ninety minutes. The architectural fix from that postmortem is what this lesson encodes.


Core Concepts

Transient vs Permanent Errors

The first decision in a retry policy is whether a given error is a thing retrying might fix. Transient errors are conditions that are likely to resolve on their own within a useful timescale: timeouts, 5xx responses, connection-refused, broker-not-available, leader-election-in-progress. Permanent errors are conditions that will not resolve no matter how many times you ask: 4xx responses (the request is malformed), deserialization failures (the bytes are not what you expected), schema-mismatch errors (the producer and consumer disagree about the type), validation failures (the observation references a satellite that has been decommissioned). Discardable errors are a third category — invariant violations that should drop the event entirely without retry or DLQ, like a malformed UUID in a field that is required to be a UUID and is generated at the source.

The classification logic lives in the operator, not in the framework. A general retry wrapper that retries every error is the wrong shape. A wrapper that takes a RetryDisposition::Retry | Permanent | Discard and expects the operator to classify is the right shape. Lesson 4 introduces the dead-letter-queue as the destination for Permanent errors (not retried; not discarded; routed somewhere the engineer can examine them), so for now we focus on the retry path itself.

A useful default for unknown errors is Permanent. If you do not know whether retrying helps, assume it does not — better to surface the unknown error to the operational dashboard than to loop on a poison pill. Engineers add explicit Retry disposition for errors they have classified; everything else falls through to Permanent and gets attention.

Exponential Backoff with Jitter

When a transient error occurs, retrying instantly is rarely right. The downstream that is failing is usually doing so because it is overloaded, recovering from a fault, or being deployed; an instant retry adds load to a system that needs the opposite. The basic shape of a sensible retry policy is exponential backoff: wait some initial delay, double it on each retry, cap at some maximum. The exponential growth means an outage of any duration eventually backs off to a low retry rate; the cap prevents runaway delays past the point where recovery is plausible.

Exponential backoff alone has a subtle problem at scale. A hundred operator instances all hit the same downstream failure at the same instant. They all back off the same delay. They all retry at the same instant. The downstream is still recovering, fails again, and the cycle repeats. The retry traffic looks like a square wave. The fix is jitter: a random component added to (or replacing) the deterministic delay. With jitter, the hundred instances retry at randomly distributed times within a window, smoothing the load.

The two jitter formulas worth knowing are full jitter and decorrelated jitter. Full jitter: delay = random_uniform(0, backoff_cap) — discard the deterministic backoff entirely; the cap is the only thing that matters. Decorrelated jitter: delay = min(cap, random_uniform(initial, prev_delay * 3)) — keep the previous attempt's delay as the lower bound's anchor; the cap is the upper. Decorrelated jitter is the AWS Architecture blog's recommendation and the one we use here. It is more aggressive than full jitter (the median delay is higher) which produces less retry pressure on a slow-recovering downstream.

Idempotency Keys

Retries can produce duplicates. A producer that retries after a partial failure may have actually succeeded on the original attempt, the success ack just got lost; the retry is a duplicate. A consumer that processes-then-commits may crash between the process and the commit; on restart it reads the message again and processes it twice. At-least-once delivery — the natural guarantee of any retry-capable system — admits duplicates by definition. To compose it into something stronger, we need the operations downstream of delivery to be idempotent: applying them twice with the same input produces the same result as applying them once.

Idempotency is a property of the operation, not the framework. Setting a record's value to a specific number is idempotent; incrementing a counter is not. Inserting a row keyed on observation_id with ON CONFLICT DO NOTHING is idempotent; a plain INSERT is not. A POST request with an Idempotency-Key header that the server respects is idempotent; the same POST without the header is not. The pipeline's operators must each be designed with the question "what does this operation do if I call it twice with the same input?" answered.

The natural idempotency key for SDA is the envelope's observation_id — a UUID generated at the source, carried through every stage, present on every observation. For derived events (a ConjunctionRisk produced by the correlator from two observations), the natural key is a hash of the inputs' observation_ids plus the window ID; deterministic, content-derived, identical across retries of the same input set. The tooling installed here will be reused in Module 5 when we discuss exactly-once delivery in depth.

Where to Carry the Key

The envelope carries the key. The downstream sink uses the key for dedup. The middle operators do not need the key for their own correctness (a stateless map operator is idempotent regardless), but they must propagate it to the downstream. The key must not change as the envelope passes through stages — a stage that "enriches" the observation by attaching a catalog entry must not regenerate observation_id; it must preserve it. This is the rule that makes the rest of the system composable: the producer-side at-least-once guarantee plus the sink-side dedup, with the same key visible at both ends, gives the pipeline effective exactly-once.

External system boundaries also need the key. An HTTP request to a downstream service includes the observation_id as the Idempotency-Key header; the downstream service uses it to dedup retries server-side. A Kafka producer with enable.idempotence=true (the underlying mechanism is producer ID + sequence number, but the conceptual model is the same) ensures the broker drops duplicate messages. A database write uses INSERT ... ON CONFLICT (observation_id) DO NOTHING. The pattern is the same in every case: the key crosses the boundary, the downstream uses it.

At-Least-Once + Idempotent = Effective Exactly-Once

This is the conceptual frame the rest of the track depends on. True exactly-once delivery — the network actually delivering each message exactly once — is impossible without either coordination (two-phase commit, transactional Kafka producers across topics) or strong assumptions about the network. Pragmatic exactly-once is achieved by combining two things that are individually achievable: at-least-once delivery at the transport layer (achievable with retries) and idempotent processing at the application layer (achievable with operation design). The two together produce a system in which every event is processed as if it were delivered exactly once, even though under the covers the transport layer may have delivered some events many times.

This frame is foundational for Module 5, where checkpointing, dead-letter queues, and exactly-once Kafka producers each get full treatment. Mention here so that every retry decision in this lesson is made with that downstream landscape in mind: we are not trying to avoid duplicates; we are trying to make sure duplicates are safe.


Code Examples

A Retry Wrapper with Decorrelated-Jitter Backoff

The wrapper takes a closure that performs the operation, plus a policy struct. It loops, dispatching on the operator's RetryDisposition. The backoff is computed per attempt with decorrelated jitter using the previous delay as the anchor.

use anyhow::Result;
use rand::Rng;
use std::time::Duration;
use tokio::time::sleep;

#[derive(Debug, Clone, Copy)]
pub struct RetryPolicy {
    pub initial: Duration,
    pub cap: Duration,
    pub max_attempts: u32,
}

#[derive(Debug)]
pub enum RetryDisposition<T> {
    /// Operation succeeded; return the value.
    Ok(T),
    /// Transient error; the wrapper should retry per the policy.
    Retry(anyhow::Error),
    /// Permanent error; the wrapper should not retry. Caller decides
    /// whether to DLQ (Lesson 4) or propagate.
    Permanent(anyhow::Error),
    /// Discard with no retry, no DLQ. The event is invariant-violating
    /// in a way that does not warrant operational attention.
    Discard,
}

/// Retry the given operation per the policy. Returns Ok on success,
/// Err on permanent failure or attempt-budget exhaustion.
pub async fn with_retry<T, F, Fut>(policy: RetryPolicy, mut op: F) -> Result<Option<T>>
where
    F: FnMut() -> Fut,
    Fut: std::future::Future<Output = RetryDisposition<T>>,
{
    let mut attempt: u32 = 0;
    let mut prev_delay = policy.initial;
    loop {
        attempt += 1;
        match op().await {
            RetryDisposition::Ok(v) => return Ok(Some(v)),
            RetryDisposition::Discard => return Ok(None),
            RetryDisposition::Permanent(e) => {
                return Err(e.context(format!("permanent failure on attempt {attempt}")));
            }
            RetryDisposition::Retry(e) if attempt >= policy.max_attempts => {
                return Err(e.context(format!(
                    "exhausted retry budget after {attempt} attempts"
                )));
            }
            RetryDisposition::Retry(_) => {
                // Decorrelated jitter: random_uniform(initial, prev_delay * 3),
                // capped at policy.cap. This produces a per-instance schedule
                // that is uncorrelated with other instances retrying the same
                // downstream — no thundering herd.
                let upper_bound = (prev_delay.as_millis() as u64).saturating_mul(3);
                let upper = upper_bound.max(policy.initial.as_millis() as u64);
                let delay_ms = rand::thread_rng()
                    .gen_range(policy.initial.as_millis() as u64..=upper);
                let delay = Duration::from_millis(delay_ms).min(policy.cap);
                prev_delay = delay;
                sleep(delay).await;
            }
        }
    }
}

Two things worth noting. The wrapper accepts a FnMut() -> Future rather than a single future — this matters because each retry needs to be a fresh operation. A future is single-use; calling await on it again is forbidden. The closure's job is to construct a fresh future on each invocation. The second point: the Discard arm returns Ok(None) rather than Err(_). This distinguishes the "this event was an invariant violation we chose to drop" case from the "this operation failed" case at the type level. The caller can dispatch on the Option and update a discards_total metric without using errors as control flow.

Classifying Errors at the Boundary

The HTTP-source-side operator is the canonical place where transient and permanent errors arrive interleaved. The classification logic looks at the response status code and the error variant; it returns the right RetryDisposition for the wrapper to act on.

use reqwest::{Client, Error as ReqwestError, StatusCode};

async fn poll_optical_archive(
    client: &Client,
    endpoint: &str,
    since: chrono::DateTime<chrono::Utc>,
) -> RetryDisposition<Vec<RawObservation>> {
    let resp = match client.get(endpoint).query(&[("since", since.to_rfc3339())]).send().await {
        Ok(r) => r,
        Err(e) if e.is_timeout() || e.is_connect() => {
            return RetryDisposition::Retry(e.into());
        }
        Err(e) => {
            // Other reqwest errors (URL parse, header mismatch) are
            // configuration bugs — permanent, not transient.
            return RetryDisposition::Permanent(e.into());
        }
    };

    match resp.status() {
        s if s.is_success() => match resp.json::<Vec<RawObservation>>().await {
            Ok(v) => RetryDisposition::Ok(v),
            Err(e) => {
                // Body is malformed JSON or has the wrong schema. Retrying
                // does not help; the next response will be the same shape.
                RetryDisposition::Permanent(e.into())
            }
        },
        s if s.is_server_error() => {
            // 500-class: transient. Retry.
            RetryDisposition::Retry(anyhow::anyhow!("optical archive {s}"))
        }
        StatusCode::TOO_MANY_REQUESTS => {
            // 429: explicitly transient, with the server asking us to slow.
            // Real code would honor a Retry-After header here.
            RetryDisposition::Retry(anyhow::anyhow!("optical archive 429"))
        }
        s if s.is_client_error() => {
            // 400-class: permanent. The request is malformed or unauthorized.
            // Retrying produces the same error.
            RetryDisposition::Permanent(anyhow::anyhow!("optical archive {s}"))
        }
        s => RetryDisposition::Permanent(anyhow::anyhow!("optical archive unexpected {s}")),
    }
}

The matching is exhaustive on the categories that matter — transient timeouts and connect failures, transient 5xx and 429, permanent 4xx, permanent everything-else. The Permanent for malformed JSON is a real consideration: if a partner's API change rolls out without coordination, the field they renamed produces a deserialization error on every request, and the pipeline starts pumping DLQ entries. That is the right behavior — alerting on DLQ growth is how we discover the partner's breaking change. Retrying the deserialization in a tight loop instead would mask the partner's bug and produce the same effect as a self-inflicted DDoS.

Idempotent Sink-Side Write

The dedup sink keeps a sliding-window set of recently-seen observation IDs and writes downstream only on novel observations. This is the application-layer half of the at-least-once-plus-idempotent composition.

use anyhow::Result;
use std::collections::{BTreeSet, VecDeque};
use std::time::{Duration, SystemTime};
use tokio::sync::mpsc;
use uuid::Uuid;

/// A sink that drops duplicate observations within a rolling window.
/// The window's capacity bounds memory; observations seen outside the
/// window may be re-emitted (acceptable for the SDA pipeline's downstream
/// alert subscriber, which itself is idempotent on alert ID).
pub struct DedupSink {
    seen: BTreeSet<Uuid>,
    /// Insertion order, used for FIFO eviction.
    order: VecDeque<(Uuid, SystemTime)>,
    window: Duration,
    capacity: usize,
    downstream: mpsc::Sender<Observation>,
}

impl DedupSink {
    pub fn new(window: Duration, capacity: usize, downstream: mpsc::Sender<Observation>) -> Self {
        Self { seen: BTreeSet::new(), order: VecDeque::new(), window, capacity, downstream }
    }

    pub async fn write(&mut self, obs: Observation) -> Result<()> {
        let now = SystemTime::now();
        // Evict expired entries first.
        while let Some(&(id, ts)) = self.order.front() {
            if now.duration_since(ts).unwrap_or_default() > self.window
                || self.order.len() >= self.capacity
            {
                self.order.pop_front();
                self.seen.remove(&id);
            } else {
                break;
            }
        }
        if self.seen.contains(&obs.observation_id) {
            // Duplicate; drop silently. (Production: increment a metric.)
            return Ok(());
        }
        self.seen.insert(obs.observation_id);
        self.order.push_back((obs.observation_id, now));
        self.downstream.send(obs).await
            .map_err(|_| anyhow::anyhow!("dedup sink downstream dropped"))
    }
}

The dedup window is bounded both by time and by count — the size cap is the safety valve in case the time-based eviction gets behind during a burst. A real production sink would also persist the seen set across restarts (via a small embedded store) so that a process restart does not produce a duplicate-observation surge while the seen set rebuilds; Module 5's checkpointing lesson supplies that machinery. Until then, a process restart causes a one-window worth of potential duplicates downstream — acceptable for SDA's alert subscriber, which has its own idempotency on alert ID, and worth flagging as a deliberate cost. The pattern at every layer is the same: choose a key, choose a bound, dedup against the bound.


Key Takeaways

  • Errors are either transient (retry helps), permanent (retry makes it worse), or discardable (drop without operational attention). The classification is the operator's responsibility, not the framework's. Default unknown errors to permanent; that surfaces them rather than masking them.
  • Exponential backoff with jitter is the right retry shape. Decorrelated jitter (random_uniform(initial, prev_delay * 3), capped) keeps a hundred operator instances from synchronizing their retries during a downstream outage. Naive fixed-delay retries amplify outages into self-inflicted DDoS events.
  • Idempotency is a property of the operation, not of the framework. The pipeline composes at-least-once delivery (achievable with retries) plus idempotent operations (achievable by design) into effective exactly-once processing. This frame is the tooling that Module 5 will fully develop.
  • The idempotency key is the envelope's observation_id for SDA, propagated unchanged through every stage. External boundaries — Kafka producers, HTTP requests, database writes — each have their own way of consuming the key (idempotent producer, Idempotency-Key header, ON CONFLICT DO NOTHING); the pattern is the same at every boundary.
  • The dedup sink is the canonical exactly-once-effective endpoint: a sliding-window set keyed on observation_id, bounded by both time and count, with documented behavior on cold start and known cost on duplicate-burst conditions.

Lesson 4 — Failure Modes

Module: Data Pipelines — M02: Pipeline Orchestration Internals Position: Lesson 4 of 4 Source: Async Rust — Maxwell Flitton & Caroline Morton, sections on panics in async tasks and structured shutdown; Fundamentals of Data Engineering — Joe Reis & Matt Housley, Chapter 3 ("Plan for Failure," "Build Loosely Coupled Systems")

Source note: The supervisor pattern as presented here draws on the Erlang/OTP design that influenced subsequent supervised-task systems and is well-documented across many sources beyond the cited texts. The Erlang one-for-one and rest-for-one strategies are not directly cited from a single source; they are general knowledge in the field and adapted here to the SDA orchestrator.


Context

Lesson 1 established the task; Lesson 2 wired tasks into a graph; Lesson 3 made each task's network calls survive transient failures. The pipeline now tolerates the failures that retry can address. This lesson handles the three classes of failure that retry cannot. Panics, where an operator hits a panic! or an unwrap() on None and the task is torn down by the runtime. Cascading slowdowns, where one operator's degradation propagates through the topology in ways that look like other operators failing. Resource exhaustion in shared pools, where one misbehaving operator starves the rest of the pipeline of file descriptors, connections, or thread time.

The mission framing is concrete. Two months ago, a release rolled out with a new validate operator that called unwrap() on the lookup result of a thread-local catalog cache. Some keys had been evicted from the cache during a startup race; the unwrap panicked. The pipeline did not crash; the panicking task was simply removed from the runtime, its channels orphaned, its upstream blocked on a full channel and its downstream starved of input. For the next four minutes, conjunction alerts were emitted, but they were emitted on the post-validate stream from observations that had skipped the validate stage. Two of those alerts later turned out to be false positives that should have been filtered. The postmortem traced the problem to a missing supervisor — there was nothing watching the validate task and restarting it.

The discipline this lesson installs is explicit failure-mode management. The supervisor pattern detects an operator's exit, classifies it, and applies a policy (restart, escalate, or fail). Bulkheading separates resources so one operator's failure cannot starve the others. Circuit breakers provide local fail-fast behavior for repeatedly-failing downstreams. Together, these patterns turn the pipeline from "runs until something panics" into "runs through panics with documented recovery semantics." Reis & Housley's framing is direct: planning for failure is what distinguishes a system that operates from a system that runs.


Core Concepts

Panic vs Error in Tokio Tasks

A future returns errors via Result. A future panics when its execution hits a runtime panic — an unwrap on a None, an out-of-bounds index, a division by zero, an explicit panic! macro. Tokio captures the panic at the task boundary and surfaces it through the task's JoinHandle: awaiting the handle returns Err(JoinError) where is_panic() is true. The runtime continues running; only the panicked task is torn down. The application's view of the panic depends entirely on whether anyone is awaiting the handle. If the handle was detached (dropped), the panic is silent — logged at most, and the application has no recovery hook.

The corollary is structural: the orchestrator must own every operator's JoinHandle and join on it. Detached operators that panic disappear with no signal beyond a log line. The Task wrapper from Lesson 1 keeps the handle owned; the supervisor in this lesson awaits each handle through JoinSet::join_next and dispatches on the four-case TaskExit enum (Ok, Errored, Panicked, Aborted). The four cases call for different actions. A panic is a programming bug that should be alerted on, not silently retried — the same panic at the same code path will recur on restart, and an unbounded restart loop on a panic burns CPU without making progress. An error is a runtime condition (a Permanent retry-disposition that propagated past the wrapper) that may or may not warrant restart depending on policy. An abort is a deliberate shutdown signal from the orchestrator. An ok-exit means the operator's input ended cleanly, which is normal at end-of-stream.

The default Tokio behavior on a panic in a task is "ignore" — the runtime captures the panic, marks the handle as failed, and continues. There is a runtime configuration to abort the process on any task panic (unhandled_panic = Shutdown); we do not enable it. A pipeline that crashes on any operator panic is more brittle than one that supervises them. The supervisor decides the response; the runtime's job is to surface the panic, not to act on it.

The Supervisor Pattern

A supervisor is a parent task that owns the JoinHandles of its children, watches for any of them to exit, and applies a policy on exit. The pattern is the same one that made Erlang's OTP framework famous: structured concurrency with explicit failure dispatch. The SDA orchestrator's supervisor is a loop over JoinSet::join_next with a match on the resulting TaskExit.

The policy decides the response. Restart spawns a new instance of the operator (using its registered factory closure), inserting the new Task into the JoinSet. Escalate propagates the failure upward — in our single-supervisor design, this means tearing down the entire pipeline. Ignore lets the operator stay dead and the pipeline continues with the missing stage; rarely the right answer for SDA but useful for purely-observational sidecars (a metrics exporter, a debug logger). The policy is per-operator and registered when the operator is added to the graph.

A naive "always restart" policy is dangerous because it does not bound the restart rate. An operator that panics on its first input panics again on the same input after restart, and again, and again — a tight restart loop that burns CPU and produces a flood of metrics noise without ever making progress. The right shape is a bounded restart budget: at most N restarts within a time window W. Past the budget, the supervisor escalates. Erlang's documented rule of thumb is "5 restarts in 60 seconds before escalation"; the SDA orchestrator uses the same default, configurable per-operator. The policy is in RestartPolicy::Bounded { max_restarts, window } from Lesson 1's Task wrapper.

Bulkheading

A bulkhead is a physical partition that prevents flooding from spreading between compartments of a ship; in software, it is a separation of resources so that a failure or resource exhaustion in one part of the system cannot affect other parts. Three levels are useful for the SDA orchestrator.

Channel-level bulkheading is what we already have: every operator pair is connected by its own bounded channel, so a slow downstream applies backpressure only to its direct upstream, not to the entire pipeline. A misbehaving correlator does not block the audit sink; the audit sink reads from a different channel that is unaffected.

Runtime-level bulkheading separates the Tokio runtime workers used by different sets of operators. The default Tokio runtime has a shared worker pool; a CPU-bound operator (placed on spawn_blocking, not spawn) holds a blocking-pool thread while it runs. The blocking pool's default size of 512 is sized for the assumption that blocking work is incidental, not the steady-state workload. For SDA, the orbital propagator runs a steady stream of CPU-bound calls; if it shares the blocking pool with file I/O on the audit sink, the propagator can starve the audit sink of pool slots during a fragmentation-event surge. The fix is to give the propagator its own runtime (a dedicated tokio::runtime::Runtime with a separate blocking pool) and have the propagator operator submit to that runtime instead. This is more complex than channel-level bulkheading and reserved for operators where the resource isolation is needed.

Process-level bulkheading is out of scope for this track but worth naming. Production deployments of the SDA pipeline run different stages in different processes (or even different hosts) so that an OOM in the correlator does not kill the radar source. The separation is bought via Kafka topics between stages, which Module 5 covers in depth.

Cascading Failures

A cascading failure is when one operator's degradation produces failure-like symptoms in unrelated operators that are not themselves degrading. Three common shapes.

Backpressure-induced timeouts: a slow correlator's full channel suspends the upstream normalize, which suspends its upstream radar source's send. The normalize and the source are not failing, but their per-event latency goes up. If a downstream operator (an alert emitter that consumes from after the correlator) has a deadline that exceeded its budget, it times out — and the timeout looks like the alert emitter is broken even though the cause is the correlator's slowness. The correct response is not to retry the alert emitter (which makes the situation worse by adding more requests to a slow downstream) but to address the correlator.

Resource-exhaustion cascades: an operator with a leaked file descriptor hits the process's FD limit. Subsequent operators that try to open files (the audit sink writing to disk; the optical poller opening TCP connections) fail with EMFILE. The fault propagates with no obvious connection to its source.

Retry-storm cascades: an operator's retry policy (Lesson 3) is misconfigured with no jitter. A downstream blip triggers synchronized retries across a fleet of operator instances, which prevent the downstream from recovering, which extends the blip into an outage, which extends the retry storm. The original blip was 200 ms; the cascade is hours.

The architectural response to cascades is to address the cause, not the symptom. Backpressure-induced timeouts are diagnosed by tracing per-stage occupancy and processing latency together; a full channel upstream of the timing-out operator is the smoking gun. Resource exhaustion is diagnosed by tracking FD counts, connection counts, blocking-pool slot counts as Prometheus gauges. Retry storms are prevented by jitter (already done in Lesson 3) and detected by retry-rate metrics. Module 6 builds this observability tooling out fully.

Circuit Breakers

A circuit breaker is a local pattern for protecting a downstream from a fleet of operator instances all hammering it during a failure. The breaker has three states. Closed: the breaker passes calls through normally and tracks the failure rate. Open: the breaker has observed too many failures recently and rejects calls immediately without making the downstream call at all. Half-open: after a cooldown, the breaker lets a single call through to test if the downstream has recovered; on success, it closes; on failure, it opens again.

The breaker complements the retry wrapper from Lesson 3. Retry handles individual call failures; the breaker handles patterns of failure. Together they produce a system that handles brief blips with a few retries, sustained outages with the breaker opening to spare the downstream from amplification, and recovery with the breaker probing to detect the downstream's return without flooding it.

The breaker's tuning is operational. The trip threshold (failures-per-window before opening) is set high enough that brief blips do not trip but low enough that genuine outages do. The cooldown (time spent open before going to half-open) is set to the longest expected recovery time; too short causes thrash, too long delays recovery detection. The half-open success criterion is one call usually; some implementations use a small batch with a quorum requirement. The SDA orchestrator's default is a 50% failure rate over a 30-second window to trip, 30-second cooldown, single-call probe — sized to match the downstream's typical recovery characteristics and adjustable per-downstream.


Code Examples

A Supervisor with Bounded Restart Budget

The supervisor loops over JoinSet::join_next, dispatching on the TaskExit enum from Lesson 1, applying the per-operator restart policy, and escalating when the budget is exhausted.

use anyhow::Result;
use std::collections::HashMap;
use std::time::Instant;
use tokio::task::JoinSet;

pub struct Supervisor {
    /// In-flight operator tasks by name. JoinSet drives the watch loop.
    set: JoinSet<(String, TaskExit)>,
    /// Per-operator restart history, used for budget enforcement.
    /// Each entry is the timestamps of recent restarts.
    restart_history: HashMap<String, Vec<Instant>>,
    /// Per-operator factories so the supervisor can respawn.
    factories: HashMap<String, OperatorFactory>,
    policies: HashMap<String, RestartPolicy>,
}

pub type OperatorFactory = Box<dyn FnMut() -> OperatorFuture + Send>;

#[derive(Debug)]
pub enum SupervisorEvent {
    /// An operator panicked; not retried. Pipeline should be torn down.
    Panicked { name: String, message: String },
    /// An operator exhausted its restart budget; pipeline should be torn down.
    BudgetExhausted { name: String },
    /// All operators exited cleanly; pipeline shut down normally.
    AllOk,
}

impl Supervisor {
    /// Run the supervisor loop. Returns when all operators have exited
    /// or when one fails in a non-recoverable way.
    pub async fn run(&mut self) -> SupervisorEvent {
        loop {
            match self.set.join_next().await {
                None => return SupervisorEvent::AllOk,
                Some(Ok((name, TaskExit::Ok))) => {
                    // End-of-stream from one operator. Other operators may
                    // still be running; let them finish naturally.
                    tracing::info!(operator = %name, "operator exited cleanly");
                }
                Some(Ok((name, TaskExit::Panicked(msg)))) => {
                    // Panic is a programming bug. Do not restart; escalate.
                    return SupervisorEvent::Panicked { name, message: msg };
                }
                Some(Ok((name, TaskExit::Errored(e)))) => {
                    // An operator returned Err. Apply its restart policy.
                    let policy = self.policies.get(&name).copied()
                        .unwrap_or(RestartPolicy::Never);
                    if !self.try_restart(&name, policy) {
                        return SupervisorEvent::BudgetExhausted { name };
                    }
                }
                Some(Ok((name, TaskExit::Aborted))) => {
                    // Aborted by the orchestrator. Not a failure.
                    tracing::info!(operator = %name, "operator aborted by orchestrator");
                }
                Some(Err(join_err)) => {
                    // join_next returned a JoinError directly — abnormal.
                    tracing::error!(?join_err, "join_next produced unexpected JoinError");
                }
            }
        }
    }

    fn try_restart(&mut self, name: &str, policy: RestartPolicy) -> bool {
        match policy {
            RestartPolicy::Never => false,
            RestartPolicy::Always => {
                self.respawn(name);
                true
            }
            RestartPolicy::Bounded { max_restarts, window } => {
                let history = self.restart_history.entry(name.to_string()).or_default();
                let cutoff = Instant::now() - window;
                history.retain(|ts| *ts >= cutoff);
                if history.len() >= max_restarts as usize {
                    tracing::error!(operator = %name, "restart budget exhausted");
                    return false;
                }
                history.push(Instant::now());
                self.respawn(name);
                true
            }
        }
    }

    fn respawn(&mut self, name: &str) {
        if let Some(factory) = self.factories.get_mut(name) {
            let future = factory();
            let name_owned = name.to_string();
            self.set.spawn(async move {
                let result = future.await;
                let exit = match result {
                    Ok(()) => TaskExit::Ok,
                    Err(e) => TaskExit::Errored(e),
                };
                (name_owned, exit)
            });
            tracing::info!(operator = %name, "operator restarted");
        }
    }
}

The supervisor is the type that the rest of the orchestrator hangs on. Its key property is structural: every spawned task's exit funnels through join_next, so every panic, error, abort, and clean exit is observed and dispatched on. There are no detached tasks in the system the supervisor knows about; if one is added by accident (a tokio::spawn somewhere outside the supervisor's view), it is a structural bug. The restart budget enforcement is straightforward: keep a sliding window of restart timestamps per operator, evict expired entries on each restart attempt, escalate when the budget is exhausted. The escalation just exits the supervisor's run loop with BudgetExhausted; the caller is the orchestrator's top-level entrypoint, which decides whether to abort the rest of the pipeline or whether to retry the supervisor itself with a longer cooldown — usually the former.

A Circuit Breaker for the Optical Archive

The breaker wraps calls to the flaky downstream. It tracks recent failures and trips when the failure rate exceeds a threshold; while open, calls return Err immediately without making the downstream call. After the cooldown, it lets a single probe through.

use std::sync::Mutex;
use std::time::{Duration, Instant};

#[derive(Debug, Clone, Copy)]
enum BreakerState {
    Closed,
    Open { opened_at: Instant },
    HalfOpen,
}

pub struct CircuitBreaker {
    state: Mutex<BreakerState>,
    /// Window of (timestamp, was_failure) tuples for failure-rate calc.
    history: Mutex<Vec<(Instant, bool)>>,
    threshold: f32,    // e.g., 0.5 = trip on 50% failures
    window: Duration,
    cooldown: Duration,
}

impl CircuitBreaker {
    pub fn new(threshold: f32, window: Duration, cooldown: Duration) -> Self {
        Self {
            state: Mutex::new(BreakerState::Closed),
            history: Mutex::new(Vec::new()),
            threshold,
            window,
            cooldown,
        }
    }

    /// Returns true if the call should be allowed through. False means
    /// the breaker is open and the caller should fail-fast without
    /// touching the downstream.
    pub fn allow(&self) -> bool {
        let mut state = self.state.lock().unwrap();
        match *state {
            BreakerState::Closed => true,
            BreakerState::HalfOpen => {
                // Already testing; do not allow concurrent probes.
                false
            }
            BreakerState::Open { opened_at } => {
                if opened_at.elapsed() >= self.cooldown {
                    *state = BreakerState::HalfOpen;
                    true
                } else {
                    false
                }
            }
        }
    }

    pub fn record_outcome(&self, was_failure: bool) {
        let now = Instant::now();
        let mut history = self.history.lock().unwrap();
        history.push((now, was_failure));
        // Drop entries outside the window.
        let cutoff = now - self.window;
        history.retain(|(ts, _)| *ts >= cutoff);
        let total = history.len();
        let failures = history.iter().filter(|(_, f)| *f).count();
        let rate = failures as f32 / total.max(1) as f32;
        let mut state = self.state.lock().unwrap();
        match *state {
            BreakerState::HalfOpen => {
                if was_failure {
                    *state = BreakerState::Open { opened_at: now };
                } else {
                    *state = BreakerState::Closed;
                    history.clear();
                }
            }
            BreakerState::Closed => {
                if total >= 5 && rate >= self.threshold {
                    *state = BreakerState::Open { opened_at: now };
                }
            }
            BreakerState::Open { .. } => {
                // Outcome from a stale in-flight call; ignore.
            }
        }
    }
}

The breaker integrates with the retry wrapper from Lesson 3 by gating the retry attempt: if breaker.allow() returns false, the retry attempt is short-circuited and the wrapper continues to the next backoff. The combination produces the layered response the lesson promises: individual transient failures are retried; sustained failure patterns trip the breaker, sparing the downstream from a fleet of synchronized retries; recovery is detected by the half-open probe without flooding the downstream. The minimum-call threshold (total >= 5) prevents the breaker from tripping on a single early failure during operator startup, when the failure rate is mathematically high but the sample is too small to be meaningful.

Bulkheading the CPU-Bound Propagator

A dedicated runtime for the propagator isolates it from the main async runtime's blocking pool. The propagator submits to its own pool; the rest of the pipeline submits to the default. A starvation in one does not affect the other.

use std::sync::Arc;
use tokio::runtime::Runtime;

pub struct PropagatorPool {
    runtime: Arc<Runtime>,
}

impl PropagatorPool {
    /// Build a dedicated runtime for CPU-bound propagation. The pool size
    /// is documented per-deployment based on expected propagation rate.
    pub fn new(blocking_threads: usize) -> Self {
        let runtime = tokio::runtime::Builder::new_multi_thread()
            .worker_threads(2)         // small async pool for handle wiring
            .max_blocking_threads(blocking_threads)
            .thread_name("propagator")
            .enable_all()
            .build()
            .expect("propagator runtime build");
        Self { runtime: Arc::new(runtime) }
    }

    /// Submit propagation work. The closure runs on the propagator's
    /// blocking pool, completely isolated from the main runtime's pool.
    pub async fn propagate(&self, obs: Observation) -> Result<PropagatedObservation> {
        let runtime = self.runtime.clone();
        let handle = runtime.spawn_blocking(move || {
            orbital::propagate(obs.target, obs.sensor_timestamp)
        });
        let propagated = handle.await??;
        Ok(PropagatedObservation { obs, propagated })
    }
}

The cost is that the propagator now lives behind an Arc<Runtime> rather than directly in the main runtime, and the pipeline graph has to pass the PropagatorPool to operators that need it. The benefit is operational isolation: a propagator surge that blocks every blocking thread it has does not touch the audit sink's blocking-pool needs. Channel-level bulkheading would not have addressed this; the audit sink and propagator share a different resource (blocking pool slots) than the channels that connect them, and the channel between them is irrelevant to the starvation. This is the case the lesson called out for runtime-level bulkheading.


Key Takeaways

  • Panics surface through JoinHandle as Err(JoinError) with is_panic() true. The orchestrator must own every operator's handle; detached tasks that panic disappear silently. The Task wrapper from Lesson 1 plus JoinSet::join_next is the structural mechanism.
  • The supervisor pattern dispatches on TaskExit. Restart on errors per the operator's RestartPolicy; escalate on panics (do not retry programming bugs); ignore aborts (deliberate shutdown). Always use a bounded restart budget — Erlang's "5 in 60 seconds" is a sensible default — and escalate on budget exhaustion.
  • Channel-level bulkheading is already provided by per-operator bounded channels. Runtime-level bulkheading separates blocking pools for resource-isolated operators (the propagator with its own Runtime). Process-level bulkheading is for Module 5 and beyond.
  • Cascading failures are diagnosed by recognizing that the failing-looking operator is downstream of the actual cause. Address the cause, not the symptom — retrying a timed-out alert emitter does not fix a slow correlator. Module 6's observability tooling makes the cause visible.
  • Circuit breakers complement retries: retries handle individual failures, breakers handle failure patterns. Closed → Open on threshold breach → Half-Open after cooldown → probe → Closed or back to Open. The combination prevents synchronized-retry amplification of downstream outages.

Capstone Project — Fusion Pipeline Orchestrator

Module: Data Pipelines — M02: Pipeline Orchestration Internals Estimated effort: 1–2 weeks of focused work Prerequisites: All four lessons in this module passed at ≥70%


Mission Brief

OPS DIRECTIVE — SDA-2026-0119 / Phase 2 Implementation Classification: ORCHESTRATION TIER STAND-UP

The Phase 1 ingestion service from Module 1 (sda-ingest) is in production and stable, but the next quarter's roadmap adds five new sensor sources, a windowed dedup stage, a cross-sensor correlator, an alert emitter, and an audit sink. The current hand-spawned topology in main.rs is at the practical limit of what one engineer can hold in their head. The Phase 1 postmortem also flagged two operational gaps: a panicking task is silently torn down with no recovery hook, and the retry policy on the optical poller is unjittered fixed-delay (the 90-minute outage extension last quarter is the canonical incident).

Phase 2 builds the orchestrator that addresses both gaps. The deliverable is a Rust library that accepts a declarative DAG of operators, spawns them with their channels correctly wired, supervises their lifecycles, and applies retry policy with jitter. The Phase 1 binary is refactored to use the orchestrator with no behavioral regression beyond the documented failure-handling improvements.

Success criteria for Phase 2: the orchestrator handles every Phase 1 source plus a synthetic 5-source future-load profile; a panic in any operator surfaces to operations rather than disappearing; transient downstream failures recover via backoff without thundering-herd behavior. Failure-isolated subsystems (the new orbital propagator) run on a dedicated runtime to avoid blocking-pool contention with the rest of the pipeline.


What You're Building

A Rust library crate, sda-orchestrator, that exposes:

  1. An OperatorGraph builder with add_source, add_operator, add_sink, and connect methods (Lesson 2)
  2. A BuiltGraph::run(supervisor_policy) -> Future that spawns the topology with the supervisor (Lesson 4) wired in
  3. The Task wrapper (Lesson 1), RetryPolicy and with_retry (Lesson 3), and CircuitBreaker (Lesson 4) types as public API
  4. Cycle detection and per-role edge validation in OperatorGraph::build() with named-operator error messages
  5. Bounded-restart-budget supervision with structured logging on every supervisor decision

Plus the refactor of the Phase 1 sda-ingest binary:

  1. The Module 1 binary is refactored to declare its topology as an OperatorGraph rather than hand-spawn it; behavior is preserved end-to-end
  2. The optical-archive HTTP poller wraps its requests in with_retry using decorrelated-jitter backoff
  3. The (new for this module) orbital propagator runs on a dedicated tokio::runtime::Runtime for blocking-pool isolation

The deliverable is the library, the refactored binary, the test suite (including a deterministic supervisor-restart test using tokio::time::pause), and a 1-page operational README documenting the orchestrator's API and its failure-handling guarantees.


Suggested Architecture

                                    OperatorGraph (declarative)
                                              │
                                              │ build()
                                              ▼
   ┌───────────────────────────────────────────────────────────────┐
   │  BuiltGraph: per-edge channels allocated, operators spawned   │
   │             in topological order with their wired ends.       │
   └───────────────────────────────┬───────────────────────────────┘
                                   │ run(policy)
                                   ▼
                          ┌───────────────────┐
                          │    Supervisor     │
                          │  (JoinSet loop)   │
                          └─────────┬─────────┘
                                    │
       ┌────────────────────────────┴────────────────────────────┐
       │                                                         │
       ▼                                                         ▼
   ┌───────┐  ┌────────────┐  ┌──────────┐  ┌──────────────┐  ┌──────┐
   │ radar │→│  normalize │→ │  dedup   │→│   correlator  │→│ sink │
   │  src  │  └────────────┘  └──────────┘  │ (Propagator-  │  │      │
   └───────┘     ↑   ↑                       │  Pool runtime)│  └──────┘
   ┌───────┐     │   │                       └──────────────┘
   │optical│─────┘   │
   │  src  │         │           Each edge: bounded mpsc::channel
   └───────┘         │           Each operator: Task wrapped, supervised
   ┌───────┐         │           Each network call: with_retry + breaker
   │  ISL  │─────────┘
   │  src  │
   └───────┘

The orchestrator does not know about Observation specifically; it operates on type-erased operator factories that consume and produce arbitrary types. The library is generic in the sense that the application (the SDA binary) wires up its own operator types and topology. Resist the temptation to bake SDA-specific assumptions into the orchestrator crate; it is meant to be reusable across pipelines.


Acceptance Criteria

Functional Requirements

  • OperatorGraph exposes new, add_source, add_operator, add_sink, connect, build matching the Lesson 2 signatures
  • OperatorGraph::build() runs all four passes (per-role validation, Kahn's topo sort with cycle detection, channel allocation, spawn) and produces actionable error messages on validation failures
  • BuiltGraph::run(policy) drives the supervisor loop and returns a SupervisorEvent that distinguishes clean shutdown from panic from budget exhaustion
  • Task::join() collapses Tokio's two-level error reporting into a TaskExit enum (Ok, Errored, Panicked, Aborted)
  • RestartPolicy::{Never, Always, Bounded { max_restarts, window }} is honored by the supervisor
  • with_retry(policy, op) retries RetryDisposition::Retry results with decorrelated-jitter backoff, propagates Permanent immediately, and discards Discard cleanly
  • CircuitBreaker implements Closed → Open → HalfOpen transitions with the threshold, window, and cooldown the lesson described; integration with with_retry is documented in the API
  • The sda-ingest binary is refactored to use OperatorGraph declaratively; the topology fits in a single build_topology() function under 80 lines

Quality Requirements

  • DAG cycle test: a unit test attempts to build a graph with a cycle and asserts the error message names the cycle's vertices
  • Disconnection test: a unit test attempts to build a graph with an unconnected operator and asserts the per-role validation error names the operator and the missing direction
  • Supervisor restart test: a unit test injects an error from a fake operator, asserts the supervisor restarts it within the budget, then asserts budget exhaustion when the error rate exceeds the policy. Use tokio::time::pause() and advance for deterministic timing — no flaky sleep in tests.
  • Supervisor panic test: a unit test injects a panic from a fake operator and asserts the supervisor returns SupervisorEvent::Panicked without restart attempts
  • Decorrelated-jitter math test: a unit test fixes the RNG and asserts the per-attempt delays match the documented schedule
  • No .unwrap() or .expect() in non-startup code paths

Operational Requirements

  • HTTP control plane (extending Module 1's): adds GET /metrics fields for operator_restart_total{operator}, operator_uptime_seconds{operator}, circuit_breaker_state{breaker} (encoded as 0/1/2 for Closed/Open/HalfOpen), and retry_attempts_total{operator}
  • Structured log line on every supervisor decision: spawn, restart, budget-exhausted, escalate, clean-exit. JSON formatter, one event per decision (not one per observation).
  • The operational README updated for Phase 2: documents the orchestrator API, the new metrics, and the new failure-handling semantics. One-page constraint preserved.

Self-Assessed Stretch Goals

  • (self-assessed) The optical source survives a 30-second downstream outage with no operator restarts and no observation drops at the application layer (the kernel may drop UDP frames during the outage; that is acceptable). Demonstrate via integration test using wiremock to simulate the outage.
  • (self-assessed) OperatorGraph::build() for a 10-operator topology completes in under 100 ms (cold) and under 10 ms (warm). Provide a criterion benchmark.
  • (self-assessed) The PropagatorPool (dedicated runtime for orbital propagation) is demonstrated to be isolated from the main runtime: an artificial 100x propagator load does not affect the main runtime's audit-sink P99 latency. Include the load-test harness.

Hints

How should I represent operators in the graph without making it generic over every operator type?

A boxed factory closure is the cleanest path. The graph stores Box<dyn FnOnce(WiredEnds) -> OperatorFuture + Send>, where OperatorFuture = Pin<Box<dyn Future<Output = Result<()>> + Send>>. The closure is called once build() has the channels; it captures whatever the operator needs (config, addresses, references to shared state).

type OperatorFuture = Pin<Box<dyn Future<Output = Result<()>> + Send>>;
type OperatorFactory = Box<dyn FnOnce(WiredEnds) -> OperatorFuture + Send>;

let radar_factory: OperatorFactory = Box::new(|ends| {
    Box::pin(async move {
        let radar = UdpRadarSource::bind("0.0.0.0:7001", "radar-01").await?;
        let tx = ends.tx.expect("source has tx");
        run_source_loop(radar, tx).await
    })
});

This keeps the graph type non-generic at the cost of a heap allocation per operator at build time — negligible for the topology sizes the SDA pipeline reaches.

How do I handle the channel-creation order problem in the topo-sort spawn pass?

The two-pass structure from the lesson:

// Pass 3: allocate every channel up-front.
let mut chan_tx: HashMap<EdgeId, mpsc::Sender<Observation>> = HashMap::new();
let mut chan_rx: HashMap<EdgeId, mpsc::Receiver<Observation>> = HashMap::new();
for e in &self.edges {
    let (tx, rx) = mpsc::channel(e.capacity);
    chan_tx.insert(e.id, tx);
    chan_rx.insert(e.id, rx);
}

// Pass 4: walk in topo order, hand each operator its channel ends.
for idx in topo_order {
    let v = &self.vertices[idx];
    let rx = v.incoming.and_then(|e| chan_rx.remove(&e));
    let tx = v.outgoing.and_then(|e| chan_tx.remove(&e));
    let future = (v.factory)(WiredEnds { rx, tx });
    let task = Task::spawn(&v.name, v.restart_policy, future);
    tasks.push(task);
}

The receiver halves are removed (not get'd) because each receiver belongs to exactly one operator — moving rather than cloning. The map's emptiness at the end of pass 4 is itself a structural sanity check.

How do I test the supervisor's restart logic without flaky timing?

tokio::time::pause() makes time deterministic in tests: sleep and Instant are driven by tokio::time::advance, not by wall clock. The supervisor's window-based budget can be tested by advancing time forward over the window and observing eviction.

#[tokio::test(start_paused = true)]
async fn supervisor_restarts_within_budget() {
    let policy = RestartPolicy::Bounded {
        max_restarts: 3,
        window: Duration::from_secs(60),
    };
    let mut sup = Supervisor::for_test(policy, panic_factory());

    // First failure within budget: should restart.
    sup.simulate_exit(TaskExit::Errored(anyhow!("boom")));
    assert_eq!(sup.restart_count(), 1);

    // Three more failures exhaust the budget.
    for _ in 0..3 {
        sup.simulate_exit(TaskExit::Errored(anyhow!("boom")));
    }
    assert_eq!(sup.event(), SupervisorEvent::BudgetExhausted { .. });

    // Advance past the window; restart history evicts; new failures recover budget.
    tokio::time::advance(Duration::from_secs(61)).await;
    sup.simulate_exit(TaskExit::Errored(anyhow!("boom")));
    assert_eq!(sup.restart_count(), 5);
}

The Supervisor::for_test constructor and simulate_exit methods are testing-only API that you expose with #[cfg(test)] or behind a testing feature flag. The principle: the supervisor should be testable without spawning actual tasks.

How do I refactor the M1 binary without breaking the integration tests?

Two-step refactor. First, build the orchestrator library with the test harness asserting the supervisor and graph behavior. Second, refactor the binary in-place, leaving its existing integration tests unmodified — they should pass against the refactored binary because the observable behavior is unchanged.

A common temptation is to build a parallel sda-ingest-v2 binary alongside the original. Resist this; it produces two binaries to maintain. The right approach is a single binary whose internals change. Keep the original integration test suite running on every commit during the refactor.

What restart policy should I default to per operator?

The defaults that match SDA's operational stance:

  • Sources (radar, optical, ISL): Bounded { max_restarts: 5, window: Duration::from_secs(60) }. Sources are the most external part of the pipeline; they are most likely to encounter transient external failures (a network blip, a partner deploy). Restart-with-budget is the right shape.
  • Stateless operators (normalize, validate): Bounded { max_restarts: 5, window: Duration::from_secs(60) }. Same defaults as sources — they have no state to corrupt and restart is cheap.
  • Stateful operators (dedup, correlator): Bounded { max_restarts: 3, window: Duration::from_secs(60) } — fewer restarts because state loss on each restart costs more, and a higher recurrence rate suggests a deeper problem.
  • Sinks where data integrity is at stake (audit log, alert emitter): Never. A retry on these may produce duplicate emits to downstream subscribers; better to fail the pipeline loudly. Module 5's idempotent-sink machinery will let you change this default later.

Document the choice per-operator in the topology declaration with a one-line comment explaining the reasoning.


Getting Started

Recommended order:

  1. Task wrapper. Define Task::spawn, Task::join, TaskExit, RestartPolicy. Write unit tests that spawn synthetic tasks (success, error, panic) and assert the right TaskExit variant.
  2. OperatorGraph builder. Define the builder API and the per-role validation. Write tests for: well-formed graphs build, dangling edges are rejected, role mismatches are rejected.
  3. Topo sort + cycle detection. Implement Kahn's algorithm in build(). Write tests for: linear chains, fan-in, fan-out, and (failing) cycles.
  4. Channel allocation and spawn. The two-pass structure from the hint. Write an end-to-end test that builds a 3-operator graph and confirms data flows from source to sink.
  5. Supervisor. Wrap the JoinSet loop. Write the deterministic-timing tests using tokio::time::pause.
  6. with_retry and CircuitBreaker. These are independent of the orchestrator; they can be developed and tested in isolation, then wired into the optical source's polling code.
  7. PropagatorPool. The dedicated-runtime wrapper for the orbital propagator. The propagator itself is mocked for this project — the real propagator is from Meridian's orbital crate, which is out of scope. A mock_propagate(state, dt) -> state function that does a deterministic-but-CPU-bound computation is sufficient.
  8. Refactor sda-ingest. The topology declaration becomes a single build_topology() function. Keep the existing integration tests passing.
  9. Operational README. Document the orchestrator API and the new metrics. One page, terse.

Aim for a working orchestrator and a passing-tests refactor by day 7 even if the operational polish (control-plane metrics, README) is incomplete. The orchestrator's correctness is what matters; the polish is finishing work.


What This Module Sets Up

In Module 3 you will replace this module's processing-time dedup operator with an event-time windowed correlator. The orchestrator interface stays the same; only one operator's implementation changes. The watermark machinery you build there flows along the same channels the orchestrator wired up here.

In Module 4 you will harden the channel boundaries against burst-load failure modes. The bounded-channel-per-edge invariant the orchestrator enforces structurally is what makes that work tractable. You will revisit buffer sizing with rigor.

In Module 5 you will make the windowed operator's state crash-safe via checkpointing. The supervisor's restart machinery you built here is what the checkpoint recovery path hooks into. The Never restart policy on data-integrity-critical sinks gets revisited with idempotent-sink tooling that lets it become Bounded safely.

The orchestrator is not a throwaway. It is the connective tissue every subsequent module's project hangs on.

Module 03 — Event Time and Watermarks

Track: Data Pipelines — Space Domain Awareness Fusion Position: Module 3 of 6 Source material: Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11 (Reasoning About Time, Windowing, Knowing When You're Ready, Out-of-Order Events); Streaming Data — Andrew Psaltis, Chapter 4 (Analyzing Streaming Data, Windowing Patterns, Watermarks); Kafka: The Definitive Guide — Shapira, Palino, Sivaram, Petty, Chapter 14 (Stream Processing Concepts: Time, Out-of-Sequence Events); Fundamentals of Data Engineering — Joe Reis & Matt Housley, Chapter 7 (Late-Arriving Data) Quiz pass threshold: 70% on all four lessons to unlock the project


Mission Context

OPS ALERT — SDA-2026-0142 Classification: CORRELATION TIER UPGRADE Subject: Replace processing-time dedup with event-time windowed correlation

The Phase 2 orchestrator from Module 2 is in production. Its dedup operator at the top of the correlation tier is processing-time-based — buckets observations by arrival time, not by sensor_timestamp. Internal review of conjunction alerts from the past quarter found a 2.3% rate of cross-sensor mismatches traceable to optical-vs-radar arrival skew straddling the 5-second dedup window. Two of the missed correlations were later determined to be real conjunctions. The fix is structural: replace processing-time dedup with event-time windowed correlation that respects sensor_timestamp regardless of arrival skew, with watermarks driving window close and allowed-lateness handling the long tail of partner-API delays.

This module is the correctness foundation of the rest of the track. The orchestrator built in Module 2 stays unchanged — only one operator's implementation changes (dedup → correlator). What does change is the conceptual machinery: every event-time-aware operator from this point on consumes a watermark stream, maintains windowed state with explicit close triggers, and handles the rare late event explicitly rather than silently dropping it. The patterns this module installs — event-time-on-the-envelope, per-source max-lateness, min-of-inputs watermark propagation, allowed-lateness retention, retract-and-correct downstream emission — are the canonical streaming-system shape that Module 4 (backpressure and flow control), Module 5 (delivery guarantees and fault tolerance), and Module 6 (observability and lineage) all build on.

The mental model the module installs is the four-piece event-time discipline: (1) the envelope carries event time and ingest time as separate first-class fields, (2) operators bucket by event time using one of four canonical window shapes, (3) watermarks are guarantees that drive window close, propagated through fan-ins by the min rule, (4) late events past the watermark are handled explicitly per a per-output strategy. Every event-time pipeline that succeeds in production is some combination of these four; pipelines that skip any of them have correctness bugs that look like flakiness.


Learning Outcomes

After completing this module, you will be able to:

  1. Distinguish event time from ingest time on the observation envelope and choose the right time for any aggregation question
  2. Reason about per-source clock heterogeneity (GPS-locked, NTP-synced, drifting) and carry source-specific quality forward through the pipeline
  3. Choose among the four canonical window shapes (tumbling, hopping, sliding, session) based on the question being asked, not on the perceived complexity of each
  4. Implement per-event sliding windows with bounded memory and correct event-time eviction, plus session windows with the production-safety double-bound
  5. Define watermarks precisely as guarantees, generate them at sources from per-source max-lateness, and propagate them through fan-in operators via the min-of-inputs rule
  6. Handle late events with the appropriate strategy — drop, allowed-lateness, or retract — and recognize which strategy fits which output
  7. Compose the operator-level retract-and-correct pattern with M2's at-least-once-plus-idempotent delivery to produce effective exactly-once at the sink

Lesson Summary

Lesson 1 — Event Time vs Processing Time

The two operationally distinct timestamps every observation carries: sensor_timestamp (when the event happened in the world) and ingest_timestamp (when the pipeline received it). Sensor-clock heterogeneity carried forward as ClockQuality so per-source skew can widen the correlator's matching window. Out-of-order arrival as the rule. Lag (split into source lag and pipeline lag) as the master diagnostic for "is the problem ours or theirs."

Key question: A radar observation arrives 80 ms after its event time; an optical observation of the same event arrives 30 seconds after its event time. Should they be correlated, and which window-assignment strategy gets that right?

Lesson 2 — Windowing

The four canonical window shapes — tumbling, hopping, sliding, session — and the question shape each fits. The conjunction-risk question fits per-event sliding windows. BTreeMap-keyed-on-window-end as the data structure that makes close-up-to-watermark a cheap O(log N) prefix query. Session windows' production-safety double-bound (gap timeout AND max session duration) as the safety valve that prevents unbounded growth.

Key question: The correlator must answer "for each new observation, find others within W of its event_time." Which window shape does that question fit, and what is the cost profile?

Lesson 3 — Watermarks

The watermark as a per-event-time guarantee, not an estimate. Heuristic watermarks computed as max_observed_event_time - max_lateness, with per-source documented bounds (radar 100ms, optical 30s, ISL 10s for SDA). The min-of-inputs rule for propagation through fan-in operators, and the operational consequence: the slowest source dominates the downstream watermark. Watermarks interleaved with data items on the same channel via enum SourceItem { Observation(_), Watermark(_) } to preserve their relationship to the data they bound.

Key question: Three sources at watermarks T-100ms, T-30s, T-10s. What is the downstream watermark, and what is the operational consequence of that answer?

Lesson 4 — Late Data

The three strategies for events that arrive after their watermark: drop (cheap, lossy), accumulate-with-allowed-lateness (medium cost, eventual completeness), retract-and-correct (highest cost, strongest correctness). Two-tier window state (active and retained). Retract-then-insert ordering with sequence numbers for downstream correction. Retract-aware sinks with strict-greater UPSERT semantics that absorb duplicates and out-of-order retransmits.

Key question: A late observation invalidates a previously-emitted conjunction alert. The alert subscriber has already triggered an avoidance maneuver. Which late-data strategy should the correlator use, and why?


Capstone Project — Conjunction Window Engine

Replace the M2 dedup operator with a windowed event-time correlator. Per-source watermarks, min-of-inputs fan-in propagation, per-key sliding windows of 30 seconds with 5 seconds of allowed lateness, retract-then-insert emission on late events, and a sequence-keyed retraction-aware SQLite sink. The replay-correctness test (byte-identical output under random arrival order) is the canary for every windowing bug. Acceptance criteria, suggested architecture, and the full project brief are in project-conjunction-window-engine.md.

The orchestrator from Module 2 is unchanged; only one operator's implementation changes. The patterns established here repeat in every subsequent module's project.


File Index

module-03-event-time-and-watermarks/
├── README.md                                  ← this file
├── lesson-01-event-vs-processing-time.md      ← Event time vs processing time
├── lesson-01-quiz.toml                        ← Quiz (5 questions)
├── lesson-02-windowing.md                     ← Windowing
├── lesson-02-quiz.toml                        ← Quiz (5 questions)
├── lesson-03-watermarks.md                    ← Watermarks
├── lesson-03-quiz.toml                        ← Quiz (5 questions)
├── lesson-04-late-data.md                     ← Late data
├── lesson-04-quiz.toml                        ← Quiz (5 questions)
└── project-conjunction-window-engine.md       ← Capstone project brief

Prerequisites

  • Module 1 (Stream Processing Foundations) and Module 2 (Pipeline Orchestration Internals) completed — the Observation envelope, the OperatorGraph builder, the supervisor pattern, and the at-least-once-plus-idempotent delivery frame are all assumed
  • Foundation Track completed — async Rust, channels, BTreeMap and VecDeque algorithmic intuitions
  • Familiarity with tokio::sync::mpsc, std::time::SystemTime, and serde for the watermark envelope extension
  • Comfort with the tokio::select! pattern from Module 2's cancel-safety lesson

What Comes Next

Module 4 (Backpressure and Flow Control) hardens the channel boundaries against burst load. The bounded-channel-per-edge invariant from M2 plus the watermark machinery from M3 are both inputs to that work — burst load affects watermark advance, and watermark stall affects late-event handling. The flow-policy machinery developed in M4 plugs in upstream of the windowed correlator without changing its windowing logic.

Lesson 1 — Event Time vs Processing Time

Module: Data Pipelines — M03: Event Time and Watermarks Position: Lesson 1 of 4 Source: Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11 ("Reasoning About Time" — the three different times in stream processing); Streaming Data — Andrew Psaltis, Chapter 4 (Analyzing Streaming Data); Kafka: The Definitive Guide — Shapira, Palino, Sivaram, Petty, Chapter 14 (Stream Processing Concepts: Time)


Context

Module 1's Observation envelope carried two timestamps, and Module 1 said the distinction would matter later. This is later. The dedup operator in M1 (and the orchestrator that wraps it in M2) assigns observations to windows by processing time — the wall-clock instant at which the pipeline received the observation. That works for ingestion-tier deduplication but it is wrong for correlation-tier reasoning. Two radars observe the same orbital event at the same instant. Their observations leave the radars within microseconds of each other. They arrive at the pipeline 100 milliseconds apart because one radar's link to the ground network has a long-haul fiber detour. A processing-time correlator concludes "two separate events." An event-time correlator concludes "one event, two views." The conjunction risk computation depends on the latter being right.

The mental shift this lesson installs is that every observation has multiple timestamps that are operationally non-interchangeable. Sensor timestamp is when the event happened in the world. Ingest timestamp is when the pipeline received it. The pipeline's wall clock is when now is. Each of these answers a different question. Throughput metrics (events per second across the pipeline) want processing time. SLO compliance (P99 ingest-to-emit latency) wants ingest time as the start. Conjunction-window assignment (which observations should be considered together for risk computation) wants sensor timestamp. Confusing them produces incorrect aggregates that are individually plausible but collectively inconsistent.

This module's job is to take the orchestrator from Module 2 and replace its processing-time dedup with an event-time windowed correlator. The lessons proceed in dependency order. This lesson establishes the time vocabulary. Lesson 2 builds windows on top of it. Lesson 3 introduces watermarks — the mechanism by which the pipeline decides "I have seen all events for window W with sufficient confidence to emit the window's result." Lesson 4 handles the late events that arrive after a watermark has already declared the window closed. The capstone project replaces M2's dedup operator with the result.


Core Concepts

The Three Times

DDIA Chapter 11 makes the case precisely: an event in a streaming system can carry up to three distinct timestamps, and a production pipeline that does not distinguish them will have correctness bugs that look like flakiness.

  • Event time — when the event actually happened in the source system. For an SDA observation, this is the instant the sensor's hardware recorded the detection. The radar's GPS-disciplined clock captures this to nanosecond precision. The optical telescope's NTP-disciplined clock captures it to about ten milliseconds. The ISL beacon's onboard satellite clock captures it with drift up to seconds between syncs.
  • Ingest time (sometimes called server time) — when the pipeline received the event. This is what SystemTime::now() returns in the source operator's recv loop. It is monotonic-ish across observations from the same sensor but not across sensors.
  • Processing time — the wall-clock instant at which a given operator processes the event. Different operators process the same event at different processing times because the event flows through them sequentially. For most aggregation purposes this and ingest time are interchangeable; the distinction matters when you are reasoning about a single operator's local clock.

For SDA's purposes, ingest time and processing time can be collapsed: every observation's ingest_timestamp is captured at the ingestion-tier source operator (Module 1), and downstream operators inherit that timestamp without modification. The two distinct times the pipeline carries forward are event time (sensor_timestamp) and ingest time (ingest_timestamp). The lessons that follow refer to these by their envelope field names.

Where Each Time Belongs

The decision is per-question, not per-pipeline. A useful rule of thumb is "what is the question being asked, and what time would make the answer right?"

QuestionThe right time
How many events did we see this minute?Processing/ingest time (operator's local view)
What is the P99 latency from sensor to alert?Both — (emit_time - sensor_timestamp) for end-to-end, (emit_time - ingest_timestamp) for pipeline-only
Which observations are part of this orbital event?Event time (sensor_timestamp) — we want all observations that physically co-occurred
Is the pipeline keeping up with real time?The lag — (now - sensor_timestamp) summarized over the recent window
Did we receive any observation in the last second?Ingest time
For a 5-second event-time window starting at T, when can we close it?This is the watermark question — Lesson 3

The trap is using the wrong time for the question and getting an answer that looks plausible. A conjunction correlator that buckets by ingest time produces correct-looking output most of the time — the typical optical-vs-radar arrival skew is small enough that most observations of the same event do land in the same processing-time bucket. The bug shows up only when a sensor source has a delay that pushes an observation across a bucket boundary. That is when correlations are missed. The bug is rare and silent and load-dependent and exactly the kind of thing that gets diagnosed only after a high-profile incident.

Clock Skew and Source Quality

Every sensor's clock has its own accuracy story, and the pipeline must accept the heterogeneity rather than pretend it away.

  • GPS-disciplined clocks — radar arrays. Accurate to about 100 nanoseconds against UTC. Locked to a constellation that disciplines drift continuously. The radar's sensor_timestamp is the most trustworthy event time in the system.
  • NTP-disciplined clocks — ground-based optical telescopes. Accurate to about 10 milliseconds in steady state, occasionally worse when the NTP server is degraded. NTP self-reports its sync state, which is data the source operator can include in the envelope as a quality flag.
  • Onboard satellite clocks for ISL beacons — disciplined by the satellite's own GPS or by occasional ground-loop syncs. Drift between syncs can grow to seconds. The beacon's envelope reports time_since_last_sync_s so the pipeline can estimate the drift and treat the timestamp accordingly.

The pipeline cannot fix bad clocks at the source, but it can carry the per-source quality forward so downstream operators can reason about it. A correlator that knows an ISL beacon's timestamp is uncertain to within ±5 seconds can either widen the matching window for that source or downweight its contribution; a correlator that treats every timestamp as ground truth produces incorrect correlations during sync drift.

Out-of-Order Arrival

Even with perfectly synchronized clocks, observations arrive at the pipeline out of event-time order. The optical archive's 30-second polling interval means optical observations can lag radar observations of the same event by up to 30 seconds of event time. The ISL beacon's 10-second buffering before downlink means beacon observations can lag both. A correlator that assumes events arrive in event-time order produces correctness bugs the first time a slow source's observation arrives after a faster source's observation that was generated later.

Out-of-order arrival is the rule, not the exception. The system must accept it. The mechanism is the watermark protocol covered in Lesson 3, which generalizes "wait for late events" into a tractable operator-level abstraction. The lesson here is conceptual: design every event-time operator under the assumption that observations arrive in arbitrary order with respect to event time, and verify the assumption with a replay test that injects deliberate out-of-order traffic.

Lag as the Master Diagnostic

The diagnostic metric that operations depends on most is lag: lag = ingest_timestamp - sensor_timestamp (or now - sensor_timestamp for currently-streaming events). Lag answers "how far behind real time is the pipeline?" — the single most operationally important question for an event-driven system.

Lag's two components separate cleanly. Source lag (ingest_timestamp - sensor_timestamp) is how long the pipeline took to receive the event after the sensor recorded it. Source lag changes when sensors get slower, when network paths degrade, when partner APIs back up. Pipeline lag (now - ingest_timestamp for a still-flowing event, or for an emitted event the difference between emit-time and ingest-time) is how long the pipeline itself takes once it has the event. Pipeline lag changes when operators slow down, when channels back up, when the pipeline is overloaded. Distinguishing the two is what lets ops answer "is the problem ours or theirs?" without playing detective.

A naive lag metric (just now - sensor_timestamp summarized as a histogram) is a useful single-number diagnostic and the right thing to put on the dashboard. A diagnostic-grade metric (split by source for source lag, by stage for pipeline lag) is the right thing to have available when you need it. Module 6 builds these out fully; for this module we instrument lag at one point — the sink — and use it as the SLO indicator.


Code Examples

Source-Side Event-Time Capture with Quality Flags

The radar source already populates sensor_timestamp from its frame data. We extend each source's emission with a quality flag describing how trustworthy the timestamp is. This is small, additive, and the foundation of every event-time decision downstream.

use std::time::{Duration, SystemTime};

/// How trustworthy is this observation's sensor_timestamp?
/// Carried alongside the timestamp so downstream operators can
/// widen matching windows or downweight observations from
/// sources with degraded clocks.
#[derive(Debug, Clone, Copy, PartialEq, Eq, serde::Serialize, serde::Deserialize)]
pub enum ClockQuality {
    /// GPS-disciplined; accurate to ~100ns against UTC.
    GpsLocked,
    /// NTP-disciplined; accurate to ~10ms in steady state.
    NtpSynced { last_sync_age_s: u32 },
    /// Onboard clock with measurable drift since last discipline.
    OnboardDrift { time_since_sync_s: u32 },
    /// Source could not provide quality; treat conservatively.
    Unknown,
}

impl ClockQuality {
    /// Worst-case event-time uncertainty for this source. Used by
    /// the correlator to widen its matching window.
    pub fn max_skew(&self) -> Duration {
        match *self {
            ClockQuality::GpsLocked => Duration::from_micros(1),
            ClockQuality::NtpSynced { last_sync_age_s } => {
                // NTP accuracy degrades roughly linearly with sync age,
                // capped at the original 10ms accuracy + 1ms per minute.
                let extra_ms = (last_sync_age_s / 60) as u64;
                Duration::from_millis(10 + extra_ms)
            }
            ClockQuality::OnboardDrift { time_since_sync_s } => {
                // Conservative: 100us drift per second since last sync.
                Duration::from_micros((time_since_sync_s as u64) * 100)
            }
            ClockQuality::Unknown => Duration::from_secs(5),
        }
    }
}

The max_skew method gives the correlator a single number to expand its event-time matching window for observations from this source. A radar observation gets a 1-microsecond skew (effectively zero); an ISL beacon two minutes past its last sync gets 12 milliseconds; a beacon at the bound of its sync interval (an hour without sync, hypothetically) gets 360 milliseconds. The correlator widens its window by the max-skew of any participating observation, ensuring legitimate correlations are not missed because of clock differences. The pattern is the same one Kafka Streams calls grace period for out-of-order events and Flink calls allowed lateness in source time domain.

Computing and Emitting Lag

The lag operator sits at the sink end of the pipeline (or near it). It computes the lag for every event flowing through and exports it as a histogram. The operator is stateless and zero-overhead in the hot path.

use std::time::{Duration, SystemTime, UNIX_EPOCH};

/// Compute end-to-end lag for an observation emitted at the sink.
/// Returns (source_lag, pipeline_lag).
/// source_lag = ingest_timestamp - sensor_timestamp (how late the sensor's
///              event was when the pipeline first saw it)
/// pipeline_lag = now - ingest_timestamp (how long the pipeline itself
///                took to process the event)
pub fn compute_lag(obs: &Observation) -> (Duration, Duration) {
    let now = SystemTime::now();
    let source_lag = obs
        .ingest_timestamp
        .duration_since(obs.sensor_timestamp)
        .unwrap_or_else(|_| {
            // Negative source lag means the source's clock is ahead of
            // ours — possible during deploy if the source is GPS-locked
            // and our host's NTP is degraded. Surface as zero rather
            // than silently subtracting.
            Duration::ZERO
        });
    let pipeline_lag = now
        .duration_since(obs.ingest_timestamp)
        .unwrap_or(Duration::ZERO);
    (source_lag, pipeline_lag)
}

/// Operator that observes lag at the sink and exports it.
/// The Prometheus histogram split by source kind makes it actionable:
/// a source-specific lag spike points at the source; a uniform spike
/// points at the pipeline.
pub fn observe_lag(obs: &Observation) {
    let (source_lag, pipeline_lag) = compute_lag(obs);
    let kind = format!("{:?}", obs.source_kind);
    metrics::histogram!("source_lag_seconds", "source" => kind.clone())
        .record(source_lag.as_secs_f64());
    metrics::histogram!("pipeline_lag_seconds", "source" => kind)
        .record(pipeline_lag.as_secs_f64());
}

The negative-lag handling is a real concern — the pipeline's host clock and a source's clock can disagree, and the difference can be in either direction. The right behavior is to surface the disagreement as a separate signal (a clock_skew_observed_total counter) rather than producing absurd lag values. The implementation collapses to zero for simplicity here; production code should split this out and alert on persistent negative lag. The Prometheus split by source is the operationally important part: when the lag dashboard shows the radar's source lag has doubled while the optical's is unchanged, you know what to investigate.

A Pitfall: Window Assignment Using Processing Time

The buggy operator below assigns observations to 5-second windows by calling Instant::now() at processing time. A teammate proposes it during code review with the rationale "this is simpler and the difference is small." It is wrong. The example shows the failure mode the lesson keeps warning about.

// BUG: assigns to windows by processing time, not by event time.
// Observations of the same event from different sources land in
// different windows because they arrive at slightly different times.
pub async fn buggy_window_assigner(
    mut input: mpsc::Receiver<Observation>,
    output: mpsc::Sender<(WindowId, Observation)>,
) -> Result<()> {
    let start = Instant::now();
    while let Some(obs) = input.recv().await {
        let elapsed_secs = start.elapsed().as_secs();
        let window_id = WindowId(elapsed_secs / 5); // 5-second windows
        output.send((window_id, obs)).await?;
    }
    Ok(())
}

// CORRECT: assigns by sensor_timestamp. Observations of the same
// physical event land in the same window regardless of arrival skew.
pub async fn correct_window_assigner(
    mut input: mpsc::Receiver<Observation>,
    output: mpsc::Sender<(WindowId, Observation)>,
    epoch: SystemTime,
) -> Result<()> {
    while let Some(obs) = input.recv().await {
        let event_offset = obs
            .sensor_timestamp
            .duration_since(epoch)
            .unwrap_or_default();
        let window_id = WindowId(event_offset.as_secs() / 5);
        output.send((window_id, obs)).await?;
    }
    Ok(())
}

Two observations that physically co-occurred — same orbital event, same instant of detection — but arriving at the pipeline 100 milliseconds apart land in different processing-time windows whenever the 100-ms gap straddles a 5-second boundary. About 2% of correlated events per the optical-radar arrival distribution, in the SDA pipeline's actual deployment, would be mis-correlated by the buggy operator. That is enough to produce a phantom-conjunction rate that operations notices but cannot easily explain — every correlator failure produces a defensible-looking output, and only the aggregate statistics reveal the bias. The fix is exactly the operator above: bucket by sensor_timestamp, not by elapsed processing time. Lesson 2 develops this into a full windowing operator with bounded memory and explicit close conditions.


Key Takeaways

  • Every observation carries two operationally distinct timestamps: sensor_timestamp (when the event happened in the world) and ingest_timestamp (when the pipeline received it). Processing time at any given operator can be collapsed to ingest time for SDA's purposes; the distinction that matters is event time vs ingest time.
  • The right time depends on the question. Throughput and SLO metrics want ingest/processing time. Window assignment for correlation wants event time. Confusing them produces output that looks correct in aggregate but is silently wrong in ways that only the aggregate-of-aggregates statistics reveal.
  • Sensor clocks are heterogeneous: GPS-locked (radar, ~100ns), NTP-synced (optical, ~10ms), onboard-with-drift (ISL, up to seconds). Carry clock quality forward in the envelope so downstream operators can widen matching windows or downweight observations from degraded sources.
  • Out-of-order arrival is the rule in event-time pipelines. Late observations from slow sources arrive after observations of later events from faster sources. Every event-time operator must be designed with this assumption; the watermark protocol in Lesson 3 makes the rule operational.
  • Lag (now - sensor_timestamp) is the master diagnostic for an event-driven pipeline. Split into source lag and pipeline lag to answer "is the problem ours or theirs?" without ambiguity. Module 6 builds the full observability stack on top of this primitive.

Lesson 2 — Windowing

Module: Data Pipelines — M03: Event Time and Watermarks Position: Lesson 2 of 4 Source: Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11 (Windowing); Kafka: The Definitive Guide — Shapira, Palino, Sivaram, Petty, Chapter 14 (Time Windows in Stream Processing); Streaming Data — Andrew Psaltis, Chapter 4 (Windowing patterns and their implementation costs)


Context

Lesson 1 established that observations carry an event time and that correlation logic must bucket them by event time rather than by arrival time. This lesson is about how to do that bucketing efficiently and correctly. A window is a bounded slice of the event stream — defined by the rule "all events whose event-time falls within this range" — that the operator can hold in memory, compute over, and emit a result for. Windows are how unbounded streams admit aggregation: you cannot compute "all conjunction risks ever," but you can compute "all conjunction risks where the contributing observations fell within this 30-second event-time slice."

The choice of window shape is a correctness decision, not a performance optimization. The conjunction risk computation cares about pairs of orbital objects whose closest-approach time falls within a small window. If we use the wrong window shape — say, fixed 30-second buckets aligned to clock minutes — every legitimate conjunction whose closest approach happens to straddle a bucket boundary is silently missed. Choosing the right shape requires understanding the four canonical kinds (tumbling, hopping, sliding, session), the cost profile of each, and the question each is shaped to answer. This lesson takes them in turn and develops a sliding-window operator that the capstone correlator builds on.

The forward references stay concrete. The window operator built here holds events in memory until it can declare a window "closed" and emit its result. Lesson 3 supplies the close mechanism — watermarks. Lesson 4 handles late events that arrive after a window has been closed. The capstone project replaces M2's processing-time dedup with this lesson's sliding-window correlator. The pattern of "windowed accumulation, watermark-driven emit, allowed-lateness for stragglers" is the standard streaming-system shape; we are building the SDA-specific instance of it.


Core Concepts

Tumbling Windows

The simplest window shape. Fixed size, non-overlapping, every event lands in exactly one window. A 5-second tumbling window over an event-time stream produces one window for [0, 5), the next for [5, 10), the next for [10, 15), and so on. The boundaries are typically aligned to a stable epoch (00:00:00 UTC, or the pipeline's start time, or some other fixed reference) so that a given event_time always maps to the same window across replicas.

Tumbling windows are the right choice for aggregates over disjoint time slices: events-per-minute counts, hourly throughput summaries, "how many observations did each sensor produce this minute." They are the wrong choice for any question of the form "find pairs of events within a small time gap" — the gap can straddle a tumbling-window boundary, and events in adjacent windows are never seen together by the operator. The conjunction-risk computation is exactly that kind of question, which is why M3 needs sliding windows rather than tumbling.

Memory cost: bounded by the largest single window's events. Once a window closes (Lesson 3), its state is freed. State per active window scales with the window size and the per-key event rate.

Hopping Windows

A generalization of tumbling. Fixed size, fixed step (sometimes called advance or slide), step typically smaller than size — which produces overlap. A 30-second window with 5-second step produces windows for [0, 30), [5, 35), [10, 40), ..., overlapping by 25 seconds each. Every event lands in size / step windows simultaneously (30 / 5 = 6 here).

Hopping windows are useful for "rolling aggregates emitted on a regular cadence" — every 5 seconds, emit the count of events in the last 30 seconds. The emit cadence (step) is decoupled from the aggregation breadth (size). Some streaming systems call this a sliding window; the terminology is unfortunately not standardized. We use hopping for fixed-size-fixed-step and reserve sliding for the per-event-driven shape below.

Memory cost: scales with size / step. Each event is held in size / step windows simultaneously. For a 30-second window with 1-second step, every event is in 30 windows. The per-event memory is small (a reference, not a copy), but the constant factor is real and shows up at scale.

Sliding Windows (Per-Event)

Every event creates its own window of [event_time - W, event_time] for some window size W. Maximum overlap. For a 30-second sliding window, an event at time T anchors a window covering T-30 to T, and the next event at T+1 anchors a window covering T-29 to T+1. The "matching set" for any given event is whatever other events fall within W of its event_time.

This is the right shape for the conjunction-risk question. The natural framing of the problem is "for each new observation, what other observations are within 30 seconds of event_time?" — exactly what a per-event sliding window computes. Production systems implement sliding windows as a deque per key, with the front evicted as new events arrive that push the deque's tail past the window boundary. The capstone operator builds this.

Memory cost: bounded by the most-active key's event rate × W. For a 30-second window over a key that sees 100 events per second, the deque holds 3000 entries — trivial. For a key that sees 100,000 events per second the deque is much larger; mitigations are key-level rate limits, sampling, or shorter windows. Module 4's load-shedding work develops these mitigations.

Session Windows

Variable-length, defined by gap of inactivity. A session window collects events while they keep arriving within a gap timeout of each other; when the gap is exceeded, the window closes. The classic use is "user web session" — collect events while the user is active, close the session when they go idle for 5 minutes. For SDA, the natural use is "satellite pass" — a satellite's beacons during a single overhead pass form a session, with the gap between passes (when the satellite is below the horizon) closing the session.

Session windows are the only window shape with state proportional to the active session's duration, not to a fixed configuration. A satellite that orbits for 90 minutes during a long pass produces a 90-minute session; a quiet pass might be 10 seconds. The gap timeout is what bounds the session — without it, a continuous event stream produces an unbounded session that never closes.

Memory cost: the dominant risk. A misconfigured gap timeout (too long) or a continuous source (a beacon that emits without gaps) produces unbounded growth. Production code adds a hard maximum session duration alongside the gap timeout: even if events keep arriving, force-close after N minutes. The double-bound makes session windows safe to deploy.

Window State and the Close Trigger

For every window class, the operator maintains state per active window: the events that have been assigned to it and any aggregations being computed (running counts, partial join results, etc.). The state grows as events arrive and shrinks when windows close. The close trigger — the condition under which the operator declares "this window is done; emit its result and discard its state" — is the single most important correctness decision in a windowed operator.

The naive close trigger is wall-clock time: a 5-second tumbling window starting at T closes when wall-clock-time reaches T+5. This is wrong in event-time semantics because late events with sensor_timestamp ≤ T+5 may still arrive after wall-clock-time T+5 — a tumbling window closed by wall clock would miss them. The correct close trigger is the watermark — Lesson 3's full topic. The intuition we install here is structural: the windowed operator does not decide on its own when to close a window; it consumes a watermark signal that tells it "no event with event_time ≤ X will arrive after this point," and it closes any window whose end ≤ X.

Until Lesson 3, this lesson's operator implementations declare windows closed via a placeholder mechanism (an explicit close_windows_up_to(T) method). The capstone operator wires that placeholder to a real watermark stream.


Code Examples

Tumbling Window Operator

The simplest implementation. A BTreeMap<WindowEnd, WindowState> keyed on the window's end time. Each event is bucketed by its sensor_timestamp; close-up-to-watermark walks the map's prefix and emits.

use std::collections::BTreeMap;
use std::time::{Duration, SystemTime};
use anyhow::Result;
use tokio::sync::mpsc;

/// Tumbling-window operator. Emits aggregated state for every window
/// whose end has been crossed by the watermark. Close-trigger is supplied
/// externally via close_up_to(); Lesson 3 wires this to a watermark.
pub struct TumblingWindow {
    /// Windows by end time. BTreeMap gives O(log N) range-prefix iteration
    /// in close_up_to(), which is the hot path on watermark advance.
    windows: BTreeMap<SystemTime, Vec<Observation>>,
    window_size: Duration,
    epoch: SystemTime,
    output: mpsc::Sender<WindowResult>,
}

impl TumblingWindow {
    pub fn new(
        window_size: Duration,
        epoch: SystemTime,
        output: mpsc::Sender<WindowResult>,
    ) -> Self {
        Self {
            windows: BTreeMap::new(),
            window_size,
            epoch,
            output,
        }
    }

    /// Assign an observation to its window and accumulate it.
    pub fn ingest(&mut self, obs: Observation) {
        let offset = obs.sensor_timestamp.duration_since(self.epoch).unwrap_or_default();
        let window_idx = offset.as_secs() / self.window_size.as_secs().max(1);
        let window_end = self.epoch + self.window_size * (window_idx + 1) as u32;
        self.windows.entry(window_end).or_default().push(obs);
    }

    /// Close every window whose end ≤ watermark. Emit result, free state.
    pub async fn close_up_to(&mut self, watermark: SystemTime) -> Result<()> {
        // BTreeMap::split_off gives us O(log N) access to the prefix
        // we want to drain. The remainder stays in self.windows.
        let still_open = self.windows.split_off(&(watermark + Duration::from_nanos(1)));
        let to_close = std::mem::replace(&mut self.windows, still_open);
        for (window_end, observations) in to_close {
            let result = WindowResult { window_end, count: observations.len() };
            self.output.send(result).await
                .map_err(|_| anyhow::anyhow!("window result downstream dropped"))?;
        }
        Ok(())
    }

    /// For diagnostics: number of windows currently held in state.
    pub fn pending_window_count(&self) -> usize { self.windows.len() }
}

#[derive(Debug, Clone)]
pub struct WindowResult {
    pub window_end: SystemTime,
    pub count: usize,
}

The BTreeMap-keyed-on-window-end choice is load-bearing for this operator. The hot path on watermark advance is "find every window whose end ≤ watermark," which is a prefix range query; BTreeMap does this in O(log N) via split_off. A HashMap-keyed-on-window-id would require a full scan on every watermark advance — fine for a few windows, untenable when the operator holds hundreds. The cost of BTreeMap inserts (O(log N) instead of HashMap's O(1) amortized) is paid back many times over by the cheap range query on close. The pattern generalizes: for any operator whose hot path is "find everything ≤ X," choose a data structure that gives you that operation cheaply. A naive HashMap.iter().filter() is O(N) and quietly catastrophic at scale.

Sliding Window Operator (Per-Event)

The conjunction-risk operator's foundation. A VecDeque<Observation> per key. New events are appended to the back; on each new event, evict from the front any events whose sensor_timestamp is older than event_time - W. The deque always represents the current sliding window for that key.

use std::collections::{HashMap, VecDeque};
use std::time::Duration;

/// Per-key sliding window operator. Each new event for a key emits
/// the set of other events within W of its event_time as a candidate
/// match list. The conjunction correlator (capstone project) builds
/// on this primitive.
pub struct SlidingWindow {
    deques: HashMap<ObjectId, VecDeque<Observation>>,
    window: Duration,
}

impl SlidingWindow {
    pub fn new(window: Duration) -> Self {
        Self { deques: HashMap::new(), window }
    }

    /// Process an observation. Evict expired entries, append the new
    /// observation, return the current contents of the window for
    /// downstream matching.
    pub fn step(&mut self, obs: Observation) -> &VecDeque<Observation> {
        let key = obs.target_object_id.clone();
        let deque = self.deques.entry(key.clone()).or_default();
        let cutoff = obs.sensor_timestamp - self.window;
        // Evict from front while the front's event_time is older than
        // the cutoff. Front is the oldest; the deque is event-time
        // ordered by construction (we only append to the back, and
        // each appended event has a sensor_timestamp ≥ all existing
        // ones, modulo out-of-order arrival within the window).
        while let Some(front) = deque.front() {
            if front.sensor_timestamp < cutoff {
                deque.pop_front();
            } else {
                break;
            }
        }
        deque.push_back(obs);
        // Return a reference; caller decides what matching to compute.
        &*deque
    }

    /// Drop a key's deque entirely — used when the watermark indicates
    /// no more events will arrive for this key (e.g., a satellite
    /// has decayed). Not used in the steady-state hot path.
    pub fn close_key(&mut self, key: &ObjectId) -> Option<VecDeque<Observation>> {
        self.deques.remove(key)
    }

    pub fn pending_keys(&self) -> usize { self.deques.len() }
}

The eviction-on-each-event pattern keeps the deque always sized to the current window for that key — no separate "garbage collect old entries" pass needed. Two design choices that are subtle. The eviction cutoff is obs.sensor_timestamp - self.window, not now - self.window; we are doing event-time windowing, so the cutoff is in event time, not wall-clock. The deque is event-time ordered by the assumption that within a single key, observations arrive in roughly event-time order — which is true for a single-source single-key stream and approximately true for a multi-source stream (out-of-order is possible within tens of milliseconds). Strict event-time ordering is not required; the eviction loop simply advances while the front is older than the cutoff and stops. A late event whose sensor_timestamp is older than the cutoff is silently dropped — that is the late event problem that Lesson 4 covers in full.

A Session Window with Hard-Cap Safety

The ISL beacon's per-satellite pass. Events arrive while the satellite is overhead; the session closes when the satellite goes below the horizon (gap exceeds threshold) or, as a safety valve, when the session has been open for longer than a configured maximum.

use std::time::{Duration, Instant};

pub struct SessionWindow {
    session_start: Instant,
    last_event: Instant,
    gap_timeout: Duration,
    max_session: Duration,
    events: Vec<Observation>,
}

impl SessionWindow {
    pub fn new(first: Observation, gap_timeout: Duration, max_session: Duration) -> Self {
        let now = Instant::now();
        Self {
            session_start: now,
            last_event: now,
            gap_timeout,
            max_session,
            events: vec![first],
        }
    }

    /// Add an event to the session if the gap is acceptable. Returns
    /// Err with the new event if the session has timed out and the
    /// caller should start a fresh session.
    pub fn try_add(&mut self, obs: Observation) -> Result<(), Observation> {
        let now = Instant::now();
        let gap_ok = now.duration_since(self.last_event) < self.gap_timeout;
        let max_ok = now.duration_since(self.session_start) < self.max_session;
        if gap_ok && max_ok {
            self.last_event = now;
            self.events.push(obs);
            Ok(())
        } else {
            Err(obs)
        }
    }

    pub fn close(self) -> Vec<Observation> { self.events }

    pub fn open_duration(&self) -> Duration { self.last_event.duration_since(self.session_start) }
}

The double-bound pattern (gap timeout AND max session duration) is the safety property that makes session windows production-safe. A gap-only design hits unbounded memory the first time a source emits without any gap (a stuck-open beacon, a misconfigured emitter); the max-session bound is the safety valve that always fires. The cost of the bound is occasional artificial session breaks for legitimately-long sessions — acceptable for SDA's beacon-pass use case (passes are bounded by orbital mechanics) and adjustable per use case. The session window operator is the third canonical shape; the SDA capstone uses sliding windows (the second shape), but the pattern generalizes.


Key Takeaways

  • Window shape is a correctness decision, not a performance optimization. The four canonical shapes — tumbling (disjoint, fixed-size), hopping (overlapping, fixed-step), sliding (per-event), session (gap-defined) — each fit different question shapes. The conjunction-risk question fits sliding windows; aggregations fit tumbling; emit-on-cadence fits hopping; satellite-pass fits session.
  • The window operator does not decide its own close trigger. It consumes a watermark signal from upstream that says "no event with event_time ≤ X will arrive" and closes any window whose end ≤ X. This decouples the window's accumulation semantics from the close logic; Lesson 3 supplies the watermark.
  • BTreeMap keyed on window end gives the close-up-to-watermark hot path O(log N) prefix iteration. Choose data structures whose hot-path operations are cheap; a HashMap with O(N) iteration is quietly catastrophic at scale.
  • Sliding windows are bounded by the most-active key's event rate × W. Per-key deques with on-each-event front eviction keep memory always sized to the current window; no separate GC pass needed.
  • Session windows need a hard maximum alongside the gap timeout. A gap-only design hits unbounded memory the first time a source streams continuously; the max-session bound is the safety valve.

Lesson 3 — Watermarks

Module: Data Pipelines — M03: Event Time and Watermarks Position: Lesson 3 of 4 Source: Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11 ("Knowing When You're Ready to Receive Events" — the watermark/punctuation discussion); Streaming Data — Andrew Psaltis, Chapter 4 (Out-of-Order Events and the Watermark Mechanism); Kafka: The Definitive Guide — Shapira, Palino, Sivaram, Petty, Chapter 14 (Out-of-Sequence Events)


Context

Lesson 2's windowed operator builds up state per active window and waits for a close trigger — the signal that says "this window is done; emit its result and discard its state." We deferred the close trigger to this lesson. The naive answer is wall-clock time: a 5-second window starting at T closes when wall-clock time exceeds T+5. This is wrong in event-time semantics. A late-arriving observation with sensor_timestamp ≤ T+5 may still arrive after wall-clock-time T+5 because the optical archive's polling cadence delayed it. Closing on wall-clock loses that observation; the window's emitted result is wrong. The right close trigger has to be a per-event-time signal, not a per-wall-clock signal.

The mechanism is the watermark: a declaration, made by a source or computed by an operator, that "no event with event_time less than X will arrive at this point in the pipeline after this watermark." A watermark is a guarantee, not an estimate. With a watermark, the windowed operator can close any window whose end ≤ the watermark's value and be confident no more relevant events will arrive. Without a watermark, the operator either holds windows forever (correct but useless — no result is ever emitted) or closes them too early on a wall-clock bound (fast but wrong). The watermark is the necessary fourth piece of event-time semantics, alongside event-time-on-the-envelope (Lesson 1) and event-time-windowing (Lesson 2).

This lesson develops watermarks in three pieces. What a watermark is precisely (and the perfect-vs-heuristic distinction). How sources generate watermarks for their own observations. How operators propagate watermarks through the pipeline (the min-of-inputs rule that the rest of the track depends on). The forward references stay tight. Lesson 4 handles late events that arrive after a watermark has already declared their window closed. The capstone wires watermarks from sources through normalize through the windowed correlator. Module 6 surfaces watermark progress as the master observability metric — "the pipeline is currently complete through event-time T."


Core Concepts

A Watermark, Defined Precisely

A watermark is a value of the form Watermark(t) whose meaning is: no event with event_time < t will arrive after this point. The watermark is a guarantee, not a hope or an estimate. When an operator receives a watermark, it can act on that guarantee — close windows, emit results, evict state — confident that the guarantee will hold.

A watermark's value monotonically advances: each new watermark from a given source has a value greater than or equal to the previous one. (Equality is permitted but uninteresting; in practice watermarks strictly advance.) A source that emits a non-monotonic watermark has violated the protocol; the operator may treat this as a bug and either ignore the regressed value or fail loudly. The orchestrator's structured-logging discipline applies — log a structured event for every regressed watermark, alert on persistent regression.

The watermark is a separate item from observations on the same channel. The convention this track uses is to interleave two kinds of items on the source-to-operator channels: data items (Observation) and control items (Watermark(SystemTime)). Operators consume both, processing data items as they arrive and updating their watermark state on watermark items. The alternative — a separate side channel for watermarks — exists in some streaming frameworks but introduces its own coordination problems (a fast data channel out-pacing a slow watermark channel produces apparent regressions). In-band interleaving keeps the ordering coherent.

Perfect vs Heuristic Watermarks

A perfect watermark is one where the source can prove the guarantee — there is some monotonic property of the source that lets it declare with certainty when no earlier event will arrive. The clearest example is a single-stream source whose events arrive in event-time order: every event is a watermark, because the source can emit "watermark = this event's event_time" and be sure no earlier events will follow.

Most production sources cannot offer perfect watermarks. They offer heuristic watermarks: an estimate of the maximum lateness an event can have, used to compute a bound. The source picks a max-lateness estimate (call it M) and emits, periodically, a watermark of value current_time - M. The estimate is documented per-source based on the source's known properties. If M is too small, late events arrive past the watermark — the watermark's guarantee was wrong, and the late event must be dropped or held in allowed-lateness state (Lesson 4). If M is too large, the watermark advances too slowly and downstream windows close later than necessary, increasing pipeline latency.

For SDA, the per-source max-lateness values are:

SourceMax latenessReasoning
Radar (UDP)100 msGPS-locked clocks; fiber path round-trip; no buffering
Optical (HTTP poll)30 sPolling cadence is 30 s; an event recorded just after a poll waits one full cycle
ISL beacon (TCP)10 sOnboard buffering before downlink; downlink-to-ground propagation

The estimates are conservative: real lateness is typically much less, but the bound covers the tail of the lateness distribution. Lesson 4 covers what happens when the estimate is wrong.

Generation at Sources

A source emits watermarks alongside its observations. The pattern is the same for each source kind, parameterized by the source's max-lateness estimate.

pub enum SourceItem {
    Observation(Observation),
    Watermark(SystemTime),
}

Each source's emit loop interleaves Observation items with periodic Watermark items. The frequency of watermark emission is operationally important: too rarely, downstream windows are held longer than necessary because the operator does not know the watermark has advanced; too frequently, the watermark items add bandwidth overhead. A watermark every 1 second of wall-clock time is a reasonable starting point for SDA — well below the optical source's 30-second polling cadence, well above the per-event rate.

The source-side watermark value is max(observed event_times) - max_lateness. A source that has just produced an observation with sensor_timestamp = T emits the watermark T - M (where M is its max-lateness estimate). The watermark trails the source's most recent observed event-time by exactly M, which is the guarantee shape the watermark protocol expects.

Propagation Through Operators

When an operator has multiple inputs (a fan-in normalize, a join, a correlation), it must compute its output watermark from its input watermarks. The rule is the minimum: the output watermark is the minimum of the most recent watermark from each input. The reason for min, not max: we can only guarantee what the worst upstream guarantees. If the radar input has watermark T and the optical input has watermark T-30, we cannot guarantee that no event with event_time < T will arrive — because the optical input might still produce one. The strongest claim we can make is "no event with event_time < T-30 will arrive," so that is the output watermark.

The min-rule has a counterintuitive consequence: the slowest source dominates the downstream watermark. A pipeline with three sources at watermarks T, T+10, and T+20 has a downstream watermark of T — the slow source's value. Improving any of the faster sources does nothing for the downstream watermark; only improving the slowest source does. This is the operational property that makes per-source max-lateness estimates so important: tightening any one estimate lowers that source's watermark trail-time, which lowers the downstream watermark trail-time only if that source was the dominant one.

Implementation in code: the operator tracks the most recent watermark per input channel, recomputes the minimum on every watermark item, and emits a new output watermark when the minimum advances.

The Aggressive-vs-Conservative Tradeoff

The watermark designer's main lever is the per-source max-lateness M. Aggressive M (small) → fast watermark → fast window close → low pipeline latency, but late events arriving past the watermark are dropped or pushed into allowed-lateness state. Conservative M (large) → slow watermark → slow window close → higher pipeline latency, but few late events are missed.

The right setting is operational, not theoretical. For the SDA pipeline's 30-second conjunction-detection SLA, the source-side max-lateness values above produce a downstream watermark that trails real time by ~30 seconds (dominated by the optical source). That gives windowed correlators 30 seconds to close before the SLA is at risk. The aggressive-vs-conservative tradeoff is per-pipeline; we set defaults that match SDA and document them in the per-source code.


Code Examples

A Source That Emits Both Observations and Watermarks

The radar UDP source from M1 emits only Observations. We extend it to interleave watermark items on a wall-clock cadence. The pattern is the same for every source; the per-source max-lateness M is the only parameter that changes.

use std::time::{Duration, Instant, SystemTime};
use anyhow::Result;
use tokio::sync::mpsc;
use tokio::time;

pub enum SourceItem {
    Observation(Observation),
    Watermark(SystemTime),
}

/// Wraps an existing source and interleaves periodic watermarks based
/// on observed event-time and the source's documented max-lateness.
pub async fn run_source_with_watermarks<S>(
    mut source: S,
    output: mpsc::Sender<SourceItem>,
    max_lateness: Duration,
    watermark_interval: Duration,
) -> Result<()>
where
    S: ObservationSource,
{
    let mut last_watermark_emit = Instant::now();
    let mut max_observed_event_time = SystemTime::UNIX_EPOCH;

    loop {
        match source.next().await? {
            Some(obs) => {
                if obs.sensor_timestamp > max_observed_event_time {
                    max_observed_event_time = obs.sensor_timestamp;
                }
                output.send(SourceItem::Observation(obs)).await
                    .map_err(|_| anyhow::anyhow!("downstream dropped"))?;

                // Emit a watermark on cadence, regardless of event rate.
                if last_watermark_emit.elapsed() >= watermark_interval {
                    let wm = max_observed_event_time
                        .checked_sub(max_lateness)
                        .unwrap_or(SystemTime::UNIX_EPOCH);
                    output.send(SourceItem::Watermark(wm)).await
                        .map_err(|_| anyhow::anyhow!("downstream dropped"))?;
                    last_watermark_emit = Instant::now();
                }
            }
            None => return Ok(()),
        }
    }
}

The watermark value is max_observed_event_time - max_lateness — the most recent event-time the source has seen, minus the source's documented worst-case lateness. The watermark monotonically advances because max_observed_event_time does and max_lateness is constant. The cadence (watermark_interval) is wall-clock-driven so the watermark advances even if the source has been silent for a stretch — important so a downstream operator's windows do not sit idle waiting for events that are not coming. A real production source also emits watermarks during idle gaps via a tokio::time::interval; we elide that for clarity but the capstone implementation includes it.

A Fan-In Operator That Computes Min-of-Inputs

The normalize operator from M1 fanned three sources into one channel. With watermarks, the fan-in must compute its output watermark as the minimum of the most recent watermarks from each input. The implementation tracks per-input watermarks in a Vec<Option<SystemTime>> and recomputes the min on each watermark item.

use std::time::SystemTime;
use anyhow::Result;
use tokio::sync::mpsc;

/// Fan-in normalize operator that consumes from N upstream channels
/// (each carrying SourceItem) and emits a single SourceItem stream
/// downstream with a properly-propagated min-of-inputs watermark.
pub async fn normalize_fan_in(
    mut inputs: Vec<mpsc::Receiver<SourceItem>>,
    output: mpsc::Sender<SourceItem>,
) -> Result<()> {
    use tokio::select;

    let n = inputs.len();
    let mut input_watermarks: Vec<Option<SystemTime>> = vec![None; n];
    let mut last_emitted_watermark: Option<SystemTime> = None;

    // Simplified select: real implementation uses select_all from
    // futures::future for arbitrary N. Here we sketch the per-input
    // handling for clarity.
    for input_idx in 0..n {
        // ... in a real implementation, all inputs are polled
        // concurrently via select_all; this loop is illustrative.
        while let Some(item) = inputs[input_idx].recv().await {
            match item {
                SourceItem::Observation(obs) => {
                    let normalized = normalize(obs);
                    output.send(SourceItem::Observation(normalized)).await
                        .map_err(|_| anyhow::anyhow!("downstream dropped"))?;
                }
                SourceItem::Watermark(wm) => {
                    input_watermarks[input_idx] = Some(wm);
                    // Compute min only when every input has reported at
                    // least one watermark. Until then, the operator's
                    // output watermark is undefined.
                    if input_watermarks.iter().all(|w| w.is_some()) {
                        let new_wm = input_watermarks
                            .iter()
                            .map(|w| w.unwrap())
                            .min()
                            .unwrap();
                        if Some(new_wm) > last_emitted_watermark {
                            output.send(SourceItem::Watermark(new_wm)).await
                                .map_err(|_| anyhow::anyhow!("downstream dropped"))?;
                            last_emitted_watermark = Some(new_wm);
                        }
                    }
                }
            }
        }
    }
    Ok(())
}

Three subtle points. The output watermark is undefined until every input has reported at least one watermark. A fan-in with three sources where one source has not yet sent a watermark cannot propagate a min — there is no upper bound on what that source's watermark might be once it arrives, so any min computed without it is unsafe. The fix is structural: hold downstream emission until every input is heard from. Second, the operator emits a new output watermark only when the min strictly advances. Re-emitting the same watermark would be correct but wasteful; the strict-advance check throttles the per-event watermark traffic to what is operationally meaningful. Third, the per-input bookkeeping is intentionally simple — a Vec<Option<SystemTime>> indexed by input position, no fancier structure needed. Production code that joins many inputs uses the same pattern with Vec lengths in the dozens; the constant-factor cost of the .min() recomputation on each watermark item is negligible at any realistic input count.

Wiring Watermarks Into the Tumbling Window Operator

Lesson 2's TumblingWindow::close_up_to(watermark) becomes the consumer of watermark items. The operator no longer has its own close logic; it reacts to the watermark stream the upstream produced.

use std::time::SystemTime;
use anyhow::Result;
use tokio::sync::mpsc;

/// Drive a TumblingWindow operator from a single SourceItem stream
/// that interleaves Observations with Watermarks. Observations are
/// ingested into the window state; watermarks trigger close-up-to.
pub async fn run_tumbling_with_watermarks(
    mut window_op: TumblingWindow,
    mut input: mpsc::Receiver<SourceItem>,
) -> Result<()> {
    while let Some(item) = input.recv().await {
        match item {
            SourceItem::Observation(obs) => {
                window_op.ingest(obs);
            }
            SourceItem::Watermark(wm) => {
                window_op.close_up_to(wm).await?;
            }
        }
    }
    Ok(())
}

The pattern is the structural property the lesson promised. The operator does not decide its own close; it consumes a watermark stream that supplies the close trigger. The same shape applies to sliding-window operators (which evict per-key state on watermark advance), session-window operators (which close sessions whose session_end + gap is past the watermark), and any other event-time-windowed operator. The watermark is the universal close trigger.


Key Takeaways

  • A watermark is a per-event-time guarantee: Watermark(t) means "no event with event_time < t will arrive after this point." It is the only correct close trigger for event-time windows; wall-clock-based close drops late events whose event_time precedes the wall-clock cutoff.
  • Heuristic watermarks (the production default) are computed as max_observed_event_time - max_lateness, where max_lateness is a per-source documented bound. Tighter max_lateness → faster watermark → lower pipeline latency, at the cost of more events arriving past the watermark.
  • The min-of-inputs rule propagates watermarks through fan-in operators: the output watermark is the minimum of the most recent watermarks from each input. We can only guarantee what the slowest upstream guarantees. The slowest source dominates the downstream watermark.
  • Watermarks are interleaved with data items on the same channel (enum SourceItem { Observation(_), Watermark(_) }). In-band ordering keeps the watermark's relationship to the data items it bounds coherent; a separate side-channel can produce apparent regressions.
  • The windowed operator does not decide its own close. It consumes a watermark stream and closes any window whose end ≤ the watermark. This decoupling is what makes windowed operators composable across the pipeline and re-usable across window shapes.

Lesson 4 — Late Data

Module: Data Pipelines — M03: Event Time and Watermarks Position: Lesson 4 of 4 Source: Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11 (Out-of-Order Events and corrections); Fundamentals of Data Engineering — Joe Reis & Matt Housley, Chapter 7 (Late-Arriving Data); Kafka: The Definitive Guide — Shapira, Palino, Sivaram, Petty, Chapter 14 (Out-of-Sequence Events and Reprocessing)


Context

A watermark is a guarantee, but the guarantee is built on an estimate — the per-source max_lateness. The estimate is calibrated to cover the typical worst case, not every possible case. Eventually a source's actual lateness exceeds its estimate. The optical archive's polling cadence runs slow because the partner's API is degraded; an observation arrives 35 seconds after its event time when max_lateness was set to 30. The radar's fiber path takes a 200-ms detour through a ground-station relay; an observation arrives 250 ms after its event time when max_lateness was 100 ms. In every case, the observation arrives after the watermark for its window has already passed. The window has already been closed and emitted. The observation is late.

Three things can happen with a late event. Drop it. Cheap, lossy, the default in many systems. Accumulate it into a re-opened window. Hold the window's state in a holding pattern past its watermark close, accept events into it for a bounded allowed lateness period, re-emit the window's result on each accepted late event. Medium cost, requires the downstream to handle re-emissions. Retract and re-emit. Emit a "negation" of the previously-emitted result, then emit a corrected one. Most expensive, requires the downstream to handle retractions, gives the strongest correctness guarantee.

The choice is operational. For SDA's conjunction-alert pipeline, the cost of a missed alert (a real conjunction not flagged) is collision risk; the cost of a phantom alert (a false conjunction emitted then never corrected) is a needless avoidance maneuver burning satellite fuel. The right choice depends on which cost is being weighed against what. For high-rate dashboards (events-per-second, throughput summaries), drop is fine — individual late events do not change the aggregate. For batched analytics where eventual completeness matters, accumulate-with-bound is the right shape. For human-actionable alerts, retract-and-correct is necessary because a phantom alert that is never withdrawn produces lasting downstream effects. This lesson covers all three, develops the implementation patterns, and closes the module by tying back to the orchestrator's at-least-once-plus-idempotency frame from Module 2.


Core Concepts

The Three Strategies

Drop. When a late event arrives — its sensor_timestamp falls within an already-closed window — discard it. The window's emitted result remains the canonical answer. The lost event contributes nothing. Cost: minimal. Cost paid: events that should have contributed to the result do not. Best for high-rate streams where individual events are statistically insignificant.

Accumulate (Allowed Lateness). Each window's state is held for a bounded allowed_lateness period past its watermark-driven close. Late events arriving within [window_end, window_end + allowed_lateness] in event time are accepted into the window's state, and the window's result is re-emitted with the additional event included. After the allowed-lateness period expires (the watermark advances past window_end + allowed_lateness), the window's state is finally evicted; events arriving past that point are dropped. Cost: state held longer; downstream must handle re-emission of previously-emitted results. Best for analytics-style pipelines where some delay is acceptable.

Retract. When a late event arrives within a previously-emitted window's lateness period, emit two records: a retraction of the previously-emitted result, then an insertion of the corrected result. The downstream consumer treats retraction-then-insertion as an atomic correction. Cost: highest — requires retraction-aware downstream. Best for human-actionable outputs where a wrong result that is never corrected is worse than a delayed-but-correct one.

The three are a progression: drop is a special case of accumulate-with-zero-allowed-lateness; accumulate is a special case of retract that does not bother distinguishing the previous result from the next; retract is the most general. Most production pipelines use a mix: drop for non-critical metrics, accumulate for batched reporting, retract for human-actionable alerts.

Allowed Lateness, Concretely

The accumulate strategy parameterizes allowed_lateness per operator. A windowed correlator with allowed_lateness = 5s holds each closed window's state for 5 seconds (in event time, not wall clock) past its watermark close. In code, this means: when the watermark advances past window_end, the operator emits the window's result but does not free the window's state; the state stays in a retained state. Late events arriving while the watermark is in [window_end, window_end + allowed_lateness] are added to the retained state and trigger a re-emission. When the watermark advances past window_end + allowed_lateness, the state is finally evicted.

The memory cost of allowed lateness scales as (allowed_lateness / window_size) × window_state. A 30-second window with 5-second allowed lateness holds ~1.17x the steady-state window memory; a 30-second window with 30-second allowed lateness doubles it. The per-key cardinality multiplies through the same ratio. For SDA's correlator at typical scales (tens of thousands of orbital objects, 30-second windows, 5-second allowed lateness), the additional memory is bounded and tractable.

The operational tuning is per-pipeline. Aggressive allowed-lateness (small, e.g. 1s) → low memory cost, fast window finalization → late events past 1s are silently lost. Conservative allowed-lateness (large, e.g. 60s) → higher memory cost, slow finalization → almost all late events captured. The right choice depends on the lateness distribution of the slowest source. SDA's setting of 5s on top of optical's 30s max_lateness covers the long tail of optical archive delays without doubling memory.

Retractions

A retraction is a downstream-visible "undo." For a window that previously emitted result R1, the operator emits a retraction of R1 followed by an insertion of R2 (the corrected result). The downstream subscriber sees the pair as: invalidate the previously-stated R1; the correct value is now R2.

The implementation requires a sequence number on each emit so the downstream can match retractions to the correct previous emit. The convention this track uses: each window has a (window_id, sequence) pair on every emission, with sequence starting at 0 for the first emit and incrementing on each correction. The downstream uses (window_id, sequence) as the primary key with last-write-wins semantics — at any point in time, the latest sequence for a given window_id is the canonical answer.

The retraction emission shape:

pub enum WindowEmit {
    Insert { window_id: WindowId, sequence: u32, result: WindowResult },
    Retract { window_id: WindowId, sequence: u32 },
}

A retract-then-insert pair on window 17 looks like: Retract { window_id: 17, sequence: 1 } followed by Insert { window_id: 17, sequence: 2, result: corrected }. The downstream's stored state, keyed on window_id, is updated to sequence 2's result. The sequence number prevents a delayed retraction from being applied to a newer result emitted in the meantime: if the retraction is for sequence 1 but the downstream has already stored sequence 2, the retract is a no-op.

The Idempotency Requirement Downstream

Retractions only work when the downstream is idempotent on (window_id, sequence). A SQL sink with ON CONFLICT (window_id) DO UPDATE SET ... WHERE EXCLUDED.sequence > stored.sequence is idempotent. A Kafka topic where the consumer dedups on (window_id, sequence) is idempotent. An HTTP webhook that does not respect any keying is not idempotent — duplicate retractions or out-of-order retract/insert pairs produce wrong final state at the subscriber.

The pattern composes with Module 2's at-least-once-plus-idempotency frame: the pipeline emits at least once (with retries on transient failures) and the downstream is idempotent (on (window_id, sequence)), giving effective exactly-once delivery of the corrected stream. Module 5 covers the full machinery for cross-pipeline exactly-once, including transactional Kafka producers; this lesson establishes the pattern at the operator level.

Choosing the Strategy

A decision table for SDA's pipeline:

OutputStrategyReasoning
events_per_minute dashboard counterDropLate events negligible at this granularity; dashboard precision is loose
conjunction_risk_summary analytics emitAccumulate (5s)Eventual completeness matters; some delay acceptable
conjunction_alert to subscriberRetractPhantom alerts cause real-world action (avoidance maneuvers); must be correctable
audit_log of every observationDropThe audit log is the input record, not a derived computation; latency is irrelevant

The decision is per-output, not per-pipeline. The same windowed correlator can produce different outputs with different strategies — a retract-emitting alert stream alongside a drop-emitting metrics stream. The implementation factors out the strategy as a per-output configuration that the operator dispatches on.


Code Examples

Allowed-Lateness Window Operator

The L2 tumbling-window operator extended with allowed-lateness retention. Window state is held in two tiers: active (window not yet closed) and retained (window closed by watermark, held for late events for allowed_lateness).

use std::collections::BTreeMap;
use std::time::{Duration, SystemTime};
use anyhow::Result;
use tokio::sync::mpsc;

pub struct AllowedLatenessWindow {
    /// Active windows: state still being accumulated, watermark hasn't passed.
    active: BTreeMap<SystemTime, WindowState>,
    /// Retained windows: emitted once, held for allowed_lateness in case
    /// late events arrive. Will be re-emitted on each accepted late event.
    retained: BTreeMap<SystemTime, WindowState>,
    window_size: Duration,
    allowed_lateness: Duration,
    epoch: SystemTime,
    output: mpsc::Sender<WindowEmit>,
}

#[derive(Debug, Clone)]
struct WindowState {
    observations: Vec<Observation>,
    sequence: u32,
}

impl AllowedLatenessWindow {
    pub fn new(
        window_size: Duration,
        allowed_lateness: Duration,
        epoch: SystemTime,
        output: mpsc::Sender<WindowEmit>,
    ) -> Self {
        Self {
            active: BTreeMap::new(),
            retained: BTreeMap::new(),
            window_size,
            allowed_lateness,
            epoch,
            output,
        }
    }

    /// Ingest an observation, dispatching on whether it lands in an
    /// active window or a retained (late) window.
    pub async fn ingest(&mut self, obs: Observation) -> Result<()> {
        let window_end = self.window_end_for(obs.sensor_timestamp);
        if let Some(state) = self.active.get_mut(&window_end) {
            state.observations.push(obs);
            return Ok(());
        }
        if let Some(state) = self.retained.get_mut(&window_end) {
            // Late event into a retained window — accept and re-emit.
            state.observations.push(obs);
            state.sequence += 1;
            self.emit(window_end, state.clone(), EmitKind::Insert).await?;
            return Ok(());
        }
        // Fresh window.
        self.active.insert(
            window_end,
            WindowState { observations: vec![obs], sequence: 0 },
        );
        Ok(())
    }

    /// Watermark advance: close every active window whose end ≤ watermark
    /// (move to retained); evict every retained window whose end +
    /// allowed_lateness ≤ watermark (final eviction, no more late events
    /// accepted for this window).
    pub async fn on_watermark(&mut self, watermark: SystemTime) -> Result<()> {
        // Move closed-by-watermark windows from active to retained, emitting
        // the initial result on the way through.
        let still_active = self.active.split_off(&(watermark + Duration::from_nanos(1)));
        let to_close = std::mem::replace(&mut self.active, still_active);
        for (window_end, state) in to_close {
            self.emit(window_end, state.clone(), EmitKind::Insert).await?;
            self.retained.insert(window_end, state);
        }
        // Evict retained windows whose lateness budget is exhausted.
        let retain_cutoff = watermark.checked_sub(self.allowed_lateness)
            .unwrap_or(SystemTime::UNIX_EPOCH);
        let still_retained = self.retained.split_off(&(retain_cutoff + Duration::from_nanos(1)));
        let to_evict = std::mem::replace(&mut self.retained, still_retained);
        // Evicted windows are silently dropped; their state is gone.
        // Lesson 4 also discusses the retraction strategy for cases where
        // even-after-eviction late events need to update results.
        drop(to_evict);
        Ok(())
    }

    fn window_end_for(&self, ts: SystemTime) -> SystemTime {
        let offset = ts.duration_since(self.epoch).unwrap_or_default();
        let idx = offset.as_secs() / self.window_size.as_secs().max(1);
        self.epoch + self.window_size * (idx + 1) as u32
    }

    async fn emit(&self, window_end: SystemTime, state: WindowState, _kind: EmitKind) -> Result<()> {
        let result = WindowResult {
            window_end,
            count: state.observations.len(),
        };
        self.output.send(WindowEmit::Insert {
            window_id: WindowId(window_end),
            sequence: state.sequence,
            result,
        }).await
            .map_err(|_| anyhow::anyhow!("downstream dropped"))
    }
}

enum EmitKind { Insert, Retract }

The two-tier state — active and retained — is the structural pattern. Active windows accumulate; the watermark advance moves them to retained and triggers their first emit; retained windows can still receive late events and re-emit; final eviction at watermark - allowed_lateness frees the state for good. The two-tier structure makes the lateness behavior explicit rather than implicit; an operator with a single tier and ad-hoc "late event" handling tends to grow correctness bugs as the pattern complicates. The cost of two tiers is a few extra BTreeMap operations per watermark advance — negligible at any realistic scale.

A Retracting Operator

The retract strategy emits explicit Retract records before each Insert of a corrected result. The downstream is responsible for processing the pair atomically. The operator's emit logic factors slightly differently than the accumulate-only version above.

async fn emit_retract_then_insert(
    output: &mpsc::Sender<WindowEmit>,
    window_end: SystemTime,
    prev_sequence: u32,
    new_state: &WindowState,
) -> Result<()> {
    let window_id = WindowId(window_end);
    // Retract the previously-emitted sequence.
    output.send(WindowEmit::Retract {
        window_id,
        sequence: prev_sequence,
    }).await
        .map_err(|_| anyhow::anyhow!("downstream dropped"))?;
    // Then the corrected result at the new sequence.
    output.send(WindowEmit::Insert {
        window_id,
        sequence: new_state.sequence,
        result: WindowResult {
            window_end,
            count: new_state.observations.len(),
        },
    }).await
        .map_err(|_| anyhow::anyhow!("downstream dropped"))?;
    Ok(())
}

The retraction must be emitted before the corrected insert. Otherwise the downstream sees insert-then-retract — the corrected result lands first, gets retracted, and the downstream's stored state ends up empty. The two-step pattern depends on the channel preserving FIFO ordering, which mpsc::Sender::send does. The sequence on the retract is the previous sequence; the sequence on the insert is the new one. The downstream, keyed on (window_id, sequence), processes the two records in order and ends up with the corrected state. This is the pattern that makes retract-and-correct safe under at-least-once delivery: the downstream's last-write-wins semantics absorb duplicate or out-of-order retract/insert pairs as long as the sequence ordering is respected.

A Retraction-Aware Sink

The sink that consumes the retractor's output. It uses an embedded SQLite as its idempotent state store. The schema is (window_id, sequence, result_blob); UPSERT on conflict by window_id with the higher sequence winning.

use rusqlite::{params, Connection};
use std::time::SystemTime;

const UPSERT_SQL: &str = r#"
    INSERT INTO window_results (window_id, sequence, result_blob)
    VALUES (?1, ?2, ?3)
    ON CONFLICT (window_id) DO UPDATE
    SET sequence = excluded.sequence, result_blob = excluded.result_blob
    WHERE excluded.sequence > window_results.sequence
"#;

const RETRACT_SQL: &str = r#"
    DELETE FROM window_results
    WHERE window_id = ?1 AND sequence = ?2
"#;

pub fn apply_emit(conn: &Connection, emit: WindowEmit) -> rusqlite::Result<()> {
    match emit {
        WindowEmit::Insert { window_id, sequence, result } => {
            let blob = serde_json::to_vec(&result).unwrap();
            conn.execute(UPSERT_SQL, params![
                window_id.0.duration_since(SystemTime::UNIX_EPOCH).unwrap().as_nanos() as i64,
                sequence,
                blob,
            ])?;
        }
        WindowEmit::Retract { window_id, sequence } => {
            conn.execute(RETRACT_SQL, params![
                window_id.0.duration_since(SystemTime::UNIX_EPOCH).unwrap().as_nanos() as i64,
                sequence,
            ])?;
        }
    }
    Ok(())
}

The WHERE excluded.sequence > window_results.sequence clause is what makes the UPSERT idempotent: a duplicate insert at sequence N (delivered twice by at-least-once) produces no change because the comparison is strict. A retraction's WHERE clause matches on both window_id and sequence, so a stale retraction (window_id matches but sequence is below the current stored value) deletes nothing — exactly the right behavior. The composition with the at-least-once-with-retries delivery layer (Module 2's with_retry) gives the exactly-once-effective property the lesson promises.


Key Takeaways

  • Late events arrive when a source's actual lateness exceeds its max_lateness estimate. The pipeline has three strategies: drop (cheap, lossy), accumulate-with-allowed-lateness (medium cost, eventual completeness), retract-and-correct (highest cost, strongest correctness guarantee). The choice is per-output, not per-pipeline.
  • Allowed lateness holds window state for a bounded period past the watermark close. The state lives in a retained tier alongside the active tier; late events into retained windows trigger re-emit; final eviction at watermark - allowed_lateness frees the state. Memory scales as (allowed_lateness / window_size) × steady_state_memory.
  • Retractions emit a Retract of the previously-emitted result before each corrected Insert. The downstream is keyed on (window_id, sequence) with last-write-wins semantics; sequence numbers prevent delayed retractions from clobbering newer results. Retract-then-insert ordering is FIFO-channel-dependent and must not be reordered.
  • Retractions only work with idempotent downstream: SQL ON CONFLICT DO UPDATE keyed on window_id with strict-greater sequence comparison, Kafka consumers that dedup on (window_id, sequence), or any sink whose effect on the world is keyed on the same identifier the operator emits. Non-idempotent downstreams produce wrong final state under retry.
  • The lateness machinery composes with Module 2's at-least-once-plus-idempotency frame: the operator emits at least once with retries on transient failures, the downstream is idempotent on (window_id, sequence), giving effective exactly-once delivery of the corrected stream. Module 5 generalizes this pattern to cross-pipeline boundaries.

Capstone Project — Conjunction Window Engine

Module: Data Pipelines — M03: Event Time and Watermarks Estimated effort: 1–2 weeks of focused work Prerequisites: All four lessons in this module passed at ≥70%


Mission Brief

OPS DIRECTIVE — SDA-2026-0142 / Phase 3 Implementation Classification: CORRELATION TIER UPGRADE

The Phase 2 orchestrator from Module 2 is in production and stable. The dedup operator at the top of the correlation tier is processing- time-based, which is correct for ingestion deduplication but wrong for cross-sensor correlation. Internal review of conjunction alerts from the past quarter found a 2.3% rate of cross-sensor mismatches traceable to optical-vs-radar arrival skew straddling the 5-second dedup window. Two of the missed correlations were later determined to be real conjunctions, surfaced only by post-hoc batch reprocessing. The fix is structural: replace processing-time dedup with event-time windowed correlation that respects sensor_timestamp regardless of arrival skew.

Success criteria for Phase 3: cross-sensor correlations are computed by event time using the watermark protocol; the per-source max-lateness values from the M3 lesson (radar 100ms, optical 30s, ISL 10s) are respected; allowed-lateness of 5s captures the long tail of optical archive delays; conjunction alerts emit a retraction when a late observation invalidates a previously-emitted alert.


What You're Building

Replace the dedup operator from M2's pipeline with a windowed event-time correlator that emits ConjunctionRisk envelopes when two orbital objects' observations within the same event-time window indicate a close approach.

The deliverable is:

  1. A WatermarkSource trait extending the M1 ObservationSource to interleave watermarks with observations on the channel
  2. Wrapped versions of M1's three sources (radar, optical, ISL) that emit per-source watermarks per the lesson's max-lateness table
  3. A min-of-inputs watermark-propagating fan-in normalize operator
  4. A WindowedCorrelator operator that holds per-key sliding windows of observations, emits ConjunctionRisk envelopes when pairs cross a configured proximity threshold, and supports allowed-lateness retention
  5. A retraction-aware alert sink that emits WindowEmit::{Insert, Retract} records via a sequence-number-keyed downstream
  6. A test harness that drives the pipeline with synthetic out-of-order events and verifies correctness across replay

The orchestrator from M2 is unchanged; only one operator's implementation changes (dedup → correlator). The OperatorGraph declaration is updated to reflect the new operator. Refresh the operational README to document the new metrics (watermark trail per source, allowed-lateness eviction count, retractions emitted).


Suggested Architecture

                                                          ┌─────────────────┐
   ┌───────────────┐  SourceItem                          │  Alert Sink     │
   │ Radar Source  │═══════════════╗                      │  (retract-aware)│
   │  + watermarks │               ║                      │                 │
   └───────────────┘               ║                      └────────▲────────┘
                                   ║                               │
   ┌───────────────┐  SourceItem   ▼                               │
   │ Optical Src   │═══════>┌────────────┐    SourceItem    ┌──────────────┐
   │  + watermarks │═══════>│  Normalize │═════════════════>│   Windowed   │
   └───────────────┘        │  Fan-In    │                  │  Correlator  │
                            │  min-WM    │                  │  (sliding,   │
   ┌───────────────┐  ═════>└────────────┘                  │   allowed-   │
   │  ISL Source   │                                        │   lateness)  │
   │  + watermarks │                                        └──────────────┘
   └───────────────┘

Each source runs its own task (orchestrated by M2's OperatorGraph). Each source emits enum SourceItem { Observation(_), Watermark(_) } on its outgoing channel. The fan-in normalize operator consumes from all three and emits a single SourceItem stream with min-of-inputs watermark propagation. The windowed correlator consumes that stream, holds per-orbital-object sliding windows, computes pairwise close-approach proximity within each window, and emits to the alert sink. The orchestrator's restart, retry, and circuit-breaker machinery from M2 wraps all of this without modification.


Acceptance Criteria

Functional Requirements

  • WatermarkSource trait with method next() -> Result<Option<SourceItem>>; the existing M1 ObservationSource is wrapped via the lesson's run_source_with_watermarks helper
  • Each source's max_lateness matches the lesson's table: radar 100ms, optical 30s, ISL 10s
  • Watermarks emitted at a wall-clock cadence (default 1 second) AND on-demand whenever the observed event_time advances; idle sources still advance their watermark
  • Fan-in normalize operator computes min-of-inputs watermark; output watermark held until every input has reported at least once
  • Windowed correlator uses sliding windows keyed on (object_a_id, object_b_id); window size 30s, allowed lateness 5s
  • ConjunctionRisk emit triggered when two observations of distinct objects within the same window indicate proximity below threshold
  • Late events into already-emitted windows trigger a retraction-then-insert pair on the alert channel; sequence numbers on every emit
  • Alert sink uses an embedded SQLite (or equivalent) keyed on (window_id, sequence) with strict-greater UPSERT semantics

Quality Requirements

  • Replay test for correctness: deterministic test injecting a fixed event sequence in event-time order, then re-running with the same events injected in random arrival order. The final state of the alert sink must be byte-identical between the two runs.
  • Watermark progress test: a unit test feeds watermarks into the fan-in operator and asserts the output watermark advances per the min-of-inputs rule and not before all inputs have reported
  • Allowed-lateness test: a unit test injects a late event into a retained window and asserts the retraction-then-insert pair is emitted; injects a late event after the lateness budget has expired and asserts it is dropped silently with the corresponding metric incremented
  • Memory bound test: under sustained load, the correlator's per-key window state stays bounded by (window_size + allowed_lateness) × per_key_event_rate; an integration test asserts steady-state memory for a synthetic workload

Operational Requirements

  • /metrics extends M2's: source_watermark_seconds{source} (gauge), pipeline_watermark_seconds (gauge — the min-of-inputs at fan-in), pending_windows{tier} (gauge — split between active and retained), late_events_dropped_total (counter), retractions_emitted_total (counter)
  • Lag dashboard split into source lag and pipeline lag per the M3 L1 framing; the pipeline lag panel makes "is the problem ours or theirs" answerable in seconds
  • Operational runbook section "Reading the Watermark Dashboard" documenting how to interpret a watermark stall (which source's max-lateness is dominant; what the per-source values are; what tightens what)

Self-Assessed Stretch Goals

  • (self-assessed) Sustain 50K observations/sec input with a 30-second window, P99 emit latency < 1 second after watermark advance. Provide a criterion benchmark and a flame graph showing where the per-event cost lives.
  • (self-assessed) Replay-correctness test runs against a corpus of 10 randomly-shuffled arrival orders and produces byte-identical final state in every case
  • (self-assessed) The operational dashboard exposes a "watermark stall" alert that fires when the pipeline watermark fails to advance for > 60 seconds, distinguishing source-side stalls from pipeline-side stalls in the alert text

Hints

How do I generate watermarks for an event-driven UDP source like the radar where there is no natural "tick"?

Two interleaved triggers. On observation: each observation updates the source's max_observed_event_time and (every Nth observation, or whenever wall-clock has advanced past the watermark interval) emits a watermark of max_observed_event_time - max_lateness. On idle: a tokio::time::interval ticking at the watermark interval emits a watermark even when no observations have arrived recently — important during quiet periods so the downstream operator's windows do not stall waiting for the source to wake up.

use tokio::select;

let mut interval = tokio::time::interval(watermark_interval);
loop {
    select! {
        obs = source.next() => {
            // Emit Observation, update max_observed_event_time,
            // possibly emit Watermark.
        }
        _ = interval.tick() => {
            // Emit Watermark even if no observations arrived;
            // the source's clock is what advances the watermark
            // during idle periods.
        }
    }
}

The select-against-interval pattern is the same one Module 2 used for the source-side retry timer. Reusable.

How do I efficiently store retained windows alongside active windows?

Two BTreeMap<SystemTime, WindowState> — one for active, one for retained — and a small dispatch on ingest. The on_watermark step uses BTreeMap::split_off for both maps to do prefix range eviction in O(log N). The combined cost per watermark advance is two prefix splits + one drain over the closed-active set; for hundreds of windows, this is sub-millisecond.

A single BTreeMap with a per-window enum tag (Active or Retained) also works and saves one allocation; both designs are fine. The two-map design makes the operations more obvious in the code and is the one the lesson uses.

How do I deterministically test replay correctness?

Two ingredients: a fixed input set of observations with known event_times, and a way to inject them in a specific arrival order. The test runs the same input set through the pipeline twice — once in event-time order, once in a deliberately scrambled order — and asserts the final state of the alert sink is identical between the two runs.

#[tokio::test]
async fn replay_correctness_under_scrambled_arrival() {
    let observations = build_test_observations();  // 1000 observations, event-time ordered
    let scrambled = {
        let mut o = observations.clone();
        // Shuffle deterministically with a fixed seed.
        use rand::seq::SliceRandom;
        use rand::rngs::StdRng;
        use rand::SeedableRng;
        o.shuffle(&mut StdRng::seed_from_u64(42));
        o
    };
    let result_ordered = run_pipeline_to_completion(&observations).await;
    let result_scrambled = run_pipeline_to_completion(&scrambled).await;
    assert_eq!(result_ordered, result_scrambled,
               "pipeline output should be invariant to arrival order");
}

The run_pipeline_to_completion helper drains the alert sink at the end and returns the stored state (the SQLite contents serialized to a comparable form). The assertion is the correctness property the watermark protocol is supposed to give you; if it fails, the bug is in the operator's allowed-lateness logic or the retract-then-insert ordering.

How small can the safety margin be on retained-window eviction?

The eviction happens at watermark - allowed_lateness. With watermark = max_observed_event_time - max_lateness, the eviction occurs when the latest event-time has advanced past window_end + allowed_lateness + max_lateness. So the real event-time-domain retention is allowed_lateness + max_lateness, not just allowed_lateness. The pipeline lag adds a third term: allowed_lateness + max_lateness + pipeline_lag.

For SDA's defaults: 5 + 30 + 1 = 36 seconds of total state retention. A budget of ~40s for the correlator's worst-case-per-window memory budget is a reasonable plan-for-the-tail value. Operations should monitor the actual eviction lag (now - retained_window_evict_event_time) and alert when it grows past, say, 120 seconds — that signals either pipeline lag growth or a stuck source.

How do I avoid duplicate retractions when the operator restarts?

The operator's emit history is in-memory; on restart, it has no idea what it already emitted. M2's at-least-once-plus-idempotent composition saves us here: a duplicate emit (the same window_id, sequence) is absorbed by the sink's strict-greater UPSERT comparison. The restart re-emits the same sequence numbers it emitted before; the sink stores the result the same way it was already stored; no observable change. A late event arriving for a previously-retained window after the operator restarts and that window is no longer in retained state is silently dropped — same as if the operator had not restarted but the lateness had simply expired.

For the full crash-safety story (where the restart should resume from a checkpoint of in-flight window state), see Module 5. This module's pipeline survives restarts but loses some allowed-lateness windows during the restart window — acceptable for SDA's tolerance.


Getting Started

Recommended order:

  1. SourceItem enum and the watermark-emitting source wrapper. Implement once; reuse across all three sources. Unit-test by feeding fixed observations and asserting the watermarks emitted match the documented formula.
  2. Min-of-inputs fan-in. Implement against three synthetic input channels in tests; do the unit test for the "hold until all inputs report" case explicitly.
  3. Sliding-window state per key. Reuse the L2 SlidingWindow primitive; add the per-key dispatch in the correlator. Test eviction with manual event-time scenarios.
  4. Allowed-lateness retention. Add the retained tier; wire watermark advance to move active→retained and to evict expired retained.
  5. Conjunction-risk computation. The actual proximity logic — given two observations of distinct objects within the same window, decide whether they form a ConjunctionRisk. The math is out of scope for this curriculum; a stub compute_proximity(obs_a, obs_b) -> f64 returning a deterministic number based on inputs is sufficient for testing the pipeline structure.
  6. Retract-and-correct emit logic. When a late event lands in a retained window, emit Retract { sequence: N } followed by Insert { sequence: N+1, result: corrected }.
  7. Retraction-aware sink. Embedded SQLite with the UPSERT pattern from L4. Test replay-after-restart correctness by killing and restarting the pipeline mid-stream.
  8. Replay-correctness integration test. The byte-identical-output-across-arrival-orders test from the hint.
  9. Refresh the operational README and the dashboard.

Aim for the sliding-window correlator with min-of-inputs watermark propagation working end-to-end by day 7; allowed-lateness and retractions can land in days 8-10. The replay-correctness test is the canary that catches every windowing bug; if it fails, stop and diagnose before adding features.


What This Module Sets Up

In Module 4 you will harden the channel boundaries against burst load. The bounded-channel-per-edge invariant from M2 plus the watermark machinery from M3 are both inputs to that work — burst load affects watermark advance, and watermark stall affects late-event handling. The flow-policy machinery developed in M4 plugs in upstream of the correlator without changing its windowing logic.

In Module 5 you will make the correlator's window state crash-safe. The two-tier (active/retained) state structure is exactly what the checkpoint code serializes. The sequence numbers on emits are exactly what the at-least-once-with-checkpoint recovery uses to replay safely. M3's correctness foundation is what M5's durability layer rests on.

In Module 6 you will surface the watermark progress as the master observability metric. The per-source watermark gauges, the pipeline watermark gauge, and the retained-window count are the diagnostic dashboard's first-row panels. The lag-distinct-from-watermark distinction is the framing that makes the dashboard usable.

The correlator built here is the canonical event-time-windowed operator. Every subsequent module's project either extends it directly or applies the same patterns to a different operator.

Module 04 — Backpressure and Flow Control

Track: Data Pipelines — Space Domain Awareness Fusion Position: Module 4 of 6 Source material: Async Rust — Maxwell Flitton & Caroline Morton, the chapters on tokio::sync::mpsc semantics and channel patterns; Network Programming with Rust — Abhishek Chanda, sections on application-level flow control and TCP windowing as transport-level backpressure; Kafka: The Definitive Guide — Shapira, Palino, Sivaram, Petty, Chapter 3 (Producer Configuration and max.in.flight.requests.per.connection); Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11 (Buffering and Pushback) Quiz pass threshold: 70% on all three lessons to unlock the project


Mission Context

OPS ALERT — SDA-2026-0188 Classification: BURST-LOAD HARDENING Subject: Replace uniform load shed with priority-aware FlowPolicy

Last week's anti-satellite test added 1,800 newly tracked debris objects to the catalog within 90 seconds. The Phase 3 pipeline survived but with twelve minutes of catch-up time and four dropped conjunction alerts during the spike. The postmortem traced the dropped alerts to a single edge — the alert-emitter's incoming channel — where the buffer was sized for nominal traffic, the upstream operator was IO-bound and could not slow further, and the downstream had no mechanism to triage which observations to drop. The pipeline's response to the burst was uniform load shed without policy. The four critical alerts that got dropped were no more important to the system than the four hundred non-critical observations dropped alongside them.

The pipeline at the start of this module is correct under steady-state load and tolerates transient downstream slowdowns. It is not yet correct under burst load. The mechanism for backpressure (send().await on bounded channels) propagates the slowdown but does not give the operator any policy choice about what to do when the channel fills up — the upstream just slows, regardless of what the data being slowed represents. For the post-Cosmos burst, that uniform behavior was exactly the wrong shape: the system needed to triage during the spike, preserving the high-priority alerts that could not afford the latency hit while shedding the low-priority redundant samples that contribute little under load.

This module installs the discipline. Per-edge FlowPolicy (Backpressure / Shed / Timed) makes the policy explicit at every edge. A priority classifier distinguishes high-priority from low-priority observations. The audit script becomes a CI test that fails the build on new unbounded channels or undocumented FlowPolicy choices. The burst simulator becomes the regression-detection canary. The patterns generalize to any streaming pipeline that must survive bursts; the module's specifics are where the techniques meet SDA's actual workload.

The mental model the module installs is the four-piece backpressure discipline: (1) every edge has an explicit per-edge FlowPolicy chosen for its operational role, (2) buffer sizing is documented per-edge with a BurstProfile rather than picked by reflex, (3) credit-based flow is reached for specifically when the receiver needs to pause without occupying a slot (checkpoint flushes, cross-runtime edges, graceful drains), (4) the pressure chain is continuously audited and the per-channel occupancy gradient is the first-look dashboard panel.


Learning Outcomes

After completing this module, you will be able to:

  1. Choose between send().await, try_send(), and send_timeout() based on the operator's load-shedding policy, and encode the choice as a per-edge FlowPolicy
  2. Size channel buffers using a documented BurstProfile rather than by reflex, with the math (rate gap × duration × safety factor) made explicit per edge
  3. Implement a load-shedding sink with priority-aware sub-channels and a biased select that gives high-priority data deterministic preference under shed conditions
  4. Reach for credit-based flow control specifically when the receiver needs to pause without occupying an in-flight slot — checkpoint flushes, cross-runtime edges, graceful drains
  5. Diagnose backpressure-chain breakage from the per-channel occupancy gradient: the rightmost persistently-full edge points at the slowest operator
  6. Recognize the canonical pressure-chain breakage patterns (per-event tokio::spawn, unbounded_channel, fire-and-forget logging, unbounded Vec::push accumulators, drop-and-recreate task patterns) and their structural fixes
  7. Reason about backpressure across boundary cases: TCP windowing as transport-level chain link; Kafka's deliberate decoupling that requires explicit consumer-lag monitoring as the substitute pressure signal

Lesson Summary

Lesson 1 — Bounded Channels

The three send semantics — send().await for backpressure, try_send for explicit load shed (always paired with a drop counter), send_timeout for SLO-bound edges where blocking past a deadline is worse than dropping — encoded as a per-edge FlowPolicy enum. Buffer sizing as documented BurstProfile rather than reflexive 1024; the math made explicit. Why unbounded_channel is a footgun for any data-path edge.

Key question: The conjunction-emitter has a 200ms SLO from observation-arrival to alert-emit, and the alert subscriber occasionally returns 503 during deploys. Which FlowPolicy is the right choice for the emitter's outgoing edge, and why?

Lesson 2 — Credit-Based Flow Control

Credit-based flow as the alternative to bounded-channel-plus-await for cases where the receiver needs to pause the upstream without occupying any in-flight slot. The structural difference (decoupled credit signal and data channel), the production protocols that use it (HTTP/2 windows, AMQP prefetch, Kafka's max.in.flight.requests.per.connection as a single-credit degenerate case), the SDA cases (checkpoint flushes, cross-runtime edges, graceful drains), and the in-flight-bounded property the AFTER-processing return preserves.

Key question: The CreditHandle has a Drop impl that warns when the handle is dropped without return_credit() being called. What canonical failure mode does that warning surface, and why does it matter operationally?

Lesson 3 — End-to-End Backpressure Propagation

The pressure chain as a contiguous sequence of bounded buffers from source to sink. The five canonical breakage patterns (per-event tokio::spawn, unbounded_channel, fire-and-forget unbounded logging, Vec::push accumulators, drop-and-recreate task patterns) and their structural fixes. Reading the per-channel occupancy gradient as first-look diagnostic. TCP windowing as transport-level chain link. Kafka's deliberate decoupling that requires explicit consumer-lag monitoring as the substitute pressure signal.

Key question: The dashboard shows three persistently-full edges across a six-edge topology. Which operator is the bottleneck, and why does the gradient-reading discipline give an unambiguous answer?


Capstone Project — Backpressure-Aware Fusion Broker

Harden the Phase 3 pipeline against the post-Cosmos-1408 burst failure mode. Every edge gets an explicit FlowPolicy; a priority classifier distinguishes High from Low observations; a PriorityShedSink with biased select gives High deterministic preference and drops Low first under shed conditions. The audit script becomes a cargo test that fails the build on new unbounded channels or undocumented policies. The BurstSimulator drives 10x rate for 5 simulated minutes and asserts zero High-priority drops. Acceptance criteria, suggested architecture, and the full project brief are in project-backpressure-broker.md.

The orchestrator from M2 and the windowed correlator from M3 are unchanged in structure. Only the edges' policies change, and a new operator (the priority classifier) sits between the normalize and the correlator.


File Index

module-04-backpressure-and-flow-control/
├── README.md                                  ← this file
├── lesson-01-bounded-channels.md              ← Bounded channels and FlowPolicy
├── lesson-01-quiz.toml                        ← Quiz (5 questions)
├── lesson-02-credit-based-flow.md             ← Credit-based flow control
├── lesson-02-quiz.toml                        ← Quiz (5 questions)
├── lesson-03-end-to-end-propagation.md        ← End-to-end backpressure propagation
├── lesson-03-quiz.toml                        ← Quiz (5 questions)
└── project-backpressure-broker.md             ← Capstone project brief

Prerequisites

  • Modules 1, 2, and 3 completed — the Observation envelope, the orchestrator's OperatorGraph and supervisor, and the watermark-aware windowed correlator are all assumed
  • Foundation Track completed — async Rust, channels, scheduling intuitions
  • Familiarity with tokio::sync::mpsc::{Sender, Receiver} semantics and tokio::time::{timeout, sleep, pause}
  • Comfort reading Prometheus-style metric names (channel_occupancy{edge=...}) and reasoning about per-label counters and gauges

What Comes Next

Module 5 (Delivery Guarantees and Fault Tolerance) makes the windowed correlator's state crash-safe via checkpointing. The credit-based-flow primitive from this module's Lesson 2 is the mechanism that pauses the upstream during the checkpoint flush. The bounded-channel-everywhere invariant the audit enforces is what lets the checkpointed state size be bounded and predictable. The exactly-once-via-idempotency frame from M2 plus the retract-aware sink from M3 plus the priority shedding from M4 compose into the pipeline M5 will harden against process restarts.

Lesson 1 — Bounded Channels

Module: Data Pipelines — M04: Backpressure and Flow Control Position: Lesson 1 of 3 Source: Async Rust — Maxwell Flitton & Caroline Morton, the chapters on tokio::sync::mpsc semantics and channel patterns; Network Programming with Rust — Abhishek Chanda, sections on application-level flow control over TCP; Kafka: The Definitive Guide — Shapira, Palino, Sivaram, Petty, Chapter 3 (Producer Configuration: buffer.memory, max.block.ms, acks)


Context

Module 1 introduced mpsc::Sender::send().await as the foundation of backpressure: a full bounded channel suspends the sending future until the consumer makes capacity, propagating the slowdown upstream. Module 2's orchestrator structurally enforced "every edge in the operator graph is a bounded channel." Module 3 added watermarks and windowing on top of that channel structure. The pipeline that exists at the start of this module is correct under steady-state load and tolerates transient downstream slowdowns.

It does not tolerate burst load. Last quarter's Cosmos-1408 anti-satellite test added 1,800 newly tracked debris objects to the catalog within 90 seconds. The pipeline survived in the sense that no operator panicked, but with twelve minutes of catch-up time and four dropped conjunction alerts during the spike. The postmortem traced the alerts to a single channel where the buffer was sized for nominal traffic, the upstream operator was IO-bound and could not slow further, and the downstream operator had no mechanism to triage which observations to drop and which to preserve. The pipeline's response to the burst was uniform load shed without policy — the four critical alerts that got dropped were no more important to the system than the four hundred non-critical observations dropped alongside them.

This module is the response. The orchestrator's bounded-channel invariant is correct but not sufficient; what to do when a bounded channel fills up is a design decision that has been left implicit, and the explicit answer is a per-edge FlowPolicy with three named choices. This lesson establishes the choices, develops the channel semantics behind each, and develops the discipline of sizing channel buffers on purpose rather than by reflex. Lesson 2 covers credit-based flow control as the alternative shape for cases where awaiting is not enough. Lesson 3 audits the full pipeline for places where the backpressure chain breaks. The capstone hardens the M3 pipeline against a 10x burst simulation with explicit flow policies and load-shedding.


Core Concepts

Bounded vs Unbounded Channels

tokio::sync::mpsc::channel(N) produces a bounded channel: at most N items can be in flight between the sender and the receiver. Beyond N, the next send suspends until the receiver consumes. tokio::sync::mpsc::unbounded_channel() produces an unbounded channel: items are buffered without limit, growing memory as long as the sender produces faster than the receiver consumes.

Unbounded channels are the wrong default. The reason is structural: any unbounded buffer in the pipeline is a place where backpressure does not propagate. The upstream sender writes into the unbounded channel, never suspends, and never observes the downstream's slowness. The buffer grows. The process's resident memory grows. Either a higher-level resource limit (the OOM killer; a container's memory cap; a tokio runtime memory budget) eventually intervenes, or the process keeps growing until something else gives. None of these outcomes are graceful; all are surprising at runtime, and all are diagnosed only after the symptom shows up.

The legitimate use cases for unbounded channels are narrow. In-process notification of singleton events — the orchestrator's "shutdown signal" that fires once and is consumed once. Tests where the test harness controls both ends and the bound is implicit in the test's structure. Bounded-by-construction sources where the application can prove the channel cannot fill. None of these apply to the data path of a streaming pipeline. The orchestrator's audit script in Lesson 3 asserts that no unbounded_channel appears anywhere in the operator graph; this lesson is the conceptual justification for that assertion.

send().await Semantics

send(item).await is the default and the right choice for most operator-pair edges. The semantics: when the channel is full, the future returned by send does not resolve until there is capacity for the item; the calling task is suspended (cooperatively, in the M2 L1 sense) and another task on the worker thread can run. The send is cancel-safe — dropping the future at any point is well-defined: either the item was inserted into the channel (Ok(()) returned) or it was not (the future was dropped before the channel had capacity). No third state.

The operational consequence is that the channel's capacity IS the backpressure mechanism. A channel of capacity 1024 is a slack budget: the upstream operator can produce 1024 items ahead of the downstream's processing before backpressure begins to apply. Within the budget, the upstream runs at its own rate; past the budget, the upstream's rate is capped at the downstream's rate. The relationship is exactly the dataflow-model contract: no operator runs faster than its slowest downstream.

The choice of send().await over try_send is the choice of "applied backpressure" over "load shed." Any pipeline whose default behavior under load should be "the upstream slows down" uses send().await. The radar source operator that calls send(observation).await on a full channel ends up suspended at that point; the next time it polls its UDP socket, the kernel buffer has had time to fill or drop frames at its layer. This is the right behavior for a UDP source: kernel-level drop preserves the pipeline's invariants while applying the right kind of pressure.

try_send() Semantics

try_send(item) returns immediately. On a full channel, it returns Err(TrySendError::Full(item)) — the item is handed back to the caller, untouched. The caller decides what to do: drop it, log it, route it to a DLQ, retry it later. The semantics are explicit load-shed: the channel's capacity is a hard limit, and an attempt to exceed it does not block.

The right use case is operator-side load shedding when the data being shed has lower marginal value than the work blocking the upstream. The metrics-export operator that emits sample metrics is the canonical case: a metric that did not get published is a small, recoverable loss; blocking the upstream operator on the metrics channel would impose a larger cost than the benefit of every metric reaching its destination. try_send with a gauge!("metrics_drops_total") increment is the right shape.

The trap that the lesson called out at the top is try_send with no logging. An operator that calls try_send and discards the Err(Full) produces silent drops that are invisible until aggregate output deficits show up downstream. The discipline is uniform: every try_send site has a counter increment and a structured log entry on Full. The orchestrator's metrics endpoint exposes the counter; an alert fires when the drop rate per second exceeds a threshold. Without the counter, try_send is a footgun; with the counter, it is a load-shedding tool.

send_timeout() Semantics

send_timeout(item, dur) is the third primitive. It suspends like send().await but resolves with Err(SendTimeoutError::Timeout(item)) after dur has elapsed without capacity becoming available. The caller decides what to do with the timed-out item: drop, DLQ, retry on a different channel.

This is the right primitive for operators with a real-time deadline. The conjunction-alert emitter has a 200 ms SLO from observation-arrival to alert-emit. If the alert subscriber's HTTP endpoint is too slow to drain the alert channel within 200 ms, send_timeout lets the operator make an explicit choice — drop the alert (with metrics and DLQ), route to a slow-path archive, or whatever the operational decision is — rather than blocking past the SLO. The deadline-bound choice fits naturally between send().await (no deadline, may block forever) and try_send (no wait at all).

The defaults across the SDA pipeline:

EdgePolicyReasoning
Source → normalizesend().awaitApply backpressure to source; UDP drops at kernel are acceptable
Normalize → correlatorsend().awaitNo deadline; correlator is the natural-rate consumer
Correlator → alert sinksend_timeout(200ms)SLO-bound; timed-out alerts go to DLQ
Pipeline → metrics exporttry_send + drop counterMetrics are sheddable; observability of drops is what matters

The discipline is per-edge, documented in the operator graph declaration with a brief comment about the choice. The orchestrator's structured log emits the policy at startup so runbooks can confirm what is configured.

Buffer Sizing

The capacity of a bounded channel is the slack budget between the producer and the consumer — how much burst the channel absorbs before backpressure begins to apply. The right sizing is operational, not magical. Three considerations.

Sustained rate vs burst rate: a channel sized for the sustained rate has effectively zero slack and applies backpressure constantly under nominal load, which is wrong. A channel sized for the burst rate has too much slack and adds latency under load (every item in the channel is one waiting in front of the next one). The right size is bounded by the expected burst duration × the rate gap between producer and consumer, with a 2x safety factor for headroom.

Per-item processing time at the consumer: a channel sized for 1024 items where each item takes 100 µs to process represents 100 ms of latency at the consumer side under steady-state full-channel conditions. If the operator's SLO budget is 200 ms, that 100 ms of channel-induced latency might be more than the budget allows, and the right answer is a smaller channel.

Memory cost per slot: each slot in the channel holds one Observation (or whatever the channel's item type is). For envelopes of a few kilobytes, a channel of 1024 is negligible. For envelopes that carry larger payloads, the per-slot memory matters and the channel should be sized accordingly.

The default for SDA's pipeline is 1024 for source-to-normalize edges, 4096 for normalize-to-correlator (the correlator's per-event work is heavier and the slack absorbs more of the burst), and 256 for the alert-emit edge (low rate, tight latency). Each edge is documented with a comment in the graph declaration explaining the choice. Sizing by reflex (everything is 1024, "because that's what we used last time") is the pattern the post-Cosmos postmortem identified as having contributed to the dropped alerts.


Code Examples

Three Sinks with Three Policies

The same logical sink role with three different flow policies, each appropriate for a different edge in the topology.

use anyhow::Result;
use std::time::Duration;
use tokio::sync::mpsc;
use tokio::time::timeout;

/// `BackpressureSink` applies upstream backpressure on a full channel.
/// The right choice for an edge where the upstream should slow rather
/// than the data should be dropped.
pub struct BackpressureSink {
    tx: mpsc::Sender<Observation>,
}

impl BackpressureSink {
    pub async fn write(&self, obs: Observation) -> Result<()> {
        // send().await suspends until capacity. Cancel-safe.
        self.tx.send(obs).await
            .map_err(|_| anyhow::anyhow!("downstream receiver dropped"))?;
        Ok(())
    }
}

/// `SheddingSink` drops on a full channel rather than block. The right
/// choice for sheddable data; ALWAYS pair with a drop counter.
pub struct SheddingSink {
    tx: mpsc::Sender<Observation>,
    drops: metrics::Counter,
}

impl SheddingSink {
    pub fn new(tx: mpsc::Sender<Observation>) -> Self {
        Self {
            tx,
            drops: metrics::counter!("shedding_sink_drops_total"),
        }
    }

    pub fn write(&self, obs: Observation) -> Result<()> {
        match self.tx.try_send(obs) {
            Ok(()) => Ok(()),
            Err(mpsc::error::TrySendError::Full(_obs)) => {
                self.drops.increment(1);
                tracing::debug!("shedding sink dropped observation; channel full");
                Ok(())
            }
            Err(mpsc::error::TrySendError::Closed(_)) => {
                Err(anyhow::anyhow!("downstream receiver dropped"))
            }
        }
    }
}

/// `TimedSink` writes with a deadline. After the deadline, the item is
/// returned to the caller; production code routes it to a DLQ or
/// archive sink.
pub struct TimedSink {
    tx: mpsc::Sender<Observation>,
    deadline: Duration,
    dropped_with_deadline: metrics::Counter,
}

impl TimedSink {
    pub fn new(tx: mpsc::Sender<Observation>, deadline: Duration) -> Self {
        Self {
            tx,
            deadline,
            dropped_with_deadline: metrics::counter!("timed_sink_deadline_drops_total"),
        }
    }

    pub async fn write(&self, obs: Observation) -> Result<()> {
        // tokio::time::timeout wraps the send; on timeout, we get the
        // item back via Err and decide what to do with it.
        match timeout(self.deadline, self.tx.send(obs)).await {
            Ok(Ok(())) => Ok(()),
            Ok(Err(_)) => Err(anyhow::anyhow!("downstream receiver dropped")),
            Err(_elapsed) => {
                self.dropped_with_deadline.increment(1);
                // Production: route to slow-path archive sink. Here:
                // log and drop.
                tracing::warn!("timed sink missed deadline; dropping");
                Ok(())
            }
        }
    }
}

The three sinks are interchangeable in shape — same write(obs) -> Result<()> signature — but operationally different. The orchestrator's edge wiring chooses one per edge; the operator that consumes the sink does not need to know which policy is in effect. The metrics surfaced by each (shedding_sink_drops_total, timed_sink_deadline_drops_total) are the operator-visibility property the lesson keeps flagging — silent drops are the bug, instrumented drops are the tool.

A FlowPolicy Enum for Per-Edge Configuration

The lesson's discipline is per-edge. Encoding the policy in an enum makes the choice visible in the operator graph declaration and lets a single sink implementation dispatch to the right semantics.

use std::time::Duration;

#[derive(Debug, Clone, Copy)]
pub enum FlowPolicy {
    /// Apply backpressure: the upstream slows down rather than items
    /// being dropped. The default for most pipeline edges.
    Backpressure,
    /// Drop items on a full channel; log via the `dropped_total` metric.
    /// For sheddable data: metrics export, optional logging, etc.
    Shed,
    /// Wait up to `deadline` for capacity; drop on timeout. For SLO-bound
    /// edges where blocking past the deadline is worse than dropping.
    Timed(Duration),
}

pub struct ConfigurableSink {
    tx: mpsc::Sender<Observation>,
    policy: FlowPolicy,
}

impl ConfigurableSink {
    pub fn new(tx: mpsc::Sender<Observation>, policy: FlowPolicy) -> Self {
        Self { tx, policy }
    }

    pub async fn write(&self, obs: Observation) -> Result<()> {
        match self.policy {
            FlowPolicy::Backpressure => self.tx.send(obs).await
                .map_err(|_| anyhow::anyhow!("receiver dropped")),
            FlowPolicy::Shed => match self.tx.try_send(obs) {
                Ok(()) => Ok(()),
                Err(mpsc::error::TrySendError::Full(_)) => {
                    metrics::counter!("flow_drops_total", "policy" => "shed").increment(1);
                    Ok(())
                }
                Err(mpsc::error::TrySendError::Closed(_)) =>
                    Err(anyhow::anyhow!("receiver dropped")),
            },
            FlowPolicy::Timed(deadline) => match timeout(deadline, self.tx.send(obs)).await {
                Ok(Ok(())) => Ok(()),
                Ok(Err(_)) => Err(anyhow::anyhow!("receiver dropped")),
                Err(_) => {
                    metrics::counter!("flow_drops_total", "policy" => "timed").increment(1);
                    Ok(())
                }
            },
        }
    }
}

Two design points. The metric label policy is what makes the metric useful operationally: a single counter with the policy label tells you which kind of drop is happening, and the dashboards filter by it. A separate counter per policy works too but produces dashboard duplication. Second, the dispatch on self.policy is per-call; the cost is a single match against a copy of an enum, which is sub-nanosecond in the hot path. The expressiveness gain over three separate sink types is worth that cost.

Buffer Sizing With Documented Reasoning

A small helper that captures the reasoning behind a buffer size as data the operator graph carries forward into structured logs and runbook references.

use std::time::Duration;

/// The expected burst characteristics of a channel and the resulting
/// recommended buffer size. Documented per-edge in the topology
/// declaration; the orchestrator emits a startup log with the values.
#[derive(Debug, Clone, Copy)]
pub struct BurstProfile {
    pub peak_rate_per_s: u64,
    pub peak_duration_s: u64,
    pub processing_rate_per_s: u64,
    pub safety_factor: f32,
}

impl BurstProfile {
    /// Recommended buffer size: how many items the channel needs to
    /// hold to absorb the burst without applying backpressure for its
    /// duration. Past the size, backpressure begins to apply normally.
    pub fn recommended_buffer(&self) -> usize {
        let rate_gap = self.peak_rate_per_s.saturating_sub(self.processing_rate_per_s);
        let burst_items = rate_gap * self.peak_duration_s;
        ((burst_items as f32) * self.safety_factor) as usize
    }
}

// Example: source→normalize edge for the radar source.
//   peak_rate_per_s: 5000 (during a fragmentation event)
//   peak_duration_s: 60 (the burst absorbs about a minute)
//   processing_rate_per_s: 4500 (the normalizer's measured throughput)
//   safety_factor: 2.0
//   recommended_buffer() = (5000 - 4500) * 60 * 2.0 = 60,000 items
//
// 60,000 items at ~200 bytes per Observation envelope = 12 MB.
// Acceptable cost for the burst-absorption property; documented in
// the graph declaration with this comment.
const RADAR_SOURCE_BURST: BurstProfile = BurstProfile {
    peak_rate_per_s: 5000,
    peak_duration_s: 60,
    processing_rate_per_s: 4500,
    safety_factor: 2.0,
};

The math is not magic but it is also not "1024 because we always use 1024." Each per-edge buffer size has a BurstProfile constant declaring the assumptions, and the orchestrator's startup log emits the (edge_name, buffer_size, profile) triple so the runbook can reference it. The numbers are operational — they come from load tests and production observation, and they evolve as the workload changes. The discipline this lesson installs is making the assumptions visible rather than baked into magic numbers.


Key Takeaways

  • Bounded channels are the default in any production pipeline. unbounded_channel is a footgun for data-path edges; use it only for singleton-event signals or test scaffolding. The orchestrator's audit asserts none appear in the operator graph.
  • The three send semantics are operationally distinct. send().await applies backpressure (upstream slows). try_send load-sheds on full (always pair with a drop counter). send_timeout(dur) is for SLO-bound edges where blocking past a deadline is worse than dropping. Encode the choice as a per-edge FlowPolicy.
  • The right primitive depends on the question being asked. Should the upstream slow down?send().await. Is this data sheddable under load?try_send + counter. Does this edge have a real-time deadline?send_timeout + DLQ/archive.
  • Buffer sizing is per-edge and documented. The math: (peak_rate - processing_rate) × peak_duration × safety_factor. The default of 1024 for everything is the pattern that the post-Cosmos-1408 postmortem identified as a contributing cause of dropped alerts.
  • Silent drops are the bug; instrumented drops are the tool. Every load-shedding site has a counter, a label, and a dashboard panel. An undocumented try_send is a bug waiting to be diagnosed; an instrumented one is operationally legible.

Lesson 2 — Credit-Based Flow Control

Module: Data Pipelines — M04: Backpressure and Flow Control Position: Lesson 2 of 3 Source: Network Programming with Rust — Abhishek Chanda, the chapter on application-level flow control over TCP and the AMQP/HTTP-2 patterns; Kafka: The Definitive Guide — Shapira, Palino, Sivaram, Petty, Chapter 3 (max.in.flight.requests.per.connection as a degenerate single-credit scheme); Async Rust — Maxwell Flitton & Caroline Morton, advanced channel patterns


Context

Lesson 1 established send().await on a bounded channel as the foundation of backpressure. The mechanism is implicit credit: the channel's capacity is the credit pool; every successful send consumes one slot; the consumer's recv returns a slot to the pool. The sender does not manage credits explicitly; it just calls send and gets backpressure for free. This is the right shape for almost every operator-pair edge in the SDA pipeline and the right default for nearly all dataflow systems.

There is one shape it does not fit well. When the receiver wants to pause ingestion entirely without holding the in-flight item in a buffer. The canonical case is a checkpoint flush — the receiver needs to durably persist its current state before accepting new events, and during that window it cannot afford to even buffer the next event because the in-flight buffer is exactly what the checkpoint is trying to capture. The bounded-channel-plus-await design has no clean way to express "stop sending, but don't fill the slot you'd have used." The receiver can just not call recv for a while, but the next item the sender produces fills the channel, occupies a slot, and is now part of the in-flight state the checkpoint must serialize.

Credit-based flow control is the alternative shape. The receiver issues credits explicitly; the sender consumes credits on each send and refuses to send when out of credits. The credit return path is a separate channel from the data path, which means the receiver can pause issuing credits without touching the data channel at all — the sender's credit counter goes to zero, the sender stops, no in-flight item is created. This decoupling is what makes credit-based flow control the right primitive for pause-without-buffer-fill operations and for pipelines where data and control are physically separated (HTTP/2 stream flow control; AMQP prefetch; Kafka's max.in.flight.requests.per.connection as a single-credit degenerate case).

This lesson develops credit-based flow control as the operational alternative to bounded-channel-plus-await, identifies precisely when each fits, and develops the implementation pattern as a wrapper around the M2 channel primitives. The capstone uses credit-based flow control upstream of the windowed correlator's checkpoint flush in Module 5; this lesson installs the machinery now so that work has the primitive available.


Core Concepts

Credits, Defined

A credit is permission to send one item. The receiver has a finite pool of credits at any moment; the sender consumes one credit per send; sending without a credit is forbidden. The receiver replenishes the pool by sending credit-return messages — back to the sender, indicating that some prior items have been consumed and N new credits are available. The sender's local credit counter starts at the receiver's initial grant, decrements on each send, increments on each credit-return.

When credits = 0 the sender stops. It does not buffer. It does not block on a channel send. It returns to its caller (or sits at its own await point waiting for credit returns) without producing the next item. This is the mechanism that makes the receiver's pause real: by withholding credits, the receiver creates an upstream stop without occupying a single channel slot.

The shape resembles backpressure-via-await in the steady state — when credits flow normally, the sender produces at the receiver's rate, just like it would on a bounded channel — but differs in two structural ways. First, the credit signal is on a separate channel, decoupled from the data path. Second, the sender's behavior on "out of credits" is its own decision (block, drop, route elsewhere) rather than the channel's.

Credit-Based vs Backpressure-via-Await

Both produce upstream slowdowns under sustained downstream pressure. The differences are operational.

Backpressure-via-await binds the credit signal to the data channel — the channel's capacity IS the credit pool. To pause the sender, the receiver must not call recv, which means the channel fills, which means the next in-flight item occupies a slot. The receiver cannot pause without buffering at least one more item.

Credit-based decouples them. The receiver pauses by withholding credits on the side channel; the data channel remains empty (or at whatever steady state it was in). The sender's local credit counter goes to zero; the sender stops without producing the next item.

For most pipeline edges this difference does not matter. The data channel filling with one extra item before the sender suspends is not a problem; it does not change the pipeline's correctness. For the checkpoint case it does matter — the in-flight buffer being empty is the property the checkpoint depends on. Credit-based is the right primitive when the receiver needs that empty-buffer property.

There is a secondary difference. Credit-based supports batched grants: the receiver can issue 100 credits at once, and the sender can fire 100 sends without coordination. Backpressure-via-await supports the same pattern only via the channel's capacity, which couples the burst size to the persistent slack. Credits let burst size and steady-state slack be independent, which is occasionally useful (a receiver that wants 10 in-flight messages steady-state but allows occasional 100-message bursts).

The Credit Return Path

Two channels: data and credit-return. The data channel is the same mpsc::Sender<T> / mpsc::Receiver<T> pair we have used throughout. The credit-return channel is mpsc::Sender<u32> / mpsc::Receiver<u32>, where each message is a credit count being returned. The receiver's per-event behavior:

  1. recv an item from the data channel.
  2. Process the item.
  3. Send 1 (or some larger batch count) on the credit-return channel.

The sender's per-event behavior:

  1. Drain the credit-return channel of any pending returns; increment the local counter by the sum.
  2. If the local counter is 0, await a credit-return on the credit-return channel.
  3. Decrement the counter; send the item on the data channel.

The data channel itself does not need to be bounded — credits bound the in-flight count, which is what the bounded channel was for. In practice, the data channel is given a small bound (matching the maximum credit grant) to keep the implementation defensive against credit-counter bugs.

Where Production Uses Credits

HTTP/2 has per-stream and per-connection flow-control windows (RFC 7540 Section 5.2). The receiver's WINDOW_UPDATE frame is exactly a credit-return: it tells the sender how many more bytes it may send on a given stream. The use case is multiplexing many streams on a single TCP connection — each stream needs its own backpressure that does not affect the others, and credits on a side channel give that.

AMQP (RabbitMQ, ActiveMQ) uses per-channel prefetch limits. The consumer declares its prefetch count; the broker delivers up to that many unacknowledged messages; the consumer's ack returns a credit. The mechanism is identical to the lesson's pattern, just with the broker and consumer in the producer/consumer roles.

Kafka's max.in.flight.requests.per.connection is a degenerate single-credit case. The producer can have at most N in-flight requests to a given broker; each completed request returns one credit. With N=1 (the strongest setting), the producer is effectively serial-pipelined. With N=5 (the default), small bursts are allowed.

The pattern is widespread in production protocols. It is less commonly used in single-process pipelines because backpressure-via-await is sufficient most of the time; credit-based shows up specifically where the additional decoupling is needed.

When to Reach for Credits in SDA

Three concrete cases.

Checkpoint flush (Module 5). The windowed correlator's state is being durably written to disk. During the write, the operator cannot afford to even buffer the next event. The flush operator pauses ingestion by withholding credits; the upstream correlator sender stops cleanly without occupying a slot.

Cross-runtime edges (advanced bulkheading from M2 L4). When an operator in one runtime sends to an operator in a different runtime (the propagator pool versus the main pipeline), the bounded channel between them does not propagate backpressure cleanly because the runtimes have independent schedulers. Credit-based flow gives an explicit mechanism that the receiving runtime controls.

Operator handoff during a graceful drain. During shutdown, the orchestrator wants the upstream to stop producing while in-flight items finish flowing. The orchestrator (or a control-plane operator) withholds credits on the affected edges; the upstream halts; downstream drains; the pipeline shuts down cleanly without losing in-flight items.

For everything else, prefer the simpler send().await pattern from Lesson 1. Credit-based flow is a heavier mechanism with more coordination overhead, and using it where it is not needed adds operational surface area.


Code Examples

A CreditChannel Wrapper

The wrapper is two channels and a small bookkeeping struct. The sender side checks credits before each send; the receiver side issues credit returns after each successful processing.

use anyhow::Result;
use tokio::sync::mpsc;

/// One end of a credit-based flow channel. The sender consumes credits
/// from its local counter on each send; when the counter is zero, it
/// awaits a credit return.
pub struct CreditSender<T> {
    data_tx: mpsc::Sender<T>,
    credit_rx: mpsc::Receiver<u32>,
    local_credits: u32,
}

impl<T> CreditSender<T> {
    pub async fn send(&mut self, item: T) -> Result<()> {
        // Drain any pending credit returns first.
        while let Ok(returned) = self.credit_rx.try_recv() {
            self.local_credits = self.local_credits.saturating_add(returned);
        }
        // If we are out of credits, block on a return.
        while self.local_credits == 0 {
            match self.credit_rx.recv().await {
                Some(returned) => {
                    self.local_credits = self.local_credits.saturating_add(returned);
                }
                None => return Err(anyhow::anyhow!("credit return channel closed")),
            }
        }
        self.local_credits -= 1;
        self.data_tx.send(item).await
            .map_err(|_| anyhow::anyhow!("data channel receiver dropped"))?;
        Ok(())
    }

    /// Current local credit count — useful for diagnostics.
    pub fn credits(&self) -> u32 { self.local_credits }
}

/// The receiver end. Reading an item produces a `CreditHandle` that
/// MUST be returned (via return_credit) after the item is processed.
/// Forgetting to return credits is the canonical credit-leak bug.
pub struct CreditReceiver<T> {
    data_rx: mpsc::Receiver<T>,
    credit_tx: mpsc::Sender<u32>,
}

pub struct CreditHandle<'a> {
    credit_tx: &'a mpsc::Sender<u32>,
    returned: bool,
}

impl<T> CreditReceiver<T> {
    pub async fn recv(&mut self) -> Option<(T, CreditHandle<'_>)> {
        let item = self.data_rx.recv().await?;
        let handle = CreditHandle {
            credit_tx: &self.credit_tx,
            returned: false,
        };
        Some((item, handle))
    }
}

impl<'a> CreditHandle<'a> {
    /// Return one credit to the sender. Should be called after the
    /// associated item has been processed.
    pub async fn return_credit(mut self) -> Result<()> {
        self.returned = true;
        self.credit_tx.send(1).await
            .map_err(|_| anyhow::anyhow!("credit return channel sender dropped"))?;
        Ok(())
    }
}

impl<'a> Drop for CreditHandle<'a> {
    fn drop(&mut self) {
        if !self.returned {
            // Forgotten credit return — log it. This is a programming bug,
            // not a recoverable condition; production code should alert.
            tracing::error!("CreditHandle dropped without returning credit; credit leaked");
        }
    }
}

/// Pair constructor: returns matched sender + receiver with the given
/// initial credit grant.
pub fn credit_channel<T: Send + 'static>(
    data_capacity: usize,
    initial_credits: u32,
) -> (CreditSender<T>, CreditReceiver<T>) {
    let (data_tx, data_rx) = mpsc::channel(data_capacity);
    let (credit_tx, credit_rx) = mpsc::channel(data_capacity);
    (
        CreditSender { data_tx, credit_rx, local_credits: initial_credits },
        CreditReceiver { data_rx, credit_tx },
    )
}

The CreditHandle with its Drop-emits-error pattern is a useful safety net: forgetting to return a credit is the canonical bug in this pattern, and the warning at Drop makes the bug visible in logs rather than mysterious in production. Production code goes a step further and refuses to compile if return_credit is not called (via a must_use lint or similar); for clarity here we use the runtime warning. The data_capacity parameter sets the data channel's bound — it should equal or exceed the maximum credit grant ever issued so the data channel itself is never the limiting factor.

A Receiver That Pauses by Withholding Credits

The case the lesson called out as the primary use: pausing the upstream during a checkpoint flush without occupying any in-flight slots.

use std::time::Duration;
use tokio::time::sleep;

/// Operator that periodically checkpoints. During the checkpoint
/// window it withholds credits, pausing the upstream sender cleanly.
pub async fn run_checkpointing_operator<T>(
    mut input: CreditReceiver<T>,
    output: mpsc::Sender<T>,
    checkpoint_interval: Duration,
) -> Result<()>
where
    T: Send + 'static,
{
    let mut last_checkpoint = std::time::Instant::now();

    loop {
        if last_checkpoint.elapsed() >= checkpoint_interval {
            // Time to checkpoint. The CRITICAL property: by NOT calling
            // input.recv() during the flush, we both stop accepting new
            // items AND withhold credit returns to the upstream. The
            // upstream's local credit counter drains; the upstream stops
            // producing without occupying any in-flight slot.
            tracing::info!("starting checkpoint flush");
            do_checkpoint_flush().await?;
            last_checkpoint = std::time::Instant::now();
            tracing::info!("checkpoint flush complete; resuming");
            // Returning credits resumes the upstream.
            continue;
        }

        match input.recv().await {
            Some((item, credit)) => {
                // Process the item.
                output.send(item).await
                    .map_err(|_| anyhow::anyhow!("downstream dropped"))?;
                // Return the credit AFTER processing. Returning before
                // processing would defeat the bounded-in-flight property
                // (the upstream could fire another item while this one
                // is still in flight at the operator).
                credit.return_credit().await?;
            }
            None => return Ok(()),
        }
    }
}

async fn do_checkpoint_flush() -> Result<()> {
    // Module 5 develops checkpointing in full. Here: stand-in.
    sleep(Duration::from_millis(150)).await;
    Ok(())
}

Two design points worth dwelling on. The credit return happens after item processing, not before. Returning before processing breaks the in-flight-bounded property: the upstream sees the credit, fires the next item, and now there are two items in flight at this operator — one being processed, one in the data channel. With the AFTER ordering, the bound is exactly the initial credit grant: at most that many items are in flight at any moment.

The second is structural: the operator's "pause during checkpoint" mechanism is not calling recv. There is no explicit "pause" or "resume" message; the credit-return mechanic falls out of the recv-loop's natural shape. When the operator is in the checkpoint branch, no recv happens, no credit is returned, the upstream's counter drains. When the operator returns to the recv loop, credits flow again, the upstream resumes. The implementation is small precisely because the mechanism is doing the work.

Fairness Across Multiple Senders

When multiple senders share a credit pool with a single receiver, the issuance policy decides who gets what share. Two strategies, each with a different fairness profile.

use std::collections::VecDeque;

/// Round-robin credit issuance: cycle through senders, granting one
/// credit per sender per round. Fair share regardless of demand.
pub struct RoundRobinCreditIssuer {
    senders: Vec<mpsc::Sender<u32>>,
    cursor: usize,
}

impl RoundRobinCreditIssuer {
    pub async fn grant_one(&mut self) -> Result<()> {
        let target = self.cursor % self.senders.len();
        self.senders[target].send(1).await
            .map_err(|_| anyhow::anyhow!("sender's credit channel closed"))?;
        self.cursor += 1;
        Ok(())
    }
}

/// First-asker-wins issuance: senders queue up requests; the issuer
/// satisfies in arrival order. Greedy senders dominate.
pub struct FifoCreditIssuer {
    request_queue: VecDeque<usize>,  // sender indices in arrival order
    senders: Vec<mpsc::Sender<u32>>,
}

impl FifoCreditIssuer {
    pub fn enqueue_request(&mut self, sender_idx: usize) {
        self.request_queue.push_back(sender_idx);
    }

    pub async fn grant_one(&mut self) -> Result<()> {
        if let Some(target) = self.request_queue.pop_front() {
            self.senders[target].send(1).await
                .map_err(|_| anyhow::anyhow!("sender's credit channel closed"))?;
        }
        Ok(())
    }
}

Round-robin gives every sender a predictable share regardless of their per-sender rate. FIFO gives faster senders a larger share because they request more. The choice is operational. SDA's three sources have very different per-source rates (radar at thousands per second; ISL beacons at tens per second), and FIFO would let the radar dominate the credit pool — possibly correct if "throughput" is the optimization, definitely wrong if "fair representation" is. Round-robin is the SDA default. Production credit-issuance schemes can be more sophisticated still (weighted round-robin, deficit weighted, fair queueing); the framework above is the starting point that the rest builds on.


Key Takeaways

  • Credit-based flow control is the alternative to bounded-channel-plus-await for cases where the receiver needs to pause the upstream without occupying any in-flight slot. Credits flow on a separate channel from data; the receiver's pause is "stop returning credits."
  • The structural difference vs send().await: backpressure-via-await binds the credit signal to the channel capacity; credit-based decouples them. Use credit-based when decoupling matters — checkpoint flushes, cross-runtime edges, graceful-drain pauses. Use bounded-channel-plus-await for everything else.
  • Production protocols use credits widely. HTTP/2 flow control windows. AMQP prefetch. Kafka's max.in.flight.requests.per.connection as a single-credit degenerate case. The pattern is well-established in distributed systems; this lesson brings it into the single-process pipeline for the cases that benefit.
  • The implementation is small: data channel + credit-return channel + per-sender credit counter. The credit return happens after item processing, not before — the AFTER ordering is what makes the in-flight bound exactly equal to the credit grant.
  • Credit-issuance fairness is operational. Round-robin gives predictable per-sender share regardless of demand; FIFO lets greedy senders dominate. SDA defaults to round-robin; the heterogeneity of per-source rates would otherwise let one source starve the others.

Lesson 3 — End-to-End Backpressure Propagation

Module: Data Pipelines — M04: Backpressure and Flow Control Position: Lesson 3 of 3 Source: Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11 (Buffering and Pushback in stream processing); Network Programming with Rust — Abhishek Chanda, sections on TCP windowing as transport-level backpressure; Kafka: The Definitive Guide — Shapira, Palino, Sivaram, Petty, Chapter 3 (Producer max.block.ms and the producer-side buffer)


Context

Module 1 introduced the bounded-channel-plus-await chain that propagates backpressure from sink to source. Module 2 wrapped that chain in an orchestrator. Lesson 1 of this module established per-edge FlowPolicy. Lesson 2 added credit-based flow as a sharper tool for cases where the bounded-channel pattern is not enough. The pipeline at this point is well-equipped to apply backpressure correctly if the chain is intact end-to-end.

The chain is rarely intact end-to-end. Engineers add small "convenience" pieces — a tokio::spawn for "fire-and-forget" logging, an unbounded_channel for "this side channel cannot fill," a Vec::push into an in-process collection — each one a place where the backpressure traversal stops. The post-Cosmos-1408 incident from this module's mission framing took two hours to diagnose because the pressure chain was structurally broken at a single tokio::spawn inside the correlator's per-event loop. The spawn looked harmless in code review, did not block, did not fill any visible buffer, and silently amplified any downstream slowdown into unbounded task accumulation. The fix was three lines; the diagnosis was the hard part.

This lesson is the audit. It identifies the canonical patterns that break the pressure chain, develops the diagnostic approach (read the channel-occupancy gradient — the slowest stage shows up as the channel-full upstream of itself), and discusses the two boundary cases where backpressure does not propagate naturally: across a Kafka topic between two pipeline halves, and through retry/loop topologies. The capstone integrates the audit as a CI test against the operator graph and a BurstSimulator that drives the M3 pipeline at 10x normal rate to verify end-to-end propagation under burst load.


Core Concepts

The Pressure Chain

A single contiguous chain of bounded buffers from source to sink. Every adjacent pair of operators connected by a bounded channel; every operator's emit using send().await (or its FlowPolicy equivalent) on its outgoing channel; no detached tasks per item; no in-process unbounded buffers. With those conditions, a slowdown at the sink propagates: the sink's incoming channel fills, the upstream operator's send().await suspends, that upstream's incoming channel fills, that upstream's upstream suspends, all the way back to the source. The source either suspends on its own producing primitive (the UDP recv_from, the HTTP client.get) or, for sources that cannot suspend (a UDP feed that produces whether anyone is listening or not), the kernel-level buffer fills and the kernel drops at its layer.

The chain has a shape: operators are the links, channels are the connections. Breaking the chain means inserting something between two operators that does not propagate the suspend signal. The next subsection enumerates the canonical breakage shapes.

Where Pressure Breaks: Anti-Patterns

Five patterns the lesson identifies as the recurring pressure-chain breakers.

tokio::spawn per-event. An operator's hot loop does tokio::spawn(async move { handle_event(e).await }). The spawn returns immediately; the operator continues processing the next event without waiting for the spawned task. Under steady-state load the spawned tasks complete fast enough that the count stays bounded. Under any sustained slowdown, spawned tasks accumulate without limit. The operator's outgoing channel never fills (because the spawned tasks do the work asynchronously) and never propagates pressure. This is the M1 lesson 3 anti-pattern, and it is the single most common pressure-chain breaker because it looks harmless in code review.

mpsc::unbounded_channel. Lesson 1's footgun. Any unbounded buffer is a pressure-chain stopgap: the upstream's send always succeeds, so the upstream never observes the downstream's slowness. The buffer grows in proportion to the sustained gap.

Fire-and-forget logging via channels. A common pattern: emit a structured log via an mpsc::Sender to a separate logging task. If that channel is unbounded (or if the operator uses try_send and discards the Err), logging events can pile up under load without anyone noticing. The fix is not "make logging block on the hot path" — that has its own problems — but rather: log via the standard tracing crate's blocking machinery (which is fast), or use try_send + counter pattern from Lesson 1.

Vec::push into ever-growing collections. An operator accumulates events into a Vec for a deferred batch operation. The accumulation has no bound; under load, the Vec grows without limit. The pattern is structurally identical to unbounded_channel and has the same fix: the operator must bound the accumulation by size or time, and apply backpressure or load-shed when the bound is reached.

Drop-and-recreate task patterns. A supervisor that, on every event, drops the current operator task and spawns a fresh one. The motivation is usually "stateless restart for cleanliness," but the effect is that the channel between this operator and its downstream is being reconstructed faster than it can drain — the new task starts with an empty channel, the old task's in-flight items are dropped or orphaned, pressure does not propagate because the channel does not persist.

The canonical fix in every case is structural: replace the breaking pattern with bounded-and-suspending equivalents. The tokio::spawn per-event becomes an inline await. The unbounded_channel becomes channel(N). The Vec::push accumulator becomes a sized VecDeque with explicit eviction or backpressure. The drop-and-recreate becomes a long-lived supervised operator (Module 2 L4).

Reading the Pressure Gradient

When a pipeline is correctly chained but slow somewhere, the channel-occupancy gradient identifies the slow stage. The slowest operator's incoming channel is consistently full. Upstream of that operator, channels are partially filled (the stages between the source and the slow operator are running at the pipeline's bottleneck rate, with channels carrying some slack). Downstream of the slow operator, channels are mostly empty (the downstream is faster than the upstream is producing).

The pattern looks like a step function in the per-channel occupancy gauges: 100% at and just before the bottleneck, decreasing toward 0% as you move downstream. The ops engineer reading the dashboard finds the bottleneck by looking for the rightmost-100%-channel — the one whose upstream is the slowest stage and whose downstream is what is starving.

The lesson develops the audit as a diagnostic operator that exports per-channel occupancy as a Prometheus gauge. Module 6 generalizes this into the operational dashboard's primary panel; this lesson installs the foundation.

TCP Windowing as Transport-Level Backpressure

Module 1 introduced TCP windowing as the kernel-level mechanism that propagates backpressure from a slow application back to the producer over the network. The receiver's TCP stack advertises a receive window — how many more bytes it can buffer; as the application reads, the window grows; when the application stops reading, the window shrinks. The sender's TCP stack respects the advertised window and pauses sending when the window is zero.

This works only if the application reads synchronously from the socket and processes each read before reading the next one. An application that reads as fast as possible into an in-process buffer and processes asynchronously breaks TCP-level backpressure: the application drains the kernel buffer as fast as the network can fill it, the receive window stays advertised at maximum, and the sender does not slow regardless of how slow the application's processing is. The backpressure chain ends at the in-process buffer, which is unbounded by definition.

The discipline is to drive the read loop and the channel send from the same task. The radar source from Module 1 does this: recv_from().await followed by sink.write(obs).await. When the sink's downstream channel is full, send().await suspends, the next iteration of the loop is delayed, the next recv_from is delayed, the kernel buffer fills, the TCP receive window shrinks, the sender's TCP stack slows. Every link in the chain — application→channel→kernel→network — propagates the pressure. Breaking any one link breaks the whole.

Backpressure Across Kafka

The pipeline often has a Kafka topic between two halves: the ingestion half writes to a topic, a consumer half reads from it. Backpressure across the topic boundary does not work the same way as within a single process.

The producer's view: a Kafka producer maintains a producer-side buffer (buffer.memory, default 32 MB). When the topic's brokers acknowledge slowly (or the consumer is slow and the topic's retention is bursting), the producer-side buffer fills. With max.block.ms configured (default 60 seconds), the producer's send blocks waiting for buffer space — which propagates backpressure into the producer-side application. With max.block.ms = 0 and acks=0, the producer drops on full buffer. The producer-side configuration determines the boundary behavior.

The consumer's view: the consumer's lag (the gap between the topic's high-watermark and the consumer's committed offset) grows when the consumer is slow. Backpressure does not propagate to the producer instantaneously — Kafka decouples the two halves intentionally. The producer keeps writing (up to its broker's retention limits) regardless of consumer lag; the consumer falls behind silently until lag is observed via metrics. The pipeline operator's responsibility is to monitor consumer lag explicitly and alert when it grows past a threshold. The implicit pressure-chain that works within a single process becomes an explicit observability discipline at the Kafka boundary.

For the SDA pipeline, this is mostly future work — Module 5 introduces Kafka as a checkpoint persistence layer and Module 6 develops the lag monitoring discipline. This lesson surfaces the boundary so the audit script does not flag Kafka producer/consumer pairs as pressure-chain breaks (they ARE breaks within a single process; they ARE intended at the cross-pipeline boundary; the monitoring is what restores the missing signal).


Code Examples

A Pressure-Chain Audit Script

The audit walks the operator graph from M2 and flags edges that are unbounded, operators that have detached tokio::spawn calls in their hot path (a heuristic check against the source code), and channels that lack a documented FlowPolicy. Failing edges produce a CI error.

use anyhow::{anyhow, Result};

#[derive(Debug)]
pub struct AuditFinding {
    pub edge_or_operator: String,
    pub category: AuditCategory,
    pub detail: String,
}

#[derive(Debug)]
pub enum AuditCategory {
    UnboundedChannel,
    NoFlowPolicy,
    DetachedSpawnSuspected,  // heuristic: source-grep for tokio::spawn inside operator
}

/// Audit an OperatorGraph for backpressure-chain integrity. Returns
/// the list of findings; an empty list means the audit passed.
pub fn audit(graph: &OperatorGraph) -> Vec<AuditFinding> {
    let mut findings = Vec::new();

    for edge in graph.edges() {
        if !edge.is_bounded() {
            findings.push(AuditFinding {
                edge_or_operator: format!("{} -> {}", edge.from_name, edge.to_name),
                category: AuditCategory::UnboundedChannel,
                detail: "edge uses unbounded_channel; pressure does not propagate".into(),
            });
        }
        if edge.flow_policy().is_none() {
            findings.push(AuditFinding {
                edge_or_operator: format!("{} -> {}", edge.from_name, edge.to_name),
                category: AuditCategory::NoFlowPolicy,
                detail: "edge has no documented FlowPolicy; default of Backpressure assumed but should be explicit".into(),
            });
        }
    }

    findings
}

/// CI helper: convert findings into a Result that fails the test.
pub fn audit_or_fail(graph: &OperatorGraph) -> Result<()> {
    let findings = audit(graph);
    if findings.is_empty() { return Ok(()); }
    let summary: Vec<String> = findings
        .iter()
        .map(|f| format!("[{:?}] {}: {}", f.category, f.edge_or_operator, f.detail))
        .collect();
    Err(anyhow!("backpressure audit failed:\n{}", summary.join("\n")))
}

The audit is intentionally conservative: an unflagged graph is probably correct but the audit cannot prove it. The DetachedSpawnSuspected heuristic is the weakest part — a source-grep for tokio::spawn inside operator bodies catches the obvious cases but misses cases where the spawn is hidden inside a helper function. Production audit tooling extends to AST-level inspection or annotation-based pattern matching; the lesson's version is sufficient as a CI canary that catches the regressions the postmortem identified.

A Per-Channel Occupancy Gauge

The diagnostic operator that exports the channel-occupancy gradient. Drops in transparently between any two operators by wrapping the channel.

use anyhow::Result;
use std::sync::Arc;
use tokio::sync::mpsc;

/// A wrapper around mpsc::Sender that exports the channel's current
/// occupancy as a metric on every send. Used between operators where
/// occupancy needs to be observable for the pressure-gradient diagnostic.
pub struct InstrumentedSender<T> {
    inner: mpsc::Sender<T>,
    capacity: usize,
    edge_label: String,
}

impl<T> InstrumentedSender<T> {
    pub fn new(inner: mpsc::Sender<T>, capacity: usize, edge_label: impl Into<String>) -> Self {
        Self { inner, capacity, edge_label: edge_label.into() }
    }

    pub async fn send(&self, item: T) -> Result<()> {
        // Exporting occupancy as a Prometheus gauge labeled by edge.
        // The dashboard's primary panel filters this by edge_label
        // and shows the per-edge gradient.
        let used = self.capacity - self.inner.capacity();
        metrics::gauge!("channel_occupancy", "edge" => self.edge_label.clone())
            .set(used as f64);
        self.inner.send(item).await
            .map_err(|_| anyhow::anyhow!("downstream dropped"))
    }
}

The mpsc::Sender::capacity() method returns the remaining capacity (slots free), so used = total - remaining. The gauge update is per-send overhead; for SDA's volumes (tens of thousands per second), the cost is negligible — sub-microsecond per send. For higher-throughput pipelines the sample rate would be lower (every Nth send) at the cost of dashboard responsiveness. Module 6 generalizes this pattern into a structured metric for every operator-pair edge.

A BurstSimulator for End-to-End Pressure Verification

The integration test that drives a synthetic 10x burst through the pipeline and asserts the per-channel occupancy gradient stabilizes at the expected bottleneck. The simulator's value is not the simulation itself but the assertion structure: under burst load, the slowest operator's incoming channel should be persistently full, every other channel should be measurably below full.

use std::time::{Duration, Instant};

pub struct BurstSimulator {
    target_rate_per_s: u64,
    duration: Duration,
}

impl BurstSimulator {
    pub fn new(target_rate_per_s: u64, duration: Duration) -> Self {
        Self { target_rate_per_s, duration }
    }

    /// Drive `target_rate_per_s` synthetic observations into the
    /// pipeline's source for `duration`, sampling channel occupancy
    /// at 10 Hz. Returns the per-edge occupancy time series.
    pub async fn drive(&self, source: impl SyntheticSource) -> Result<OccupancyReport> {
        let start = Instant::now();
        let interval = Duration::from_millis(1000 / self.target_rate_per_s.max(1));
        let mut sample_at = start + Duration::from_millis(100);
        let mut samples: Vec<OccupancySample> = Vec::new();

        while start.elapsed() < self.duration {
            source.emit_observation().await?;
            tokio::time::sleep(interval).await;
            if Instant::now() >= sample_at {
                samples.push(sample_all_edges());
                sample_at = Instant::now() + Duration::from_millis(100);
            }
        }
        Ok(OccupancyReport { samples })
    }
}

#[derive(Debug)]
pub struct OccupancyReport {
    pub samples: Vec<OccupancySample>,
}

impl OccupancyReport {
    /// Identify the persistently-full edge (>= 95% occupancy in the
    /// final third of the simulation). That edge's downstream operator
    /// is the bottleneck.
    pub fn identify_bottleneck(&self) -> Option<String> {
        let final_third_start = self.samples.len() * 2 / 3;
        let final_samples = &self.samples[final_third_start..];
        for edge_name in self.edge_names() {
            let avg_occupancy: f64 = final_samples.iter()
                .map(|s| s.edge_occupancy(&edge_name))
                .sum::<f64>() / final_samples.len() as f64;
            if avg_occupancy >= 0.95 {
                return Some(edge_name);
            }
        }
        None
    }

    fn edge_names(&self) -> Vec<String> { /* ... */ vec![] }
}

The simulator is more useful as a CI canary than as a load-testing tool — its value is the bottleneck-identification assertion, not the absolute throughput numbers. A regression that pushes the bottleneck from where ops expects it to be (the correlator) to somewhere else (a normalize that just got slower because of an unrelated change) is exactly the kind of thing the burst simulator catches before it becomes a production incident. The capstone wires this into CI.


Key Takeaways

  • The pressure chain is a contiguous sequence of bounded buffers from source to sink with send().await (or FlowPolicy equivalent) on every edge. A slowdown at any operator propagates upstream all the way to the source's producing primitive. The chain breaks at any unbounded buffer or detached tokio::spawn per event.
  • The canonical breakage patterns are: per-event tokio::spawn (M1's anti-pattern revisited), mpsc::unbounded_channel, fire-and-forget logging on unbounded channels, Vec::push into unbounded accumulators, and drop-and-recreate task patterns. The fix in every case is structural: replace with bounded-and-suspending equivalents.
  • Reading the pressure gradient identifies the slowest stage. The slowest operator's incoming channel is persistently full; channels upstream are partially full; channels downstream are mostly empty. The dashboard panel for per-channel occupancy is the primary diagnostic.
  • TCP windowing as backpressure works only when the application reads synchronously from the socket and processes inline. An async-buffer pattern that reads as fast as possible breaks the kernel-level chain just like an unbounded Vec breaks the application-level chain.
  • Backpressure across Kafka does not propagate synchronously. Kafka decouples producer and consumer intentionally. The pipeline's discipline is to monitor consumer lag explicitly as the substitute for the within-process pressure signal. Module 5 develops Kafka as a checkpoint store; Module 6 develops the lag monitoring.

Capstone Project — Backpressure-Aware Fusion Broker

Module: Data Pipelines — M04: Backpressure and Flow Control Estimated effort: 1–2 weeks of focused work Prerequisites: All three lessons in this module passed at ≥70%


Mission Brief

OPS DIRECTIVE — SDA-2026-0188 / Phase 4 Implementation Classification: BURST-LOAD HARDENING

Last week's anti-satellite test added 1,800 newly tracked debris objects to the catalog within 90 seconds. The Phase 3 pipeline survived but with twelve minutes of catch-up time and four dropped conjunction alerts during the spike. The postmortem traced the dropped alerts to a single edge — the alert-emitter's incoming channel — where the buffer was sized for nominal traffic, the upstream correlator was IO-bound and could not slow further, and the alert-emitter had no mechanism to triage which observations to drop and which to preserve. The pipeline's response to the burst was uniform load shed without policy: the four critical alerts that got dropped were no more important to the system than the four hundred non-critical observations dropped alongside them.

Phase 4 hardens the pipeline against this failure mode. Every edge in the operator graph gets an explicit FlowPolicy. A priority classifier distinguishes high-priority observations (previously- unseen objects, sustained-trajectory updates) from low-priority ones (redundant samples for objects already being tracked). Under shed conditions, low-priority observations are dropped before high-priority. The audit script becomes a CI test that fails the build if a new edge is added without a documented FlowPolicy. A burst simulator drives the pipeline at 10x normal rate for five minutes and asserts no high-priority alerts are dropped during the spike.

Success criteria for Phase 4: a deliberate 10x burst test passes with zero high-priority alerts dropped, with the pipeline memory bounded throughout the spike, and with the operational dashboard identifying the bottleneck operator within 30 seconds of the spike's onset.


What You're Building

Harden the Phase 3 pipeline (M3's conjunction window engine running on M2's orchestrator) against burst load. The deliverable is:

  1. Every channel in the topology has an explicit FlowPolicy set at construction time and documented with a one-line comment
  2. A priority classifier (fn classify_priority(obs: &Observation) -> Priority) that distinguishes High and Low based on whether the observation is a fresh sighting, a sustained-trajectory update, or a redundant sample
  3. A PriorityShedSink that maintains separate sub-channels per priority and drops Low-priority items first when the shared channel approaches capacity
  4. The audit script from L3 wired into the cargo test suite as a CI gate that fails on any new unbounded channel or undocumented FlowPolicy
  5. The BurstSimulator from L3 running as an integration test that drives 10x rate for 5 minutes and asserts the high-priority retention property
  6. Per-channel occupancy gauges (the InstrumentedSender from L3) on every edge, plus a flow_policy_drops_total{policy, priority} counter
  7. An updated operational README documenting how to read the channel-occupancy gradient, what each FlowPolicy means, and what the burst simulator's pass/fail criteria are

The orchestrator from M2 and the windowed correlator from M3 are unchanged in structure. Only the channels' policies change, and a new operator (the priority classifier) sits between the normalize and the correlator.


Suggested Architecture

   ┌───────┐   FP=Backpressure                                FP=Backpressure
   │ Radar │═════════>┌────────┐                            ┌──────────────┐
   │  Src  │═════════>│ Norm   │   FP=Backpressure          │   Windowed   │
   └───────┘          │ FanIn  │═════>┌─────────────┐═════>│  Correlator  │
   ┌───────┐  ═══════>│        │      │  Priority   │═════>│  (M3)        │
   │ Optical│         └────────┘      │ Classifier  │      └──────┬───────┘
   │  Src  │                          │  + Shed     │             │ FP=Timed(200ms)
   └───────┘                          └─────────────┘             ▼
   ┌───────┐                                                ┌──────────────┐
   │  ISL  │                                                │  Alert Sink  │
   │  Src  │                                                │ (priority    │
   └───────┘                                                │  preserving) │
                                                            └──────────────┘
                                                                  │
                                                                  ▼ shed_drop
                                                            ┌──────────────┐
                                                            │     DLQ      │
                                                            └──────────────┘

   Plus a sidecar:
   ┌─────────────────────────────────────┐
   │  /metrics endpoint exporting:       │
   │   - channel_occupancy{edge}         │
   │   - flow_policy_drops_total{policy} │
   │   - per-priority counters           │
   └─────────────────────────────────────┘

The priority classifier operator sits between the fan-in normalize and the correlator. Its outgoing edge is a PriorityShedSink — a sink that holds two sub-channels (one per priority) and feeds the correlator from a select! that prefers High when both have items. Under shed conditions (the underlying correlator's incoming channel is filling), Low items are dropped first while High items still flow.


Acceptance Criteria

Functional Requirements

  • Every channel in the operator graph has a FlowPolicy set and documented with a comment naming the choice and the reasoning
  • No unbounded_channel anywhere in the data path; the audit script fails CI if one is introduced
  • No tokio::spawn inside any operator's per-event hot path; the audit script's heuristic check fails CI on new occurrences
  • The priority classifier returns Priority::{High, Low} based on the documented rules; the rules are encoded in code with a // Reason: ... comment per branch
  • The PriorityShedSink maintains separate sub-channels and drops Low first when the shared underlying channel is over a configured high-water mark (e.g., 80% capacity)
  • The alert-emitter uses FlowPolicy::Timed(200ms) and routes timed-out alerts to a DLQ rather than blocking past the SLO

Quality Requirements

  • Audit script test: a unit test runs the audit on a synthetic graph that contains an unbounded channel, asserts the audit fails with a recognizable error message
  • Burst simulator test: an integration test drives the pipeline at 10x normal rate for 5 minutes; asserts (a) zero High-priority alerts dropped, (b) memory plateau reached within 30 seconds (no unbounded growth), (c) flow_policy_drops_total{priority="low"} increments while flow_policy_drops_total{priority="high"} stays at zero
  • Bottleneck identification test: a deterministic test injects a synthetic slow operator, runs the simulator, asserts the gradient-reading helper identifies the right operator as the bottleneck within 30 seconds
  • All channel buffer sizes have a documented BurstProfile in the topology declaration; the orchestrator's startup log emits the (edge, buffer_size, profile) triple

Operational Requirements

  • /metrics endpoint extends Phase 3's with: channel_occupancy{edge} (gauge per edge), flow_policy_drops_total{policy, priority} (counter), priority_classifier_decisions_total{priority} (counter)
  • Operational runbook section "Diagnosing a Backpressure Incident" documenting the gradient-reading discipline, with a worked example using the burst simulator's output as the canonical case
  • The audit script runs as part of cargo test --release; failing it fails the CI build

Self-Assessed Stretch Goals

  • (self-assessed) The 10x burst test holds P99 high-priority alert latency under 5 seconds throughout the spike (the 30-second SLA is comfortably met). Provide a histogram from a real 5-minute run.
  • (self-assessed) A circuit breaker (Module 2 L4) wraps the optical-archive call: if the archive's failure rate exceeds 50% over 30 seconds, the breaker opens for 30 seconds; the dashboard shows breaker state. Demonstrated with a wiremock simulating intermittent 503 responses.
  • (self-assessed) The pipeline's checkpoint operator (foreshadowed for M5) uses the Lesson 2 CreditChannel to pause ingestion during the flush; the burst simulator runs through a checkpoint and continues without observable High-priority alert loss.

Hints

How should I encode the priority classification rules?

A small enum and a function with explicit branches. Each branch carries a comment naming the operational rationale.

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum Priority { High, Low }

pub fn classify_priority(obs: &Observation, recent_objects: &RecentSet) -> Priority {
    // Reason: previously-unseen objects are time-critical; their first
    // observation defines the orbital track and a missed alert can mean
    // a missed conjunction.
    if !recent_objects.contains(&obs.target_object_id) {
        return Priority::High;
    }

    // Reason: sustained-trajectory updates from the high-cadence radar
    // refine existing tracks but redundant samples (>4 within 30s) add
    // little to track quality and are sheddable.
    if obs.source_kind == SourceKind::Radar
       && recent_objects.sample_count(&obs.target_object_id, Duration::from_secs(30)) > 4
    {
        return Priority::Low;
    }

    Priority::High
}

The RecentSet is a small auxiliary structure that the classifier owns; it tracks the last N seconds of (object_id, sensor_timestamp) pairs and supports contains and sample_count lookups. Bound it by both time and count (the L2 sliding-window pattern from M3 applies here).

How should the PriorityShedSink select between sub-channels?

A tokio::select! over two recv futures, with a bias toward High when both are ready. The biased mode of select! gives deterministic precedence — without it, the macro picks randomly among ready arms.

pub async fn run_priority_sink(
    mut high_rx: mpsc::Receiver<Observation>,
    mut low_rx: mpsc::Receiver<Observation>,
    output: mpsc::Sender<Observation>,
) -> Result<()> {
    loop {
        tokio::select! {
            biased;
            recv = high_rx.recv() => match recv {
                Some(obs) => output.send(obs).await
                    .map_err(|_| anyhow!("downstream dropped"))?,
                None => break,
            },
            recv = low_rx.recv() => match recv {
                Some(obs) => output.send(obs).await
                    .map_err(|_| anyhow!("downstream dropped"))?,
                None => continue,
            },
        }
    }
    Ok(())
}

The biased; directive is what gives High priority deterministic preference; without it, ties (both channels ready) resolve randomly, which means the High channel only gets ~50% of the throughput. Documented in the operator's comment.

How do I make the burst simulator deterministic?

Drive synthetic events with a fixed-seed RNG; sample channel occupancy at deterministic wall-clock intervals; assert against the resulting time series rather than against transient peaks. The key insight: real-world burst behavior is not deterministic, but the simulator's value is the regression-detection property — same inputs produce same outputs, so a regression is visible as a change in the time series.

Use tokio::time::pause() and tokio::time::advance() to drive the simulation in fast-forward without real wall-clock waits. The simulator runs in milliseconds rather than minutes, fits in CI, runs deterministically every time.

How do I size the High and Low sub-channels?

Total capacity should match the underlying channel's capacity — say, 4096 in total. Split: 75% High (3072), 25% Low (1024). The High side is sized to the steady-state High rate plus burst headroom; the Low side is sized small because it is the first to shed under load and additional capacity does not add value.

The split is documented in the BurstProfile for the edge and visible at startup in the orchestrator's structured log. Operational tuning revisits the split if dashboard data shows the High side hitting capacity under normal load (then it is undersized) or the Low side being persistently empty (then it is oversized and the capacity could be reallocated).

How do I integrate the audit script with cargo test?

A #[test] function that builds the production graph and runs audit_or_fail on it. The test fails the CI build whenever a new edge or operator violates the audit rules.

#[test]
fn pipeline_passes_backpressure_audit() {
    let graph = build_production_pipeline_graph();
    let result = audit_or_fail(&graph);
    assert!(
        result.is_ok(),
        "backpressure audit failed:\n{}",
        result.unwrap_err()
    );
}

The build_production_pipeline_graph is the same function the binary calls for its actual topology — sharing the construction code between binary and test ensures the test exercises what production runs, not a stale fixture.


Getting Started

Recommended order:

  1. FlowPolicy enum and ConfigurableSink. Lesson 1's primitive; wire it through every existing edge in the topology with FlowPolicy::Backpressure as the explicit choice on most edges.
  2. Audit script. Lesson 3's primitive; encode it as a cargo test function. It will pass at this stage; the value is preventing future regressions.
  3. InstrumentedSender + per-channel occupancy gauge. Lesson 3's primitive; wrap every mpsc::Sender in the production graph. Verify the dashboard shows occupancy values at startup.
  4. Priority classifier. Encode the classification rules; unit-test them against synthetic inputs.
  5. PriorityShedSink with sub-channel split. The biased-select pattern; integration-test it with a synthetic load mix.
  6. BurstSimulator integration test. Drive 10x rate for 5 simulated minutes; assert High-priority retention.
  7. Operational runbook + structured log emit at startup. The "what each policy means" reference for ops.
  8. CI integration: audit script in cargo test, burst simulator in nightly CI.

Aim for the 10x burst simulator passing with the priority classifier in place by day 7. The audit script and runbook are finishing work that pays back later.


What This Module Sets Up

In Module 5 you will make the windowed correlator's state crash-safe via checkpointing. The credit-based-flow primitive from this module's Lesson 2 is the mechanism that pauses the upstream during the checkpoint flush. The bounded-channel-everywhere invariant the audit enforces is what lets the checkpointed state size be bounded and predictable. The exactly-once-via-idempotency frame from M2 plus the retract-aware sink from M3 plus the priority shedding from M4 compose into a pipeline that is correct, hardenable under load, and crash-safe under restart — the full M5 deliverable.

In Module 6 you will surface the per-channel occupancy gradient and the per-priority drop counters as the operational dashboard's primary panels. The audit script becomes part of the deploy gate; the BurstSimulator becomes part of the SLO compliance test. The work in this module is the operational foundation for that observability stack.

The patterns the module installs — explicit per-edge FlowPolicy, priority-aware load shedding, audit-as-CI-test — generalize beyond SDA. They are the standard streaming-pipeline hardening techniques for any system that must survive bursts; the module's specifics are where the techniques meet a real workload.

Module 05 — Delivery Guarantees and Fault Tolerance

Track: Data Pipelines — Space Domain Awareness Fusion Position: Module 5 of 6 Source material: Kafka: The Definitive Guide — Shapira, Palino, Sivaram, Petty, Chapter 7 (Reliable Data Delivery), Chapter 8 (Exactly-Once Semantics), Chapter 9 (Failure Handling and Reprocessing); Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11 (Fault Tolerance, Microbatching and Checkpointing); Database Internals — Alex Petrov, Chapter 5 (Checkpointing in Recovery); Fundamentals of Data Engineering — Joe Reis & Matt Housley, Chapter 7 (Error Handling and Dead-Letter Queues, Late-Arriving Data) Quiz pass threshold: 70% on all four lessons to unlock the project


Mission Context

OPS ALERT — SDA-2026-0207 Classification: RESTART-SAFETY HARDENING Subject: Make the windowed correlator crash-safe and the alert path exactly-once-effective

Two months ago, a maintenance window required restarting the pipeline to apply a security patch. The orchestrator's graceful-drain logic worked correctly — every operator drained its incoming channel before exiting — but the alert subscriber had already received fourteen alerts that the new pipeline did not know about, and the new pipeline emitted six alerts that the subscriber had already acted on. Two false-positive collision-avoidance maneuvers were executed as a consequence. The postmortem identified two missing pieces: durable state on the producer side (so restart resumes from where the previous instance left off), and idempotent processing on the consumer side (so duplicate deliveries do not produce duplicate effects). This module installs both.

The pipeline at the start of this module handles steady-state load (M4), produces correct event-time results (M3), and is correctly orchestrated and supervised (M2). It has one structural blind spot: it loses data on a process restart. The windowed correlator's state is in process memory; the supervisor restarts a panicked operator with a fresh empty state; in-flight observations between operators are buffered in tokio channels that do not survive a process exit. Every restart loses some non-trivial amount of work, and the SDA-2026-0207 incident's failure mode is what happens when that lost work straddles a real-world action boundary like an alert subscriber.

Module 5 is the response. At-least-once delivery at the transport layer (Kafka producer with acks=all + retries; consumer with process-then-commit) gives the property "every observation reaches the consumer at least once, with duplicates as the operational cost." Idempotency at the application layer (sink-side dedup keyed on observation_id, idempotent SQL UPSERT, Kafka's idempotent producer) gives the property "duplicate deliveries produce identical effects on the world." The two together are effective exactly-once — the pipeline's net effect is exactly-once even though the underlying transport admits duplicates. Checkpointing captures the windowed correlator's state durably so restart resumes from a saved snapshot rather than rebuilding from scratch. Dead-letter queues route permanent errors to a separate sink with metadata so engineers can investigate and re-inject after fixes.

The mental model the module installs is the four-piece reliability discipline: (1) at-least-once at every transport boundary, (2) idempotency at every state-modifying boundary, (3) checkpointed state on every stateful operator, (4) DLQ for every permanent-error path with explicit re-processing tooling. Every streaming pipeline in production combines these four; the module's specifics are where the patterns meet SDA's actual workload.


Learning Outcomes

After completing this module, you will be able to:

  1. Distinguish at-most-once, at-least-once, and exactly-once delivery semantics rigorously, and configure the Kafka producer/consumer pair for at-least-once
  2. Compose at-least-once delivery with application-layer idempotency to produce effective exactly-once processing, and recognize where the guarantee holds versus where boundary owners must implement their own dedup
  3. Implement bounded sliding-window dedup sets keyed on natural or derived idempotency keys, with the production-safety double-bound (time AND count)
  4. Configure Kafka's idempotent producer (enable.idempotence=true) and reason about its partition-scoped guarantee versus transactional Kafka's cross-partition guarantee
  5. Implement checkpointing of stateful operators with the State+Offset recovery contract, atomic temp-file + rename writes, and the pause-snapshot-resume protocol via credit-withholding
  6. Choose between aligned (Flink-style barriers) and per-operator checkpoints based on the pipeline's idempotency machinery and operational tradeoffs
  7. Classify operator errors into transient/permanent/discardable and route each to the appropriate destination (retry/DLQ/drop with counter), with DLQ entries carrying schema-versioned metadata for re-processing tools
  8. Recognize the discard-bucket anti-pattern and apply the operational disciplines (alert on growth rate, periodic review, bounded retention) that prevent it

Lesson Summary

Lesson 1 — At-Least-Once Delivery

The three levels of delivery guarantee (at-most-once, at-least-once, exactly-once) and the operational tradeoffs each implies. Producer-side at-least-once via acks=all + retries + max.in.flight (with the ordering caveat). Consumer-side at-least-once via process-then-commit and enable.auto.commit=false. Three sources of duplicates (producer retries after partial success, consumer crashes between process and commit, rebalances during processing). The cost of duplicates is proportional to the duplicate rate × per-event downstream cost; for SDA's alert-triggering subscriber, this drives the discipline.

Key question: A consumer is configured with enable.auto.commit=true. The application crashes after auto-commit fired but before the application processed the messages whose offsets it committed. What happens, and what is the lesson's recommended fix?

Lesson 2 — Exactly-Once via Idempotency

Idempotency as the application-layer property that composes with at-least-once delivery into effective exactly-once. Natural keys (observation_id) versus derived keys (sorted content-addressable hash). Bounded dedup sets with the double-bound (time AND count) for production safety. Kafka's enable.idempotence=true as broker-side PID + sequence dedup, partition-scoped. Where the guarantee holds (within the pipeline, at idempotent SQL/Kafka/HTTP-with-key boundaries) versus where boundary owners must implement dedup themselves (alerts that trigger external actions).

Key question: The pipeline emits ConjunctionRisk events to a non-idempotent webhook that triggers an email to a human operator. Does the at-least-once + idempotency composition give effective exactly-once at the email side, and what is the framework for thinking about it?

Lesson 3 — Checkpointing

Checkpointing as the durable-state mechanism that lets a restarted operator resume from a saved State+Offset rather than rebuilding from scratch. The pause-snapshot-resume protocol via credit-withholding from M4 L2. Aligned (Flink barriers) versus per-operator checkpoints — SDA uses per-operator with idempotency-driven recovery for global consistency. Storage tiers: local fast NVMe primary, remote object-storage durable replicate, hybrid as production default. Atomic temp-file + rename for all-or-nothing durability.

Key question: The teammate proposes storing only the operator's serialized state in the checkpoint, omitting the offset on the grounds that 'the consumer's committed offset is already durable.' Why does the lesson reject this and insist on the State+Offset pair?

Lesson 4 — Dead Letter Queues

The DLQ pattern. Three error categories (transient/retry, permanent/DLQ, discardable/drop). DLQ metadata as the debug-tool foundation: timestamp, operator, error_kind, error_message, retry_count, original_payload, schema_version. Poison pills as the canonical case the DLQ exists for. The DLQ as a re-processing source after underlying issues are fixed, absorbed by L2's idempotency machinery. Three operational disciplines that prevent the discard-bucket anti-pattern: alert on growth rate, periodic review cadence, bounded retention.

Key question: Operations sees a sudden 40x spike in dlq_entries_total{error_kind="Deserialization"} from a single operator. What does the DLQ's schema design support as a first-look diagnosis, and what is the typical resolution path?


Capstone Project — Exactly-Once Conjunction Alert Pipeline

Make the M4 hardened pipeline crash-safe and exactly-once-effective. The windowed correlator becomes a CheckpointingOperator with 30-second cadence, atomic writes, and local-first/remote-second/fresh-third recovery. The alert sink uses the L2 DedupSet keyed on alert_id. Kafka consumer reconfigures for process-then-commit. Permanent errors route to a DLQ with the L4 schema; an sda-reprocess CLI tool re-injects DLQ entries with filter and dry-run support. Three crash tests (mid-process, mid-checkpoint, mid-emit) assert post-restart correctness. Acceptance criteria, suggested architecture, deterministic crash-test patterns, and the full project brief are in project-exactly-once-alerts.md.

The orchestrator from M2, the windowed correlator from M3, and the priority-aware shedding from M4 are all unchanged in structure. The new operators wrap or extend the existing pieces; the operator graph grows by a few nodes (DLQ sink, alert subscriber boundary state).


File Index

module-05-delivery-guarantees-and-fault-tolerance/
├── README.md                                         ← this file
├── lesson-01-at-least-once.md                        ← At-least-once delivery
├── lesson-01-quiz.toml                               ← Quiz (5 questions)
├── lesson-02-exactly-once-idempotency.md             ← Exactly-once via idempotency
├── lesson-02-quiz.toml                               ← Quiz (5 questions)
├── lesson-03-checkpointing.md                        ← Checkpointing
├── lesson-03-quiz.toml                               ← Quiz (5 questions)
├── lesson-04-dead-letter-queues.md                   ← Dead letter queues
├── lesson-04-quiz.toml                               ← Quiz (5 questions)
└── project-exactly-once-alerts.md                    ← Capstone project brief

Prerequisites

  • Modules 1 through 4 completed — the Observation envelope, the OperatorGraph with supervisor, the watermark-aware windowed correlator, and the per-edge FlowPolicy discipline are all assumed
  • Foundation Track completed — async Rust, channels, runtime intuitions
  • Familiarity with tokio::sync::mpsc, tokio::fs (for atomic-rename writes), serde (for state serialization), bincode (for compact binary checkpoint format), and the rdkafka Rust client's producer/consumer APIs
  • Comfort reading Kafka's producer configuration documentation (acks, retries, enable.idempotence, max.in.flight.requests.per.connection) and consumer configuration (enable.auto.commit, commit modes)

What Comes Next

Module 6 (Observability and Lineage) makes the pipeline's correctness visible to operations. The new metrics from this module — checkpoint_age_seconds, dlq_entries_total, recovery_path_total — become panels on the resilience dashboard. The runbook discipline established here (per-error-kind playbooks, re-processing protocols) becomes part of the on-call rotation's standard procedure. The patterns this module installed — at-least-once + idempotent + checkpointed + DLQ'd — generalize beyond SDA to any streaming system that must survive restarts; M6 develops the observability stack that makes those patterns operationally legible.

The pipeline at the end of this module is correct under load, correct across restart, correct in event time, and correctly orchestrated. Module 6 turns "correct" into "correct AND visible," which is the difference between a system that works and a system that operations can trust.

Lesson 1 — At-Least-Once Delivery

Module: Data Pipelines — M05: Delivery Guarantees and Fault Tolerance Position: Lesson 1 of 4 Source: Kafka: The Definitive Guide — Shapira, Palino, Sivaram, Petty, Chapter 7 (Reliable Data Delivery — acks, enable.idempotence, retries, in-flight requests, consumer commit ordering); Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11 (Fault tolerance in stream processing)


Context

Modules 1 through 4 produced a pipeline that handles steady-state load, propagates backpressure cleanly, computes correct event-time results, and survives most operational failures. It has one structural blind spot: it loses data on a process restart. The windowed correlator from Module 3 holds in-process state. The orchestrator's supervisor restarts a panicked operator with a fresh Task, which means a fresh empty state. In-flight observations between the source and the correlator are buffered in tokio channels that do not survive a process exit. When a deploy rolls a new pipeline binary, the previous binary's in-flight buffer is gone and the windows it had been accumulating are gone.

The mission framing for this module is the SDA-2026-0207 incident two months ago. A maintenance window required restarting the pipeline to apply a security patch. The orchestrator's graceful-drain logic worked correctly — every operator drained its incoming channel before exiting — but the alert subscriber had already received fourteen alerts that the new pipeline did not know about, and the new pipeline emitted six alerts that the subscriber had already acted on. Two false-positive collision-avoidance maneuvers were executed as a consequence. The postmortem identified two missing pieces: durable state on the producer side (so restart resumes from where the previous instance left off), and idempotent processing on the consumer side (so duplicate deliveries do not produce duplicate effects).

This module installs both pieces. Lesson 1 establishes the delivery-semantics vocabulary — the three levels of guarantee (at-most-once, at-least-once, exactly-once) — and the producer-and-consumer-side configuration choices that produce at-least-once. Lesson 2 develops idempotency as the application-layer property that composes with at-least-once delivery to produce effective exactly-once processing. Lesson 3 introduces checkpointing — the durable state mechanism that lets a restarted operator resume without losing window state. Lesson 4 covers dead-letter queues for events that cannot be processed regardless of how many times they are retried. The capstone wires all four into the M4 pipeline; by the end of M5, the alert pipeline is correct under load AND across restarts.


Core Concepts

The Three Levels

Every message-delivery system makes one of three guarantees about a given message.

At-most-once. The message is delivered zero or one times. Loss is possible (the message can be dropped); duplication is not. The simplest configuration: send and forget. UDP without retries falls in this category.

At-least-once. The message is delivered one or more times. Loss is not possible (every message reaches the consumer at least once); duplication is possible (the same message can be delivered multiple times under retry). The default for any system that retries on failure. Kafka's standard producer/consumer pair, configured with acks=all and retries, falls in this category.

Exactly-once. The message is delivered exactly one time. No loss, no duplication. The strongest guarantee, and the most expensive — true exactly-once delivery is impossible without coordination between producer and consumer (two-phase commit, transactional Kafka, or strong assumptions about the network). Most systems labeled "exactly-once" are actually "at-least-once delivery + idempotent processing = effective exactly-once at the application layer."

The choice is operational and determined by what the consumer's failure mode looks like under each. A throughput counter is fine with at-most-once — losing 0.1% of events does not change the per-minute aggregate. An audit log requires at-least-once — every observation must be recorded, even at the cost of duplicates that the audit reader can dedupe. A conjunction alert that triggers a real-world action requires effective exactly-once — neither a missed alert (collision risk) nor a phantom alert (unnecessary maneuver) is acceptable.

At-Least-Once on the Producer Side

The Kafka producer's reliability is configured by three settings working together.

acks. Controls when the producer considers a send "successful." acks=0 means "send and assume success" (at-most-once: a network drop is silently lost). acks=1 means "wait for the partition leader to acknowledge" (still loses if the leader dies before replicating). acks=all means "wait for the full in-sync replica set to acknowledge" (durable on the broker side; the message will not be lost barring catastrophic broker failures). For at-least-once, acks=all is required.

retries. How many times the producer retries a failed send. Failures here are transport-level: network errors, broker timeouts, leader-elections-in-progress. With retries=0 and acks=all you get at-most-once with strong durability when delivery succeeds; with retries > 0 you get at-least-once. Production setting is typically a high number (Integer.MAX_VALUE is common — the producer retries until either success or delivery.timeout.ms elapses). For at-least-once, retries must be enabled.

max.in.flight.requests.per.connection. How many sends the producer can have outstanding to a given broker at once. With idempotence disabled, values > 1 risk re-ordering on retry: send A is in flight, send B is in flight; A fails and is retried; B succeeds before A's retry; the broker sees B-then-A. For ordered at-least-once, set this to 1 (degenerate single-credit case from Module 4 L2). For unordered at-least-once, higher values give more throughput at the cost of order.

The combination acks=all + retries > 0 produces at-least-once. Duplicates are possible because the producer might retry a send that actually succeeded but whose acknowledgment was lost in transit; the broker sees the same message twice. The application must tolerate that — the lesson's title is the guarantee, not "exactly-once."

At-Least-Once on the Consumer Side

The consumer's reliability is about the ordering of processing and committing. The consumer reads a batch of messages, processes them, then commits the offset back to the broker. If the consumer crashes between reading and processing, the next consumer instance starts at the previously-committed offset and re-reads (and re-processes) the unprocessed batch — duplicate processing, but no loss.

The critical ordering is process-then-commit. The consumer must finish processing a message before committing its offset. The wrong ordering (commit-then-process) loses messages: if the consumer commits and then crashes before processing, the next instance starts at the committed offset and never re-reads the messages that the first instance had committed but not yet processed. The messages are silently lost.

The Kafka consumer's enable.auto.commit=true configuration is the canonical version of the wrong ordering. Auto-commit fires on a timer, regardless of whether the application has actually processed the messages whose offsets it commits. For any reliability discipline beyond the loosest, enable.auto.commit=false and explicit commitSync after processing is required.

The duplicate-on-restart property is acceptable because the application is idempotent on observation_id (the topic of Lesson 2). Messages re-read after a restart go through the dedup logic and are silently dropped. The combination of producer-side at-least-once + consumer-side process-then-commit + sink-side idempotency is the effective-exactly-once shape this module is building toward.

Where Duplicates Come From

Three sources of duplication under at-least-once.

Producer-side retries after partial success. The producer sends; the broker writes the message to its log; the broker's acknowledgment is lost in transit; the producer's retry timer fires; the producer sends again; the broker writes the message again. The same logical message is now in the log twice with different offsets. Consumers see both copies. This is the duplicate that the producer-side idempotent producer (enable.idempotence=true, Lesson 2) is designed to prevent — using a producer ID + sequence number that the broker uses to dedup.

Consumer-side crashes between process and commit. The consumer processes a batch; the consumer crashes before committing; the next consumer instance reads the same batch and processes it again. The application's effect on the world has been applied twice. Idempotency on the application's effect (e.g., UPSERT-by-natural-key in the sink) is what absorbs this.

Rebalances during processing. A consumer group rebalance reassigns partitions among consumer instances. If a partition is reassigned mid-batch (the original consumer was processing but had not committed when the rebalance fired), the new consumer reads from the previously-committed offset and re-processes. Same shape as the crash case.

The lesson's framing: duplicates are not a bug, they are a configurable cost. Rare under good operational conditions, frequent during deploys or partition migrations, always possible. The application is responsible for handling them.

Operational Cost of Duplicates

The cost is proportional to the duplicate rate × the per-event cost of duplicate processing downstream.

For a sink that does an UPSERT keyed on observation_id with a strict-greater check (the M3 L4 retract-aware shape): the duplicate is absorbed by the comparison, the cost is one wasted SQL round-trip per duplicate. At a typical duplicate rate of 0.1% during a healthy steady-state, this is invisible at SDA's scale.

For a sink that increments a counter without an idempotency check: every duplicate is a miscount. The counter ends up high by the duplicate rate × window count. The audit dashboard reports inflated numbers. This is the canonical "we have at-least-once + non-idempotent sink" failure mode and the reason Module 2 L3 made idempotency a first-class topic.

For a sink that triggers an external action (an alert subscriber that fires a satellite-avoidance maneuver): every duplicate is a real-world wrong action. The cost is real fuel burn or a real hardware adjustment. This is the case the SDA-2026-0207 incident reflected, and the case where exactly-once-effective via idempotency on alert ID is necessary, not optional.

The cost shape determines how aggressively the application tightens the at-least-once bound (idempotent producer, smaller in-flight, faster commit cadence) and how robust the sink's idempotency must be.


Code Examples

A Kafka Producer Configured for At-Least-Once

The rdkafka crate's producer configuration. The settings encode the lesson's at-least-once shape exactly.

use anyhow::Result;
use rdkafka::config::ClientConfig;
use rdkafka::producer::{FutureProducer, FutureRecord};
use std::time::Duration;

pub fn build_at_least_once_producer(brokers: &str) -> Result<FutureProducer> {
    let producer: FutureProducer = ClientConfig::new()
        .set("bootstrap.servers", brokers)
        // acks=all: wait for the full in-sync replica set to ack.
        // The message is durable on the broker side before send returns.
        .set("acks", "all")
        // retries: keep retrying transient transport failures.
        // i32::MAX is the conventional "retry until delivery.timeout.ms".
        .set("retries", "2147483647")
        // delivery.timeout.ms: how long the producer keeps retrying
        // before giving up. 2 minutes is reasonable for the SDA pipeline;
        // longer encourages the producer to ride out longer broker
        // transient failures.
        .set("delivery.timeout.ms", "120000")
        // enable.idempotence is OFF deliberately for this lesson — we
        // want raw at-least-once. Lesson 2 turns it on for the
        // exactly-once-effective producer.
        .set("enable.idempotence", "false")
        // max.in.flight.requests: 1 forces the strongest ordering
        // guarantee at the cost of throughput. Production might use
        // 5 (the broker-enforced max for idempotent mode) when ordering
        // can be reconstructed via observation_id at the consumer.
        .set("max.in.flight.requests.per.connection", "1")
        .create()?;
    Ok(producer)
}

pub async fn send_observation(
    producer: &FutureProducer,
    topic: &str,
    obs: &Observation,
) -> Result<()> {
    let payload = serde_json::to_vec(obs)?;
    let key = obs.observation_id.to_string();
    let record = FutureRecord::to(topic)
        .key(&key)
        .payload(&payload);

    // The future resolves when the full in-sync replica set has acked.
    // On any broker-acknowledged-but-network-lost case, the producer
    // retries automatically per the configuration above; the resolution
    // happens on the eventual successful retry.
    producer.send(record, Duration::from_secs(30)).await
        .map_err(|(e, _)| anyhow::anyhow!("send failed after retries: {e}"))?;
    Ok(())
}

The enable.idempotence=false is deliberate here — we are demonstrating raw at-least-once. The producer's retry-on-failure is what gives the at-least guarantee; duplicates are the cost. Lesson 2 turns idempotence on and explains the producer-ID + sequence-number mechanism that the broker uses to dedup at the producer side. This lesson's pipeline accepts the producer-side duplicates and absorbs them at the consumer-side sink.

A Process-Then-Commit Consumer

The Kafka consumer pattern that gives at-least-once on the consumer side. enable.auto.commit=false, explicit commit_message after the batch is processed.

use rdkafka::consumer::{CommitMode, Consumer, StreamConsumer};
use rdkafka::message::{BorrowedMessage, Message};

pub fn build_at_least_once_consumer(brokers: &str, group: &str) -> Result<StreamConsumer> {
    let consumer: StreamConsumer = ClientConfig::new()
        .set("bootstrap.servers", brokers)
        .set("group.id", group)
        // The critical setting: commits must be explicit, not auto.
        // Auto-commit fires on a timer regardless of processing
        // progress, which is the canonical 'commit-then-process'
        // bug shape.
        .set("enable.auto.commit", "false")
        .set("auto.offset.reset", "earliest")
        .create()?;
    Ok(consumer)
}

pub async fn run_consumer_loop(
    consumer: &StreamConsumer,
    sink: &impl ObservationSink,
) -> Result<()> {
    use rdkafka::consumer::CommitMode;
    use tokio_stream::StreamExt;

    let mut stream = consumer.stream();
    while let Some(result) = stream.next().await {
        let message: BorrowedMessage = result?;
        let payload = message.payload()
            .ok_or_else(|| anyhow::anyhow!("empty payload"))?;
        let obs: Observation = serde_json::from_slice(payload)?;

        // PROCESS first: feed the sink. The sink's idempotency
        // (Lesson 2) absorbs duplicates from rare retries.
        sink.write(obs).await?;

        // COMMIT only after process succeeds. A crash between process
        // and commit causes the next consumer instance to re-read
        // and re-process; the sink dedups.
        consumer.commit_message(&message, CommitMode::Sync)?;
    }
    Ok(())
}

The commit_message(..., CommitMode::Sync) blocks until the broker confirms the commit — the next message is not read until the previous commit is durable. The async variant (CommitMode::Async) commits in the background, which is faster but introduces a window where a crash between the async commit's queue and its broker confirmation can lose the commit; for SDA's reliability budget we use sync commits. The duplicate window is exactly the time between sink.write returning and commit_message returning — typically sub-millisecond.

A Crash-Injection Test Harness

The test harness that verifies the at-least-once guarantee under deliberate crash conditions. The same harness is used in Lesson 2 to verify exactly-once-effective when idempotency is added.

use std::sync::atomic::{AtomicU32, Ordering};
use std::sync::Arc;

/// A test-only sink that counts writes and panics on the Nth call.
pub struct CrashingSink {
    writes: Arc<AtomicU32>,
    panic_at: u32,
}

impl CrashingSink {
    pub fn new(panic_at: u32) -> Self {
        Self {
            writes: Arc::new(AtomicU32::new(0)),
            panic_at,
        }
    }

    pub fn writes(&self) -> u32 { self.writes.load(Ordering::SeqCst) }
}

#[async_trait::async_trait]
impl ObservationSink for CrashingSink {
    async fn write(&self, _obs: Observation) -> Result<()> {
        let n = self.writes.fetch_add(1, Ordering::SeqCst) + 1;
        if n == self.panic_at {
            // Simulate process crash mid-write. In a real test this
            // would be a tokio::process restart or similar; for
            // illustration, a panic captured by the orchestrator.
            panic!("simulated crash at write #{n}");
        }
        Ok(())
    }
}

#[tokio::test]
async fn crash_between_process_and_commit_replays_messages() {
    // Setup: Kafka producer feeds 10 observations to topic. Consumer
    // processes them with the CrashingSink that panics on write #5.
    let producer = build_at_least_once_producer("localhost:9092").unwrap();
    for i in 0..10 {
        send_observation(&producer, "test-topic", &test_obs(i)).await.unwrap();
    }

    // First consumer instance: panics at write 5. Writes 1-4 succeeded
    // and were committed (commit happens after each process); write 5
    // panicked before commit.
    let sink_a = CrashingSink::new(5);
    let _ = run_consumer_with_sink("test-group", &sink_a).await;
    assert_eq!(sink_a.writes(), 5); // 4 succeeded + the panicked one

    // Second consumer instance: starts from offset 4 (last committed).
    // Re-reads message 5 and processes it, plus 6-10. Total writes
    // for the system: 5 (from instance A) + 6 (from instance B) = 11.
    // The duplicate is the M5's at-least-once cost.
    let sink_b = CrashingSink::new(0); // does not crash this time
    let _ = run_consumer_with_sink("test-group", &sink_b).await;
    assert_eq!(sink_b.writes(), 6);  // 5 through 10
}

The test asserts the at-least-once contract: every observation reaches a sink at least once, and message 5 reaches a sink exactly twice. With idempotent processing in Lesson 2, the duplicate's effect on the world is unchanged from a single processing — but the raw write count at the sink is still 11. The lesson's framing: at-least-once delivery is the transport guarantee; idempotency at the application layer is what turns it into exactly-once-effective.


Key Takeaways

  • The three levels of delivery guarantee are at-most-once, at-least-once, and exactly-once. True exactly-once delivery requires coordination beyond what most systems implement; pragmatic exactly-once is achieved by composing at-least-once delivery with idempotent processing at the application layer.
  • At-least-once on the producer side requires acks=all (wait for the full in-sync replica set), retries > 0 (retry transient failures until delivery.timeout.ms elapses), and a deliberate choice on max.in.flight.requests.per.connection (1 for strict ordering; higher for throughput at the cost of order under retry).
  • At-least-once on the consumer side requires process-then-commit ordering with enable.auto.commit=false. Commit-then-process loses messages on a crash between commit and process. Auto-commit's timer-based behavior is the canonical version of the wrong ordering.
  • Three sources of duplicates under at-least-once: producer retries after partial success, consumer crashes between process and commit, rebalances during processing. All three are absorbed by sink-side idempotency (Lesson 2's topic).
  • The cost of duplicates is proportional to duplicate rate × per-event cost of duplicate processing downstream. For UPSERT-by-natural-key sinks the cost is one wasted SQL round-trip; for counter-style sinks it is silent miscounting; for sinks that trigger external actions it is real-world wrong actions. The cost determines how tightly the at-least-once bound is held.

Lesson 2 — Exactly-Once via Idempotency

Module: Data Pipelines — M05: Delivery Guarantees and Fault Tolerance Position: Lesson 2 of 4 Source: Kafka: The Definitive Guide — Shapira, Palino, Sivaram, Petty, Chapter 8 (Exactly-Once Semantics — Idempotent Producer, Transactions); Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11 (Idempotent Operations and Atomicity)


Context

Lesson 1 established at-least-once delivery — every message reaches the consumer at least once, with duplicates as the operational cost. This lesson is the second half of the composition. The application makes its operations idempotent: processing the same message twice produces the same effect on the world as processing it once. The pair — at-least-once delivery + idempotent processing — gives effective exactly-once semantics. The pipeline's net effect is exactly-once even though the underlying transport admits duplicates. This is the standard production approach to exactly-once; true transport-level exactly-once requires coordination (transactional Kafka, two-phase commit) that costs more than the application-layer pattern in throughput, complexity, and failure modes.

The pattern is not new to this module. Module 2 L3 introduced idempotency keys carried on the envelope. Module 3 L4 added retract-aware sinks with strict-greater UPSERT semantics. This lesson develops the topic in depth: what makes an operation idempotent, where the natural keys come from, how to bound the dedup state, what Kafka's idempotent producer does at the broker level, and where the exactly-once guarantee holds versus where it doesn't. The capstone in this module composes all of it — at-least-once delivery from L1, idempotent processing from this lesson, checkpointing from L3, dead-letter queues from L4 — into a pipeline that survives the SDA-2026-0207 incident's failure mode without dropping or duplicating alerts.


Core Concepts

Idempotency, Defined

A function f is idempotent if f(f(x)) = f(x) — applying it twice with the same input produces the same result as applying it once. The classic example is setting a field to a specific value: setting x = 5 is idempotent; setting x += 1 is not. The streaming-system version is about operator effects on durable state: an operator that writes "the value of window 17 is result" is idempotent on the (window_id, result) pair; an operator that writes "increment the count of window 17" is not.

Idempotency is a property of the operation, not of the framework or the pipeline. The pipeline can be at-least-once at the transport layer; the application layer is what determines whether duplicates produce identical effects or accumulate. Each operator's effect on the world has its own idempotency story, and the system's overall behavior depends on every operator's individual choice.

The discipline this lesson installs: every operator that writes to durable state — a database, a topic, an external service — must declare in its design which operations are idempotent and on what key. The operator graph from M2 carries forward an explicit idempotency_key_field per stage, documented in the operator's specification. The capstone uses this metadata to assert at startup that every sink is configured idempotently against its expected duplicate-source.

Natural Keys and Derived Keys

The natural key for SDA observations is the envelope's observation_id — a UUID generated at the source, carried unchanged through every stage, present on every observation. Every operator-level dedup logic uses this key. The orchestrator from M2 enforces that operators preserve observation_id across transformations (a normalize that re-generates the UUID is a bug; the supervisor's audit asserts this).

For derived events — the ConjunctionRisk that the M3 correlator emits from two observations — there is no natural ID. The standard derivation is a deterministic hash of the contributing inputs:

derived_id = hash(left_observation_id || right_observation_id || window_id)

Sorting the input IDs before hashing makes the result symmetric (same two observations produce the same derived_id regardless of order). The hash is content-addressable and reproducible: any two operator instances correlating the same pair of observations produce the same derived_id, which is what makes downstream dedup work.

For events derived from a sliding window over many inputs (analytics aggregates), the natural derived key is (window_id, sequence) — Module 3 L4's retract sequence numbering. The window is the deterministic identifier; the sequence number distinguishes the original emit from corrected emits.

Bounded Dedup State

The sink-side dedup is implemented by a seen-set: a record of recently-seen IDs that the sink consults before applying each operation. A new ID is processed; a previously-seen ID is silently dropped. The set must be bounded — an unbounded seen-set is the silent-OOM pattern Module 4's audit catches.

The bound is by time or by count, ideally by both. Time-based: keep IDs seen in the last N seconds; evict older entries on each insert. Count-based: keep at most M IDs; evict the oldest when at capacity. The double-bound is the production-safety pattern: time alone fails when a burst causes more entries to land in the window than memory permits; count alone fails when a slow stream has its old entries evicted before the duplicates that would be deduped against them.

The window size is operationally determined: it must be larger than the maximum re-delivery window (the longest gap between an original send and a duplicate retry). For Kafka with default settings, this is typically minutes; for SDA's pipeline with 30-second consumer commit cadence and bounded checkpoint duration, 5 minutes is comfortable. Set the window too narrow and duplicates leak through; set it too wide and memory grows. Module 4's BurstProfile pattern applies here — document the chosen window with the rationale.

Kafka's Idempotent Producer

Kafka offers a producer-side idempotency mechanism that prevents duplicates from producer retries. With enable.idempotence=true, the producer attaches a producer ID (PID) and a sequence number to every message. The broker tracks the highest sequence number seen per PID per partition; on a retry that arrives with a sequence number it has already accepted, the broker drops the duplicate silently. The producer continues to retry on transient errors, but the broker dedups before persisting.

The mechanism is single-partition, single-producer scoped — the dedup is per (PID, partition). Across partitions, the producer's idempotence does not provide a cross-partition consistency guarantee; that requires Kafka's transactional producer (a separate mechanism with transactional.id set). For SDA's pipeline, partition-scoped idempotence is sufficient because every observation is keyed on observation_id and routed to a partition by hash(observation_id), so duplicate retries land on the same partition where they get dedupped.

The throughput impact is small (<5%) and the configuration is straightforward. Production pipelines should enable enable.idempotence=true by default; the lesson's at-least-once L1 deliberately turned it off to demonstrate the bare semantics. Enabling it complements the application-layer dedup: producer-side dedup catches retries before they reach the broker; application-layer dedup catches duplicates from other sources (consumer crashes, rebalances). The two layers are belt-and-suspenders, and both are worth having.

Where the Guarantee Holds

Effective exactly-once via at-least-once + idempotency holds at three boundaries.

Within the pipeline. Every operator that produces output to a downstream operator is implicitly part of the dedup chain because every operator preserves observation_id. The sink-side dedup at the end of the pipeline catches every duplicate that traveled from any source. This works as long as the chain is intact — no operator silently regenerates IDs, no operator emits a fresh ID for a copy of an existing observation.

At external sink boundaries. SQL writes use INSERT ... ON CONFLICT (observation_id) DO NOTHING (or DO UPDATE per Module 3's retract-aware shape). Kafka producers use enable.idempotence=true. HTTP requests carry an Idempotency-Key header (the standard HTTP convention) that the downstream service uses to dedup. Every external write has its own idempotency story, and the operator's responsibility is to know what it is and configure it.

Where the guarantee does NOT hold. Operations whose effect on the world is inherently non-idempotent — sending an email, charging a credit card, firing a satellite avoidance maneuver. For these, idempotency must be added at the boundary by storing recently-seen IDs and ignoring duplicates at the consumer side of the boundary, not just within the pipeline. The conjunction-alert subscriber from M5 L1's example is exactly this case; the capstone wires up the subscriber's seen-set as an integrated piece of the alert delivery path.

The lesson's framing is precise: the pipeline can guarantee effective exactly-once delivery TO a boundary; whether the action AT the boundary is itself exactly-once depends on the boundary's idempotency. Boundary owners must implement their own dedup; the pipeline's responsibility is to deliver the keys correctly and document the contract.


Code Examples

A Sliding-Window Dedup Set with Double-Bound Eviction

The sink-side primitive that absorbs duplicates from at-least-once delivery. Bounded by both time (seen IDs older than window are evicted) and count (no more than capacity IDs held at once).

use std::collections::{BTreeSet, VecDeque};
use std::time::{Duration, SystemTime};
use anyhow::Result;
use uuid::Uuid;

/// Bounded seen-set keyed on observation_id. Maintains:
///   - seen: a BTreeSet for O(log N) lookup of duplicate-or-not
///   - order: a VecDeque<(Uuid, SystemTime)> for FIFO eviction
///
/// Invariants:
///   - len(seen) == len(order)
///   - order is strictly time-ordered by insert time
///   - on every insert, expired entries are evicted from order's front
pub struct DedupSet {
    seen: BTreeSet<Uuid>,
    order: VecDeque<(Uuid, SystemTime)>,
    window: Duration,
    capacity: usize,
}

impl DedupSet {
    pub fn new(window: Duration, capacity: usize) -> Self {
        Self {
            seen: BTreeSet::new(),
            order: VecDeque::with_capacity(capacity),
            window,
            capacity,
        }
    }

    /// Record an observation; return true if the ID was novel (caller
    /// should process it), false if the ID is a duplicate (caller
    /// should drop).
    pub fn record(&mut self, id: Uuid, now: SystemTime) -> bool {
        // Evict by time first.
        let cutoff = now.checked_sub(self.window).unwrap_or(SystemTime::UNIX_EPOCH);
        while let Some(&(front_id, front_ts)) = self.order.front() {
            if front_ts < cutoff || self.order.len() >= self.capacity {
                self.order.pop_front();
                self.seen.remove(&front_id);
            } else {
                break;
            }
        }
        // Now check: if seen, return false (duplicate).
        if self.seen.contains(&id) { return false; }
        // Otherwise insert.
        self.seen.insert(id);
        self.order.push_back((id, now));
        true
    }

    pub fn len(&self) -> usize { self.seen.len() }
    pub fn is_empty(&self) -> bool { self.seen.is_empty() }
}

The record method is the only public mutation; the boolean return is the dispatch signal for the sink ("true → write, false → skip"). The order-of-operations matters: time-eviction first, then capacity-eviction, then duplicate check. Doing the duplicate check before eviction would leave time-expired entries in the seen set briefly, which is harmless but wastes memory. Doing eviction-then-insert means the seen set's size is always bounded by capacity after any single insert, regardless of how the inserts came in. The pattern is the L4 retract-aware sink's ancestor: same shape, same eviction discipline, different check direction (this records true-once-then-false; that one writes-once-then-overwrites).

A SQL Sink with Idempotent UPSERT

The boundary-side idempotency. The sink writes observations to a Postgres table; the UPSERT with DO NOTHING is the idempotent operation.

use anyhow::Result;
use sqlx::{Pool, Postgres};

const UPSERT_OBSERVATION: &str = r#"
    INSERT INTO observations (observation_id, source_kind, sensor_timestamp, payload)
    VALUES ($1, $2, $3, $4)
    ON CONFLICT (observation_id) DO NOTHING
"#;

pub struct PostgresSink {
    pool: Pool<Postgres>,
}

impl PostgresSink {
    pub async fn write(&self, obs: &Observation) -> Result<()> {
        // The sql query is idempotent on observation_id (the primary key
        // with the ON CONFLICT clause). A duplicate row at the same
        // observation_id is silently ignored — the existing row is
        // preserved unchanged. The sink never produces a duplicate
        // effect on the world even under heavy at-least-once duplication.
        let payload = serde_json::to_value(obs)?;
        sqlx::query(UPSERT_OBSERVATION)
            .bind(obs.observation_id)
            .bind(format!("{:?}", obs.source_kind))
            .bind(obs.sensor_timestamp)
            .bind(payload)
            .execute(&self.pool)
            .await?;
        Ok(())
    }
}

The ON CONFLICT DO NOTHING is the SQL-standard idempotent insert. A duplicate row at the same observation_id is silently rejected; the sink's write returns Ok regardless of whether the row was new or pre-existing. The duplicate cost is one network round-trip plus a brief lock on the existing row — meaningfully cheap, and well within the at-least-once duplicate budget. Alternative shapes: DO UPDATE SET ... WHERE EXCLUDED.sequence > rows.sequence for the M3 L4 retract-aware case; DO UPDATE SET ... WHERE EXCLUDED.value > rows.value for last-write-wins on a different field. The pattern at every external sink: choose the operation, document the idempotency, configure the sink correctly.

A Non-Idempotent Operation Made Idempotent

Increment is the canonical non-idempotent operation. The pattern to make it idempotent is to wrap with a seen-set check and only increment on first sight.

use std::collections::HashSet;
use std::sync::Mutex;
use uuid::Uuid;

pub struct IdempotentCounter {
    count: Mutex<u64>,
    seen: Mutex<HashSet<Uuid>>,
}

impl IdempotentCounter {
    pub fn new() -> Self {
        Self {
            count: Mutex::new(0),
            seen: Mutex::new(HashSet::new()),
        }
    }

    /// Increment if the observation_id has not been seen before;
    /// no-op for duplicates. Returns the post-increment count.
    pub fn record_unique(&self, id: Uuid) -> u64 {
        let mut seen = self.seen.lock().unwrap();
        if !seen.insert(id) {
            // Already seen; return current count without incrementing.
            return *self.count.lock().unwrap();
        }
        let mut count = self.count.lock().unwrap();
        *count += 1;
        *count
    }

    pub fn count(&self) -> u64 {
        *self.count.lock().unwrap()
    }
}

The record_unique pattern is the standard wrapping. The seen.insert(id) returns false if the ID was already present, in which case the function returns early; otherwise it increments. The two locks are taken in a fixed order (seen before count) which prevents deadlock; in production code you would typically combine them into a single struct to make the locking simpler. The seen-set must be bounded (per the previous example) — an unbounded seen-set in a long-running counter eventually OOMs the process. The capstone's counter-style metrics use the bounded seen-set pattern at every sink that does increment-style operations; the pipeline as a whole is exactly-once-effective despite at-least-once delivery.


Key Takeaways

  • Idempotency is a property of the operation, not the framework. Every operator that writes to durable state must declare which operations are idempotent and on what key. The orchestrator's metadata carries the per-stage idempotency_key_field; the audit asserts it at startup.
  • Natural keys come from the envelope (the SDA pipeline uses observation_id end-to-end). Derived keys are deterministic hashes of contributing inputs (sorted, content-addressable). Every operator preserves the natural key and computes derived keys reproducibly.
  • The bounded dedup set uses time AND count for production safety. Double-bound is the pattern: time alone fails under bursts, count alone fails under slow streams. Window size is operationally determined by maximum re-delivery window plus margin.
  • Kafka's idempotent producer (enable.idempotence=true) catches duplicates from producer retries at the broker level via PID + sequence numbers. Partition-scoped, sub-5% throughput cost. Default-on for production pipelines.
  • The at-least-once + idempotent = exactly-once-effective composition holds within the pipeline and at idempotent boundaries (SQL UPSERT, idempotent producers, HTTP Idempotency-Key headers). At non-idempotent boundaries (alerts that trigger external actions), the boundary's owner must implement dedup; the pipeline's responsibility is to deliver the keys correctly and document the contract.

Lesson 3 — Checkpointing

Module: Data Pipelines — M05: Delivery Guarantees and Fault Tolerance Position: Lesson 3 of 4 Source: Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11 (Fault Tolerance — Microbatching and Checkpointing); Database Internals — Alex Petrov, Chapter 5 (Checkpointing in Recovery — the same conceptual machinery applied to streaming state)


Context

Lessons 1 and 2 produced a pipeline that delivers messages at-least-once and processes them with effective exactly-once semantics via idempotency. Both rely on durable state at the boundary — the SQL UPSERT, the Kafka idempotent producer's PID+sequence state, the consumer's committed offset. The state inside the pipeline — the windowed correlator's per-key sliding windows, the M3 retract-aware sink's in-flight retained windows, the M2 supervisor's restart history — lives in process memory and disappears on restart.

For most operators that does not matter. A normalize stateless map operator restarts cleanly and resumes by reading from the consumer offset. The orchestrator's supervisor restarts a panicked operator with a fresh Task, which means a fresh empty state, which is exactly right for stateless operators because their state IS empty by definition. For stateful operators — the windowed correlator is the canonical case — restart means losing the current windows in flight. The pipeline restarts; the input replay from the consumer offset begins; the windows that had been accumulating before the crash are gone, and the replay rebuilds them from scratch.

The cost is real. A 30-second window's worth of in-flight events takes 30 seconds of replay to reconstruct. During that 30 seconds, the pipeline is producing alerts whose source data is incomplete — the correlator has not yet seen all the observations that should be in its window because they are in the past, before the consumer offset's restart point. Module 5's mission framing (the SDA-2026-0207 incident) had exactly this shape: the restart's replay window straddled an active conjunction event, the windows were rebuilt without the early observations of that event, and the resulting alerts were emitted with degraded confidence.

This lesson installs checkpointing — the durable-state mechanism that lets a restarted operator resume from a saved state plus the input offset that produced it, without replaying from scratch. Database engines use checkpointing in WAL recovery (Petrov Chapter 5); Spark and Flink use it for streaming-state recovery; the SDA pipeline uses it for the same purpose at the operator level. The pattern is the same in every system: pause the operator briefly, snapshot its state to durable storage, record the input offset that the snapshot reflects, resume. On restart, load the latest snapshot, set the input offset, resume processing. The window of data lost on restart shrinks from "since process start" to "since last checkpoint."


Core Concepts

State + Offset = Checkpoint

A checkpoint is the durable record of an operator's recoverable progress. It has two components.

State: the operator's in-memory state at the moment the checkpoint was taken. For the windowed correlator, this is the per-key sliding windows. For the retract-aware sink, this is the recently-emitted (window_id, sequence) records. For a stateful aggregator, this is the running aggregations. The state is whatever the operator needs to resume processing without losing data.

Offset: the position in the input stream that the checkpointed state reflects. For a Kafka consumer, this is the broker offset of the last fully-processed message. For a file-based source, this is the byte offset. The offset is the link between the state and the input — it answers "if I restart with this state, where do I start reading from?"

The two together are the recovery contract: load the state, set the offset, resume. Either one alone is insufficient. A state without an offset is a snapshot of indeterminate vintage; resuming from it produces incorrect output because the input position is unknown. An offset without a state is a position to resume reading from, but the operator's running aggregations are gone — the pipeline replays correctly only if the operator is stateless. The capstone enforces the State+Offset invariant structurally: every checkpoint write atomically includes both, and every checkpoint read returns both or fails.

Pause-Snapshot-Resume Protocol

The simplest checkpoint protocol: pause the operator's input for the duration of the checkpoint, write the state to durable storage along with the current offset, resume input. The pause is what guarantees the State+Offset pair is consistent — no new input arrives during the snapshot, so the state and the offset reflect the same point in the stream.

The credit-based-flow primitive from Module 4 L2 is the natural mechanism for the pause. The operator withholds credits from its upstream during the snapshot; the upstream's local credit counter drains; the upstream stops producing without occupying any in-flight slot. The snapshot completes; the operator returns credits; the upstream resumes. The duration of the pause is the snapshot's wall-clock cost, typically 50-200ms for SDA-scale operator state.

The cost of the pause is end-to-end latency. During the pause, no observations flow through the operator; downstream operators see a brief gap in their input. For SDA's 30-second alert SLO, a 200ms pause is 0.7% of the budget — acceptable. For tighter SLOs the pause must be tighter; production checkpoints write to local fast storage (NVMe, RAM disk) to keep the pause sub-millisecond. The discipline: choose checkpoint frequency and pause duration to fit the SLO budget.

For a single-operator checkpoint the pause-snapshot-resume protocol is sufficient. For a multi-operator pipeline, the operators' checkpoints must be aligned — the saved state across operators must reflect the same point in the input stream — or the recovery produces inconsistent results (operator A resumes from input position X, operator B resumes from input position X+10, the pipeline's overall state is incoherent).

Flink's solution is barrier alignment. The orchestrator injects a special checkpoint barrier marker into the source streams. The barrier flows through the pipeline as a normal event; when an operator receives a barrier on every input, it snapshots its current state and forwards the barrier to its downstream. The barrier reaches the next operator, which waits until it has the barrier on every input, snapshots, forwards. The structure produces a consistent cut across the pipeline: every operator's snapshot reflects the same set of input events.

The cost is operationally real. The slowest input's barrier delay determines the alignment time — operators wait at the barrier until every input has reported. A slow input stalls the entire checkpoint. For SDA's pipeline with three sources at very different rates (radar 5000/sec, ISL 100/sec, optical 50/sec) the alignment is dominated by the slow inputs. Flink offers an unaligned mode that trades consistency for speed; the SDA pipeline uses aligned checkpoints because the consistency property is what the recovery contract relies on.

Per-Operator (Unaligned) Checkpoints

The simpler alternative: each operator checkpoints on its own schedule, with no cross-operator alignment. Each operator's checkpoint reflects only its own state and the position in its input stream. Recovery is then per-operator: each operator independently loads its latest checkpoint and resumes from its checkpointed offset.

The simplification has a real cost: cross-operator state is no longer consistent. If operator A checkpointed at position X and operator B (downstream of A) checkpointed at position Y > X, then after a restart, A starts at X but B starts at Y. A re-emits messages in (X, Y]; B has already processed them and dedups via idempotency. This works correctly under the at-least-once + idempotent composition from Lessons 1 and 2 — duplicates are absorbed.

For SDA's pipeline, per-operator checkpointing is the right shape because the idempotency machinery is already in place. The pipeline does not need the strong consistency that aligned checkpoints provide; it needs each operator to recover its own state, and the global consistency is recovered via dedup at every boundary. The capstone uses per-operator checkpoints; the lesson covers aligned checkpoints for completeness because they show up in production streaming systems and the framing matters when those systems are introduced.

Checkpoint Storage

The state must go somewhere durable. Three tiers, each with its own cost profile.

Local disk (NVMe / SSD). Fastest write latency (~100µs for small writes). The natural choice for the checkpoint's hot-path destination. Limitation: a host failure loses the checkpoint along with the host. For pipeline restart-without-host-loss (the common case), local-disk checkpoints suffice.

Remote object storage (S3, GCS). Slower write latency (10-100ms). Survives host loss because the storage is redundant across availability zones. The natural choice for durable checkpoints that must survive worst-case failures.

Hybrid: local-then-async-replicated-to-remote. The standard production pattern. Write to local first (fast, on the hot path); replicate asynchronously to remote (durable, off the hot path). The pause-snapshot-resume cost is bounded by the local write; the remote-replication catches up in the background. On restart, prefer the local checkpoint if it exists (fast recovery); fall back to remote if it doesn't (host loss recovery).

The choice is operational. For SDA's pipeline, the hybrid approach with a 1-second local checkpoint cadence and 60-second remote replication is the production default — adjustable per operator based on the state size and the recovery time budget. The capstone exposes the cadence and replication parameters per operator; ops tuning happens in the orchestrator's startup configuration rather than in code.


Code Examples

A Periodic-Checkpoint Operator with Local Disk

The simplest implementation: every N seconds, pause the input via credit-withholding, serialize the state, write to disk, resume. On startup, look for the latest checkpoint and restore.

use anyhow::Result;
use serde::{Deserialize, Serialize};
use std::path::PathBuf;
use std::time::{Duration, Instant};
use tokio::fs;
use tokio::time::sleep;

#[derive(Debug, Serialize, Deserialize)]
pub struct OperatorCheckpoint<S> {
    /// The operator's serialized state at checkpoint time.
    pub state: S,
    /// The input-stream offset that the state reflects.
    pub offset: u64,
    /// When the checkpoint was taken (for diagnostics).
    pub taken_at_unix_ms: u64,
}

/// A checkpointing wrapper around any stateful operator. The operator
/// implements a Checkpointable trait that exposes serialize/restore
/// hooks; the wrapper handles the pause-snapshot-resume protocol and
/// the disk I/O.
pub struct CheckpointingOperator<S> {
    state: S,
    last_offset: u64,
    interval: Duration,
    checkpoint_dir: PathBuf,
    operator_name: String,
}

impl<S> CheckpointingOperator<S>
where
    S: Serialize + for<'de> Deserialize<'de> + Default,
{
    /// Build a fresh operator OR restore from the latest checkpoint
    /// if one exists. The constructor does the recovery.
    pub async fn new_or_restore(
        operator_name: impl Into<String>,
        interval: Duration,
        checkpoint_dir: PathBuf,
    ) -> Result<Self> {
        let operator_name = operator_name.into();
        let path = checkpoint_dir.join(format!("{operator_name}.bin"));
        let (state, last_offset) = if path.exists() {
            let bytes = fs::read(&path).await?;
            let cp: OperatorCheckpoint<S> = bincode::deserialize(&bytes)?;
            tracing::info!(
                operator = %operator_name,
                offset = cp.offset,
                "restored from checkpoint"
            );
            (cp.state, cp.offset)
        } else {
            tracing::info!(operator = %operator_name, "no checkpoint found; starting fresh");
            (S::default(), 0)
        };
        Ok(Self {
            state,
            last_offset,
            interval,
            checkpoint_dir,
            operator_name,
        })
    }

    /// Take a checkpoint NOW. Caller is responsible for pausing input
    /// (e.g., via credit-withholding) before invocation; this method
    /// is just the snapshot and write.
    pub async fn checkpoint(&self) -> Result<()> {
        let path = self.checkpoint_dir.join(format!("{}.bin", self.operator_name));
        let cp = OperatorCheckpoint {
            state: &self.state,
            offset: self.last_offset,
            taken_at_unix_ms: std::time::SystemTime::now()
                .duration_since(std::time::UNIX_EPOCH)
                .map(|d| d.as_millis() as u64)
                .unwrap_or(0),
        };
        // Write to a temp file first, then atomically rename. The rename
        // is the durability barrier — the file appears at its final path
        // only when the write is complete, so a crash during write
        // leaves the previous (consistent) checkpoint intact.
        let tmp_path = path.with_extension("bin.tmp");
        let bytes = bincode::serialize(&cp)?;
        fs::write(&tmp_path, bytes).await?;
        fs::rename(&tmp_path, &path).await?;
        Ok(())
    }
}

The atomic-rename pattern is the durability discipline. Writing directly to the final path leaves the file in a partial state if the process crashes mid-write; the next startup reads a corrupt file and either fails to deserialize or (worse) deserializes an inconsistent State+Offset pair. The rename is atomic at the filesystem level (POSIX rename is atomic for files in the same directory), so the file is either the previous good checkpoint or the new good checkpoint, never a partial state. Production code uses fs::sync_data or fsync before the rename to ensure the write is durable past a power failure; we elide that for clarity.

A Barrier-Based Coordinator (Aligned Checkpoint)

For aligned checkpoints, the orchestrator injects a barrier marker into source streams. Each operator forwards the barrier after snapshotting; the coordinator waits for all operators' barrier acknowledgments to declare the checkpoint complete.

use std::sync::Arc;
use tokio::sync::{mpsc, Notify};

/// A barrier marker that flows through the pipeline as a control
/// item. Operators forward it after taking their checkpoint.
#[derive(Debug, Clone, Copy)]
pub struct CheckpointBarrier(pub u64); // checkpoint id

/// The orchestrator's coordinator. Tracks barrier acknowledgments from
/// every operator; the checkpoint is complete when every operator has
/// acked the barrier with the same checkpoint id.
pub struct CheckpointCoordinator {
    expected_acks: usize,
    received_acks: Arc<Mutex<HashMap<u64, HashSet<String>>>>,
    completion: Arc<Notify>,
}

impl CheckpointCoordinator {
    /// Inject a barrier into the source streams and wait for all
    /// operators to ack.
    pub async fn run_checkpoint(
        &self,
        cp_id: u64,
        sources: &[mpsc::Sender<SourceItem>],
    ) -> Result<()> {
        // Reset ack state for this checkpoint.
        self.received_acks.lock().unwrap().entry(cp_id).or_default();
        // Inject the barrier into every source stream.
        for src in sources {
            src.send(SourceItem::Barrier(cp_id)).await?;
        }
        // Wait for all operators to ack.
        loop {
            self.completion.notified().await;
            let acks = self.received_acks.lock().unwrap();
            if let Some(set) = acks.get(&cp_id) {
                if set.len() >= self.expected_acks {
                    return Ok(());
                }
            }
        }
    }

    /// Operator-side ack: called when an operator has snapshotted in
    /// response to a barrier and is forwarding it downstream.
    pub fn ack(&self, cp_id: u64, operator_name: &str) {
        let mut acks = self.received_acks.lock().unwrap();
        acks.entry(cp_id).or_default().insert(operator_name.to_string());
        drop(acks);
        self.completion.notify_waiters();
    }
}

The barrier mechanism is more complex than the per-operator pattern but produces the consistent-cut property the framework guarantees. The cost is operational: the coordinator is a centralized component that the supervisor must keep alive, the barrier protocol must be implemented in every operator, and the slowest input stalls the entire checkpoint. Production systems that use aligned checkpointing (Flink's defaults) accept this complexity in exchange for cross-operator consistency. The lesson covers it for completeness; SDA's pipeline uses the simpler per-operator approach.

Recovery from a Checkpoint at Startup

The startup path that calls new_or_restore for every checkpointing operator. The orchestrator's bootstrap logic looks for local checkpoints first, falls back to remote, defaults to fresh state if neither exists.

pub async fn bootstrap_pipeline(
    config: &PipelineConfig,
) -> Result<RunningPipeline> {
    let local_dir = &config.checkpoint_dir;
    let remote_dir = &config.remote_checkpoint_uri;
    let mut operators: Vec<CheckpointingOperator<_>> = Vec::new();

    for op_spec in &config.operators {
        // Try local first. If nothing exists locally, try remote.
        // If nothing exists in either, start fresh.
        let local_path = local_dir.join(format!("{}.bin", op_spec.name));
        let restored = if local_path.exists() {
            CheckpointingOperator::new_or_restore(
                &op_spec.name,
                op_spec.checkpoint_interval,
                local_dir.clone(),
            ).await?
        } else if remote_exists(remote_dir, &op_spec.name).await? {
            tracing::info!(operator = %op_spec.name, "local missing; pulling from remote");
            pull_remote_to_local(remote_dir, local_dir, &op_spec.name).await?;
            CheckpointingOperator::new_or_restore(
                &op_spec.name,
                op_spec.checkpoint_interval,
                local_dir.clone(),
            ).await?
        } else {
            tracing::warn!(operator = %op_spec.name, "no checkpoint anywhere; starting fresh");
            CheckpointingOperator::new_or_restore(
                &op_spec.name,
                op_spec.checkpoint_interval,
                local_dir.clone(),
            ).await?
        };
        operators.push(restored);
    }

    Ok(RunningPipeline { operators, /* ... */ })
}

The local-then-remote-then-fresh hierarchy is what gives the pipeline different recovery profiles for different failure modes. A normal restart (process exit, redeploy) recovers from local and is fast (sub-second). A host-failure recovery (the host went down) recovers from remote and is slower (tens of seconds for the pull, plus the deserialize time). A first-time start has no checkpoint and starts fresh. Each path is logged at the structured-log level the orchestrator's monitoring expects; the recovery route taken is itself a metric (pipeline_recovery_path{path="local|remote|fresh"} counter) that surfaces interesting events for ops to diagnose.


Key Takeaways

  • A checkpoint is State + Offset: the operator's serialized state plus the input-stream offset that the state reflects. Either alone is insufficient; both together are the recovery contract. The atomic-rename pattern ensures the file on disk is either the previous good checkpoint or the new good checkpoint, never a partial state.
  • The pause-snapshot-resume protocol is the simplest implementation: withhold credits from upstream, serialize, write to disk, return credits. The pause duration is end-to-end latency cost during the snapshot — typically 50-200ms for SDA-scale state, tunable via storage choice.
  • Aligned checkpoints (Flink-style barriers) produce consistent cuts across operators at the cost of slowest-input stall time. Per-operator checkpoints trade the consistency property for simplicity and rely on at-least-once + idempotency to recover global consistency via dedup. SDA uses per-operator.
  • Storage tiers are local disk (fast, host-bound), remote object storage (slower, durable across host loss), and the hybrid (local-first, async-replicated-to-remote) which is the standard production pattern. The hot-path cost is bounded by local; the durability comes from remote.
  • Recovery hierarchy is local → remote → fresh. Normal restart recovers from local in sub-second; host failure pulls from remote in tens of seconds; cold start has no checkpoint and starts at offset 0. Each path is logged as a structured metric so ops can see the recovery profile of every restart.

Lesson 4 — Dead Letter Queues

Module: Data Pipelines — M05: Delivery Guarantees and Fault Tolerance Position: Lesson 4 of 4 Source: Fundamentals of Data Engineering — Joe Reis & Matt Housley, Chapter 7 (Error Handling and Dead-Letter Queues); Kafka: The Definitive Guide — Shapira, Palino, Sivaram, Petty, Chapter 9 (Failure Handling and Reprocessing)


Context

The retry policy from Module 2 L3 covers transient errors — the network blip, the broker leader election, the partner API hiccup. The idempotency machinery from M5 L2 absorbs the duplicates that retries inevitably produce. The checkpoint machinery from L3 lets the pipeline recover its state across restart. The combination handles every recoverable failure mode the SDA pipeline encounters.

It does not handle permanent failures. Some events cannot be processed regardless of how many times they are retried. A frame from the radar source whose binary payload cannot be deserialized — the wire format does not match the protocol the operator was built against, possibly because the radar's firmware was upgraded without coordination. An observation referencing a satellite catalog ID that has been decommissioned for orbit burn — the record is internally consistent but references state that no longer exists. An ISL beacon's state vector with an impossible-physics value (radius below Earth's surface, velocity above c) that violates the operator's input invariants. Retrying these does not help; every retry produces the same error, and the operator's retry-budget eventually exhausts. Dropping them silently violates the audit requirement that every observation is accounted for.

The third path is the dead-letter queue — a separate sink, distinct from the main pipeline, that receives events the operator cannot process. Each entry carries the original event plus enough metadata for an engineer to investigate: the error kind, the operator name, the timestamp, the retry attempts that were tried. The DLQ is not a discard bucket; it is a debug tool and a re-processing source. Engineers inspect it during incident response; engineers re-inject from it when the underlying issue is fixed. The pattern is universal in production streaming systems; this lesson develops it for SDA's pipeline.


Core Concepts

The Retry-Disposition Decision Tree

Every error from an operator's hot path classifies into one of three buckets.

Transient (Retry). Network errors, 5xx responses, timeouts, broker leader-elections. Will resolve on retry given enough time. The retry wrapper from M2 L3 handles these with decorrelated-jitter backoff. The operator's retry budget bounds the time spent retrying.

Permanent (DLQ). Deserialization errors, schema-mismatch errors, validation failures, references to non-existent state. Will NOT resolve on retry — every attempt produces the same error. Routes to the DLQ for human-investigable handling. Does not consume retry budget.

Discardable (drop). Invariant violations that should be dropped silently without operational attention. The radar frame whose wire format declares a length larger than the maximum permitted — a clear bug at the source, not worth investigating, not worth re-processing. Drops with a metric increment.

The classification is the operator's responsibility. M2 L3 introduced RetryDisposition::{Retry, Permanent, Discard}; this lesson extends Permanent to mean "DLQ" — the operator hands the event to the DLQ sink rather than just propagating the error. The lesson's discipline: every operator's error path classifies explicitly, every classification is documented in code, the default for unknown errors is Permanent (DLQ) so they surface for operational attention rather than getting silently retried or dropped.

DLQ Metadata

The DLQ entry is the event plus context. The context is what makes the DLQ a debug tool rather than a dump.

pub struct DlqEntry {
    /// Wall-clock time when the error occurred.
    pub timestamp: SystemTime,
    /// The operator that produced the error.
    pub operator: String,
    /// The kind of error (deserialization, validation, processing exception).
    pub error_kind: DlqErrorKind,
    /// Free-form error message for human investigators.
    pub error_message: String,
    /// Number of retry attempts before giving up (typically 0 for
    /// permanent errors, > 0 for transient errors that exceeded the
    /// retry budget).
    pub retry_count: u32,
    /// The original event. Stored as the raw bytes (or the deserialized
    /// envelope when available) so re-processing can replay it.
    pub original_payload: Vec<u8>,
    /// Schema version of the metadata format itself. Important for
    /// re-processing tools that span DLQ entries from different
    /// pipeline versions.
    pub schema_version: u32,
}

pub enum DlqErrorKind {
    Deserialization,
    SchemaMismatch,
    ValidationFailed,
    ProcessingException,
    RetryBudgetExhausted,
}

The schema_version is the often-overlooked field. The DLQ accumulates entries over a long time horizon; the metadata format will evolve as the pipeline evolves. A re-processing tool reading entries from six months ago needs to know what fields the entry carried when it was written. The version field is the migration mechanism — a future tool reads schema_version=1 entries with the v1 format and schema_version=2 entries with the v2 format. Without the version, future migrations require either guessing or losing entries.

Poison Pills

A poison pill is an event that causes errors every retry. The poison-pill scenario is what distinguishes DLQ-bearing pipelines from drop-only ones. Without a DLQ, a poison pill blocks the pipeline: the operator retries, fails, retries, fails, exhausts its retry budget, the supervisor restarts the operator, the operator reads the same poison pill from the consumer offset, fails again. The pipeline makes no progress past the poison pill.

With a DLQ, the poison pill is quarantined. The operator's first retry attempt classifies the error as Permanent, hands it to the DLQ, and continues with the next event. The pipeline stays healthy; the poison pill is in the DLQ for investigation. The operational discipline: a metric for dlq_entries_total{error_kind} and an alert when the rate exceeds a threshold. A spike in DLQ entries is a signal — a partner's API change, a schema migration that did not roll out everywhere, a bug in the operator's deserializer.

The threshold for the alert is tuned per error kind. A handful of Deserialization errors per day during steady-state is normal (occasional malformed wire packets). A spike to hundreds per minute signals a partner change or a rollback. RetryBudgetExhausted errors should be rare; a spike means a downstream is degraded longer than the retry budget covers, and operations should investigate.

The DLQ as Re-processing Source

The DLQ is also a stream. Once an underlying issue is fixed (a partner's API change is reverted, a schema migration completes, a bug fix deploys), the DLQ's entries can be re-processed through the pipeline. A re-processing tool reads from the DLQ, reconstructs the original events, and pushes them back into the pipeline's input topic. The operators process them as if they were freshly arrived, the dedup logic from L2 absorbs the duplicates from any prior partial processing, and the events end up in the right downstream state.

The re-processing tool is its own piece of code, separate from the live pipeline. It has access to the DLQ's stored entries, knows the schema_version migration story, can filter by error_kind or operator or time range. The tool is an operational lever: when an issue is fixed, ops runs the tool against the affected DLQ window to recover the affected events. Without the tool, the events are stuck in the DLQ forever; with the tool, the DLQ's role is "temporarily holding events while we figure out what to do," which is exactly the right framing.

The tool's correctness depends on the pipeline's idempotency. Re-injecting from the DLQ produces the same events the pipeline processed before; without idempotency, re-processing produces duplicate effects. The L2 machinery (sink-side dedup, idempotent UPSERT, Idempotency-Key headers) absorbs the re-injection; the re-processing is safe by construction. This is the L2-L4 composition the capstone exercises end-to-end.

The Discard-Bucket Anti-Pattern

A DLQ that nobody reads is worse than no DLQ. Entries pile up; ops loses track of what they mean; a real incident produces a small spike that gets lost in the noise of historical entries. This is the anti-pattern the lesson identifies and the discipline to prevent it.

Three operational practices distinguish a useful DLQ from a discard bucket.

Alert on growth rate. A DLQ growing at >10x its baseline rate for >5 minutes fires an alert. The alert text names the dominant error_kind in the recent window so on-call has an immediate hypothesis to investigate.

Periodic review. The DLQ has a weekly review cadence. Ops walks through new entries since last review, classifies the dominant patterns, decides on remediation per error_kind. Some patterns become permanent fixes (the operator's classifier is updated to handle the case). Some become re-processing tasks (the underlying issue was fixed; re-inject the affected window). Some become explicit non-events (the entry represents a known-and-accepted failure mode).

Bounded retention. The DLQ does not grow forever. Entries older than the retention window (typically 30 days for SDA) are evicted. The retention is a forcing function for operational discipline — the team cannot ignore the DLQ indefinitely; entries must be addressed before they age out. The retention is documented and the eviction is logged so the team knows what they're losing.

The DLQ's value is proportional to the discipline applied to it. A well-managed DLQ catches partner API changes within minutes; a discard bucket catches nothing in particular.


Code Examples

A DLQ Sink with Structured Metadata

The DLQ sink writes JSON-Lines to disk in the SDA pipeline's local filesystem; production deployments write to a Kafka topic with longer retention so re-processing tools can read from there. The local-disk version is sufficient for the lesson's purposes and matches the M3 L4 retract-aware sink's storage choice.

use anyhow::Result;
use serde::{Deserialize, Serialize};
use std::path::PathBuf;
use std::time::SystemTime;
use tokio::fs::OpenOptions;
use tokio::io::AsyncWriteExt;

#[derive(Debug, Clone, Copy, Serialize, Deserialize)]
pub enum DlqErrorKind {
    Deserialization,
    SchemaMismatch,
    ValidationFailed,
    ProcessingException,
    RetryBudgetExhausted,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct DlqEntry {
    pub schema_version: u32,
    pub timestamp_unix_ms: u64,
    pub operator: String,
    pub error_kind: DlqErrorKind,
    pub error_message: String,
    pub retry_count: u32,
    pub original_payload: Vec<u8>,
}

pub struct DlqSink {
    file_path: PathBuf,
    operator_name: String,
}

impl DlqSink {
    pub fn new(file_path: PathBuf, operator_name: impl Into<String>) -> Self {
        Self { file_path, operator_name: operator_name.into() }
    }

    /// Write a DLQ entry. Each entry is one JSON-Lines record.
    /// Append-only; never overwrites existing entries.
    pub async fn write(
        &self,
        kind: DlqErrorKind,
        error_message: impl Into<String>,
        retry_count: u32,
        original_payload: Vec<u8>,
    ) -> Result<()> {
        let entry = DlqEntry {
            schema_version: 1,
            timestamp_unix_ms: SystemTime::now()
                .duration_since(SystemTime::UNIX_EPOCH)
                .map(|d| d.as_millis() as u64)
                .unwrap_or(0),
            operator: self.operator_name.clone(),
            error_kind: kind,
            error_message: error_message.into(),
            retry_count,
            original_payload,
        };
        let mut line = serde_json::to_vec(&entry)?;
        line.push(b'\n');

        let mut file = OpenOptions::new()
            .create(true)
            .append(true)
            .open(&self.file_path)
            .await?;
        file.write_all(&line).await?;
        // For SDA's reliability: sync after each write. Production with
        // higher DLQ rates batches multiple writes per fsync.
        file.sync_data().await?;
        Ok(())
    }
}

The append-only + per-write fsync is the durability discipline. A DLQ entry that is buffered in the kernel's page cache and not yet on disk is at risk if the process crashes. For the SDA pipeline's DLQ rates (single-digit entries per minute under steady state) the per-write sync cost is negligible. For higher-rate DLQs the cost matters and batching is appropriate; the standard pattern is "sync at most every N entries or every M milliseconds, whichever first." JSON-Lines as the format makes the file human-inspectable (tail -f /path/to/dlq.jsonl | jq) and machine-readable in one pass — useful for both ad-hoc investigation and the re-processing tool.

An Operator That Routes to DLQ Based on RetryDisposition

The operator's error path. M2 L3's retry wrapper is extended: RetryDisposition::Permanent(e) triggers a DLQ write before the error propagates; Discard drops with a counter; Retry goes through the wrapper's backoff machinery as before.

use anyhow::Result;

pub async fn run_operator_with_dlq<F, Fut, T>(
    mut op: F,
    dlq: &DlqSink,
    policy: RetryPolicy,
    payload: Vec<u8>,  // raw payload bytes for DLQ
) -> Result<Option<T>>
where
    F: FnMut() -> Fut,
    Fut: std::future::Future<Output = RetryDisposition<T>>,
{
    use std::time::Duration;
    use tokio::time::sleep;

    let mut attempt = 0u32;
    let mut prev_delay = policy.initial;
    loop {
        attempt += 1;
        match op().await {
            RetryDisposition::Ok(v) => return Ok(Some(v)),
            RetryDisposition::Discard => {
                metrics::counter!("operator_discards_total").increment(1);
                return Ok(None);
            }
            RetryDisposition::Permanent(e) => {
                // Permanent — route to DLQ.
                dlq.write(
                    DlqErrorKind::ValidationFailed,
                    e.to_string(),
                    attempt - 1,
                    payload,
                ).await?;
                return Ok(None);  // operator continues to next event
            }
            RetryDisposition::Retry(e) if attempt >= policy.max_attempts => {
                // Retry budget exhausted — also DLQ.
                dlq.write(
                    DlqErrorKind::RetryBudgetExhausted,
                    e.to_string(),
                    attempt,
                    payload,
                ).await?;
                return Ok(None);
            }
            RetryDisposition::Retry(_) => {
                // Decorrelated jitter from M2 L3.
                let upper = (prev_delay.as_millis() as u64).saturating_mul(3)
                    .max(policy.initial.as_millis() as u64);
                let delay = Duration::from_millis(upper).min(policy.cap);
                prev_delay = delay;
                sleep(delay).await;
            }
        }
    }
}

Two things to notice. The Permanent arm and the RetryBudgetExhausted arm both DLQ but with different error_kind labels — the DLQ entry distinguishes "this is a permanent error type" from "this was transient but we couldn't make it work." Operations dashboards split on the label to prioritize different remediation patterns. The operator's Result<Option<T>> return makes "go to next event" explicit at the type level: Ok(None) means "this event was discarded or DLQ'd, move on"; Ok(Some(v)) means "process this value"; Err(e) means "operator-level error, propagate to supervisor." The structure makes the operator's hot loop easy to read: match op_result { Some(v) => process(v), None => continue }.

A Re-Processing Tool

The CLI tool that reads from the DLQ and re-injects events into the pipeline's input topic. It is a separate binary from the pipeline itself, designed for operational use after an underlying issue has been fixed.

use anyhow::Result;
use std::path::PathBuf;
use tokio::fs::File;
use tokio::io::{AsyncBufReadExt, BufReader};

pub async fn reprocess(
    dlq_path: PathBuf,
    pipeline_input_topic: &str,
    producer: &FutureProducer,
    filter: ReprocessFilter,
) -> Result<ReprocessReport> {
    let mut report = ReprocessReport::default();
    let file = File::open(&dlq_path).await?;
    let reader = BufReader::new(file);
    let mut lines = reader.lines();
    while let Some(line) = lines.next_line().await? {
        let entry: DlqEntry = serde_json::from_str(&line)?;
        if !filter.matches(&entry) {
            report.skipped += 1;
            continue;
        }
        // Push the original payload back into the pipeline's input.
        // The pipeline's idempotency machinery from L2 absorbs any
        // duplicates from prior partial processing.
        producer.send(
            FutureRecord::to(pipeline_input_topic)
                .payload(&entry.original_payload),
            std::time::Duration::from_secs(30),
        ).await
            .map_err(|(e, _)| anyhow::anyhow!("send failed: {e}"))?;
        report.reprocessed += 1;
    }
    Ok(report)
}

#[derive(Debug)]
pub struct ReprocessFilter {
    pub error_kinds: Vec<DlqErrorKind>,
    pub operators: Vec<String>,
    pub since_unix_ms: Option<u64>,
    pub until_unix_ms: Option<u64>,
}

impl ReprocessFilter {
    pub fn matches(&self, entry: &DlqEntry) -> bool {
        // Filter on error_kind, operator, time range. Empty filters
        // match everything.
        let kind_ok = self.error_kinds.is_empty() ||
            self.error_kinds.iter().any(|k| matches!((k, &entry.error_kind),
                (DlqErrorKind::Deserialization, DlqErrorKind::Deserialization) |
                (DlqErrorKind::SchemaMismatch, DlqErrorKind::SchemaMismatch) |
                (DlqErrorKind::ValidationFailed, DlqErrorKind::ValidationFailed) |
                (DlqErrorKind::ProcessingException, DlqErrorKind::ProcessingException) |
                (DlqErrorKind::RetryBudgetExhausted, DlqErrorKind::RetryBudgetExhausted)
            ));
        let op_ok = self.operators.is_empty() || self.operators.contains(&entry.operator);
        let since_ok = self.since_unix_ms.map_or(true, |s| entry.timestamp_unix_ms >= s);
        let until_ok = self.until_unix_ms.map_or(true, |u| entry.timestamp_unix_ms <= u);
        kind_ok && op_ok && since_ok && until_ok
    }
}

#[derive(Debug, Default)]
pub struct ReprocessReport {
    pub reprocessed: usize,
    pub skipped: usize,
}

The filter API matters operationally. A typical re-processing run targets "all Deserialization errors from operator radar-ingest between 14:00 and 15:30 yesterday" — the time when the partner's wire format change rolled out, and the time it was reverted. The filter narrows the re-processing to the affected window, avoiding re-injecting unrelated DLQ entries that might now succeed and produce unwanted side effects. The L2 idempotency is the safety net that makes the re-injection correct under any duplicate; the filter is the operational discipline that limits re-injection to what is intended.


Key Takeaways

  • Three error categories: transient (retry with backoff), permanent (DLQ), discardable (drop with metric). The classification is the operator's responsibility; the default for unknown errors is permanent (DLQ) so they surface for operational attention.
  • DLQ entries carry metadata: timestamp, operator, error_kind, error_message, retry_count, original_payload, schema_version. The schema_version field is the migration mechanism for future re-processing tools that span DLQ entries from different pipeline versions.
  • Poison pills are the case the DLQ exists for. Without a DLQ, a poison pill blocks the pipeline; the operator retries forever and makes no progress. With a DLQ, the pill is quarantined and the pipeline stays healthy.
  • The DLQ is also a re-processing source. After an underlying issue is fixed, a re-processing tool reads from the DLQ and re-injects events into the pipeline. The L2 idempotency machinery absorbs any duplicates from prior partial processing.
  • The discard-bucket anti-pattern is what makes a DLQ useless. The operational disciplines that prevent it: alert on growth rate, weekly review cadence, bounded retention. A well-managed DLQ catches partner API changes within minutes; a discard bucket catches nothing.

Capstone Project — Exactly-Once Conjunction Alert Pipeline

Module: Data Pipelines — M05: Delivery Guarantees and Fault Tolerance Estimated effort: 1–2 weeks of focused work Prerequisites: All four lessons in this module passed at ≥70%


Mission Brief

OPS DIRECTIVE — SDA-2026-0207 / Phase 5 Implementation Classification: RESTART-SAFETY HARDENING

Two months ago, a maintenance window required restarting the SDA pipeline to apply a security patch. The orchestrator's graceful-drain logic worked correctly — every operator drained its incoming channel before exiting — but the alert subscriber had already received fourteen alerts that the new pipeline did not know about, and the new pipeline emitted six alerts that the subscriber had already acted on. Two false-positive collision-avoidance maneuvers were executed as a consequence. The postmortem identified two missing pieces: durable state on the producer side (so restart resumes from where the previous instance left off), and idempotent processing on the consumer side (so duplicate deliveries do not produce duplicate effects).

Phase 5 installs both. The windowed correlator from M3 becomes crash-safe via periodic checkpoints. The alert-emit path becomes idempotent end-to-end via the M5 L2 dedup machinery extended to the subscriber boundary. Permanent errors route to a DLQ with metadata sufficient for re-processing after underlying issues are fixed.

Success criteria for Phase 5: a deliberate kill -9 of the pipeline at three different points (mid-process, mid-checkpoint, mid-emit) followed by restart produces an alert log with every alert exactly once. The 30-second alert SLO is held throughout the test. The DLQ captures permanent errors with full metadata; the re-processing tool re-injects DLQ entries without producing duplicate alerts.


What You're Building

Make the M4 hardened pipeline crash-safe and exactly-once-effective.

  1. The windowed correlator from M3 becomes a CheckpointingOperator (L3 pattern): periodic checkpoints write its sliding-window state plus the consumer offset to local NVMe, async-replicated to S3
  2. The alert sink uses the L2 DedupSet keyed on alert_id with a 5-minute window and 100K capacity bound
  3. The Kafka consumer is configured for at-least-once (L1: acks=all, enable.idempotence=true, enable.auto.commit=false, process-then-commit)
  4. The alert subscriber boundary stores recently-seen alert_ids in a small embedded SQLite (durable across the subscriber's own restarts)
  5. Operators classify errors per L4: transient → retry, permanent → DLQ, invariant-violation → discard
  6. A DLQ sink writes JSON-Lines to local disk with the L4 metadata schema
  7. A re-processing CLI tool (sda-reprocess) reads from the DLQ and re-injects events into the pipeline's input topic

The orchestrator from M2, the windowed correlator from M3, and the priority-aware shedding from M4 are all unchanged in structure. The new operators wrap or extend the existing pieces; the operator graph declaration grows by a few nodes (DLQ sink, alert subscriber boundary state).


Suggested Architecture

                  ┌─────────────────────┐
                  │  Kafka input topic  │
                  │  (consumer offset   │
                  │   committed         │
                  │   process-then-     │
                  │   commit)           │
                  └──────────┬──────────┘
                             │
                             ▼
              ┌──────────────────────────────┐
              │ Source operators (radar,     │
              │ optical, ISL) wrapped with   │
              │ retry + DLQ classifier       │
              └──────────────┬───────────────┘
                             │
                             ▼
              ┌──────────────────────────────┐
              │ Normalize fan-in (M3 L1)     │
              └──────────────┬───────────────┘
                             │ ──── credit channel from L2
                             ▼
              ┌──────────────────────────────┐
              │ Windowed Correlator          │
              │ (M3) wrapped as              │
              │ CheckpointingOperator        │
              │ - state: sliding windows     │
              │ - offset: consumer commit    │
              │ - cadence: 30s               │
              │ - storage: local + S3        │
              └──────────────┬───────────────┘
                             │
                             ▼
              ┌──────────────────────────────┐
              │ Alert Sink (M2 L3 dedup +   │
              │ M3 L4 retract-aware)        │
              │ + 5-min DedupSet 100K bound  │
              └──────────────┬───────────────┘
                             │ alerts
                             ▼
              ┌──────────────────────────────┐
              │ Alert Subscriber Boundary    │
              │ (embedded SQLite seen-set)   │
              └──────────────────────────────┘

   Side paths:                                  Out-of-band:
   ┌──────────┐    ┌──────────┐                 ┌──────────────┐
   │   DLQ    │◄───│ Operator │                 │ sda-reprocess│
   │  Sink    │    │ classifier│                │  CLI tool    │
   └──────────┘    └──────────┘                 └──────┬───────┘
                                                       │
                                                       ▼
                                              re-inject into Kafka
                                              input topic

Acceptance Criteria

Functional Requirements

  • Kafka consumer configured per L1: enable.auto.commit=false, acks=all, enable.idempotence=true, max.in.flight.requests.per.connection=5 (idempotent producer makes higher in-flight safe)
  • Process-then-commit ordering with explicit commit_message(..., CommitMode::Sync) after each batch
  • Source-side internal log: every observation is durably recorded in a per-source append-only file with its consumer offset before being emitted to downstream
  • Sink-side DedupSet (5-minute time bound, 100K capacity bound) on alert_id; duplicates absorbed silently
  • CheckpointingOperator wrapping the windowed correlator: 30-second cadence, atomic temp-file + rename writes, local NVMe primary + S3 async replicate
  • On restart: load latest local checkpoint if present; fall back to S3 if local is missing; fall back to fresh state if neither
  • DLQ sink with the L4 schema (schema_version=1); per-error-kind classification by every operator
  • sda-reprocess CLI tool with filter args (--error-kind, --operator, --since, --until) and a dry-run mode that prints what would be re-injected without sending

Quality Requirements

  • Three crash tests in the integration test suite, each with kill -9 at a different point: (a) mid-process (between consumer recv and sink write), (b) mid-checkpoint (during the checkpoint flush), (c) mid-emit (after sink write but before commit). Each test asserts the post-restart alert log contains every alert exactly once.
  • Checkpoint pause duration measured per snapshot; the histogram's P99 is below 200ms (the SLO budget). Performance test asserts this on representative load.
  • DLQ schema versioned at write; the re-processing tool dispatches on schema_version and supports the historical versions documented in the codebase
  • No .unwrap() or .expect() in non-startup code paths; all errors propagate to the operator's classifier

Operational Requirements

  • /metrics extends M4's with: checkpoint_age_seconds (gauge per operator), checkpoint_size_bytes (gauge), checkpoint_pause_duration_ms (histogram), dlq_entries_total{operator, error_kind} (counter), recovery_path_total{path} (counter for local/remote/fresh on each startup)
  • Alert when checkpoint_age_seconds > 2 × cadence (a stalled checkpoint indicates a problem)
  • Alert when dlq_entries_total{error_kind="Deserialization"} rate > 10× baseline for >5 minutes (partner schema change canary)
  • DLQ runbook: per-error-kind playbook documenting investigation steps and remediation patterns (entry, hypothesis, validation, fix, re-processing decision)

Self-Assessed Stretch Goals

  • (self-assessed) Recovery time from a 100MB checkpoint is under 5 seconds end-to-end (load + deserialize + resume + first event emitted)
  • (self-assessed) The re-processing tool handles 10K events without producing any new alerts (per the L4 lesson's exemplary "zero new effects" outcome). Demonstrated via a synthetic DLQ window from the integration tests.
  • (self-assessed) The pipeline's restart-recovery test runs as a chaos-engineering integration test in CI, killing the pipeline at random points across 100 iterations; asserts every iteration ends with a consistent alert log

Hints

How do I serialize the windowed correlator's per-key sliding windows efficiently?

The natural representation is a BTreeMap<KeyType, VecDeque<Observation>>. bincode::serialize on this produces a compact binary format suitable for the checkpoint write. Pre-allocate the temp-file with a reasonable size hint to avoid re-allocation during the write. For the 30-second window at SDA's load (~10K observations/sec), the serialized state is in the tens of megabytes — well within the 200ms pause budget when written to NVMe.

#[derive(Serialize, Deserialize)]
struct CorrelatorState {
    windows: BTreeMap<ObjectIdPair, VecDeque<Observation>>,
    last_offset: u64,
}

let bytes = bincode::serialize(&state)?;  // typically <50MB
fs::write(&tmp_path, bytes).await?;
fs::rename(&tmp_path, &final_path).await?;

The serialization can be parallelized for very large states: split the BTreeMap into chunks, serialize each chunk in parallel via rayon, concatenate. For SDA's scale this is unnecessary; the bincode serialize is single-digit microseconds per MB.

How do I inject crash points deterministically in tests?

The L1 test harness pattern with the CrashingSink is the foundation. Extend it: the operator's hot loop has a #[cfg(test)]-gated crash_after: Option<u32> field that panics after N successful events. The integration test sets crash_after = Some(N), runs the pipeline, asserts the panic was caught by the supervisor, then runs a second instance from the same Kafka topic and asserts the recovery completed correctly.

#[cfg(test)]
async fn process_event(&mut self, ev: Event) -> Result<()> {
    self.events_processed += 1;
    if let Some(crash_at) = self.crash_after {
        if self.events_processed == crash_at {
            panic!("test-injected crash at event {}", self.events_processed);
        }
    }
    // ... actual processing
}

Combine with tokio::time::pause() (M2 L4 pattern) so the test runs in fast-forward without real wall-clock delays. The whole crash-recover-verify cycle should run in under a second for CI suitability.

How do I version the DLQ schema with serde tagged unions?

The simplest pattern is to use serde's #[serde(tag = "schema_version")] with a sum-type enum that wraps each version's struct. The reader dispatches on the tag automatically:

#[derive(Serialize, Deserialize)]
#[serde(tag = "schema_version")]
enum DlqEntryAnyVersion {
    #[serde(rename = "1")]
    V1(DlqEntryV1),
    #[serde(rename = "2")]
    V2(DlqEntryV2),
    // ... future versions
}

The writer always emits the latest version. The reader's serde_json::from_str::<DlqEntryAnyVersion>(line) automatically dispatches based on the schema_version field in the JSON. New versions add a variant; old code that doesn't know about the new version returns an error on deserialization, which can be handled gracefully (skip with metric).

The DlqEntryV2 struct evolves backwards-compatibility — keep field names where possible, add new ones as Option<T> for graceful upgrade. Production tooling should provide a migrate-old-to-new tool that reads V1 entries and writes V2 entries; the re-processing tool reads either.

How do I size the dedup window and capacity for the alert sink?

The window must exceed the maximum re-delivery window. For SDA's pipeline, the dominant re-delivery source is checkpoint replay during restart: a checkpoint at cadence 30s means the pipeline can replay up to 30s of events (the time between the latest checkpoint and the crash). Add a safety factor — 5 minutes is comfortable. The capacity bound is the safety valve for bursts; during the 10x burst test from M4, the sink's incoming rate peaks at ~50K alerts/sec briefly, so a 100K capacity covers a 2-second peak comfortably.

The numbers are operational: tune based on actual observed re-delivery rates and burst characteristics. Document the chosen values with a BurstProfile-style comment in the topology declaration.

How do I test the re-processor without polluting the live pipeline?

A --target-topic test-mode-input argument that lets the tool point at a test topic instead of production. The integration test uses this flag; production runs use the default (the live input topic). The test asserts events landed in the test topic (via a test-mode consumer) without affecting the production state.

For dry-run validation in production, the --dry-run flag has the tool print what it would re-inject without actually sending. Operations uses dry-run before any production re-injection to confirm the filter is targeting the right window.


Getting Started

Recommended order:

  1. Source-side internal log. A per-source append-only file that records every observation with its Kafka offset before emitting downstream. The recovery story works only when this is durable.
  2. Sink-side DedupSet on alert_id. L2's pattern; double-bound (time + count). Verify with a test that injects the same alert_id twice and asserts only one downstream emit.
  3. Kafka consumer reconfiguration. Switch from auto-commit to process-then-commit. Verify with the L1 crash test: kill between process and commit, restart, observe the redelivery.
  4. CheckpointingOperator wrapping the windowed correlator. L3's pattern; atomic temp-file + rename; 30-second cadence to start.
  5. Recovery routine on startup. Local-first, S3-second, fresh-third hierarchy.
  6. DLQ sink + per-operator classifier. L4's pattern; classify each error explicitly; route to DLQ with metadata.
  7. sda-reprocess CLI tool. Read from DLQ, filter, re-inject. Test against a synthetic DLQ.
  8. Three crash tests in the integration suite. Mid-process, mid-checkpoint, mid-emit. Assert every alert lands exactly once.

Aim for the first crash test passing by day 5 (consumer reconfig + checkpoint + recovery). The DLQ and re-processor land in the second week along with the operational runbook. The chaos-engineering stretch goal is an end-of-week-2 polish if time permits.


What This Module Sets Up

In Module 6 you will surface the new metrics — checkpoint_age, dlq_entries_total, recovery_path — as the operational dashboard's resilience panels. The runbook discipline you establish here (per-error-kind playbooks, re-processing protocols) becomes part of the on-call rotation's standard procedure. The audit script from M4 extends to verify every operator has a documented retry-disposition classifier and DLQ wiring.

The pipeline at the end of this module is correct under load (M4), correct across restart (M5), correct in event time (M3), and correctly orchestrated (M2). It produces output that downstream subscribers can trust to be exactly-once-effective. M6's work is making that correctness visible to operations — the dashboards, lineage, distributed tracing, and SLO monitoring that turn a correct pipeline into an operationally-legible one.

This is the module where the SDA Fusion Service crosses from "works under happy paths" to "works through real production failure modes." The patterns generalize beyond SDA to any streaming system that must survive restarts: at-least-once + idempotent + checkpointed + DLQ'd is the canonical streaming-pipeline reliability stack.

Module 06 — Observability and Lineage

Track: Data Pipelines — Space Domain Awareness Fusion Position: Module 6 of 6 (final module of the track) Source material: Kafka: The Definitive Guide — Shapira, Palino, Sivaram, Petty, Chapter 13 (Monitoring Kafka — Service-Level Objectives, Lag Monitoring, Diagnosing Cluster Problems); Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11 (operational considerations, provenance and audit trails); Fundamentals of Data Engineering — Joe Reis & Matt Housley, Chapter 2 (DataOps and Data Lineage as Undercurrents of the Data Engineering Lifecycle) Quiz pass threshold: 70% on all three lessons to unlock the project


Mission Context

OPS ALERT — SDA-2026-0245 Classification: OBSERVABILITY STAND-UP Subject: Make pipeline correctness visible to operations during incidents

The Phase 5 pipeline is correct under load (M4), correct across restart (M5), correct in event time (M3), and correctly orchestrated (M2). It produces alerts the subscriber can trust. Last week's lag-detection incident took 3 hours to diagnose because the dashboard had pipeline-level metrics but not stage-level metrics, and the on-call engineer had to instrument the pipeline live to find the slow stage. Operations cannot observe the pipeline's correctness during an incident; the pipeline IS correct, but its correctness is invisible at exactly the moments when visibility matters most.

This module is the final piece. Metrics supply the aggregate signals that summarize behavior — the four golden signals plus lag, the SLI/SLO/SLA tracking discipline, the per-stage breakdowns that turn "the pipeline is slow" into "stage 4 is the bottleneck." Lineage supplies per-event traceability — given a wrong output, walk backward to find the contributing inputs; given a known-bad input, walk forward to find the affected outputs. Distributed tracing supplies per-event explanations — for a specific event, what was each operator doing when it processed it. The three together complete the observability stack: metrics for "is something wrong?", lineage for "which events?", tracing for "why?".

The pipeline at the end of this module is correct, hardenable, restart-safe, and operationally legible. Operations can diagnose any of the three common incident patterns within minutes; the dashboard surfaces the right signals at the right granularity; the runbook documents the standard procedures; the CLI tooling supports both real-time investigation and post-incident analysis. This is the production-quality data pipeline the SDA Fusion Service has been building toward for six modules.

The mental model the module installs is the three-component observability stack: (1) metrics emit aggregate signals at the right scope (per-stage, pipeline-level), (2) lineage emits per-event ancestry both backward-queryable (from output to inputs) and forward-queryable (from input to affected outputs), (3) tracing emits per-event-per-operator spans linked into traces via a propagated trace_id. Each component answers a different operational question; together they cover every diagnostic need that arises during an incident.


Learning Outcomes

After completing this module, you will be able to:

  1. Define the four golden signals for streaming pipelines (throughput, latency, errors, lag, saturation) and choose the right metric type (counter, gauge, histogram) for each measurement
  2. Distinguish SLI, SLO, and SLA, and configure proactive alerting on SLO violations before they become SLA breaches
  3. Implement per-event lineage tagging on the envelope with deterministic-hash sampling and top-K truncation at fan-in operators
  4. Reason about lineage's two query directions — backward for post-incident analysis, forward for impact assessment — and build the indices that make each query efficient
  5. Add distributed tracing spans to operators with tracing::instrument, propagate trace_id across operator boundaries via the envelope, and choose between head-based and tail-based sampling per the operational profile of each path
  6. Inject canary observations as a pipeline-wide regression detector that catches what aggregate metrics smooth over
  7. Apply the three diagnostic patterns (rising lag, wrong alert, DLQ spike) using metrics + lineage + tracing in a fixed reading order during incidents

Lesson Summary

Lesson 1 — Pipeline Metrics

The four golden signals for streaming pipelines (throughput, latency, errors, saturation) plus lag as the streaming-specific master diagnostic. SLI/SLO/SLA as three layered quality measurements. The three metric types matched to question shapes — wrong-type bugs produce misleading dashboards. Per-stage and pipeline-level metrics both required: pipeline-level for first-look "is something wrong?", per-stage for "which operator?". Lag split into source-lag and pipeline-lag answers "is it ours or theirs."

Key question: The dashboard shows nominal throughput, latency, and errors but lag is at 240s and rising. Per the lesson's diagnostic order, what is the on-call engineer's first action?

Lesson 2 — Data Lineage

Event-level lineage as a first-class pipeline output. The Option<LineageTrace> envelope extension; per-operator append on emit; the storage cost (2^N at fan-in depth N) and three management strategies (1% sampling, top-K truncation, externalization to a graph database). Deterministic hash sampling for cross-replica reproducibility. Two query directions: backward for post-incident analysis, forward for impact assessment, with the forward index making impact queries O(K). OpenLineage as the complementary cross-system standard at the dataset-and-job level.

Key question: A radar later found to have a calibration bug between 14:00-15:00 yesterday produced ~6,000 observations. The pipeline emitted ~80 alerts during the same window. Which alerts were affected, and which lineage query direction answers this efficiently?

Lesson 3 — Debugging Under Load

Distributed tracing as the third layer of observability. Spans per-event-per-operator; trace_id propagation via the envelope; head-based sampling (deterministic hash, 1% default) and tail-based sampling (buffers spans, decides at sink). Canary observations as the regression detector for pipeline-wide subtle changes that aggregate metrics smooth over. The three diagnostic patterns the runbook documents: rising lag (split, per-stage, channel gradient, span tree); wrong alert (lineage backward + per-operator trace); DLQ spike (metadata, partner-side hypothesis, lineage forward, sda-reprocess after fix).

Key question: The on-call engineer is paged at 02:00 AM with rising lag. Per the diagnostic discipline, what should the first action be, and why is reading the metrics layer the right starting point?


Capstone Project — Fusion Pipeline Observability Stack

Wrap the M5 pipeline in the complete observability stack. Per-stage metrics for the four golden signals plus lag; SLO compliance calculator with proactive alerting; lineage tagging at 1% sampling with top-K truncation; distributed tracing with head-based + tail-based sampling; canary injector and watcher; a Grafana dashboard JSON; an operational runbook documenting the three diagnostic patterns. The deliverable is the production-quality SDA Fusion Service: ingest → orchestrate → window → backpressure → exactly-once → observe. Acceptance criteria, suggested architecture, runbook structure, and the full project brief are in project-observability-stack.md.

The orchestrator from M2, the windowed correlator from M3, the priority-aware shedding from M4, and the resilience tooling from M5 are all unchanged in structure. The new components are wrappers and side cars that observe the running pipeline; the operator graph grows by a few nodes (canary injector, canary watcher, tail-sampler).


File Index

module-06-observability-and-lineage/
├── README.md                                  ← this file
├── lesson-01-pipeline-metrics.md              ← Pipeline metrics
├── lesson-01-quiz.toml                        ← Quiz (5 questions)
├── lesson-02-data-lineage.md                  ← Data lineage
├── lesson-02-quiz.toml                        ← Quiz (5 questions)
├── lesson-03-debugging-under-load.md          ← Debugging under load
├── lesson-03-quiz.toml                        ← Quiz (5 questions)
└── project-observability-stack.md             ← Capstone project brief

Prerequisites

  • Modules 1 through 5 completed — the Observation envelope, the orchestrator with supervisor, the watermark-aware windowed correlator, the per-edge FlowPolicy discipline, and the at-least-once + idempotent + checkpointed + DLQ'd pipeline are all assumed
  • Foundation Track completed — async Rust, channels, runtime intuitions
  • Familiarity with the tracing crate's instrument macro and the metrics crate's counter/gauge/histogram primitives
  • Working familiarity with Prometheus metric types and PromQL-style rate/percentile queries; comfort reading a Grafana dashboard JSON

What Comes Next

This is the final module of the Data Pipelines track. The pipeline at the end of M6 is correct, hardenable, restart-safe, and operationally legible — every piece of the production-streaming-pipeline stack is in place. The patterns the six modules installed generalize beyond SDA: every streaming pipeline that operates at scale (financial trading, ad-tech, IoT telemetry, distributed log processing) uses some combination of these techniques. The Meridian Space Academy curriculum's framing was specific; the techniques are universal.

The next track in the curriculum (Data Lakes — Artemis Base Cold Archive) takes the M6 pipeline's emitted alerts and builds the durable archive layer beneath them. The track after that (Distributed Systems — Constellation Network) takes the single-process pipeline and distributes it across the 48-satellite compute grid. Both build on the foundation this track establishes.

The SDA Fusion Service is now a production-quality data pipeline. The Data Pipelines track is complete.

Lesson 1 — Pipeline Metrics

Module: Data Pipelines — M06: Observability and Lineage Position: Lesson 1 of 3 Source: Kafka: The Definitive Guide — Shapira, Palino, Sivaram, Petty, Chapter 13 (Monitoring Kafka — Service-Level Objectives, Lag Monitoring, Metric Basics); Fundamentals of Data Engineering — Joe Reis & Matt Housley, Chapter 2 (DataOps as an Undercurrent of the Data Engineering Lifecycle)


Context

The pipeline at the start of this module is correct under load (M4), correct across restarts (M5), correct in event time (M3), and correctly orchestrated (M2). It produces output that downstream subscribers can trust. There is one remaining gap: the pipeline's correctness is invisible to operations. Every internal property the previous modules built — the watermark progress, the per-channel occupancy gradient, the per-operator latency, the duplicate-absorption rate at the dedup sinks, the DLQ growth rate — is a property the engineers know exists but cannot easily see during incident response. The SDA-2026-0245 incident from this module's mission framing took three hours to diagnose because the operational dashboard had pipeline-level metrics (throughput, error rate) but not stage-level metrics (per-operator latency, channel occupancy), and the on-call engineer had to instrument the pipeline live to find the slow stage.

The fix is observability. Metrics (this lesson) supply the aggregate signals the dashboard summarizes — the four golden signals adapted for streaming pipelines, the SLI/SLO/SLA discipline applied to event-time-aware operators, and the per-stage breakdowns that turn "the pipeline is slow" into "stage 4 is the bottleneck." Lineage (Lesson 2) supplies the per-event traceability — given a wrong output, walk backward through the pipeline to find the contributing inputs and the operator that introduced the error. Debugging under load (Lesson 3) is the discipline that turns metrics and lineage into actionable diagnosis during incidents — sampling strategies for tracing, canary observations for regression detection, and the dashboard reading patterns for the three common failure shapes.

The pipeline is the same pipeline; this module wraps it in a layer that makes its correctness operationally legible. The capstone integrates all three pieces into the operational dashboard and the runbook that ops uses during incidents.


Core Concepts

The Four Golden Signals (for Streaming Pipelines)

Google's Site Reliability Engineering book identifies four golden signals for any user-facing service: latency, traffic, errors, saturation. The streaming pipeline analogues are similar but with one important addition: lag as a fifth signal that subsumes the unique time dimension of event-time pipelines.

  • Throughput (traffic). Events per second crossing a stage. Per-stage and per-pipeline. Counter type. The simplest signal — confirms the pipeline is running and indicates bottleneck symptoms when a downstream throughput is below an upstream's.
  • Latency (latency). Per-event processing time, measured as emit_time - enqueue_time per stage. Histogram type so percentiles (P50, P95, P99) are queryable. The signal that catches operator-internal slowdowns.
  • Errors (errors). Errors per second per operator, split by error_kind to match the L4 DLQ classifier from M5. Counter type with a label per kind. A sudden change in the error rate is operationally meaningful — typically a partner-side change or a deploy regression.
  • Lag (the streaming-specific signal). The gap between event time and processing time. lag = now - sensor_timestamp for an in-flight observation. Gauge type per source plus an aggregated pipeline-level value. The master diagnostic for "is the pipeline keeping up with real time" — a question that throughput alone cannot answer.
  • Saturation (saturation). Per-channel occupancy from M4 L3. Gauge type per edge. The signal that identifies the pressure-gradient bottleneck.

The five together cover every operational question the SDA pipeline gets asked during an incident. The dashboard's primary panels show all five at once; the runbook reads them in a fixed order.

SLI / SLO / SLA

The acronyms are precise and the distinction matters operationally.

SLI — Service-Level Indicator. The metric you measure. For SDA's conjunction-alert path, the canonical SLI is P99 of (alert_emit_time - sensor_timestamp) over the last hour. A number, not a target.

SLO — Service-Level Objective. The target the SLI must meet. For SDA, the SLO is P99 alert latency < 30 seconds, 99.9% of the time over a 30-day window. A target, internally agreed.

SLA — Service-Level Agreement. The contractual obligation to downstream consumers. For SDA's pipeline, the alert subscriber's SLA is "alerts arrive within 60 seconds of event time, with 99.5% reliability." Less aggressive than the SLO, by design — the SLO is the internal goal; the SLA is the external promise; the gap is the buffer that protects the SLA when the SLO momentarily slips.

The discipline is to track all three and alert on SLO violations before they become SLA violations. The SLO calculator from this lesson's code examples computes the SLI's value over the rolling window and compares against the SLO target; an alert fires when the SLO is at risk (SLI trending toward the target's edge with non-trivial probability of crossing). Module 5's checkpoint-age and DLQ-rate metrics from L3-L4 contribute to the SLO compliance picture; if checkpoints stall or DLQ entries spike, the alert latency SLI is at risk.

Metric Types: Counter, Gauge, Histogram

Choosing the right metric type matters because a wrong choice produces a misleading dashboard. The three primary types in Prometheus-style metrics:

Counter. Monotonically increasing. Reset on process restart. Used for cumulative event counts: events_total, errors_total{kind=...}, dlq_entries_total, retractions_emitted_total. The dashboard derives rates as rate(counter[window]) — the per-second rate over the window.

Gauge. Instantaneous value, can go up or down. Used for current state: channel_occupancy{edge=...}, pending_windows{operator=...}, pipeline_watermark_seconds, consumer_lag_seconds. The dashboard shows the current value or a recent average.

Histogram. Distribution of values, used for percentile queries: latency_seconds{stage=...}, checkpoint_pause_duration_ms, source_lag_seconds{source=...}. The dashboard queries P50, P95, P99 from the histogram's buckets.

Wrong-type bugs show up as misleading dashboards. A counter for "current backlog" is wrong because the dashboard shows ever-growing values that don't reflect the actual current state. A gauge for "events processed this minute" is wrong because the gauge does not survive process restarts and the rate calculation is broken across them. A histogram for an integer enum (like priority_classifier_decision) is wrong because the buckets cannot meaningfully represent the values. Choose the type that matches the question being asked.

Per-Stage vs Pipeline-Level

Two scopes of metrics, both required.

Pipeline-level metrics summarize end-to-end behavior. pipeline_throughput_total, pipeline_lag_seconds, slo_compliance_ratio. Useful for ops dashboards aimed at "is the pipeline healthy?" Aggregate signals; tell you something is wrong but not where.

Per-stage metrics localize to specific operators. operator_throughput_total{stage=normalize}, operator_latency_seconds{stage=correlator}, channel_occupancy{edge=correlator->alert_sink}. Useful during diagnosis aimed at "which operator is slow?" Localize the problem to the smallest component.

The two complement each other. The dashboard's primary panels are pipeline-level (the on-call engineer's first look). Drill-down panels are per-stage (the on-call engineer's investigation tool). Both are required because pipeline-level alone tells you something is wrong but does not tell you where; per-stage alone is too granular for the first-look summary.

Lag as the Master Diagnostic

Lag — now - sensor_timestamp for the most recent emission — answers the single most important operational question for an event-time pipeline: is the pipeline keeping up with real time? Throughput tells you how many events per second the pipeline is processing; latency tells you how long each event takes; neither tells you whether the pipeline is current with the world it is observing.

A pipeline can have high throughput and low latency and still be hours behind in lag. The shape: a partner API outage causes the source to back up; the source's input queue grows; when the partner recovers, the source drains the backlog at full speed (high throughput) and processes each event quickly (low latency) — but the events being processed are hours old in event time. Lag is the metric that surfaces this.

For SDA's pipeline, lag is split into source lag (ingest_timestamp - sensor_timestamp per source) and pipeline lag (now - ingest_timestamp for in-flight events) per Module 3 L1. The dashboard shows both, distinguishing "is the source slow?" from "is the pipeline slow?" The split is what makes the lag diagnostic actionable rather than just descriptive.


Code Examples

Instrumenting an Operator with the Prometheus Crate

Every operator in the pipeline gets per-stage metrics for the four golden signals. The prometheus Rust crate gives a registry-and-metric-handle pattern; the orchestrator's startup code creates the registry and shares it across operators.

use anyhow::Result;
use std::time::Instant;
use tokio::sync::mpsc;

pub async fn run_operator_instrumented(
    name: &str,
    mut input: mpsc::Receiver<Observation>,
    output: mpsc::Sender<Observation>,
) -> Result<()> {
    // Counter for throughput — increments on every successful process.
    let events_total = metrics::counter!("operator_events_total", "stage" => name.to_string());
    // Counter for errors, labeled by error kind.
    let errors_total = metrics::counter!("operator_errors_total", "stage" => name.to_string());
    // Histogram for per-event latency (Prometheus default buckets work for SDA).
    let latency_seconds = metrics::histogram!("operator_latency_seconds", "stage" => name.to_string());

    while let Some(obs) = input.recv().await {
        let started = Instant::now();
        match process(obs.clone()).await {
            Ok(processed) => {
                output.send(processed).await
                    .map_err(|_| anyhow::anyhow!("downstream dropped"))?;
                events_total.increment(1);
                latency_seconds.record(started.elapsed().as_secs_f64());
            }
            Err(e) => {
                errors_total.increment(1);
                tracing::warn!(?e, stage = %name, "operator error");
                // ... DLQ classification per M5 L4 ...
            }
        }
    }
    Ok(())
}

Three things to notice. The metric handles are created at the top of the function and reused on every event; the metrics::counter! macro is cheap (single hashmap lookup) but not free, and creating per-event handles would dominate the hot path. The latency histogram records elapsed().as_secs_f64() rather than millis — the Prometheus convention is seconds with float precision, which lets the histogram's buckets (default: 0.005s, 0.01s, ..., 10s) cover the realistic latency range of any pipeline operator. The errors counter is incremented in the Err arm, which is the single place where error counting belongs; the DLQ classification logic from M5 L4 hooks into the same arm via the operator's classifier function.

A Lag-Computing Sink Operator

The lag operator sits at or near the pipeline's sink end. It computes both source lag and pipeline lag for every emitted event and exports them as histograms.

use std::time::{Duration, SystemTime};

pub fn observe_lag(obs: &Observation) {
    let now = SystemTime::now();

    // Source lag: the gap between when the sensor recorded the event
    // and when the pipeline received it. Driven by the source's
    // upstream behavior (network paths, partner APIs).
    let source_lag = obs.ingest_timestamp
        .duration_since(obs.sensor_timestamp)
        .unwrap_or(Duration::ZERO);

    // Pipeline lag: the gap between when the pipeline received the
    // event and right now. Driven by the pipeline's own processing
    // time across all stages.
    let pipeline_lag = now.duration_since(obs.ingest_timestamp)
        .unwrap_or(Duration::ZERO);

    let kind = format!("{:?}", obs.source_kind);
    metrics::histogram!("source_lag_seconds", "source" => kind.clone())
        .record(source_lag.as_secs_f64());
    metrics::histogram!("pipeline_lag_seconds", "source" => kind)
        .record(pipeline_lag.as_secs_f64());
}

The split is the operationally important part. When the dashboard panel for source_lag_seconds{source="optical"} shows a 30-second P99 spike while pipeline_lag_seconds is unchanged, the on-call engineer knows the partner-side optical archive is the cause; pipeline operators are processing fine, the events are just arriving late. Conversely, a pipeline_lag_seconds spike with stable source lag points at the pipeline itself — a slow operator, a full channel, a stalled checkpoint. The split answers "is it ours or theirs" without ambiguity.

An SLO Compliance Calculator

The SLO calculator computes the SLI's current value from the histogram and compares against the target. It runs as a small auxiliary task that emits a gauge for slo_compliance_ratio — a value between 0.0 and 1.0 indicating what fraction of the rolling window's events met the SLO target.

use anyhow::Result;
use std::time::Duration;

#[derive(Debug, Clone, Copy)]
pub struct SloDefinition {
    /// The target — e.g., Duration::from_secs(30) for the alert latency SLO.
    pub target: Duration,
    /// The percentile being measured — typically 0.99 for P99 SLOs.
    pub percentile: f64,
    /// The rolling window — e.g., Duration::from_secs(3600) for hourly SLO.
    pub window: Duration,
}

pub struct SloCompliance {
    name: String,
    definition: SloDefinition,
    /// In-process histogram of recently-observed values for the SLI.
    /// Production code uses Prometheus's histogram and queries it via
    /// the metrics registry; we use an in-memory histogram for clarity.
    histogram: Histogram,
}

impl SloCompliance {
    pub fn new(name: impl Into<String>, definition: SloDefinition) -> Self {
        Self {
            name: name.into(),
            definition,
            histogram: Histogram::new(),
        }
    }

    /// Record an SLI sample (e.g., one observation's emit-to-sensor lag).
    pub fn record(&mut self, value: Duration) {
        self.histogram.record(value);
    }

    /// Compute the SLO compliance ratio: fraction of window's events
    /// that met the SLO target. Used by the alerting rule and the
    /// dashboard's SLO panel.
    pub fn compliance(&self) -> f64 {
        let total = self.histogram.count_in_window(self.definition.window);
        if total == 0 { return 1.0; }
        let breaching = self.histogram.count_above(
            self.definition.target,
            self.definition.window,
        );
        let compliant = total - breaching;
        compliant as f64 / total as f64
    }

    /// Emit the compliance ratio as a gauge for the dashboard.
    pub fn export(&self) {
        let ratio = self.compliance();
        metrics::gauge!("slo_compliance_ratio", "name" => self.name.clone()).set(ratio);
    }
}

// (Histogram is a stand-in for the production Prometheus histogram
// query; the real implementation uses prometheus::Histogram.)
struct Histogram { /* ... */ }
impl Histogram {
    fn new() -> Self { Histogram {} }
    fn record(&mut self, _value: Duration) { /* ... */ }
    fn count_in_window(&self, _window: Duration) -> u64 { 0 }
    fn count_above(&self, _threshold: Duration, _window: Duration) -> u64 { 0 }
}

The compliance ratio (e.g., 0.999 for "99.9% of events met the SLO") is exactly the number the SLO target compares against — the SLO is "compliance ≥ 0.999 over rolling 30 days," and the alert fires when the gauge drops below the threshold for a sustained period. The SLO calculator's value is operational: the dashboard panel showing the compliance ratio over time is what the ops engineer reads to confirm the pipeline is meeting its commitments without drilling into latency histograms directly. The SLI/SLO/SLA discipline becomes legible when the compliance ratio is on the dashboard alongside the raw latency.


Key Takeaways

  • The five signals for streaming pipelines are throughput, latency, errors, lag, and saturation. Lag is the streaming-specific signal that subsumes the event-time dimension; it is the master diagnostic for "is the pipeline keeping up with real time?"
  • SLI / SLO / SLA are the three layers of service-quality measurement. SLI is what you measure (a number), SLO is the target you commit to internally, SLA is the contract you make externally. Track all three; alert on SLO violations before they become SLA violations.
  • The three metric types match different question shapes. Counter for cumulative counts (events, errors). Gauge for current state (occupancy, lag, watermark). Histogram for percentile queries (latency, pause duration). Wrong-type bugs produce misleading dashboards.
  • Per-stage and pipeline-level metrics are both required. Pipeline-level for the first-look "is something wrong?" check; per-stage for the diagnosis "which operator?" question. The dashboard is structured to show pipeline-level prominently and per-stage as drill-down.
  • Lag split into source-lag and pipeline-lag answers "is the problem ours or theirs?" without ambiguity. The split is the M3 L1 framing made operational; the dashboard panel that distinguishes them is what turns the lag diagnostic from descriptive to actionable.

Lesson 2 — Data Lineage

Module: Data Pipelines — M06: Observability and Lineage Position: Lesson 2 of 3 Source: Fundamentals of Data Engineering — Joe Reis & Matt Housley, Chapter 2 (Data Management — Data Lineage as an Undercurrent); Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11 (provenance and audit trails in stream processing)


Context

Metrics from Lesson 1 answer aggregate questions: how many alerts per minute, what's the P99 latency, where is the saturation gradient. They do not answer per-event questions. Why was this specific conjunction alert emitted? What observations contributed to it? Which sensor's data was the dominant signal? These are the questions the on-call engineer asks during an incident — when an alert turned out to be a false positive, when the alert subscriber asks "where did this come from," when ops needs to confirm a partner's data quality regression.

The mission framing for this module's predecessor (M5) was the SDA-2026-0207 incident where two false-positive avoidance maneuvers were executed. The post-incident analysis required tracing each phantom alert backward through the pipeline — the alert came from a ConjunctionRisk event, which was emitted by the correlator from two Observation envelopes, which came from the radar source and the optical source, which... at each step the analysis required pulling logs from the operator that produced the output, finding the matching input by timestamp, querying the upstream operator, and so on. Two days of investigation produced the answer (a single radar with bad calibration produced one of the inputs). With lineage — the per-event traceability the pipeline emits as part of every output — the same diagnosis would have taken minutes.

This lesson develops lineage as a first-class pipeline output. The pattern: every event the pipeline emits carries a lineage field listing the events that contributed to it. Map and filter operators copy the lineage forward unchanged. Aggregations and joins (the windowed correlator from M3 is the canonical case) merge their inputs' lineages. The lineage is queryable both backward (given a bad output, find the contributing inputs) and forward (given a bad input, find the affected outputs). The cost is per-event memory and serialization overhead; the benefit is the diagnostic capability that turns post-incident analysis from days into minutes.


Core Concepts

Event-Level Lineage

The simplest form of lineage is per-event: every output carries a list of parent event IDs that contributed to it. For a stateless map operator (the radar source's normalize step), the lineage is just the input's ID — one parent. For an aggregating operator (the windowed correlator that produces a ConjunctionRisk from two observations), the lineage is two parents. For a join over many inputs (a future operator that fuses N sensors' observations into a track), the lineage grows.

The data structure is small but load-bearing.

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct LineageStep {
    /// The operator that produced this step.
    pub operator: String,
    /// When the step happened (the operator's emit time).
    pub timestamp: SystemTime,
    /// The IDs of the events that contributed.
    pub parent_ids: Vec<Uuid>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct LineageTrace {
    pub steps: Vec<LineageStep>,
}

A LineageTrace carries the entire history of an event's path through the pipeline. The first step lists the source's parent IDs (typically just the source's own observation_id); each subsequent step adds the operator that processed it. The trace can be walked backward from any output to find every contributing event at every stage.

The Lineage Tag on the Envelope

Lineage extends the M1 Observation envelope as an optional field. The Option makes the change backwards-compatible — old envelopes without lineage continue to deserialize correctly; new envelopes carry the trace.

pub struct Observation {
    // ... existing fields from M1 ...
    pub observation_id: Uuid,
    pub source_id: SourceId,
    // ...
    /// Optional lineage trace. None means "lineage is not being tracked
    /// for this event," typically because of sampling (see below).
    pub lineage: Option<LineageTrace>,
}

The same shape applies to derived events like ConjunctionRisk: the envelope grows by one optional LineageTrace field. Every operator's emit logic appends a LineageStep to the trace if it exists, leaves it as None if it does not. The pattern is identical at every operator; the implementation is small and easy to instrument uniformly.

Storage Cost

Lineage is not free. Every event carries the trace; the trace grows with each step the event passes through; for joins and aggregations the trace's parent_ids list grows with the fan-in. For a 10-stage pipeline with average fan-in of 2, lineage doubles ten times — 2^10 = 1024 IDs per output event, plus the per-step metadata. At SDA's volumes (10K events/sec across the pipeline, 100 bytes per LineageStep, 16 bytes per UUID), that is ~17 GB/sec of lineage data — clearly impractical to carry on every event.

Three strategies for managing the cost.

Sampling. Only some events carry lineage; the rest carry lineage: None. Sample rate is configurable per pipeline; the SDA default is 1% — every 100th event has its lineage tracked. Investigations rely on the sampled events being representative; with 1% sampling and SDA's volumes, the on-call engineer has hundreds of lineage-bearing events per minute to investigate, which is plenty for diagnostic purposes.

Truncation. Lineage is bounded at top-K parents per fan-in operator. The correlator's emit lists only the most-contributing 4 parents (by uncertainty weight or similar) rather than every observation in the window. Truncation loses some diagnostic detail but keeps the trace size bounded regardless of fan-in.

Externalization. Lineage is written to a separate store (a Kafka topic, a graph database, an object-storage bucket); the envelope carries only a lineage_id reference. Investigation queries the external store. Externalization decouples lineage size from envelope size at the cost of an additional system to operate.

For SDA's pipeline, sampling at 1% is the default; truncation is configured per operator; externalization is the path forward when the pipeline scales 10x or when lineage volume becomes a bottleneck. The lesson develops sampling and truncation; externalization is a forward reference for production scale.

OpenLineage

Production lineage tooling has converged on the OpenLineage standard — a JSON-Schema-defined event format that describes datasets, jobs, and their relationships. Tools like Marquez, Datakin, and DataHub consume OpenLineage events from various pipelines (Airflow, Spark, dbt, custom) and provide a unified lineage browser.

The SDA pipeline can emit OpenLineage events alongside its in-envelope lineage; the in-envelope version is for per-event traceability (the diagnostic case the lesson focuses on), and the OpenLineage version is for cross-system visibility (the dataset-level case where lineage is part of the broader data engineering graph). The two are complementary; this lesson focuses on the in-envelope shape because it is the operationally most useful for the SDA team, but the OpenLineage emission is documented as an operational extension.

Lineage as a Debugging Tool

The on-call engineer's investigation pattern is the same in every incident.

Backward walk — given a wrong output, trace its lineage backward to find the contributing inputs. The phantom alert from M5's incident: lineage shows the two observations that fed the correlator, the radar's observation_id and the optical's observation_id. The radar's lineage shows the upstream UDP frame. The optical's lineage shows the HTTP poll response. The investigation lands at the source within minutes; the next question (which radar? which time?) is answerable from the lineage's metadata.

Forward walk — given a known-bad input, trace forward to find every output that incorporates it. A radar that turned out to have a calibration bug between 14:00-15:00 yesterday: lineage queries find every ConjunctionRisk event whose parent_ids include any observation from that radar in that time window. The investigation produces the list of affected alerts; the alert subscriber's owner is notified to invalidate them.

Both directions are valuable. The backward walk is the canonical post-incident analysis; the forward walk is the canonical impact assessment. Both require the lineage to be queryable as a graph rather than a stream, which the externalization or in-pipeline indexing supports. The capstone wires this into a small sda-lineage CLI tool that performs both walks against the sampled lineage data.


Code Examples

Adding Lineage to the Envelope

The minimal envelope change. The Option<LineageTrace> is backwards-compatible with existing event handling; the lineage-aware operators populate it, the lineage-unaware operators ignore it.

use serde::{Deserialize, Serialize};
use std::time::SystemTime;
use uuid::Uuid;

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct LineageStep {
    pub operator: String,
    pub timestamp_unix_ms: u64,
    pub parent_ids: Vec<Uuid>,
}

#[derive(Debug, Clone, Default, Serialize, Deserialize)]
pub struct LineageTrace {
    pub steps: Vec<LineageStep>,
}

impl LineageTrace {
    /// Append a step for an operator that processed `parents` and
    /// emitted a new event derived from them.
    pub fn append(&mut self, operator: impl Into<String>, parents: Vec<Uuid>) {
        self.steps.push(LineageStep {
            operator: operator.into(),
            timestamp_unix_ms: SystemTime::now()
                .duration_since(SystemTime::UNIX_EPOCH)
                .map(|d| d.as_millis() as u64)
                .unwrap_or(0),
            parent_ids: parents,
        });
    }

    /// All distinct ancestor IDs across the whole trace. Used by the
    /// forward-walk query: 'is this event_id an ancestor of this output?'
    pub fn all_ancestors(&self) -> std::collections::HashSet<Uuid> {
        let mut set = std::collections::HashSet::new();
        for step in &self.steps {
            for id in &step.parent_ids {
                set.insert(*id);
            }
        }
        set
    }
}

The append method is the per-operator instrumentation point. Each operator's emit logic — after producing a derived event from its inputs — calls lineage.append("operator-name", parents) to record the step. The all_ancestors method enables the forward-walk query: given an output, does its lineage include any of the IDs in a known-bad set? If yes, the output is affected. The Default impl makes constructing an empty trace easy for source operators.

A Sampling Lineage Policy

The sampling decision is a hash on the observation_id, which makes it deterministic and reproducible. Across replicas, the same observation_id always samples to the same decision — useful for cross-replica comparisons during debugging.

use std::hash::{Hash, Hasher};
use std::collections::hash_map::DefaultHasher;

pub struct LineageSampler {
    sample_rate: f64,  // 0.0..=1.0; 0.01 = 1% sampling
}

impl LineageSampler {
    pub fn new(sample_rate: f64) -> Self {
        Self { sample_rate: sample_rate.clamp(0.0, 1.0) }
    }

    /// Should this observation carry lineage? Hash-based deterministic
    /// decision — same observation_id always produces the same answer.
    pub fn should_sample(&self, observation_id: &Uuid) -> bool {
        let mut h = DefaultHasher::new();
        observation_id.hash(&mut h);
        let hashed = h.finish();
        // Map the 64-bit hash to a value in [0, 1.0).
        let bucket = (hashed % 10_000) as f64 / 10_000.0;
        bucket < self.sample_rate
    }
}

The deterministic-hash approach beats random sampling in two operationally important ways. First, the same input produces the same lineage decision regardless of which replica processed it; debugging a specific event_id always gets the same lineage answer, not a flip-of-a-coin. Second, the sampling distribution is exactly the configured rate — no clustering effects from poor RNG. The cost is microseconds per event for the hash; well below noise in any pipeline operator's hot path.

A Lineage Query Tool

The CLI tool that performs backward and forward walks against a corpus of sampled lineage data. Production deployments would query a graph database or the externalized lineage store; for SDA's scale and the lesson's scope, an in-memory query against a JSON-Lines file of sampled events is sufficient.

use anyhow::Result;
use std::collections::HashMap;
use std::path::Path;
use uuid::Uuid;

pub struct LineageStore {
    /// Map from event_id to its lineage trace.
    traces: HashMap<Uuid, LineageTrace>,
    /// Reverse index: parent_id → set of event_ids that have it as an ancestor.
    forward_index: HashMap<Uuid, Vec<Uuid>>,
}

impl LineageStore {
    /// Load sampled events from a JSON-Lines file (one event per line,
    /// each line a (event_id, lineage) tuple).
    pub async fn load(path: impl AsRef<Path>) -> Result<Self> {
        let mut traces = HashMap::new();
        let mut forward_index: HashMap<Uuid, Vec<Uuid>> = HashMap::new();
        let content = tokio::fs::read_to_string(path).await?;
        for line in content.lines() {
            let (event_id, trace): (Uuid, LineageTrace) = serde_json::from_str(line)?;
            for ancestor in trace.all_ancestors() {
                forward_index.entry(ancestor).or_default().push(event_id);
            }
            traces.insert(event_id, trace);
        }
        Ok(Self { traces, forward_index })
    }

    /// Backward walk: given an event_id, return its full lineage.
    pub fn backward(&self, event_id: Uuid) -> Option<&LineageTrace> {
        self.traces.get(&event_id)
    }

    /// Forward walk: given an event_id (typically a known-bad input),
    /// return all event_ids whose lineage includes it as an ancestor.
    pub fn forward(&self, ancestor_id: Uuid) -> Vec<Uuid> {
        self.forward_index.get(&ancestor_id).cloned().unwrap_or_default()
    }
}

The forward index is the operationally critical data structure for the impact-assessment use case. Querying "every event affected by ancestor X" is O(1) lookup if the index is precomputed at load time; without the index, the query is O(N) over every event in the corpus, which makes incident response slower than it needs to be. The cost is memory (the index roughly doubles the lineage corpus's footprint) but typically negligible compared to the diagnostic value. Production code uses a graph database (Neo4j, JanusGraph) or a specialized lineage store (Marquez); the in-memory version above is the SDA pipeline's starting point.


Key Takeaways

  • Event-level lineage carries each output's parent IDs back through the pipeline. Map operators copy the trace forward; aggregating operators merge their inputs' traces; the trace can be walked backward from any output to find every contributing event at every stage.
  • The storage cost of lineage is real: 2^N where N is the pipeline depth times average fan-in. For SDA's volumes, 1% sampling is the default; truncation at top-K parents bounds per-event cost; externalization to a graph database is the path to production scale.
  • Deterministic hash sampling beats random sampling: same input always produces the same lineage decision, so debugging a specific event_id is reproducible across replicas. Cost is microseconds per event for the hash; well below noise.
  • The two query directions are operationally distinct. Backward walk: given a bad output, find the contributing inputs (post-incident analysis). Forward walk: given a known-bad input, find the affected outputs (impact assessment). Both require lineage to be queryable as a graph; the forward index makes the second query O(1).
  • OpenLineage is the production cross-system standard; the in-envelope lineage from this lesson is the per-event traceability that powers diagnostic queries. The two are complementary — emit OpenLineage events for cross-pipeline visibility, carry in-envelope lineage for per-event investigation.

Lesson 3 — Debugging Under Load

Module: Data Pipelines — M06: Observability and Lineage Position: Lesson 3 of 3 Source: Kafka: The Definitive Guide — Shapira, Palino, Sivaram, Petty, Chapter 13 (Diagnosing Cluster Problems, The Art of Under-Replicated Partitions); Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11 (operational considerations); Fundamentals of Data Engineering — Joe Reis & Matt Housley, Chapter 2 (DataOps as an Undercurrent)


Context

Metrics from Lesson 1 tell you the pipeline is broken. Lineage from Lesson 2 tells you which events were affected. Neither tells you why — for a specific event, what was the operator doing at the moment it produced the wrong output? What was on the worker thread? What did the operator's internal state look like when it processed this input?

Distributed tracing is the answer. Spans — discrete intervals of work with a name, start time, end time, and structured attributes — captured per event as it flows through operators. Trace context — a correlation ID propagated across operator boundaries via the envelope so the same event's spans across multiple operators link into a single trace. Sampling — the policy that decides which events get traced (because tracing every event at SDA's volumes is the same volume problem as full lineage from L2).

Together with metrics and lineage, tracing completes the operational debugging stack: metrics for "is something wrong?" — lineage for "which events?" — tracing for "why?" The capstone wires all three into the operational dashboard and the runbook. This lesson establishes the tracing piece and the discipline that turns the three together into an actionable diagnostic process during incidents. The lesson closes the Data Pipelines track; M6 is the final module, and the SDA Fusion Service curriculum is complete.


Core Concepts

Distributed Tracing for Pipelines

A span is a discrete unit of work. The streaming-pipeline analogue is per-event-per-operator: when an operator processes an event, the operator opens a span, does its work, closes the span. The span has a name (the operator's name), a start_time and end_time, and a set of structured attributes (the event_id, any operator-specific values like the window the event landed in). At the end of the operator's processing, the span is exported — sent to a tracing backend (Jaeger, Zipkin, OpenTelemetry collector) where it can be queried.

The Rust ecosystem's tracing crate is the de-facto standard. Each operator's processing function uses #[tracing::instrument] to wrap its body in a span automatically; the span's name defaults to the function name; attributes can be added via the macro's parameters or via runtime record! calls.

#[tracing::instrument(skip(obs), fields(observation_id = %obs.observation_id))]
async fn process_observation(obs: Observation) -> Result<ConjunctionRisk> {
    // ... operator body ...
}

The skip directive omits the entire obs payload from the span (it would balloon span size with the payload contents); the fields directive adds the specific attribute we care about for the trace. Production code is conservative about which attributes to record — too few and the trace is opaque; too many and the trace volume becomes the dominant operational cost.

Trace Context Propagation

A trace covers a single event's path across multiple operators. For the spans across operators to link into one trace, the trace context (a unique 128-bit trace_id) must be propagated as the event flows through. The pattern is to attach the trace_id to the envelope:

pub struct Observation {
    // ... existing fields ...
    /// Optional trace_id for distributed tracing. None means
    /// 'this event is not being traced' (sampling decision).
    pub trace_id: Option<TraceId>,
}

When operator A processes an event with trace_id: Some(t), it creates a span with that trace_id; the span is associated with the existing trace rather than a new one. When A emits to operator B, it copies the trace_id forward; B's span joins the same trace. The result is a single trace covering every operator's per-event work, queryable in the tracing backend as one tree.

The integration with tracing::instrument is via its parent-span detection: if the operator's body sets the current span's parent to the upstream's span, the spans link correctly. Production OpenTelemetry tooling handles this via context propagation across async boundaries; the SDA pipeline's wiring uses tracing::Span::current() and explicit parent_span! annotations.

Sampling Strategies

Tracing every event at SDA's volumes is impractical (the same volume problem as full lineage). Two strategies for managing the cost.

Head-based sampling. The decision to trace is made at the source operator, before the event flows through the pipeline. The same hash-based deterministic approach from L2's lineage sampling applies: the source uses hash(observation_id) % 100 < sample_pct to decide. Cheap (single hash per event), reproducible (same event_id always produces the same decision), and produces a uniform sample. The downside: errors that occur mid-pipeline don't get extra tracing — the decision was made before the error appeared.

Tail-based sampling. The decision is made at the sink, after the event's processing is complete and the operator knows whether the path was interesting (errored, slow, or otherwise notable). Tail-based sampling captures more of the interesting events at the cost of buffering all events' spans until the decision is made. For SDA's scale, tail-based requires a separate tail-sampler service (the OpenTelemetry tail-sampling-processor) that buffers spans and applies policies at egress. More complex but catches errors that head-based misses.

The SDA pipeline uses head-based sampling at 1% by default, tail-based for the alert path specifically (where slow or errored alerts are operationally meaningful and head-based might miss them). The combination — head-based for general visibility, tail-based for SLO-relevant paths — is the production pattern.

Canary Observations

Synthetic observations injected at the source on a regular cadence. The canary has a known observation_id and known properties; it flows through the pipeline like any other event; the canary-watcher at the sink confirms it arrived within an expected latency. The pattern catches three problems that other observability mechanisms miss.

Pipeline-wide regression detection. A code change that breaks the pipeline's overall correctness (an operator that drops events; a serialization mismatch; a wiring error) shows up as canaries not arriving. The metric canary_arrived_total counter falling behind canary_emitted_total is the regression signal.

End-to-end latency under load. Real events have variable processing time depending on the input; the canary's known content makes its expected processing time known. A spike in canary latency without a corresponding spike in real-event latency is informative — it points at the processing path itself rather than at variable input characteristics.

Cold-start verification after deploys. After a deploy, the first canaries through the new pipeline confirm the deploy succeeded. The canary's first arrival is the deploy-success signal.

For SDA, canaries are emitted every 30 seconds at the source. Each carries a deterministic content (a synthetic radar observation with a known timestamp and known orbital coordinates). The canary-watcher at the alert sink emits canary_arrival_latency_seconds as a histogram and canary_missed_total as a counter; alerts fire when the latency exceeds 60 seconds (twice the SLO) or when missed count grows.

Diagnostic Patterns

The on-call engineer reads metrics, lineage, and tracing in a fixed order during an incident.

Symptom: "lag is rising." Investigation:

  1. Check pipeline lag and source lag separately (M6 L1's split). If source lag is the dominant component, the cause is upstream of ingestion — investigate partner-side health.
  2. If pipeline lag is the dominant component, check per-stage latency histograms (M6 L1's per-stage metrics). The slowest stage's P99 stands out.
  3. Check the slow stage's incoming channel occupancy (M4 L3's gradient). A persistently full channel just upstream of the slow stage confirms the diagnosis.
  4. Open the tracing UI for a recent slow-path event; the span tree shows where the per-event time is being spent.

Symptom: "alert subscriber reports a wrong alert." Investigation:

  1. Get the alert's event_id from the subscriber.
  2. Backward-walk the lineage (M6 L2) to find contributing inputs.
  3. Open each contributing input's trace; the per-operator spans show what each operator did with it.
  4. Compare against the canary's expected behavior — is the difference in the input or in the operator's processing?

Symptom: "DLQ entries spiking from operator X." Investigation:

  1. Read the latest DLQ entries' error_kind distribution (M5 L4 metadata).
  2. Pull a few original_payload samples; deserialize manually to confirm the partner-side change hypothesis.
  3. Trace forward (M6 L2) from the partner's last-known-good observation_id to find when the change started.
  4. Notify partner; once fixed, run sda-reprocess against the affected DLQ window.

The discipline is to have the order memorized and the tools at hand. The dashboard is structured to support this workflow; the runbook documents each pattern with concrete pointers to which dashboard panels and which queries to run.


Code Examples

Adding Tracing Spans to Operators

The minimal instrumentation. #[tracing::instrument] wraps the operator's per-event work in a span. The span's attributes include the event_id (for correlation with lineage and DLQ entries) and any operator-specific structured fields the engineer wants visible in the trace.

use tracing::{info, instrument};

#[instrument(
    skip(obs, output),
    fields(
        observation_id = %obs.observation_id,
        source_kind = ?obs.source_kind,
    ),
)]
async fn correlator_process(
    obs: Observation,
    output: &mpsc::Sender<ConjunctionRisk>,
) -> Result<()> {
    // The span is automatically created by the macro; it covers the
    // entire body of this function.
    info!("correlator received observation");
    let risk = compute_conjunction_risk(&obs).await?;
    output.send(risk).await?;
    info!("correlator emitted risk");
    Ok(())
}

The skip directive omits the bulky payload (obs and the output sender); fields adds the lightweight identifier. The info! calls inside the span add log lines that are correlated with the span automatically (the tracing crate's structured-log integration). Production code uses info! sparingly on the hot path — every log line is a serialization cost; debug! calls are the right verbosity for inner-loop instrumentation, with the log level configurable per operator at startup.

A Head-Based Sampling Layer

The sampling decision is made at the source operator. The lesson's hash-based approach from L2 applies here unchanged: the same observation_id deterministically samples to the same decision, which makes the trace data reproducible across replicas.

use std::hash::{Hash, Hasher};
use std::collections::hash_map::DefaultHasher;

pub struct TracingSampler {
    sample_rate_pct: u8,  // 0..=100
}

impl TracingSampler {
    pub fn new(sample_rate_pct: u8) -> Self {
        Self { sample_rate_pct: sample_rate_pct.min(100) }
    }

    pub fn should_trace(&self, observation_id: &Uuid) -> bool {
        let mut h = DefaultHasher::new();
        observation_id.hash(&mut h);
        let bucket = (h.finish() % 100) as u8;
        bucket < self.sample_rate_pct
    }
}

pub async fn run_source_with_tracing(
    sampler: &TracingSampler,
    mut output: mpsc::Sender<Observation>,
) -> Result<()> {
    loop {
        let mut obs = read_next_observation().await?;
        // Decide whether this event will be traced; if yes, generate a
        // trace_id and attach it. If no, leave trace_id as None.
        if sampler.should_trace(&obs.observation_id) {
            obs.trace_id = Some(TraceId::new_v4());
        }
        output.send(obs).await?;
    }
}

The deterministic hash means that for any given event_id, the same trace decision is made every time across every replica and across pipeline restarts. Investigators can reliably ask "do we have a trace for event X?" and get a stable answer; "yes, here it is" or "no, that event was not sampled." Random sampling produces a different answer per call.

A Canary Injector and Watcher

The canary system is two operators: an injector at the source that emits a synthetic observation every 30 seconds, and a watcher at the sink that confirms each canary arrived within the expected latency.

use std::time::{Duration, SystemTime};
use tokio::time::interval;

pub async fn run_canary_injector(
    output: mpsc::Sender<Observation>,
    cadence: Duration,
) -> Result<()> {
    let mut tick = interval(cadence);
    loop {
        tick.tick().await;
        let canary = build_canary_observation();
        let canary_id = canary.observation_id;
        metrics::counter!("canary_emitted_total").increment(1);
        output.send(canary).await
            .map_err(|_| anyhow::anyhow!("downstream dropped"))?;
        tracing::info!(canary_id = %canary_id, "canary emitted");
    }
}

pub async fn run_canary_watcher(
    mut input: mpsc::Receiver<Observation>,
    expected_max_latency: Duration,
) -> Result<()> {
    while let Some(obs) = input.recv().await {
        if !is_canary(&obs) { continue; }
        let now = SystemTime::now();
        let latency = now.duration_since(obs.sensor_timestamp)
            .unwrap_or(Duration::ZERO);
        metrics::histogram!("canary_arrival_latency_seconds")
            .record(latency.as_secs_f64());
        if latency > expected_max_latency {
            metrics::counter!("canary_late_total").increment(1);
            tracing::warn!(
                canary_id = %obs.observation_id,
                latency_secs = latency.as_secs_f64(),
                "canary arrived late",
            );
        }
    }
    Ok(())
}

fn build_canary_observation() -> Observation {
    // The canary has deterministic content with a recent timestamp.
    // The orbital coordinates are a fixed test track that never
    // produces real conjunctions, so the canary cannot pollute alerts.
    Observation {
        observation_id: Uuid::new_v4(),
        sensor_timestamp: SystemTime::now(),
        source_kind: SourceKind::Canary,
        // ... deterministic test values for the rest of the envelope ...
        ingest_timestamp: SystemTime::now(),
        target: ObservationTarget::test_canary_target(),
        uncertainty: Uncertainty { sigma: 0.1 },
        trace_id: None,  // canaries are tagged separately, not via the sampler
        lineage: None,
    }
}

fn is_canary(obs: &Observation) -> bool {
    matches!(obs.source_kind, SourceKind::Canary)
}

The canary uses its own SourceKind::Canary variant so downstream operators can distinguish canaries from real observations — useful when an operator's behavior should differ for canaries (e.g., the alert sink should drop canary-derived alerts rather than emit them). The is_canary function provides a single point for that filtering. The metrics emitted (canary_emitted_total, canary_arrival_latency_seconds, canary_late_total) feed the dashboard's canary panel and the alert that fires when canaries fall behind.


Key Takeaways

  • Distributed tracing completes the observability stack alongside metrics and lineage. Spans (per-event-per-operator) capture the what and when; trace_id propagation links spans across operators into one trace per event; the tracing backend (Jaeger, Zipkin, OpenTelemetry collector) makes per-event behavior queryable.
  • Head-based sampling (decide at source, deterministic hash, 1% default) is the standard. Tail-based sampling catches more interesting events at the cost of buffering; the SDA pipeline uses tail-based for SLO-relevant paths specifically.
  • Canary observations are synthetic events injected every 30 seconds with known content; the canary-watcher at the sink confirms expected arrival. Catches pipeline-wide regressions, end-to-end-latency-under-load issues, and cold-start verification after deploys.
  • The diagnostic patterns match three common symptom shapes: rising lag (split source lag vs pipeline lag, find the bottleneck stage, drill into spans), wrong alert (backward lineage walk + per-operator trace inspection), DLQ spikes (read DLQ metadata, partner-side hypothesis, forward lineage walk to find when the change started, sda-reprocess after fix).
  • The complete stack — metrics for is-it-wrong, lineage for which-events, tracing for why — is what turns operational observability from descriptive to actionable. The dashboard is structured to support the read-in-fixed-order workflow; the runbook documents each pattern with concrete dashboard pointers.

Capstone Project — Fusion Pipeline Observability Stack

Module: Data Pipelines — M06: Observability and Lineage Estimated effort: 1–2 weeks of focused work Prerequisites: All three lessons in this module passed at ≥70%


Mission Brief

OPS DIRECTIVE — SDA-2026-0245 / Phase 6 Implementation Classification: OBSERVABILITY STAND-UP — FINAL TRACK MILESTONE

The pipeline at the start of Phase 6 is correct under load (M4), correct across restart (M5), correct in event time (M3), and correctly orchestrated (M2). It produces alerts the subscriber can trust to be exactly-once-effective. There is one remaining gap: the pipeline's correctness is invisible to operations during incidents. Last week's lag-detection incident took 3 hours to diagnose because the operational dashboard had pipeline-level metrics but not per-stage breakdowns, and the on-call engineer had to instrument the pipeline live to find the slow stage. The post-incident review concluded that observability — metrics, lineage, tracing — is the remaining missing piece.

Phase 6 installs the complete observability stack. The four golden signals plus lag, structured per-stage; SLI/SLO/SLA tracking with proactive alerting on SLO violations; per-event lineage at 1% sampling with backward and forward query support; distributed tracing with head-based + tail-based sampling; canary observations as the regression detector; an operational runbook that documents the diagnostic patterns for the three common symptom shapes.

Success criteria for Phase 6: the dashboard's primary panels show all five signals (throughput, latency, errors, lag, saturation) per stage and pipeline-level. The SLO compliance ratio is queryable as a gauge. A trained on-call engineer can diagnose any of the three incident patterns (rising lag, wrong alert, DLQ spike) within 5 minutes of a page using only the dashboard, lineage CLI, and tracing UI. Canaries fire on any pipeline-wide regression. This is the final module of the Data Pipelines track.


What You're Building

The complete operational observability stack for the SDA Fusion Service.

  1. Per-stage metrics instrumented on every operator: operator_events_total, operator_latency_seconds, operator_errors_total{kind=...}. The L1 patterns applied to every operator in the pipeline graph.
  2. Pipeline-level lag measured at the sink with the L1 split into source-lag and pipeline-lag, exported as source_lag_seconds{source=...} and pipeline_lag_seconds{source=...} histograms.
  3. SLO compliance calculator: tracks the alert latency SLI against the 30-second / 99.9% target over a rolling 30-day window. Emits slo_compliance_ratio as a gauge; alerts fire when the gauge drops below 0.999.
  4. Lineage tagging on every operator per L2: each operator's emit appends a LineageStep to the trace if sampling is enabled. Sampling rate 1% via deterministic hash; truncation at top-K parents at fan-in operators.
  5. Distributed tracing per L3: every operator wrapped in #[tracing::instrument(skip(payload), fields(observation_id))]; trace_id propagated on the envelope; head-based sampling at 1% via deterministic hash.
  6. Tail-based sampling for the alert path specifically: a tail-sampler buffers alert-path spans and emits all of them when the path errored or exceeded P99 latency.
  7. Canary injector + watcher: synthetic observations every 30 seconds; canary-watcher at the sink reports canary_arrival_latency_seconds, canary_late_total. Alerts fire on canaries falling behind the 60-second budget.
  8. Operational dashboard JSON committed to the repo: Grafana-format dashboard showing the four golden signals + lag per stage, SLO compliance, channel occupancy gradient, canary panel, DLQ growth rate.
  9. Operational runbook in docs/runbook-sda-pipeline.md covering the three diagnostic patterns from L3 with concrete dashboard panel references and CLI tool commands.
  10. sda-lineage CLI tool for backward and forward lineage queries against the sampled lineage corpus, supporting the L2 query directions.

The orchestrator from M2, all the operators from M3-M5, and the resilience tooling are unchanged in structure. The new components are wrappers and side cars that observe the running pipeline; the operator graph grows by a few nodes (canary injector, canary watcher, tail-sampler).


Suggested Architecture

                    Pipeline (M2-M5, unchanged in structure)
   ┌────────────────────────────────────────────────────────────────┐
   │  Source ─→ Normalize ─→ Correlator ─→ Alert Sink ─→ Subscriber │
   └─────┬────────┬────────────┬───────────────┬────────────────────┘
         │        │            │               │
         │ instrumented  instrumented  instrumented  instrumented
         │ (events_total, latency, errors per stage)
         ▼        ▼            ▼               ▼
   ┌─────────────────────────────────────────────────────────────────┐
   │            Prometheus exporter (/metrics endpoint)              │
   └────────────────────────┬────────────────────────────────────────┘
                            ▼
                  ┌─────────────────────┐
                  │ Grafana dashboard:  │
                  │  - 5 golden signals │
                  │  - SLO compliance   │
                  │  - Channel gradient │
                  │  - Canary panel     │
                  │  - DLQ growth       │
                  └─────────────────────┘

   Trace flow:                          Lineage flow:
   ┌──────────┐                         ┌──────────────┐
   │ TraceId  │                         │ LineageStep  │
   │ on env   │                         │ appended per │
   │ if 1%-   │                         │ operator if  │
   │ sampled  │                         │ 1%-sampled   │
   └────┬─────┘                         └──────┬───────┘
        ▼                                      ▼
   ┌──────────┐    ┌──────────────┐    ┌──────────────┐
   │ tracing  │───▶│ Tail-sampler│    │ JSON-Lines   │
   │ spans    │    │ (alert path) │    │ corpus       │
   └────┬─────┘    └──────┬───────┘    └──────┬───────┘
        ▼                 ▼                   ▼
   ┌──────────────────────────────────┐   ┌──────────────┐
   │ Jaeger / OpenTelemetry collector │   │ sda-lineage  │
   └──────────────────────────────────┘   │ CLI tool     │
                                          └──────────────┘

   Canary flow:
   ┌────────┐                                    ┌────────┐
   │ Canary │ every 30s ─────────────────────────│ Canary │
   │ inj.   │ flows through pipeline normally    │ watcher│
   └────────┘                                    └────┬───┘
                                                      │
                                                ┌─────▼──────┐
                                                │ canary_*    │
                                                │ metrics     │
                                                └─────────────┘

Acceptance Criteria

Functional Requirements

  • Every operator instrumented with the L1 metrics: operator_events_total{stage}, operator_latency_seconds{stage} (histogram), operator_errors_total{stage, error_kind}
  • Sink-side lag operator emits source_lag_seconds{source} and pipeline_lag_seconds{source} per L1
  • SloCompliance calculator continuously emits slo_compliance_ratio{name="alert_latency"} as a gauge over a 30-day rolling window
  • Lineage tagging on every operator's emit; Option<LineageTrace> on the envelope; deterministic 1% hash sampling; top-K=4 truncation at fan-in operators
  • #[tracing::instrument] on every operator's per-event function; skip for the bulky payload, fields for observation_id and source_kind
  • Trace context propagation via Option<TraceId> on the envelope; head-based 1% sampling decision at the source operator
  • Tail-based sampler for the alert path: buffers alert-path spans for up to 30 seconds; emits all spans of any path that errored or exceeded its SLO latency target
  • Canary injector at every source emitting one canary every 30 seconds; canary-watcher at the alert sink emitting the L3 metrics
  • sda-lineage CLI tool with backward <event_id> and forward <ancestor_id> subcommands

Quality Requirements

  • Metric instrumentation overhead test: a benchmark comparing per-operator latency with and without instrumentation; the instrumented version must be within 5% of the non-instrumented baseline
  • SLO compliance correctness test: a unit test that feeds known latency values into the compliance calculator and asserts the computed ratio matches the analytical answer
  • Lineage round-trip test: an integration test that runs the pipeline against a known event sequence and asserts the sampled lineage corpus contains the expected event_id → ancestor_ids relationships
  • Canary regression-detection test: an integration test that injects a deliberate slowdown in the correlator and asserts canary_late_total increments while real-event metrics remain nominal — the canary's value-add property
  • All tail-sampler decisions are documented per-policy; the policy logic is tested independently from the rest of the pipeline

Operational Requirements

  • Grafana dashboard JSON committed to dashboards/sda-pipeline.json. Panels: top row shows pipeline-level (throughput, lag, error rate, SLO compliance); second row per-stage golden signals; third row channel-occupancy gradient + canary; fourth row DLQ growth + checkpoint age + recovery path
  • Runbook in docs/runbook-sda-pipeline.md covering: (1) reading the dashboard during an incident, (2) lag-rising diagnostic pattern, (3) wrong-alert diagnostic pattern, (4) DLQ-spike diagnostic pattern, (5) using the sda-lineage and sda-reprocess CLIs, (6) escalation paths
  • Per-error-kind documentation in the runbook: for each DlqErrorKind variant from M5 L4, a paragraph documenting the typical hypothesis and remediation
  • On-call training material: a 1-page summary of the dashboard layout and the diagnostic pattern decision tree, suitable for printing for the on-call rotation

Self-Assessed Stretch Goals

  • (self-assessed) End-to-end trace from radar source through correlator to alert subscriber visible in a tracing UI (Jaeger or similar) for any sampled event. Demonstrate with a screenshot showing the per-operator span tree.
  • (self-assessed) SLO compliance > 99.9% over a 24-hour soak test at sustained nominal load with periodic 10x bursts (the M4 burst-simulator test pattern). Provide the compliance gauge time series.
  • (self-assessed) The runbook's three diagnostic patterns are validated by tabletop exercise: an engineer who has not seen the runbook is given a synthetic incident scenario and the dashboard, and reaches the correct diagnosis within 5 minutes. Document the exercise's results.

Hints

How do I integrate with Prometheus efficiently?

The metrics crate (the facade) plus metrics-exporter-prometheus (the backend) is the standard. Initialize the exporter in main; the metrics::counter! and metrics::histogram! macros become cheap operations that write to a thread-local registry. The exporter exposes the registry's contents at /metrics in Prometheus text format.

use metrics_exporter_prometheus::PrometheusBuilder;

#[tokio::main]
async fn main() -> Result<()> {
    PrometheusBuilder::new()
        .with_http_listener(([0, 0, 0, 0], 9100))
        .install()?;
    // ... rest of pipeline init ...
}

Per-event metric emissions cost ~50 ns each — well below noise in any operator. The 5% overhead requirement in acceptance criteria is comfortably met.

How do I bound lineage size at fan-in operators?

Top-K truncation. At a fan-in operator with N parent observations, sort the parents by some ranking criterion (uncertainty weight, contribution score, or simply the first K in event-time order), keep the top K, drop the rest. K=4 is a reasonable starting point for SDA's correlator (most conjunctions are determined by 2-4 dominant observations).

fn truncate_to_top_k(parents: Vec<(Uuid, f64)>, k: usize) -> Vec<Uuid> {
    let mut p = parents;
    p.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal));
    p.into_iter().take(k).map(|(id, _)| id).collect()
}

The truncation is documented per-operator with a comment; investigators reading lineage from a truncated source need to know the top-K bound is in effect. Module 6 dashboard's lineage panel surfaces this metadata.

How do I implement deterministic tail-based sampling?

The OpenTelemetry collector's tail-sampling-processor is the production reference. The pattern: every span is buffered in memory until its trace's full path is known (the OpenTelemetry collector uses a rolling window typically 30 seconds — long enough that any span will have completed but short enough that memory stays bounded). When a trace's path is complete, the policies are evaluated: if any policy matches, all spans of the trace are exported; otherwise, all are dropped.

For SDA's alert path, the policies are:

  • "any span had error" (matches whenever an operator errored)
  • "p99 latency exceeded" (matches when the trace's total latency exceeds the SLO)
  • "explicit sample" (matches when the head-based sampler tagged it)
# tail-sampling-processor configuration
policies:
  - name: error_traces
    type: status_code
    status_code: { status_codes: [ERROR] }
  - name: slow_traces
    type: latency
    latency: { threshold_ms: 30000 }
  - name: head_sampled
    type: boolean_attribute
    boolean_attribute: { key: head_sampled, value: true }

The configuration matches "any" rather than "all" — a trace is exported if any policy matches. This is the production default and the right shape for SLO-relevant paths.

How do I structure the runbook?

The pattern that works in incidents:

  1. Top-of-page table of contents with the three symptom shapes as direct anchor links: rising lag, wrong alert, DLQ spike.
  2. Per-symptom diagnostic flowchart with concrete actions: which dashboard panel to open, what value range is normal, what the next step is for each branch of the diagnosis.
  3. Per-error-kind documentation for the DLQ: each DlqErrorKind variant gets a paragraph with the typical hypothesis, the verification step, and the remediation pattern.
  4. CLI tool reference with the canonical commands for sda-lineage and sda-reprocess — what flags to use for the impact-assessment, post-incident-analysis, and dry-run scenarios.
  5. Escalation paths at the bottom: when does the on-call engineer escalate, to whom, with what context.

The runbook is meant to be readable during an incident at 02:00 AM. Avoid prose; favor numbered steps and direct links to dashboard panels. Every CLI command is copy-pastable with concrete example arguments.

How do I run the canary correctness test deterministically?

The same tokio::time::pause() pattern from M2 L4 and M5. The test pauses the runtime, advances time deterministically, drives the pipeline with synthetic events plus a deliberate slowdown injection, asserts the canary metrics. Combine with a fixed-seed RNG for any randomness in the pipeline (the priority classifier from M4, the dedup set's eviction policy if it has any) so the test is reproducible.

#[tokio::test(start_paused = true)]
async fn canary_detects_slowdown() {
    let mut pipeline = build_test_pipeline().await?;
    pipeline.inject_slowdown(operator: "correlator", factor: 5).await;
    pipeline.run_with_canaries(injection_cadence: Duration::from_secs(30)).await;
    tokio::time::advance(Duration::from_secs(300)).await;
    let m = pipeline.metrics();
    assert!(m.get("canary_late_total") > 0,
            "canary system should detect the deliberate slowdown");
    let real = m.get_histogram("operator_latency_seconds")
        .quantile(0.99);
    assert!(real < Duration::from_secs(5),
            "real-event P99 should be unchanged because the slowdown affects only the canary path");
}

The test asserts the canary catches the regression that real-event metrics would have missed — the lesson's stated value-add of the canary system.


Getting Started

Recommended order:

  1. Per-operator metrics instrumentation. L1's pattern; wire metrics::counter!, metrics::histogram! into every operator. Verify the dashboard shows non-zero values at startup.
  2. Lag operator at the sink. L1's split-lag pattern; emit both source and pipeline lag per source.
  3. SLO compliance calculator. L1's per-window percentile + threshold check; emit slo_compliance_ratio.
  4. Lineage tagging. Add Option<LineageTrace> to the envelope; instrument every operator's emit logic; implement deterministic-hash sampling at the source.
  5. sda-lineage CLI tool. Read JSON-Lines lineage corpus, build forward index, expose backward and forward queries.
  6. Distributed tracing. Add Option<TraceId> to the envelope; #[tracing::instrument] on every operator; configure tracing-subscriber to emit OpenTelemetry-format spans to a collector.
  7. Tail-sampling for the alert path. Configure the OpenTelemetry collector's tail-sampling-processor with the three policies; verify by injecting a deliberate slow alert and confirming the trace is exported.
  8. Canary injector + watcher. L3's pattern; emit canaries every 30 seconds at every source; watcher at sink reports the metrics.
  9. Grafana dashboard JSON. Commit a complete dashboard with the panels described in the acceptance criteria. Test by importing into a local Grafana and confirming all panels render.
  10. Operational runbook. Document the three diagnostic patterns with concrete dashboard panel references and CLI commands. Run a tabletop exercise to validate the runbook works in practice.

Aim for the metrics + dashboard combo working end-to-end by day 5; lineage and tracing land in the second week along with the runbook polish. The chaos-engineering stretch goal (the tabletop exercise) is end-of-week-2 finishing work that produces the strongest validation of the whole stack.


What This Module Completes

This is the final module of the Data Pipelines track. The pipeline at the end of M6 is correct, hardenable, restart-safe, and operationally legible — every piece of the production-streaming-pipeline stack is in place. Operations can diagnose any of the three common incident patterns within minutes; the dashboard surfaces the right signals at the right granularity; the runbook documents the standard procedures; the CLI tooling supports both real-time investigation and post-incident analysis.

The patterns generalize beyond SDA. Every streaming pipeline that operates at scale — financial trading, ad-tech, IoT telemetry, distributed log processing — uses some combination of these six modules' techniques. The Meridian Space Academy curriculum's framing was specific (the SDA Fusion Service against debris fields and conjunction risk) but the techniques are universal. An engineer who has built the M2-M6 capstones has built every layer of a production streaming pipeline and can apply the same patterns to any other domain.

The next track in the Meridian Space curriculum (Data Lakes — Artemis Base Cold Archive) takes the M6 pipeline's emitted alerts and builds the durable archive layer beneath them. The track after that (Distributed Systems — Constellation Network) takes the single-process pipeline and distributes it across the 48-satellite compute grid. Both build on the foundation this track establishes.

The SDA Fusion Service is now a production-quality data pipeline. Phase 6 closes the engineering work.